Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mon,osd,osdc: refactor snap trimming (phase 1) #18276

Merged
merged 32 commits into from Dec 7, 2017

Conversation

Projects
None yet
2 participants
@liewegas
Copy link
Member

commented Oct 12, 2017

Phase 1 of plan D on the pad http://pad.ceph.com/p/removing_removed_snaps

TODO

  • throttle reporting of (potentially large) purged_snaps immediately after upgrade
  • fix pg upgrade path on first mimic epoch
  • test upgrade with very large existing removed_snaps set

@liewegas liewegas force-pushed the liewegas:wip-removed-snaps branch from d30c2e2 to 2e7f411 Oct 30, 2017

@liewegas liewegas changed the title WIP: mon,osd,osdc: reimplement handling of removed_snaps mon,osd,osdc: reimplement handling of removed_snaps Oct 30, 2017

@liewegas liewegas changed the title mon,osd,osdc: reimplement handling of removed_snaps mon,osd,osdc: reimplement handling of removed_snaps (phase 1) Oct 30, 2017

@liewegas liewegas changed the title mon,osd,osdc: reimplement handling of removed_snaps (phase 1) mon,osd,osdc: refactor snap trimming (phase 1) Oct 30, 2017

@liewegas liewegas force-pushed the liewegas:wip-removed-snaps branch from 2e7f411 to 59a5215 Oct 30, 2017

@liewegas liewegas force-pushed the liewegas:wip-removed-snaps branch from 59a5215 to d250bdf Oct 31, 2017

@liewegas liewegas force-pushed the liewegas:wip-removed-snaps branch 3 times, most recently from 73e001d to b551f30 Nov 3, 2017

@liewegas liewegas force-pushed the liewegas:wip-removed-snaps branch 5 times, most recently from 27a2b0c to 84ddfc3 Nov 8, 2017

@gregsfortytwo gregsfortytwo self-requested a review Nov 21, 2017

@gregsfortytwo
Copy link
Member

left a comment

The overall structure here looks good, but there are some issues.

And the missing data on the wire (and other broken checks) make clear that the current QA suite is not sufficient to validate this PR, so it needs some good tests added.

@@ -167,6 +173,28 @@ inline ostream& operator<<(ostream& out, const set<A, Comp, Alloc>& iset) {
return out;
}

template<class A, class Comp, class Alloc>
inline ostream& operator<<(ostream& out, const boost::container::flat_set<A, Comp, Alloc>& iset) {

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Nov 27, 2017

Member

Why is this boost-specific instead of using the generic ceph-namespaced set/map?

This comment has been minimized.

Copy link
@liewegas

liewegas Nov 27, 2017

Author Member

flat_set and flat_map != set and map

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Nov 27, 2017

Member

I meant ceph::flat_map instead of boost::flat_map, guess I left off too many words!

This comment has been minimized.

Copy link
@liewegas

liewegas Nov 27, 2017

Author Member

oh! because they're not aliased in the ceph namespace. i'm not a real fan of doing that unless there is a reason we'd swap implementations (like we had to with shared_ptr forever ago); it just obscures things for someone reading the code.

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Nov 28, 2017

Member

Oh hmm, I misread the mempool setup and thought you were adding an alias.

I always kind of liked them thanks to the shared_ptr experience. I thought @wjwithagen had taken advantage of that pattern for some porting work as well, but maybe it doesn't matter for boost bits.

@@ -4465,16 +4465,6 @@ struct SnapSet {
return out;
}

// return min element of snaps > after, return max if no such element
snapid_t get_first_snap_after(snapid_t after, snapid_t max) const {

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Nov 27, 2017

Member

Just killing this because nobody uses it, right? It's always helpful to state a reason in the commit (unused, broken, whatever)

This comment has been minimized.

Copy link
@liewegas

liewegas Nov 27, 2017

Author Member

yeah, implied no users because the code still compiles without it. i'll update the commit msg

@@ -412,7 +412,7 @@ class pool_allocator {
\
template<typename k, typename v, typename cmp = std::less<k> > \
using flat_map = boost::container::flat_map<k,v,cmp, \
pool_allocator<std::pair<const k,v>>>; \
pool_allocator<std::pair<k,v>>>; \

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Nov 27, 2017

Member

Squash this patch?

@@ -4872,6 +4900,34 @@ void OSDMonitor::clear_pool_flags(int64_t pool_id, uint64_t flags)
pool->unset_flag(flags);
}

string OSDMonitor::make_snap_epoch_key(int64_t pool, epoch_t epoch)
{
char k[40];

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Nov 27, 2017

Member

So this is pretty stupid/unlikely but there are 15 characters in the string plus up to 10 digits for the epoch plus up to 19 for the pool, which doesn’t quite add up...

More pertinently, perhaps we should try and save some space in the name given @markhpc’s recent concerns about sizing and our having disabled compression.

This comment has been minimized.

Copy link
@liewegas

liewegas Nov 27, 2017

Author Member

rocksdb and leveldb still do prefix encoding, so the full key isn't stored for every item--only the suffix that changes. i'm not worried about the size (or even performance) here since we aren't doing many queries over this data (it's just the items that are in the process of being purged)... i'm more worried about code and data schema clarify.

This comment has been minimized.

Copy link
@liewegas

liewegas Nov 27, 2017

Author Member

(fixed the buffer sizes tho!)

// removed_snaps
if (tmp.require_osd_release >= CEPH_RELEASE_MIMIC) {
for (auto& i : pending_inc.new_removed_snaps) {
auto poolid = i.first;

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Nov 27, 2017

Member

You don’t always use this below, and I’m not quite sure why. Be consistent?

interval_set<snapid_t> intersection;
intersection.intersection_of(snap_trimq, info.purged_snaps);
snap_trimq.subtract(intersection);
info.purged_snaps.swap(intersection);

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Nov 27, 2017

Member

I’m not getting how we can do this swap on pre-mimic?

This comment has been minimized.

Copy link
@liewegas

liewegas Nov 27, 2017

Author Member

this is actually identical behavior to the old code, except the old code had different debug output if the old purged_snaps wasn't a subset of removed_snaps (if they are == then substracting intersection is the same thing):

  // initialize snap_trimq
   if (is_primary()) {
-    dout(20) << "activate - purged_snaps " << info.purged_snaps
-	     << " cached_removed_snaps " << pool.cached_removed_snaps << dendl;
-    snap_trimq = pool.cached_removed_snaps;
-    interval_set intersection;
-    intersection.intersection_of(snap_trimq, info.purged_snaps);
-    if (intersection == info.purged_snaps) {
-      snap_trimq.subtract(info.purged_snaps);
     } else {
-      dout(0) << "warning: info.purged_snaps (" << info.purged_snaps
-	      << ") is not a subset of removed_snaps" << dendl;
-      snap_trimq.subtract(intersection);
-      assert(!cct->_conf->osd_debug_verify_cached_snaps);
     }
 }

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Nov 28, 2017

Member

I never liked this code (it's too many swaps), but I'm not seeing how we want to swap the new intersection set with the info's purged_snaps. I don't think we did that before?

This comment has been minimized.

Copy link
@liewegas

liewegas Nov 28, 2017

Author Member

It's because removed_snaps_queue will shrink as the purged snaps get pruned from the osdmap. when that happens our purged_snaps here also needs to shrink (so we don't keep reporting old news).

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Nov 29, 2017

Member

I am good with us getting the intersection and then trimming the snap_trimq by that value.

But the part where we swap our intersection into the purged_snaps is confusing me. On Mimic code I think it makes sense, because we have set up the snap_trimq from what the monitor knows considers removed_snaps (so we can dump purged_snaps the monitor agrees are purged). But for pre-Mimic, I think we're just tossing out information about which snapshots still exist or not.

This comment has been minimized.

Copy link
@liewegas

liewegas Nov 29, 2017

Author Member

Yeah, that makes sense. made that condition on >= mimic. It should never shrink pre-mimic.

out << " snaptrimq=" << pg.snap_trimq;
if (!pg.snap_trimq.empty() ||
pg.info.purged_snaps.size()) {
out << " rsq/ps=";

This comment has been minimized.

Copy link
@gregsfortytwo
overlap.intersection_of(pg->snap_trimq, added);
lderr(pg->cct) << __func__ << " removed_snaps already contains "
<< overlap << dendl;
bad = false;

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Nov 27, 2017

Member

Supposed to be true?

}
ldout(pg->cct,10) << __func__ << " new removed_snaps " << i->second
<< ", snap_trimq now " << pg->snap_trimq << dendl;
assert(!bad || pg->cct->_conf->osd_debug_verify_cached_snaps);

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Nov 27, 2017

Member

We are...being less careful in debug mode? I don’t think that’s how we use this config elsewhere.

}
ldout(pg->cct,10) << __func__ << " new purged_snaps " << j->second
<< ", now " << pg->info.purged_snaps << dendl;
assert(!bad || pg->cct->_conf->osd_debug_verify_cached_snaps);

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Nov 27, 2017

Member

Ad is never set (outside of setup) in this block.

@liewegas liewegas force-pushed the liewegas:wip-removed-snaps branch from 84ddfc3 to ce518a5 Nov 27, 2017

@liewegas

This comment has been minimized.

Copy link
Member Author

commented Nov 28, 2017

@gregsfortytwo addressed comments and fixed the upgrade case. haven't squashed yet--when I do that i'll fix up the commit messages appropriately.

@gregsfortytwo

This comment has been minimized.

Copy link
Member

commented Nov 29, 2017

Those fixes look good to me, @liewegas.

Think you said there was still one bug outstanding. And I'm still perplexed about the snap_trimq/purged_snaps interval_set intersection swapping code.

@liewegas

This comment has been minimized.

Copy link
Member Author

commented Nov 30, 2017

Oh, pretty sure you're right, the purged_snaps swap doesn't belong there at all. /facepalm

liewegas added some commits Oct 13, 2017

osd/osd_types: note about removed_snaps hack
We add in the new snap_seq just to try to keep the interval_set
contiguous.

Signed-off-by: Sage Weil <sage@redhat.com>
mon/OSDMonitor: reset OSDMap state before decode
This ensures we don't have any cruft left over in fields that decode()
assumes are initialized from the ctor (and not a previous instance).

Signed-off-by: Sage Weil <sage@redhat.com>
include/mempool: add flat_set alias
Signed-off-by: Sage Weil <sage@redhat.com>
include/types: flat_set operator<<
Signed-off-by: Sage Weil <sage@redhat.com>

liewegas added some commits Oct 12, 2017

mon/OSDMonitor: share snaps removed during a map gap
If a client requests a map older than the mon's oldest, share with
them snaps deleted during the gap too.

Signed-off-by: Sage Weil <sage@redhat.com>
osdc/Objecter: rename _scan_requests force_resend -> skipped_map
This is what the caller is passing.

Signed-off-by: Sage Weil <sage@redhat.com>
osdc/Objecter: apply removed_snaps from gap to in-flight requests
If we are so laggy that we aren't contiguous with the mon's latest
map, the mon will provide a summary of removed_snaps for the gap.
Apply those to our in-flight ops.

Signed-off-by: Sage Weil <sage@redhat.com>
mon/OSDMonitor: record removed_snaps by epoch outside of the osdmap
Index by snap and by epoch; separate out pools.

Signed-off-by: Sage Weil <sage@redhat.com>
mon/MgrStatMonitor: dump PGMapDigest at debug level 20
Signed-off-by: Sage Weil <sage@redhat.com>
osd,mon: add 'nosnaptrim' osd flag
Now

 ceph osd {set,unset} nosnaptrim

will suspend or resume snap trimming.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/osd_types: add purged_snaps to pg_stat_t
Signed-off-by: Sage Weil <sage@redhat.com>
osd/PG: share purged_snaps with mgr at mimic
Signed-off-by: Sage Weil <sage@redhat.com>
mon/PGMap: add purged_snaps map to PGMapDigest
Signed-off-by: Sage Weil <sage@redhat.com>
osd/PG: move debug_verify_cached_snaps check into PGPool::update
Signed-off-by: Sage Weil <sage@redhat.com>
osd/osd_types: pg_pool_t: add FLAG_{SELFMANAGED,POOL}_SNAPS flags
Explicitly track whether we are a pool snaps pool or a selfmanaged
snaps pool.  This was inferred from removed_snaps.empty() before, but
that was fragile and kludgey and removed_snaps is going away.

The upgrade/compat behavior is a bit tricky:

- on decode, we set the flags based on the legacy condition.  This lets us
use and rely on the flags in memory.
- on encode, we exclude the flags if decoding an older pg_pool_t

Signed-off-by: Sage Weil <sage@redhat.com>
mon/OSDMonitor: convert removed_snaps on first mimic map
On the first mimic map, consider previously removed_snaps to be removed
in that epoch (since we don't easily know when it happened).

Signed-off-by: Sage Weil <sage@redhat.com>
osd/PG: some whitespace
Signed-off-by: Sage Weil <sage@redhat.com>
osd/PG: use new mimic osdmap structures for removed, pruned snaps
- update snap_trimq and purged_snaps based on new mimic OSDMap fields
- improve debug output to include both trimq and purged

Signed-off-by: Sage Weil <sage@redhat.com>
osd/PG: simplify replica purged_snaps update
This dependency on the ondisk version dates back before argonaut, and no
longer makes sense.  Once the snap is trimmed by the primary, and
purged_snaps is updated, the replica can (must!) blindly follow suit.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/PG: break out of Active AdvMap handler if interval change
If we are about to lose our primary status, we don't want to do *any*
of this stuff... especially share_pg_info(), which would get tagged with
the current epoch but confuse our peers!

Signed-off-by: Sage Weil <sage@redhat.com>
mon/OSDMonitor: prune purged snaps
Be a bit careful here because the mon has to do some bookkeeping to avoid
pruning things twice.  If the PGMapDigest set appears obviously stale,
skip some work (looking at this particular interval) until it is not
obviously stale--move onto the next interval instead.

Signed-off-by: Sage Weil <sage@redhat.com>
mon/OSDMonitor: propagate new_removed_snaps to other tiers
Signed-off-by: Sage Weil <sage@redhat.com>

@liewegas liewegas force-pushed the liewegas:wip-removed-snaps branch from 8877135 to be0442d Dec 3, 2017

osd/PG: ignore purged_snaps inconsistencies for now
These are possible because we update purged_snaps, part of the pg_info_t,
but we do not bump the pg version or match it with a log entry, which
means that the change does not reliably propagate to new OSDs during
peering etc.

Signed-off-by: Sage Weil <sage@redhat.com>

@liewegas liewegas force-pushed the liewegas:wip-removed-snaps branch from be0442d to 8c44dab Dec 3, 2017

@liewegas

This comment has been minimized.

Copy link
Member Author

commented Dec 4, 2017

Okay, I think this is ready now.

I took the "easy" path on the purged_snaps updates. The scenario is:

  • pg log is empty (version is 0'0)
  • acting purges some snap S, adds to purged_snaps
  • reports S as purged to mgr
  • peering, pg moves to different primary
  • new primary does not get purged_snaps because pg info does not update because the pg version matches
  • mon publishes S in new_purged_snaps, but new primary doesn't have it in its purged_snaps

Basically, there are a few options:

  1. Easy. Relax the debug assertion that's there to catch bugs and just trust the OSDMap. S is purged, the new primary just didn't know about it. This is what the PR does now. Note that the same thing is possible even for a non-empty PG, but it is much more rare--the PG has to exist with the same version on the next primary-to-be but not be in the acting set when S is purged. That can happen, but is a race with stray PG removal.
  2. Hacky. Add a timestamp or something for just the purged_snaps and conditionally update that even if the rest of the pg info does not need an update. This would be gross, and still wouldn't catch everything, I don't think.
  3. Pedantic. Update the pg version when we update purged_snaps. I tried the simple version of this (update just the PG but no log entry), which ought to work (thanks to split we don't have to have log entries for this sort of thing) but it broke with EC read/modify/write update pipeline. This would take a fair bit more work to implement properly.

I went for 1 instead of 3 because even if we did 3, I'm not sure we can reenable the debug assertion without also carefully ordering our purged_snaps reporting to the mgr to after we have safely persisted the purged_snaps updates on disk. It feels like a lot of work and complexity for very little benefit (a debug assertion).

@liewegas

This comment has been minimized.

Copy link
Member Author

commented Dec 5, 2017

@gregsfortytwo
Copy link
Member

left a comment

I am good with option 1 here.
Branch looks good inasmuch as I could identify new code. A little confused by one bit.

advmap.osdmap)) {
ldout(pg->cct, 10) << "Active advmap interval change, fast return" << dendl;
return forward_event();
}

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Dec 6, 2017

Member

Why isn’t this caught somewhere else? We haven’t changed the peering algorithms here...
I presume it’s because of the map processing change, but I’m not quite seeing how.

This comment has been minimized.

Copy link
@liewegas

liewegas Dec 6, 2017

Author Member

This was one of those cases where I was surprised we hadn't hit it before. Until now, the deeply-nested states' AdvMap handler didn't do anything important, so the fact that the outer-state handler that detects the interval change runs after didn't matter. Now, I've added processing to that handler that gets royally confused when it isn't (yet) aware of the interval change. I forget now which crash I saw, but I think it was that purged_snaps (in pg_info_t) was being updated differently. I suspect the option 1 tolerates that better now, but the rest of this function is still all work that shouldn't be run at all if the interval just changed.

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Dec 6, 2017

Member

Sounds good.

@liewegas liewegas merged commit f3b2eb9 into ceph:master Dec 7, 2017

5 checks passed

Docs: build check OK - docs built
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
make check (arm64) make check succeeded
Details

@liewegas liewegas deleted the liewegas:wip-removed-snaps branch Dec 7, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.