Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mon,osd,osdc: refactor snap trimming (phase 1) #18276

Merged
merged 32 commits into from Dec 7, 2017

Conversation

liewegas
Copy link
Member

@liewegas liewegas commented Oct 12, 2017

Phase 1 of plan D on the pad http://pad.ceph.com/p/removing_removed_snaps

TODO

  • throttle reporting of (potentially large) purged_snaps immediately after upgrade
  • fix pg upgrade path on first mimic epoch
  • test upgrade with very large existing removed_snaps set

@liewegas liewegas changed the title WIP: mon,osd,osdc: reimplement handling of removed_snaps mon,osd,osdc: reimplement handling of removed_snaps Oct 30, 2017
@liewegas liewegas changed the title mon,osd,osdc: reimplement handling of removed_snaps mon,osd,osdc: reimplement handling of removed_snaps (phase 1) Oct 30, 2017
@liewegas liewegas changed the title mon,osd,osdc: reimplement handling of removed_snaps (phase 1) mon,osd,osdc: refactor snap trimming (phase 1) Oct 30, 2017
@liewegas liewegas force-pushed the wip-removed-snaps branch 3 times, most recently from 73e001d to b551f30 Compare November 5, 2017 06:39
@liewegas liewegas force-pushed the wip-removed-snaps branch 5 times, most recently from 27a2b0c to 84ddfc3 Compare November 15, 2017 16:47
Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall structure here looks good, but there are some issues.

And the missing data on the wire (and other broken checks) make clear that the current QA suite is not sufficient to validate this PR, so it needs some good tests added.

@@ -167,6 +173,28 @@ inline ostream& operator<<(ostream& out, const set<A, Comp, Alloc>& iset) {
return out;
}

template<class A, class Comp, class Alloc>
inline ostream& operator<<(ostream& out, const boost::container::flat_set<A, Comp, Alloc>& iset) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this boost-specific instead of using the generic ceph-namespaced set/map?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flat_set and flat_map != set and map

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant ceph::flat_map instead of boost::flat_map, guess I left off too many words!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh! because they're not aliased in the ceph namespace. i'm not a real fan of doing that unless there is a reason we'd swap implementations (like we had to with shared_ptr forever ago); it just obscures things for someone reading the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh hmm, I misread the mempool setup and thought you were adding an alias.

I always kind of liked them thanks to the shared_ptr experience. I thought @wjwithagen had taken advantage of that pattern for some porting work as well, but maybe it doesn't matter for boost bits.

@@ -4465,16 +4465,6 @@ struct SnapSet {
return out;
}

// return min element of snaps > after, return max if no such element
snapid_t get_first_snap_after(snapid_t after, snapid_t max) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just killing this because nobody uses it, right? It's always helpful to state a reason in the commit (unused, broken, whatever)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, implied no users because the code still compiles without it. i'll update the commit msg

@@ -412,7 +412,7 @@ class pool_allocator {
\
template<typename k, typename v, typename cmp = std::less<k> > \
using flat_map = boost::container::flat_map<k,v,cmp, \
pool_allocator<std::pair<const k,v>>>; \
pool_allocator<std::pair<k,v>>>; \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Squash this patch?

@@ -4872,6 +4900,34 @@ void OSDMonitor::clear_pool_flags(int64_t pool_id, uint64_t flags)
pool->unset_flag(flags);
}

string OSDMonitor::make_snap_epoch_key(int64_t pool, epoch_t epoch)
{
char k[40];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is pretty stupid/unlikely but there are 15 characters in the string plus up to 10 digits for the epoch plus up to 19 for the pool, which doesn’t quite add up...

More pertinently, perhaps we should try and save some space in the name given @markhpc’s recent concerns about sizing and our having disabled compression.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rocksdb and leveldb still do prefix encoding, so the full key isn't stored for every item--only the suffix that changes. i'm not worried about the size (or even performance) here since we aren't doing many queries over this data (it's just the items that are in the process of being purged)... i'm more worried about code and data schema clarify.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(fixed the buffer sizes tho!)

// removed_snaps
if (tmp.require_osd_release >= CEPH_RELEASE_MIMIC) {
for (auto& i : pending_inc.new_removed_snaps) {
auto poolid = i.first;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don’t always use this below, and I’m not quite sure why. Be consistent?

src/osd/PG.cc Outdated
interval_set<snapid_t> intersection;
intersection.intersection_of(snap_trimq, info.purged_snaps);
snap_trimq.subtract(intersection);
info.purged_snaps.swap(intersection);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not getting how we can do this swap on pre-mimic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually identical behavior to the old code, except the old code had different debug output if the old purged_snaps wasn't a subset of removed_snaps (if they are == then substracting intersection is the same thing):

  // initialize snap_trimq
   if (is_primary()) {
-    dout(20) << "activate - purged_snaps " << info.purged_snaps
-	     << " cached_removed_snaps " << pool.cached_removed_snaps << dendl;
-    snap_trimq = pool.cached_removed_snaps;
-    interval_set intersection;
-    intersection.intersection_of(snap_trimq, info.purged_snaps);
-    if (intersection == info.purged_snaps) {
-      snap_trimq.subtract(info.purged_snaps);
     } else {
-      dout(0) << "warning: info.purged_snaps (" << info.purged_snaps
-	      << ") is not a subset of removed_snaps" << dendl;
-      snap_trimq.subtract(intersection);
-      assert(!cct->_conf->osd_debug_verify_cached_snaps);
     }
 }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never liked this code (it's too many swaps), but I'm not seeing how we want to swap the new intersection set with the info's purged_snaps. I don't think we did that before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because removed_snaps_queue will shrink as the purged snaps get pruned from the osdmap. when that happens our purged_snaps here also needs to shrink (so we don't keep reporting old news).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am good with us getting the intersection and then trimming the snap_trimq by that value.

But the part where we swap our intersection into the purged_snaps is confusing me. On Mimic code I think it makes sense, because we have set up the snap_trimq from what the monitor knows considers removed_snaps (so we can dump purged_snaps the monitor agrees are purged). But for pre-Mimic, I think we're just tossing out information about which snapshots still exist or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. made that condition on >= mimic. It should never shrink pre-mimic.

src/osd/PG.cc Outdated
out << " snaptrimq=" << pg.snap_trimq;
if (!pg.snap_trimq.empty() ||
pg.info.purged_snaps.size()) {
out << " rsq/ps=";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rsq?

src/osd/PG.cc Outdated
overlap.intersection_of(pg->snap_trimq, added);
lderr(pg->cct) << __func__ << " removed_snaps already contains "
<< overlap << dendl;
bad = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supposed to be true?

src/osd/PG.cc Outdated
}
ldout(pg->cct,10) << __func__ << " new removed_snaps " << i->second
<< ", snap_trimq now " << pg->snap_trimq << dendl;
assert(!bad || pg->cct->_conf->osd_debug_verify_cached_snaps);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are...being less careful in debug mode? I don’t think that’s how we use this config elsewhere.

src/osd/PG.cc Outdated
}
ldout(pg->cct,10) << __func__ << " new purged_snaps " << j->second
<< ", now " << pg->info.purged_snaps << dendl;
assert(!bad || pg->cct->_conf->osd_debug_verify_cached_snaps);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ad is never set (outside of setup) in this block.

@liewegas
Copy link
Member Author

@gregsfortytwo addressed comments and fixed the upgrade case. haven't squashed yet--when I do that i'll fix up the commit messages appropriately.

@gregsfortytwo
Copy link
Member

Those fixes look good to me, @liewegas.

Think you said there was still one bug outstanding. And I'm still perplexed about the snap_trimq/purged_snaps interval_set intersection swapping code.

@liewegas
Copy link
Member Author

Oh, pretty sure you're right, the purged_snaps swap doesn't belong there at all. /facepalm

We add in the new snap_seq just to try to keep the interval_set
contiguous.

Signed-off-by: Sage Weil <sage@redhat.com>
In reality we only call this when the PG is peered and thus past_intervals
is empty, but this is more defensive in case that changes someday!

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Index by snap and by epoch; separate out pools.

Signed-off-by: Sage Weil <sage@redhat.com>
If a client requests a map older than the mon's oldest, share with
them snaps deleted during the gap too.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
If an in-flight Op has a snapc referencing a deleted snap, remove it
from the snapc.

Signed-off-by: Sage Weil <sage@redhat.com>
This is what the caller is passing.

Signed-off-by: Sage Weil <sage@redhat.com>
If we are so laggy that we aren't contiguous with the mon's latest
map, the mon will provide a summary of removed_snaps for the gap.
Apply those to our in-flight ops.

Signed-off-by: Sage Weil <sage@redhat.com>
Now

 ceph osd {set,unset} nosnaptrim

will suspend or resume snap trimming.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
If we are about to lose our primary status, we don't want to do *any*
of this stuff... especially share_pg_info(), which would get tagged with
the current epoch but confuse our peers!

Signed-off-by: Sage Weil <sage@redhat.com>
This dependency on the ondisk version dates back before argonaut, and no
longer makes sense.  Once the snap is trimmed by the primary, and
purged_snaps is updated, the replica can (must!) blindly follow suit.

Signed-off-by: Sage Weil <sage@redhat.com>
- update snap_trimq and purged_snaps based on new mimic OSDMap fields
- improve debug output to include both trimq and purged

Signed-off-by: Sage Weil <sage@redhat.com>
Explicitly track whether we are a pool snaps pool or a selfmanaged
snaps pool.  This was inferred from removed_snaps.empty() before, but
that was fragile and kludgey and removed_snaps is going away.

The upgrade/compat behavior is a bit tricky:

- on decode, we set the flags based on the legacy condition.  This lets us
use and rely on the flags in memory.
- on encode, we exclude the flags if decoding an older pg_pool_t

Signed-off-by: Sage Weil <sage@redhat.com>
On the first mimic map, consider previously removed_snaps to be removed
in that epoch (since we don't easily know when it happened).

Signed-off-by: Sage Weil <sage@redhat.com>
Be a bit careful here because the mon has to do some bookkeeping to avoid
pruning things twice.  If the PGMapDigest set appears obviously stale,
skip some work (looking at this particular interval) until it is not
obviously stale--move onto the next interval instead.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
These are possible because we update purged_snaps, part of the pg_info_t,
but we do not bump the pg version or match it with a log entry, which
means that the change does not reliably propagate to new OSDs during
peering etc.

Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas
Copy link
Member Author

liewegas commented Dec 4, 2017

Okay, I think this is ready now.

I took the "easy" path on the purged_snaps updates. The scenario is:

  • pg log is empty (version is 0'0)
  • acting purges some snap S, adds to purged_snaps
  • reports S as purged to mgr
  • peering, pg moves to different primary
  • new primary does not get purged_snaps because pg info does not update because the pg version matches
  • mon publishes S in new_purged_snaps, but new primary doesn't have it in its purged_snaps

Basically, there are a few options:

  1. Easy. Relax the debug assertion that's there to catch bugs and just trust the OSDMap. S is purged, the new primary just didn't know about it. This is what the PR does now. Note that the same thing is possible even for a non-empty PG, but it is much more rare--the PG has to exist with the same version on the next primary-to-be but not be in the acting set when S is purged. That can happen, but is a race with stray PG removal.
  2. Hacky. Add a timestamp or something for just the purged_snaps and conditionally update that even if the rest of the pg info does not need an update. This would be gross, and still wouldn't catch everything, I don't think.
  3. Pedantic. Update the pg version when we update purged_snaps. I tried the simple version of this (update just the PG but no log entry), which ought to work (thanks to split we don't have to have log entries for this sort of thing) but it broke with EC read/modify/write update pipeline. This would take a fair bit more work to implement properly.

I went for 1 instead of 3 because even if we did 3, I'm not sure we can reenable the debug assertion without also carefully ordering our purged_snaps reporting to the mgr to after we have safely persisted the purged_snaps updates on disk. It feels like a lot of work and complexity for very little benefit (a debug assertion).

@liewegas
Copy link
Member Author

liewegas commented Dec 5, 2017

@gregsfortytwo @jdurgin ping

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am good with option 1 here.
Branch looks good inasmuch as I could identify new code. A little confused by one bit.

advmap.osdmap)) {
ldout(pg->cct, 10) << "Active advmap interval change, fast return" << dendl;
return forward_event();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn’t this caught somewhere else? We haven’t changed the peering algorithms here...
I presume it’s because of the map processing change, but I’m not quite seeing how.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was one of those cases where I was surprised we hadn't hit it before. Until now, the deeply-nested states' AdvMap handler didn't do anything important, so the fact that the outer-state handler that detects the interval change runs after didn't matter. Now, I've added processing to that handler that gets royally confused when it isn't (yet) aware of the interval change. I forget now which crash I saw, but I think it was that purged_snaps (in pg_info_t) was being updated differently. I suspect the option 1 tolerates that better now, but the rest of this function is still all work that shouldn't be run at all if the interval just changed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@liewegas liewegas merged commit f3b2eb9 into ceph:master Dec 7, 2017
@liewegas liewegas deleted the wip-removed-snaps branch December 7, 2017 03:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants