mon,osd,osdc: refactor snap trimming (phase 1) #18276

liewegas · 2017-10-12T21:17:03Z

Phase 1 of plan D on the pad http://pad.ceph.com/p/removing_removed_snaps

TODO

throttle reporting of (potentially large) purged_snaps immediately after upgrade
fix pg upgrade path on first mimic epoch
test upgrade with very large existing removed_snaps set

gregsfortytwo

The overall structure here looks good, but there are some issues.

And the missing data on the wire (and other broken checks) make clear that the current QA suite is not sufficient to validate this PR, so it needs some good tests added.

gregsfortytwo · 2017-11-22T19:36:56Z

src/include/types.h

@@ -167,6 +173,28 @@ inline ostream& operator<<(ostream& out, const set<A, Comp, Alloc>& iset) {
  return out;
 }

+template<class A, class Comp, class Alloc>
+inline ostream& operator<<(ostream& out, const boost::container::flat_set<A, Comp, Alloc>& iset) {


Why is this boost-specific instead of using the generic ceph-namespaced set/map?

flat_set and flat_map != set and map

I meant ceph::flat_map instead of boost::flat_map, guess I left off too many words!

oh! because they're not aliased in the ceph namespace. i'm not a real fan of doing that unless there is a reason we'd swap implementations (like we had to with shared_ptr forever ago); it just obscures things for someone reading the code.

Oh hmm, I misread the mempool setup and thought you were adding an alias.

I always kind of liked them thanks to the shared_ptr experience. I thought @wjwithagen had taken advantage of that pattern for some porting work as well, but maybe it doesn't matter for boost bits.

gregsfortytwo · 2017-11-22T19:41:02Z

src/osd/osd_types.h

@@ -4465,16 +4465,6 @@ struct SnapSet {
    return out;
  }

-  // return min element of snaps > after, return max if no such element
-  snapid_t get_first_snap_after(snapid_t after, snapid_t max) const {


Just killing this because nobody uses it, right? It's always helpful to state a reason in the commit (unused, broken, whatever)

yeah, implied no users because the code still compiles without it. i'll update the commit msg

gregsfortytwo · 2017-11-22T19:41:58Z

src/include/mempool.h

@@ -412,7 +412,7 @@ class pool_allocator {
 									\
    template<typename k, typename v, typename cmp = std::less<k> >	\
    using flat_map = boost::container::flat_map<k,v,cmp,		\
-						pool_allocator<std::pair<const k,v>>>; \
+						pool_allocator<std::pair<k,v>>>; \


Squash this patch?

gregsfortytwo · 2017-11-27T10:37:31Z

src/mon/OSDMonitor.cc

@@ -4872,6 +4900,34 @@ void OSDMonitor::clear_pool_flags(int64_t pool_id, uint64_t flags)
  pool->unset_flag(flags);
 }

+string OSDMonitor::make_snap_epoch_key(int64_t pool, epoch_t epoch)
+{
+  char k[40];


So this is pretty stupid/unlikely but there are 15 characters in the string plus up to 10 digits for the epoch plus up to 19 for the pool, which doesn’t quite add up...

More pertinently, perhaps we should try and save some space in the name given @markhpc’s recent concerns about sizing and our having disabled compression.

rocksdb and leveldb still do prefix encoding, so the full key isn't stored for every item--only the suffix that changes. i'm not worried about the size (or even performance) here since we aren't doing many queries over this data (it's just the items that are in the process of being purged)... i'm more worried about code and data schema clarify.

(fixed the buffer sizes tho!)

gregsfortytwo · 2017-11-27T10:46:14Z

src/mon/OSDMonitor.cc

+  // removed_snaps
+  if (tmp.require_osd_release >= CEPH_RELEASE_MIMIC) {
+    for (auto& i : pending_inc.new_removed_snaps) {
+      auto poolid = i.first;


You don’t always use this below, and I’m not quite sure why. Be consistent?

gregsfortytwo · 2017-11-27T12:13:47Z

src/osd/PG.cc

+    interval_set<snapid_t> intersection;
+    intersection.intersection_of(snap_trimq, info.purged_snaps);
+    snap_trimq.subtract(intersection);
+    info.purged_snaps.swap(intersection);


I’m not getting how we can do this swap on pre-mimic?

this is actually identical behavior to the old code, except the old code had different debug output if the old purged_snaps wasn't a subset of removed_snaps (if they are == then substracting intersection is the same thing):

// initialize snap_trimq if (is_primary()) { - dout(20) << "activate - purged_snaps " << info.purged_snaps - << " cached_removed_snaps " << pool.cached_removed_snaps << dendl; - snap_trimq = pool.cached_removed_snaps; - interval_set intersection; - intersection.intersection_of(snap_trimq, info.purged_snaps); - if (intersection == info.purged_snaps) { - snap_trimq.subtract(info.purged_snaps); } else { - dout(0) << "warning: info.purged_snaps (" << info.purged_snaps - << ") is not a subset of removed_snaps" << dendl; - snap_trimq.subtract(intersection); - assert(!cct->_conf->osd_debug_verify_cached_snaps); } }

I never liked this code (it's too many swaps), but I'm not seeing how we want to swap the new intersection set with the info's purged_snaps. I don't think we did that before?

It's because removed_snaps_queue will shrink as the purged snaps get pruned from the osdmap. when that happens our purged_snaps here also needs to shrink (so we don't keep reporting old news).

I am good with us getting the intersection and then trimming the snap_trimq by that value.

But the part where we swap our intersection into the purged_snaps is confusing me. On Mimic code I think it makes sense, because we have set up the snap_trimq from what the monitor knows considers removed_snaps (so we can dump purged_snaps the monitor agrees are purged). But for pre-Mimic, I think we're just tossing out information about which snapshots still exist or not.

Yeah, that makes sense. made that condition on >= mimic. It should never shrink pre-mimic.

gregsfortytwo · 2017-11-27T12:16:12Z

src/osd/PG.cc

-    out << " snaptrimq=" << pg.snap_trimq;
+  if (!pg.snap_trimq.empty() ||
+      pg.info.purged_snaps.size()) {
+    out << " rsq/ps=";


gregsfortytwo · 2017-11-27T12:20:40Z

src/osd/PG.cc

+	  overlap.intersection_of(pg->snap_trimq, added);
+	  lderr(pg->cct) << __func__ << " removed_snaps already contains "
+			 << overlap << dendl;
+	  bad = false;


Supposed to be true?

gregsfortytwo · 2017-11-27T12:21:25Z

src/osd/PG.cc

+      }
+      ldout(pg->cct,10) << __func__ << " new removed_snaps " << i->second
+			<< ", snap_trimq now " << pg->snap_trimq << dendl;
+      assert(!bad || pg->cct->_conf->osd_debug_verify_cached_snaps);


We are...being less careful in debug mode? I don’t think that’s how we use this config elsewhere.

gregsfortytwo · 2017-11-27T12:24:32Z

src/osd/PG.cc

+      }
+      ldout(pg->cct,10) << __func__ << " new purged_snaps " << j->second
+			<< ", now " << pg->info.purged_snaps << dendl;
+      assert(!bad || pg->cct->_conf->osd_debug_verify_cached_snaps);


Ad is never set (outside of setup) in this block.

liewegas · 2017-11-28T17:30:52Z

@gregsfortytwo addressed comments and fixed the upgrade case. haven't squashed yet--when I do that i'll fix up the commit messages appropriately.

gregsfortytwo · 2017-11-29T21:19:24Z

Those fixes look good to me, @liewegas.

Think you said there was still one bug outstanding. And I'm still perplexed about the snap_trimq/purged_snaps interval_set intersection swapping code.

liewegas · 2017-11-30T15:13:01Z

Oh, pretty sure you're right, the purged_snaps swap doesn't belong there at all. /facepalm

We add in the new snap_seq just to try to keep the interval_set contiguous. Signed-off-by: Sage Weil <sage@redhat.com>

In reality we only call this when the PG is peered and thus past_intervals is empty, but this is more defensive in case that changes someday! Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

Index by snap and by epoch; separate out pools. Signed-off-by: Sage Weil <sage@redhat.com>

If a client requests a map older than the mon's oldest, share with them snaps deleted during the gap too. Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

If an in-flight Op has a snapc referencing a deleted snap, remove it from the snapc. Signed-off-by: Sage Weil <sage@redhat.com>

This is what the caller is passing. Signed-off-by: Sage Weil <sage@redhat.com>

If we are so laggy that we aren't contiguous with the mon's latest map, the mon will provide a summary of removed_snaps for the gap. Apply those to our in-flight ops. Signed-off-by: Sage Weil <sage@redhat.com>

Now ceph osd {set,unset} nosnaptrim will suspend or resume snap trimming. Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

If we are about to lose our primary status, we don't want to do *any* of this stuff... especially share_pg_info(), which would get tagged with the current epoch but confuse our peers! Signed-off-by: Sage Weil <sage@redhat.com>

This dependency on the ondisk version dates back before argonaut, and no longer makes sense. Once the snap is trimmed by the primary, and purged_snaps is updated, the replica can (must!) blindly follow suit. Signed-off-by: Sage Weil <sage@redhat.com>

- update snap_trimq and purged_snaps based on new mimic OSDMap fields - improve debug output to include both trimq and purged Signed-off-by: Sage Weil <sage@redhat.com>

Explicitly track whether we are a pool snaps pool or a selfmanaged snaps pool. This was inferred from removed_snaps.empty() before, but that was fragile and kludgey and removed_snaps is going away. The upgrade/compat behavior is a bit tricky: - on decode, we set the flags based on the legacy condition. This lets us use and rely on the flags in memory. - on encode, we exclude the flags if decoding an older pg_pool_t Signed-off-by: Sage Weil <sage@redhat.com>

On the first mimic map, consider previously removed_snaps to be removed in that epoch (since we don't easily know when it happened). Signed-off-by: Sage Weil <sage@redhat.com>

Be a bit careful here because the mon has to do some bookkeeping to avoid pruning things twice. If the PGMapDigest set appears obviously stale, skip some work (looking at this particular interval) until it is not obviously stale--move onto the next interval instead. Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

These are possible because we update purged_snaps, part of the pg_info_t, but we do not bump the pg version or match it with a log entry, which means that the change does not reliably propagate to new OSDs during peering etc. Signed-off-by: Sage Weil <sage@redhat.com>

liewegas · 2017-12-04T14:50:15Z

Okay, I think this is ready now.

I took the "easy" path on the purged_snaps updates. The scenario is:

pg log is empty (version is 0'0)
acting purges some snap S, adds to purged_snaps
reports S as purged to mgr
peering, pg moves to different primary
new primary does not get purged_snaps because pg info does not update because the pg version matches
mon publishes S in new_purged_snaps, but new primary doesn't have it in its purged_snaps

Basically, there are a few options:

Easy. Relax the debug assertion that's there to catch bugs and just trust the OSDMap. S is purged, the new primary just didn't know about it. This is what the PR does now. Note that the same thing is possible even for a non-empty PG, but it is much more rare--the PG has to exist with the same version on the next primary-to-be but not be in the acting set when S is purged. That can happen, but is a race with stray PG removal.
Hacky. Add a timestamp or something for just the purged_snaps and conditionally update that even if the rest of the pg info does not need an update. This would be gross, and still wouldn't catch everything, I don't think.
Pedantic. Update the pg version when we update purged_snaps. I tried the simple version of this (update just the PG but no log entry), which ought to work (thanks to split we don't have to have log entries for this sort of thing) but it broke with EC read/modify/write update pipeline. This would take a fair bit more work to implement properly.

I went for 1 instead of 3 because even if we did 3, I'm not sure we can reenable the debug assertion without also carefully ordering our purged_snaps reporting to the mgr to after we have safely persisted the purged_snaps updates on disk. It feels like a lot of work and complexity for very little benefit (a debug assertion).

liewegas · 2017-12-05T15:23:12Z

@gregsfortytwo @jdurgin ping

gregsfortytwo

I am good with option 1 here.
Branch looks good inasmuch as I could identify new code. A little confused by one bit.

gregsfortytwo · 2017-12-06T07:50:27Z

src/osd/PG.cc

+	advmap.osdmap)) {
+    ldout(pg->cct, 10) << "Active advmap interval change, fast return" << dendl;
+    return forward_event();
+  }


Why isn’t this caught somewhere else? We haven’t changed the peering algorithms here...
I presume it’s because of the map processing change, but I’m not quite seeing how.

This was one of those cases where I was surprised we hadn't hit it before. Until now, the deeply-nested states' AdvMap handler didn't do anything important, so the fact that the outer-state handler that detects the interval change runs after didn't matter. Now, I've added processing to that handler that gets royally confused when it isn't (yet) aware of the interval change. I forget now which crash I saw, but I think it was that purged_snaps (in pg_info_t) was being updated differently. I suspect the option 1 tolerates that better now, but the rest of this function is still all work that shouldn't be run at all if the interval just changed.

Sounds good.

liewegas added core performance labels Oct 12, 2017

liewegas force-pushed the wip-removed-snaps branch from d30c2e2 to 2e7f411 Compare October 30, 2017 21:55

liewegas changed the title ~~WIP: mon,osd,osdc: reimplement handling of removed_snaps~~ mon,osd,osdc: reimplement handling of removed_snaps Oct 30, 2017

liewegas changed the title ~~mon,osd,osdc: reimplement handling of removed_snaps~~ mon,osd,osdc: reimplement handling of removed_snaps (phase 1) Oct 30, 2017

liewegas changed the title ~~mon,osd,osdc: reimplement handling of removed_snaps (phase 1)~~ mon,osd,osdc: refactor snap trimming (phase 1) Oct 30, 2017

liewegas force-pushed the wip-removed-snaps branch from 2e7f411 to 59a5215 Compare October 30, 2017 21:56

liewegas added the wip-sage-testing label Oct 30, 2017

liewegas force-pushed the wip-removed-snaps branch from 59a5215 to d250bdf Compare October 31, 2017 18:50

liewegas removed the wip-sage-testing label Nov 2, 2017

liewegas force-pushed the wip-removed-snaps branch 3 times, most recently from 73e001d to b551f30 Compare November 5, 2017 06:39

liewegas added the wip-sage-testing label Nov 8, 2017

liewegas force-pushed the wip-removed-snaps branch 5 times, most recently from 27a2b0c to 84ddfc3 Compare November 15, 2017 16:47

zmedico mentioned this pull request Nov 17, 2017

osd/PGPool::update: optimize with deleting_snaps #18147

Closed

gregsfortytwo self-requested a review November 21, 2017 01:31

gregsfortytwo requested changes Nov 27, 2017

View reviewed changes

liewegas force-pushed the wip-removed-snaps branch from 84ddfc3 to ce518a5 Compare November 27, 2017 20:59

liewegas added 4 commits December 1, 2017 21:15

osd/osd_types: note about removed_snaps hack

c536d4c

We add in the new snap_seq just to try to keep the interval_set contiguous. Signed-off-by: Sage Weil <sage@redhat.com>

osd/PG: share_pg_info shares past_itnervals, not PastIntervals()

c8bfe3f

In reality we only call this when the PG is peered and thus past_intervals is empty, but this is more defensive in case that changes someday! Signed-off-by: Sage Weil <sage@redhat.com>

osd/OSDMap: improve osdmap flag dumping in json

81d63f2

Signed-off-by: Sage Weil <sage@redhat.com>

qa/suites/rados/singleton/all/thrash-eio: more whitelist

df7523b

Signed-off-by: Sage Weil <sage@redhat.com>

liewegas added 19 commits December 1, 2017 21:26

mon/OSDMonitor: record removed_snaps by epoch outside of the osdmap

9d606c5

Index by snap and by epoch; separate out pools. Signed-off-by: Sage Weil <sage@redhat.com>

mon/OSDMonitor: share snaps removed during a map gap

49833c3

If a client requests a map older than the mon's oldest, share with them snaps deleted during the gap too. Signed-off-by: Sage Weil <sage@redhat.com>

mon/MgrStatMonitor: dump PGMapDigest at debug level 20

38e96ec

Signed-off-by: Sage Weil <sage@redhat.com>

osdc/Objecter: prune new_removed_snaps from active op snapc's

32d7538

If an in-flight Op has a snapc referencing a deleted snap, remove it from the snapc. Signed-off-by: Sage Weil <sage@redhat.com>

osdc/Objecter: rename _scan_requests force_resend -> skipped_map

b1b8fc6

This is what the caller is passing. Signed-off-by: Sage Weil <sage@redhat.com>

osdc/Objecter: apply removed_snaps from gap to in-flight requests

192a8dc

If we are so laggy that we aren't contiguous with the mon's latest map, the mon will provide a summary of removed_snaps for the gap. Apply those to our in-flight ops. Signed-off-by: Sage Weil <sage@redhat.com>

osd,mon: add 'nosnaptrim' osd flag

a53ba73

Now ceph osd {set,unset} nosnaptrim will suspend or resume snap trimming. Signed-off-by: Sage Weil <sage@redhat.com>

osd/osd_types: add purged_snaps to pg_stat_t

345d3b6

Signed-off-by: Sage Weil <sage@redhat.com>

osd/PG: share purged_snaps with mgr at mimic

6df912b

Signed-off-by: Sage Weil <sage@redhat.com>

mon/PGMap: add purged_snaps map to PGMapDigest

86f0b81

Signed-off-by: Sage Weil <sage@redhat.com>

osd/PG: move debug_verify_cached_snaps check into PGPool::update

e5f62fb

Signed-off-by: Sage Weil <sage@redhat.com>

osd/PG: some whitespace

33c9907

Signed-off-by: Sage Weil <sage@redhat.com>

osd/PG: use new mimic osdmap structures for removed, pruned snaps

6e1b7c4

- update snap_trimq and purged_snaps based on new mimic OSDMap fields - improve debug output to include both trimq and purged Signed-off-by: Sage Weil <sage@redhat.com>

mon/OSDMonitor: convert removed_snaps on first mimic map

fd6a59e

On the first mimic map, consider previously removed_snaps to be removed in that epoch (since we don't easily know when it happened). Signed-off-by: Sage Weil <sage@redhat.com>

mon/OSDMonitor: propagate new_removed_snaps to other tiers

f2d602a

Signed-off-by: Sage Weil <sage@redhat.com>

liewegas force-pushed the wip-removed-snaps branch from 8877135 to be0442d Compare December 3, 2017 03:02

liewegas force-pushed the wip-removed-snaps branch from be0442d to 8c44dab Compare December 3, 2017 17:36

gregsfortytwo approved these changes Dec 6, 2017

View reviewed changes

liewegas added needs-qa wip-sage-testing and removed wip-sage-testing labels Dec 6, 2017

liewegas merged commit f3b2eb9 into ceph:master Dec 7, 2017

liewegas deleted the wip-removed-snaps branch December 7, 2017 03:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mon,osd,osdc: refactor snap trimming (phase 1) #18276

mon,osd,osdc: refactor snap trimming (phase 1) #18276

liewegas commented Oct 12, 2017 •

edited

gregsfortytwo left a comment

gregsfortytwo Nov 22, 2017

liewegas Nov 27, 2017

gregsfortytwo Nov 27, 2017

liewegas Nov 27, 2017

gregsfortytwo Nov 28, 2017

gregsfortytwo Nov 22, 2017

liewegas Nov 27, 2017

gregsfortytwo Nov 22, 2017

gregsfortytwo Nov 27, 2017

liewegas Nov 27, 2017

liewegas Nov 27, 2017

gregsfortytwo Nov 27, 2017

gregsfortytwo Nov 27, 2017

liewegas Nov 27, 2017

gregsfortytwo Nov 28, 2017

liewegas Nov 28, 2017

gregsfortytwo Nov 29, 2017

liewegas Nov 29, 2017

gregsfortytwo Nov 27, 2017

gregsfortytwo Nov 27, 2017

gregsfortytwo Nov 27, 2017

gregsfortytwo Nov 27, 2017

liewegas commented Nov 28, 2017

gregsfortytwo commented Nov 29, 2017

liewegas commented Nov 30, 2017

liewegas commented Dec 4, 2017

liewegas commented Dec 5, 2017

gregsfortytwo left a comment

gregsfortytwo Dec 6, 2017

liewegas Dec 6, 2017

gregsfortytwo Dec 6, 2017

mon,osd,osdc: refactor snap trimming (phase 1) #18276

mon,osd,osdc: refactor snap trimming (phase 1) #18276

Conversation

liewegas commented Oct 12, 2017 • edited

gregsfortytwo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liewegas commented Nov 28, 2017

gregsfortytwo commented Nov 29, 2017

liewegas commented Nov 30, 2017

liewegas commented Dec 4, 2017

liewegas commented Dec 5, 2017

gregsfortytwo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liewegas commented Oct 12, 2017 •

edited