New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mon,osd,osdc: refactor snap trimming (phase 1) #18276
Conversation
d30c2e2
to
2e7f411
Compare
2e7f411
to
59a5215
Compare
59a5215
to
d250bdf
Compare
73e001d
to
b551f30
Compare
27a2b0c
to
84ddfc3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall structure here looks good, but there are some issues.
And the missing data on the wire (and other broken checks) make clear that the current QA suite is not sufficient to validate this PR, so it needs some good tests added.
@@ -167,6 +173,28 @@ inline ostream& operator<<(ostream& out, const set<A, Comp, Alloc>& iset) { | |||
return out; | |||
} | |||
|
|||
template<class A, class Comp, class Alloc> | |||
inline ostream& operator<<(ostream& out, const boost::container::flat_set<A, Comp, Alloc>& iset) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this boost-specific instead of using the generic ceph-namespaced set/map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flat_set and flat_map != set and map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant ceph::flat_map instead of boost::flat_map, guess I left off too many words!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh! because they're not aliased in the ceph namespace. i'm not a real fan of doing that unless there is a reason we'd swap implementations (like we had to with shared_ptr forever ago); it just obscures things for someone reading the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh hmm, I misread the mempool setup and thought you were adding an alias.
I always kind of liked them thanks to the shared_ptr experience. I thought @wjwithagen had taken advantage of that pattern for some porting work as well, but maybe it doesn't matter for boost bits.
@@ -4465,16 +4465,6 @@ struct SnapSet { | |||
return out; | |||
} | |||
|
|||
// return min element of snaps > after, return max if no such element | |||
snapid_t get_first_snap_after(snapid_t after, snapid_t max) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just killing this because nobody uses it, right? It's always helpful to state a reason in the commit (unused, broken, whatever)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, implied no users because the code still compiles without it. i'll update the commit msg
src/include/mempool.h
Outdated
@@ -412,7 +412,7 @@ class pool_allocator { | |||
\ | |||
template<typename k, typename v, typename cmp = std::less<k> > \ | |||
using flat_map = boost::container::flat_map<k,v,cmp, \ | |||
pool_allocator<std::pair<const k,v>>>; \ | |||
pool_allocator<std::pair<k,v>>>; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Squash this patch?
src/mon/OSDMonitor.cc
Outdated
@@ -4872,6 +4900,34 @@ void OSDMonitor::clear_pool_flags(int64_t pool_id, uint64_t flags) | |||
pool->unset_flag(flags); | |||
} | |||
|
|||
string OSDMonitor::make_snap_epoch_key(int64_t pool, epoch_t epoch) | |||
{ | |||
char k[40]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is pretty stupid/unlikely but there are 15 characters in the string plus up to 10 digits for the epoch plus up to 19 for the pool, which doesn’t quite add up...
More pertinently, perhaps we should try and save some space in the name given @markhpc’s recent concerns about sizing and our having disabled compression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rocksdb and leveldb still do prefix encoding, so the full key isn't stored for every item--only the suffix that changes. i'm not worried about the size (or even performance) here since we aren't doing many queries over this data (it's just the items that are in the process of being purged)... i'm more worried about code and data schema clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(fixed the buffer sizes tho!)
src/mon/OSDMonitor.cc
Outdated
// removed_snaps | ||
if (tmp.require_osd_release >= CEPH_RELEASE_MIMIC) { | ||
for (auto& i : pending_inc.new_removed_snaps) { | ||
auto poolid = i.first; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don’t always use this below, and I’m not quite sure why. Be consistent?
src/osd/PG.cc
Outdated
interval_set<snapid_t> intersection; | ||
intersection.intersection_of(snap_trimq, info.purged_snaps); | ||
snap_trimq.subtract(intersection); | ||
info.purged_snaps.swap(intersection); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not getting how we can do this swap on pre-mimic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is actually identical behavior to the old code, except the old code had different debug output if the old purged_snaps wasn't a subset of removed_snaps (if they are == then substracting intersection is the same thing):
// initialize snap_trimq if (is_primary()) { - dout(20) << "activate - purged_snaps " << info.purged_snaps - << " cached_removed_snaps " << pool.cached_removed_snaps << dendl; - snap_trimq = pool.cached_removed_snaps; - interval_set intersection; - intersection.intersection_of(snap_trimq, info.purged_snaps); - if (intersection == info.purged_snaps) { - snap_trimq.subtract(info.purged_snaps); } else { - dout(0) << "warning: info.purged_snaps (" << info.purged_snaps - << ") is not a subset of removed_snaps" << dendl; - snap_trimq.subtract(intersection); - assert(!cct->_conf->osd_debug_verify_cached_snaps); } }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I never liked this code (it's too many swaps), but I'm not seeing how we want to swap the new intersection set with the info's purged_snaps. I don't think we did that before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's because removed_snaps_queue will shrink as the purged snaps get pruned from the osdmap. when that happens our purged_snaps here also needs to shrink (so we don't keep reporting old news).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am good with us getting the intersection and then trimming the snap_trimq by that value.
But the part where we swap our intersection into the purged_snaps is confusing me. On Mimic code I think it makes sense, because we have set up the snap_trimq from what the monitor knows considers removed_snaps (so we can dump purged_snaps the monitor agrees are purged). But for pre-Mimic, I think we're just tossing out information about which snapshots still exist or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense. made that condition on >= mimic. It should never shrink pre-mimic.
src/osd/PG.cc
Outdated
out << " snaptrimq=" << pg.snap_trimq; | ||
if (!pg.snap_trimq.empty() || | ||
pg.info.purged_snaps.size()) { | ||
out << " rsq/ps="; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rsq?
src/osd/PG.cc
Outdated
overlap.intersection_of(pg->snap_trimq, added); | ||
lderr(pg->cct) << __func__ << " removed_snaps already contains " | ||
<< overlap << dendl; | ||
bad = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supposed to be true?
src/osd/PG.cc
Outdated
} | ||
ldout(pg->cct,10) << __func__ << " new removed_snaps " << i->second | ||
<< ", snap_trimq now " << pg->snap_trimq << dendl; | ||
assert(!bad || pg->cct->_conf->osd_debug_verify_cached_snaps); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are...being less careful in debug mode? I don’t think that’s how we use this config elsewhere.
src/osd/PG.cc
Outdated
} | ||
ldout(pg->cct,10) << __func__ << " new purged_snaps " << j->second | ||
<< ", now " << pg->info.purged_snaps << dendl; | ||
assert(!bad || pg->cct->_conf->osd_debug_verify_cached_snaps); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ad is never set (outside of setup) in this block.
84ddfc3
to
ce518a5
Compare
@gregsfortytwo addressed comments and fixed the upgrade case. haven't squashed yet--when I do that i'll fix up the commit messages appropriately. |
Those fixes look good to me, @liewegas. Think you said there was still one bug outstanding. And I'm still perplexed about the snap_trimq/purged_snaps interval_set intersection swapping code. |
Oh, pretty sure you're right, the purged_snaps swap doesn't belong there at all. /facepalm |
We add in the new snap_seq just to try to keep the interval_set contiguous. Signed-off-by: Sage Weil <sage@redhat.com>
In reality we only call this when the PG is peered and thus past_intervals is empty, but this is more defensive in case that changes someday! Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Index by snap and by epoch; separate out pools. Signed-off-by: Sage Weil <sage@redhat.com>
If a client requests a map older than the mon's oldest, share with them snaps deleted during the gap too. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
If an in-flight Op has a snapc referencing a deleted snap, remove it from the snapc. Signed-off-by: Sage Weil <sage@redhat.com>
This is what the caller is passing. Signed-off-by: Sage Weil <sage@redhat.com>
If we are so laggy that we aren't contiguous with the mon's latest map, the mon will provide a summary of removed_snaps for the gap. Apply those to our in-flight ops. Signed-off-by: Sage Weil <sage@redhat.com>
Now ceph osd {set,unset} nosnaptrim will suspend or resume snap trimming. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
If we are about to lose our primary status, we don't want to do *any* of this stuff... especially share_pg_info(), which would get tagged with the current epoch but confuse our peers! Signed-off-by: Sage Weil <sage@redhat.com>
This dependency on the ondisk version dates back before argonaut, and no longer makes sense. Once the snap is trimmed by the primary, and purged_snaps is updated, the replica can (must!) blindly follow suit. Signed-off-by: Sage Weil <sage@redhat.com>
- update snap_trimq and purged_snaps based on new mimic OSDMap fields - improve debug output to include both trimq and purged Signed-off-by: Sage Weil <sage@redhat.com>
Explicitly track whether we are a pool snaps pool or a selfmanaged snaps pool. This was inferred from removed_snaps.empty() before, but that was fragile and kludgey and removed_snaps is going away. The upgrade/compat behavior is a bit tricky: - on decode, we set the flags based on the legacy condition. This lets us use and rely on the flags in memory. - on encode, we exclude the flags if decoding an older pg_pool_t Signed-off-by: Sage Weil <sage@redhat.com>
On the first mimic map, consider previously removed_snaps to be removed in that epoch (since we don't easily know when it happened). Signed-off-by: Sage Weil <sage@redhat.com>
Be a bit careful here because the mon has to do some bookkeeping to avoid pruning things twice. If the PGMapDigest set appears obviously stale, skip some work (looking at this particular interval) until it is not obviously stale--move onto the next interval instead. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
8877135
to
be0442d
Compare
These are possible because we update purged_snaps, part of the pg_info_t, but we do not bump the pg version or match it with a log entry, which means that the change does not reliably propagate to new OSDs during peering etc. Signed-off-by: Sage Weil <sage@redhat.com>
be0442d
to
8c44dab
Compare
Okay, I think this is ready now. I took the "easy" path on the purged_snaps updates. The scenario is:
Basically, there are a few options:
I went for 1 instead of 3 because even if we did 3, I'm not sure we can reenable the debug assertion without also carefully ordering our purged_snaps reporting to the mgr to after we have safely persisted the purged_snaps updates on disk. It feels like a lot of work and complexity for very little benefit (a debug assertion). |
@gregsfortytwo @jdurgin ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am good with option 1 here.
Branch looks good inasmuch as I could identify new code. A little confused by one bit.
advmap.osdmap)) { | ||
ldout(pg->cct, 10) << "Active advmap interval change, fast return" << dendl; | ||
return forward_event(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn’t this caught somewhere else? We haven’t changed the peering algorithms here...
I presume it’s because of the map processing change, but I’m not quite seeing how.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was one of those cases where I was surprised we hadn't hit it before. Until now, the deeply-nested states' AdvMap handler didn't do anything important, so the fact that the outer-state handler that detects the interval change runs after didn't matter. Now, I've added processing to that handler that gets royally confused when it isn't (yet) aware of the interval change. I forget now which crash I saw, but I think it was that purged_snaps (in pg_info_t) was being updated differently. I suspect the option 1 tolerates that better now, but the rest of this function is still all work that shouldn't be run at all if the interval just changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
Phase 1 of plan D on the pad http://pad.ceph.com/p/removing_removed_snaps
TODO