New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new stretch mode for 2-site Ceph clusters #35906
Conversation
This PR obviously doesn't meet our usual testing standards. :( But I'd like to get it in-tree early for downstream development reasons — this is directly driven by Red Hat's product goals. I've used the set_up_stretch_mode.sh script and manual testing to verify that we do correctly transition through the various healthy/degraded/recovery stretch modes, and that the appropriate peering rules are enforced at a basic level. I am reasonably confident that this won't break any existing functionality, but waiting on the lab to catch up to run a suite through for that. |
This PR is built on a (trivially-rebased) #32336; we can close that one and merge this if we're happy doing so. |
@@ -693,7 +694,19 @@ void OSDMap::Incremental::encode(ceph::buffer::list& bl, uint64_t features) cons | |||
if (target_v >= 9) { | |||
encode(new_device_class_flags, bl); | |||
} | |||
ENCODE_FINISH(bl); // osd-only data | |||
if (target_v >= 10) { | |||
encode(change_stretch_mode, bl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks to me like the other fields are only interpretted if change_stretch_mode, could we choose not to encode/decode them if !change_stretch_mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked this over and sounds like we don't want to — Sam has visions of switching normal peering to use some of these fields to solve our mis-placement troubles for Pacific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, at least peering_crush_bucket_barrier (which should be renamed peering_crush_failure_domain) should be totally independent if we wish to make use of it normally.
@@ -5369,6 +5516,7 @@ PeeringState::Recovering::react(const RequestBackfill &evt) | |||
if (!ps->async_recovery_targets.empty()) { | |||
pg_shard_t auth_log_shard; | |||
bool history_les_bound = false; | |||
// FIXME: Uh-oh we have to check this return value; choose_acting can fail! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it? Our current acting set (and up set, presumably?) meets the requirements, so moving async recovery/backfill targets into acting should always work, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You’re probably right; I didn’t evaluate these too carefully, it was more of a note to check. Will look more carefully at what’s going on (though if it’s not possible to fail shouldn’t we assert that?)
// The per-bucket replica count is calculated with this "target" | ||
// instead of the above crush_bucket_count. This means we can maintain a | ||
// target size of 4 without attempting to place them all in 1 DC | ||
uint32_t peering_crush_bucket_target = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/peering_crush_bucket_count/peering_crush_bucket_min_size
s/peering_crush_bucket_target/peering_crush_bucket_target_size
src/osd/osd_types.h
Outdated
// of this bucket type... | ||
uint32_t peering_crush_bucket_barrier = 0; | ||
// including this one | ||
int32_t peering_crush_mandatory_member = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have missed something, but for this to be effective, PastIntervals::check_new_interval needs to consider this when dealing with maybe_went_rw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with bucket_count/barrier.
Will have patches for check_new_interval and the encoding make check failures later today. |
src/osd/osd_types.h
Outdated
// of this bucket type... | ||
uint32_t peering_crush_bucket_barrier = 0; | ||
// including this one | ||
int32_t peering_crush_mandatory_member = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you want to use 0 as default/null here. osd.0 is actually valid. CRUSH_ITEM_NONE seems like it would be the right answer.
I've pushed possibly a more robust approach to calc_replicated_acting for stretch clusters: https://github.com/athanatos/ceph/tree/sjust/wip-stretch-peering . Note that this is entirely 100% untested -- it needs unit testing and debugging. I think the above approach generalizes nicely to fixing the existing deficiency with calc_replicated_acting -- it doesn't necessarily respect failure domains when constructing a temp mapping. If we allow peering_crush_bucket_barrier to be used independently of bucket_count and rename it to something like peering_crush_failure_domain, we could set it on all clusters and eliminate the previous implementation entirely in the future. |
… peers This lets us build up a long-term view of how reliable our peers are. Presently it's not used for anything, but soon we will make elections look at these reliability scores! Still to-do: persist the ConnectionTracker state locally so we don't lose it on restart. Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
…ders If they are disallowed, they always defer to others and are never deferred to themselves. You can use this if you have a high-latency monitor you want as a peon but not a leader, for instance. Currently there is no way to configure it within the wider ecosystem, though. Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
This means we can implement a new strategy (or change an existing one) without potentially adding bugs to the older, stable implementations. And admins can eventually select a strategy that works for them. Right now it defaults to CLASSIC (which I brought back after mauling it for the DISALLOW mode) and there's no plumbing to change it in a real monitor. Also still to-do: efficiently invoke all the unit tests on each applicable strategy. Signed-off-by: Greg Farnum <gfarnum@redhat.com>
The leader_acked value resets to -1 once the election completes and we need to be able to check it after that. Whoops! Signed-off-by: Greg Farnum <gfarnum@redhat.com>
We defer to other leaders based on our understanding of their total connectivity score to mons with which we peer. Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
…nStrategy Switch the ElectionLogic to require an ElectionStrategy input on construction. Signed-off-by: Greg Farnum <gfarnum@redhat.com>
This will let us queue up other messages (like pings) that don't imply the election is unstable. Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
…n_stable This lets us check that the quorum hasn't changed within a given number of timesteps. This is distinguished from election_stable() by ignoring whether non-quorum members are happy. See for instance the test where we isolate two monitors and election_stable() fails but quorum_stable() succeeds. Signed-off-by: Greg Farnum <gfarnum@redhat.com>
76d4195
to
6b4f1d6
Compare
Updated the OSDMap feature bit handling; removed it from SIGNIFICANT_FEATURES as that was not actually what we want. The OSDMonitor now additionally prevents boot of OSDs without CEPH_FEATUREMASK_STRETCH_MODE when it's enabled in the OSDMap, and OSDMap::get_features() returns it as a needed OSD feature. Poked a little more at the ElectionLogic first-election zero-score handling so it now passes make check (whoops about that last push, though I don't think it was likely to show up anywhere outside the unit test framework), and I think I fixed the syntax errors in the docs. More tests should happen overnight, though I had to filter out cephadm as they currently seem to be broken in master. |
https://pulpito.ceph.com/gregf-2020-07-20_09:30:49-rados-wip-stretch-mode-4-distro-basic-smithi/ Two jobs hung so I killed them. 5243222 was running ceph_test_msgr and the process was stuck in the MessengerTest.ConnectionRaceReuseBannerTest; 5243339 seemed to have a network hiccup and then the DaemonWatchdog tried and failed to kill the test (I see a hangup coming from osd.0 despite it running happily when I looked, and an issue getting command descriptions from osd.6 but no other indications it had failed). Of the failures: I think the upgrade failures a result of more encoding issues — current code is unconditionally encoding the new stretch values in pg_pool_t and the OSDMap, assuming the features support it. But that doesn't work when the cluster includes a mix of versions — whoops! |
Hmm yep my local testing of a patch to fix the encoding is turning up the same bug now. Besides the immediate mismatche-CRC bug, at first glance this looks to me like the OSD's handling of bad CRCs and fetching full maps is broken. :( |
…based Previously we compared it to zero, but we could technically want to require osd.0 as a member, maybe? In any case we have a "DNE" indicator in CRUSH_ITEM_NONE, so use it. Also, for osd_types.h, declare a pg_pool_t::pg_CRUSH_ITEM_NONE to use, since apparently we can't import crush.h there and hard-coding it is bad. Signed-off-by: Greg Farnum <gfarnum@redhat.com>
I was erroneously making a copy of the existing OSDMap's pg_pool_t and then putting it into pending_inc's pools member, but there might have already been a modified one there. Use the convenient get_new_pool() function instead. Signed-off-by: Greg Farnum <gfarnum@redhat.com>
…new compat This struct gets sent out to clients as well as OSDs, and we need them to be able to decode it without crashing/failing. So we can't require it, and happily the OSDs which might act on it will be gated by the OSDMap. Additionally, we don't want older servers to fail OSDMap crc checks, so don't encode the stretch data members if they aren't in use. Signed-off-by: Greg Farnum <gfarnum@redhat.com>
We were previously encoding them unconditionally, but that led to OSDMap crc mismatches between old and new daemons, which was bad. Instead, we only set target_v to encode stretch mode when it's in use, and we assert safety upon doing so. Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
6b4f1d6
to
0a86fe2
Compare
latest updates look good to me |
@gregsfortytwo how are the tests looking now, is this ready to go? |
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Conflicts: src/include/ceph_features.h Signed-off-by: Greg Farnum <gfarnum@redhat.com>
@jdurgin @neha-ojha I still have more testing I want to write for this but it's looking good against our current suite: I tracked all the failures/hangs down to identified issues in the tracker, except for creating https://tracker.ceph.com/issues/47453 -- and one where the initial monitor quorum startup took a couple rounds to quiesce, so it flagged MON_DOWN in the log. |
@gregsfortytwo I am not seeing any new failures in the FAILED jobs, although some have already been fixed in latest master. Among the DEAD jobs, https://tracker.ceph.com/issues/47453 is definitely new, need to check with @ifed01 and 5433261/5433382 are also failing abruptly but don't think this PR is related. |
@neha-ojha, is that an approval? I need you or @jdurgin to actually click the review button! :) |
Honestly I have no clue where https://tracker.ceph.com/issues/47453 comes from. Neither have any idea how to catch it... Would QA re-run reveal it again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update looks good
Actually, I'd like to run this PR on top of latest master, since your upgrade test runs are encountering a bug that's now fixed. |
…se.py The "mon add" command now lets you pass in arbitrary numbers of strings, so that you can include locations, so this test is invalid. I considered updating it to only allow a single non-spaced string, but datacenter=site1 rack=abc host=host1 is accepted elsewhere, so let's keep that consistent and just remove this test instead. Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Okay, Yuri's run had 2 upgrade tests fail but they were both due to finding CephFS client evictions in the logs; things still ran to completion. I pushed a final new patch to deal with the issue in ceph_test_argparse.py that was discovered in backporting to nautilus, and I created a ticket so that we can prevent discovering things at that late stage in future: https://tracker.ceph.com/issues/47509 |
@gregsfortytwo https://pulpito.ceph.com/yuriw-2020-09-16_17:34:03-rados-wip-yuri7-testing-2020-09-16-1533-master-distro-basic-smithi/ - the upgrade tests look good and no other related failures, go ahead with the merge. |
This PR adds a new "stretch mode" to RADOS.
It targets 2-site clusters (with a tiebreaker mon elsewhere) and extends the existing rules
so that we can directly guarantee peers in each site, and handles netsplits between them.
More details are in the included dev and user documentation.