Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new stretch mode for 2-site Ceph clusters #35906

Merged
merged 101 commits into from Sep 18, 2020

Conversation

gregsfortytwo
Copy link
Member

This PR adds a new "stretch mode" to RADOS.

It targets 2-site clusters (with a tiebreaker mon elsewhere) and extends the existing rules
so that we can directly guarantee peers in each site, and handles netsplits between them.
More details are in the included dev and user documentation.

@gregsfortytwo
Copy link
Member Author

This PR obviously doesn't meet our usual testing standards. :( But I'd like to get it in-tree early for downstream development reasons — this is directly driven by Red Hat's product goals. I've used the set_up_stretch_mode.sh script and manual testing to verify that we do correctly transition through the various healthy/degraded/recovery stretch modes, and that the appropriate peering rules are enforced at a basic level. I am reasonably confident that this won't break any existing functionality, but waiting on the lab to catch up to run a suite through for that.

@gregsfortytwo
Copy link
Member Author

This PR is built on a (trivially-rebased) #32336; we can close that one and merge this if we're happy doing so.

@@ -693,7 +694,19 @@ void OSDMap::Incremental::encode(ceph::buffer::list& bl, uint64_t features) cons
if (target_v >= 9) {
encode(new_device_class_flags, bl);
}
ENCODE_FINISH(bl); // osd-only data
if (target_v >= 10) {
encode(change_stretch_mode, bl);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks to me like the other fields are only interpretted if change_stretch_mode, could we choose not to encode/decode them if !change_stretch_mode?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked this over and sounds like we don't want to — Sam has visions of switching normal peering to use some of these fields to solve our mis-placement troubles for Pacific.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, at least peering_crush_bucket_barrier (which should be renamed peering_crush_failure_domain) should be totally independent if we wish to make use of it normally.

@@ -5369,6 +5516,7 @@ PeeringState::Recovering::react(const RequestBackfill &evt)
if (!ps->async_recovery_targets.empty()) {
pg_shard_t auth_log_shard;
bool history_les_bound = false;
// FIXME: Uh-oh we have to check this return value; choose_acting can fail!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it? Our current acting set (and up set, presumably?) meets the requirements, so moving async recovery/backfill targets into acting should always work, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re probably right; I didn’t evaluate these too carefully, it was more of a note to check. Will look more carefully at what’s going on (though if it’s not possible to fail shouldn’t we assert that?)

// The per-bucket replica count is calculated with this "target"
// instead of the above crush_bucket_count. This means we can maintain a
// target size of 4 without attempting to place them all in 1 DC
uint32_t peering_crush_bucket_target = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/peering_crush_bucket_count/peering_crush_bucket_min_size
s/peering_crush_bucket_target/peering_crush_bucket_target_size

// of this bucket type...
uint32_t peering_crush_bucket_barrier = 0;
// including this one
int32_t peering_crush_mandatory_member = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have missed something, but for this to be effective, PastIntervals::check_new_interval needs to consider this when dealing with maybe_went_rw.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with bucket_count/barrier.

@gregsfortytwo
Copy link
Member Author

Will have patches for check_new_interval and the encoding make check failures later today.
Sounds like @athanatos has a new better implementation of calc_replicated_acting incoming as well.

// of this bucket type...
uint32_t peering_crush_bucket_barrier = 0;
// including this one
int32_t peering_crush_mandatory_member = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you want to use 0 as default/null here. osd.0 is actually valid. CRUSH_ITEM_NONE seems like it would be the right answer.

@athanatos
Copy link
Contributor

athanatos commented Jul 7, 2020

I've pushed possibly a more robust approach to calc_replicated_acting for stretch clusters: https://github.com/athanatos/ceph/tree/sjust/wip-stretch-peering . Note that this is entirely 100% untested -- it needs unit testing and debugging.

I think the above approach generalizes nicely to fixing the existing deficiency with calc_replicated_acting -- it doesn't necessarily respect failure domains when constructing a temp mapping. If we allow peering_crush_bucket_barrier to be used independently of bucket_count and rename it to something like peering_crush_failure_domain, we could set it on all clusters and eliminate the previous implementation entirely in the future.

… peers

This lets us build up a long-term view of how reliable our peers are. Presently
it's not used for anything, but soon we will make elections look at these
reliability scores!

Still to-do: persist the ConnectionTracker state locally so we don't lose it
on restart.

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
…ders

If they are disallowed, they always defer to others and are never
deferred to themselves. You can use this if you have a high-latency
monitor you want as a peon but not a leader, for instance.
Currently there is no way to configure it within the wider ecosystem, though.

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
This means we can implement a new strategy (or change an existing one) without
potentially adding bugs to the older, stable implementations. And admins
can eventually select a strategy that works for them. Right now it defaults to
CLASSIC (which I brought back after mauling it for the DISALLOW mode) and
there's no plumbing to change it in a real monitor.

Also still to-do: efficiently invoke all the unit tests on each applicable
strategy.

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
The leader_acked value resets to -1 once the election completes and we
need to be able to check it after that. Whoops!

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
We defer to other leaders based on our understanding of their total
connectivity score to mons with which we peer.

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
…nStrategy

Switch the ElectionLogic to require an ElectionStrategy input on construction.

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
This will let us queue up other messages (like pings) that don't
imply the election is unstable.

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
…n_stable

This lets us check that the quorum hasn't changed within a given number
of timesteps. This is distinguished from election_stable() by ignoring
whether non-quorum members are happy. See for instance the test where
we isolate two monitors and election_stable() fails but quorum_stable()
succeeds.

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
@gregsfortytwo
Copy link
Member Author

Updated the OSDMap feature bit handling; removed it from SIGNIFICANT_FEATURES as that was not actually what we want. The OSDMonitor now additionally prevents boot of OSDs without CEPH_FEATUREMASK_STRETCH_MODE when it's enabled in the OSDMap, and OSDMap::get_features() returns it as a needed OSD feature.

Poked a little more at the ElectionLogic first-election zero-score handling so it now passes make check (whoops about that last push, though I don't think it was likely to show up anywhere outside the unit test framework), and I think I fixed the syntax errors in the docs. More tests should happen overnight, though I had to filter out cephadm as they currently seem to be broken in master.

@gregsfortytwo
Copy link
Member Author

https://pulpito.ceph.com/gregf-2020-07-20_09:30:49-rados-wip-stretch-mode-4-distro-basic-smithi/

Two jobs hung so I killed them. 5243222 was running ceph_test_msgr and the process was stuck in the MessengerTest.ConnectionRaceReuseBannerTest; 5243339 seemed to have a network hiccup and then the DaemonWatchdog tried and failed to kill the test (I see a hangup coming from osd.0 despite it running happily when I looked, and an issue getting command descriptions from osd.6 but no other indications it had failed).

Of the failures:
5243114: Nautilus OSDs are crashing in a segfault somewhere in/under OSD::_committed_osd_maps. I see they are also failing to encode OSD maps with the expected CRC after getting an incremental, but not sure if this is related.
5243163: a mon election interval where mon.c got left out?
5243172: cephadm failures
5243221: OSD incorrectly marked down, but no Tracebacks.
5243224: a mon election interval where one monitor (of 21) got left out, but no Tracebacks.
5243241: cephadm failure
5243272: cephadm failure
5243287: OSDMap commit crashes
5243406: cephadm failure
5243441: cephadm failure

I think the upgrade failures a result of more encoding issues — current code is unconditionally encoding the new stretch values in pg_pool_t and the OSDMap, assuming the features support it. But that doesn't work when the cluster includes a mix of versions — whoops!

@gregsfortytwo
Copy link
Member Author

Hmm yep my local testing of a patch to fix the encoding is turning up the same bug now. Besides the immediate mismatche-CRC bug, at first glance this looks to me like the OSD's handling of bad CRCs and fetching full maps is broken. :(

…based

Previously we compared it to zero, but we could technically want to require
osd.0 as a member, maybe? In any case we have a "DNE" indicator
in CRUSH_ITEM_NONE, so use it.

Also, for osd_types.h, declare a pg_pool_t::pg_CRUSH_ITEM_NONE to use,
since apparently we can't import crush.h there and hard-coding it is bad.

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
I was erroneously making a copy of the existing OSDMap's pg_pool_t and
then putting it into pending_inc's pools member, but there might
have already been a modified one there. Use the convenient
get_new_pool() function instead.

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
…new compat

This struct gets sent out to clients as well as OSDs, and we need them
to be able to decode it without crashing/failing. So we can't require it,
and happily the OSDs which might act on it will be gated by the OSDMap.

Additionally, we don't want older servers to fail OSDMap crc checks, so don't
encode the stretch data members if they aren't in use.

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
We were previously encoding them unconditionally, but that led to
OSDMap crc mismatches between old and new daemons, which was bad.

Instead, we only set target_v to encode stretch mode when it's in use,
and we assert safety upon doing so.

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
@jdurgin
Copy link
Member

jdurgin commented Aug 6, 2020

latest updates look good to me

@jdurgin
Copy link
Member

jdurgin commented Aug 18, 2020

@gregsfortytwo how are the tests looking now, is this ready to go?

@gregsfortytwo
Copy link
Member Author

@jdurgin @neha-ojha I still have more testing I want to write for this but it's looking good against our current suite:
https://pulpito.ceph.com/gregf-2020-09-14_05:25:36-rados-wip-stretch-mode-distro-basic-smithi/
and I think it needs to go in before more conflicts get introduced (and, well, so I can put it into our downstream builds for testing).

I tracked all the failures/hangs down to identified issues in the tracker, except for creating https://tracker.ceph.com/issues/47453 -- and one where the initial monitor quorum startup took a couple rounds to quiesce, so it flagged MON_DOWN in the log.

@neha-ojha
Copy link
Member

@jdurgin @neha-ojha I still have more testing I want to write for this but it's looking good against our current suite:
https://pulpito.ceph.com/gregf-2020-09-14_05:25:36-rados-wip-stretch-mode-distro-basic-smithi/
and I think it needs to go in before more conflicts get introduced (and, well, so I can put it into our downstream builds for testing).

I tracked all the failures/hangs down to identified issues in the tracker, except for creating https://tracker.ceph.com/issues/47453 -- and one where the initial monitor quorum startup took a couple rounds to quiesce, so it flagged MON_DOWN in the log.

@gregsfortytwo I am not seeing any new failures in the FAILED jobs, although some have already been fixed in latest master. Among the DEAD jobs, https://tracker.ceph.com/issues/47453 is definitely new, need to check with @ifed01 and 5433261/5433382 are also failing abruptly but don't think this PR is related.

@gregsfortytwo
Copy link
Member Author

@gregsfortytwo I am not seeing any new failures in the FAILED jobs, although some have already been fixed in latest master. Among the DEAD jobs, https://tracker.ceph.com/issues/47453 is definitely new, need to check with @ifed01 and 5433261/5433382 are also failing abruptly but don't think this PR is related.

@neha-ojha, is that an approval? I need you or @jdurgin to actually click the review button! :)

@ifed01
Copy link
Contributor

ifed01 commented Sep 16, 2020

@jdurgin @neha-ojha I still have more testing I want to write for this but it's looking good against our current suite:
https://pulpito.ceph.com/gregf-2020-09-14_05:25:36-rados-wip-stretch-mode-distro-basic-smithi/
and I think it needs to go in before more conflicts get introduced (and, well, so I can put it into our downstream builds for testing).
I tracked all the failures/hangs down to identified issues in the tracker, except for creating https://tracker.ceph.com/issues/47453 -- and one where the initial monitor quorum startup took a couple rounds to quiesce, so it flagged MON_DOWN in the log.

@gregsfortytwo I am not seeing any new failures in the FAILED jobs, although some have already been fixed in latest master. Among the DEAD jobs, https://tracker.ceph.com/issues/47453 is definitely new, need to check with @ifed01 and 5433261/5433382 are also failing abruptly but don't think this PR is related.

Honestly I have no clue where https://tracker.ceph.com/issues/47453 comes from. Neither have any idea how to catch it... Would QA re-run reveal it again?
Generally we're getting complains about corrupted RocksDB from time to time for various Ceph releases so it's rather not a recent regression. No evidence whether all of them have the same root cause though...
E.g. here is another log I got yesterday (note that's a monitor's and octopus release, hence no bluefs and lower stuff):
debug 2020-09-15T12:49:48.523+0000 7fdb85d8f700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 3388771841, got 3472779923 in /var/lib/ceph/mon/ceph-b/store.db/005294.sst offset 108089 size 3925 code = 2 Rocksdb transaction:
Put( Prefix = m key = 'n_sync'0x006c6174'est_monmap' Value size = 379)
Put( Prefix = m key = 'n_sync'0x00696e5f'sync' Value size = 8)
Put( Prefix = m key = 'n_sync'0x006c6173't_committed_floor' Value size = 8)
/home/abuild/rpmbuild/BUILD/ceph-15.2.4-827-g318de690ed/src/mon/MonitorDBStore.h: In function 'int MonitorDBStore::apply_transaction(MonitorDBStore::TransactionRef)' thread 7fdb85d8f700 time 2020-09-15T12:49:48.529255+0000
/home/abuild/rpmbuild/BUILD/ceph-15.2.4-827-g318de690ed/src/mon/MonitorDBStore.h: 354: ceph_abort_msg("failed to write to db")
ceph version 15.2.4-827-g318de690ed (318de69) octopus (stable)
1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)+0xe1) [0x7fdb92e284c9]
2: (MonitorDBStore::apply_transaction(std::shared_ptrMonitorDBStore::Transaction)+0x107f) [0x55af70694b3f]
3: (Monitor::sync_start(entity_addrvec_t&, bool)+0x1ec) [0x55af706a38ec]
4: (Monitor::handle_probe_reply(boost::intrusive_ptr)+0xc01) [0x55af706be611]
5: (Monitor::handle_probe(boost::intrusive_ptr)+0x1af) [0x55af706bfd1f]
6: (Monitor::dispatch_op(boost::intrusive_ptr)+0x10c8) [0x55af706d4868]
7: (Monitor::_ms_dispatch(Message*)+0x4fa) [0x55af706d513a]
8: (Dispatcher::ms_dispatch2(boost::intrusive_ptr const&)+0x58) [0x55af707032b8]
9: (DispatchQueue::entry()+0x11c2) [0x7fdb93019d42]
10: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fdb930b87bd]
11: (()+0x84f9) [0x7fdb91c7c4f9]
12: (clone()+0x3f) [0x7fdb90e7ffbf]

Copy link
Member

@jdurgin jdurgin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update looks good

@neha-ojha
Copy link
Member

@gregsfortytwo I am not seeing any new failures in the FAILED jobs, although some have already been fixed in latest master. Among the DEAD jobs, https://tracker.ceph.com/issues/47453 is definitely new, need to check with @ifed01 and 5433261/5433382 are also failing abruptly but don't think this PR is related.

@neha-ojha, is that an approval? I need you or @jdurgin to actually click the review button! :)

Actually, I'd like to run this PR on top of latest master, since your upgrade test runs are encountering a bug that's now fixed.

@yuriw
Copy link
Contributor

yuriw commented Sep 16, 2020

…se.py

The "mon add" command now lets you pass in arbitrary numbers of strings,
so that you can include locations, so this test is invalid.

I considered updating it to only allow a single non-spaced string, but

datacenter=site1 rack=abc host=host1

is accepted elsewhere, so let's keep that consistent and just remove
this test instead.

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
@gregsfortytwo
Copy link
Member Author

Okay, Yuri's run had 2 upgrade tests fail but they were both due to finding CephFS client evictions in the logs; things still ran to completion.

I pushed a final new patch to deal with the issue in ceph_test_argparse.py that was discovered in backporting to nautilus, and I created a ticket so that we can prevent discovering things at that late stage in future: https://tracker.ceph.com/issues/47509

@neha-ojha
Copy link
Member

@gregsfortytwo I am not seeing any new failures in the FAILED jobs, although some have already been fixed in latest master. Among the DEAD jobs, https://tracker.ceph.com/issues/47453 is definitely new, need to check with @ifed01 and 5433261/5433382 are also failing abruptly but don't think this PR is related.

@neha-ojha, is that an approval? I need you or @jdurgin to actually click the review button! :)

Actually, I'd like to run this PR on top of latest master, since your upgrade test runs are encountering a bug that's now fixed.

@gregsfortytwo https://pulpito.ceph.com/yuriw-2020-09-16_17:34:03-rados-wip-yuri7-testing-2020-09-16-1533-master-distro-basic-smithi/ - the upgrade tests look good and no other related failures, go ahead with the merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants