Add a new stretch mode for 2-site Ceph clusters #35906

gregsfortytwo · 2020-07-02T20:53:31Z

This PR adds a new "stretch mode" to RADOS.

It targets 2-site clusters (with a tiebreaker mon elsewhere) and extends the existing rules
so that we can directly guarantee peers in each site, and handles netsplits between them.
More details are in the included dev and user documentation.

gregsfortytwo · 2020-07-02T20:58:52Z

This PR obviously doesn't meet our usual testing standards. :( But I'd like to get it in-tree early for downstream development reasons — this is directly driven by Red Hat's product goals. I've used the set_up_stretch_mode.sh script and manual testing to verify that we do correctly transition through the various healthy/degraded/recovery stretch modes, and that the appropriate peering rules are enforced at a basic level. I am reasonably confident that this won't break any existing functionality, but waiting on the lab to catch up to run a suite through for that.

gregsfortytwo · 2020-07-02T20:59:29Z

This PR is built on a (trivially-rebased) #32336; we can close that one and merge this if we're happy doing so.

athanatos · 2020-07-02T21:54:18Z

src/osd/OSDMap.cc

@@ -693,7 +694,19 @@ void OSDMap::Incremental::encode(ceph::buffer::list& bl, uint64_t features) cons
    if (target_v >= 9) {
      encode(new_device_class_flags, bl);
    }
-    ENCODE_FINISH(bl); // osd-only data
+    if (target_v >= 10) {
+      encode(change_stretch_mode, bl);


Looks to me like the other fields are only interpretted if change_stretch_mode, could we choose not to encode/decode them if !change_stretch_mode?

Talked this over and sounds like we don't want to — Sam has visions of switching normal peering to use some of these fields to solve our mis-placement troubles for Pacific.

Indeed, at least peering_crush_bucket_barrier (which should be renamed peering_crush_failure_domain) should be totally independent if we wish to make use of it normally.

athanatos · 2020-07-02T22:01:54Z

src/osd/PeeringState.cc

@@ -5369,6 +5516,7 @@ PeeringState::Recovering::react(const RequestBackfill &evt)
  if (!ps->async_recovery_targets.empty()) {
    pg_shard_t auth_log_shard;
    bool history_les_bound = false;
+    // FIXME: Uh-oh we have to check this return value; choose_acting can fail!


Can it? Our current acting set (and up set, presumably?) meets the requirements, so moving async recovery/backfill targets into acting should always work, right?

You’re probably right; I didn’t evaluate these too carefully, it was more of a note to check. Will look more carefully at what’s going on (though if it’s not possible to fail shouldn’t we assert that?)

athanatos · 2020-07-06T18:08:03Z

src/osd/osd_types.h

+  // The per-bucket replica count is calculated with this "target"
+  // instead of the above crush_bucket_count. This means we can maintain a
+  // target size of 4 without attempting to place them all in 1 DC
+  uint32_t peering_crush_bucket_target = 0;


s/peering_crush_bucket_count/peering_crush_bucket_min_size
s/peering_crush_bucket_target/peering_crush_bucket_target_size

athanatos · 2020-07-06T18:50:09Z

src/osd/osd_types.h

+  // of this bucket type...
+  uint32_t peering_crush_bucket_barrier = 0;
+  // including this one
+  int32_t peering_crush_mandatory_member = 0;


I may have missed something, but for this to be effective, PastIntervals::check_new_interval needs to consider this when dealing with maybe_went_rw.

Same with bucket_count/barrier.

gregsfortytwo · 2020-07-06T19:06:43Z

Will have patches for check_new_interval and the encoding make check failures later today.
Sounds like @athanatos has a new better implementation of calc_replicated_acting incoming as well.

athanatos · 2020-07-07T00:23:23Z

src/osd/osd_types.h

+  // of this bucket type...
+  uint32_t peering_crush_bucket_barrier = 0;
+  // including this one
+  int32_t peering_crush_mandatory_member = 0;


I don't think you want to use 0 as default/null here. osd.0 is actually valid. CRUSH_ITEM_NONE seems like it would be the right answer.

athanatos · 2020-07-07T02:13:11Z

I've pushed possibly a more robust approach to calc_replicated_acting for stretch clusters: https://github.com/athanatos/ceph/tree/sjust/wip-stretch-peering . Note that this is entirely 100% untested -- it needs unit testing and debugging.

I think the above approach generalizes nicely to fixing the existing deficiency with calc_replicated_acting -- it doesn't necessarily respect failure domains when constructing a temp mapping. If we allow peering_crush_bucket_barrier to be used independently of bucket_count and rename it to something like peering_crush_failure_domain, we could set it on all clusters and eliminate the previous implementation entirely in the future.

… peers This lets us build up a long-term view of how reliable our peers are. Presently it's not used for anything, but soon we will make elections look at these reliability scores! Still to-do: persist the ConnectionTracker state locally so we don't lose it on restart. Signed-off-by: Greg Farnum <gfarnum@redhat.com>

Signed-off-by: Greg Farnum <gfarnum@redhat.com>

…ders If they are disallowed, they always defer to others and are never deferred to themselves. You can use this if you have a high-latency monitor you want as a peon but not a leader, for instance. Currently there is no way to configure it within the wider ecosystem, though. Signed-off-by: Greg Farnum <gfarnum@redhat.com>

Signed-off-by: Greg Farnum <gfarnum@redhat.com>

This means we can implement a new strategy (or change an existing one) without potentially adding bugs to the older, stable implementations. And admins can eventually select a strategy that works for them. Right now it defaults to CLASSIC (which I brought back after mauling it for the DISALLOW mode) and there's no plumbing to change it in a real monitor. Also still to-do: efficiently invoke all the unit tests on each applicable strategy. Signed-off-by: Greg Farnum <gfarnum@redhat.com>

The leader_acked value resets to -1 once the election completes and we need to be able to check it after that. Whoops! Signed-off-by: Greg Farnum <gfarnum@redhat.com>

We defer to other leaders based on our understanding of their total connectivity score to mons with which we peer. Signed-off-by: Greg Farnum <gfarnum@redhat.com>

Signed-off-by: Greg Farnum <gfarnum@redhat.com>

…nStrategy Switch the ElectionLogic to require an ElectionStrategy input on construction. Signed-off-by: Greg Farnum <gfarnum@redhat.com>

This will let us queue up other messages (like pings) that don't imply the election is unstable. Signed-off-by: Greg Farnum <gfarnum@redhat.com>

Signed-off-by: Greg Farnum <gfarnum@redhat.com>

…n_stable This lets us check that the quorum hasn't changed within a given number of timesteps. This is distinguished from election_stable() by ignoring whether non-quorum members are happy. See for instance the test where we isolate two monitors and election_stable() fails but quorum_stable() succeeds. Signed-off-by: Greg Farnum <gfarnum@redhat.com>

gregsfortytwo · 2020-07-20T07:38:16Z

Updated the OSDMap feature bit handling; removed it from SIGNIFICANT_FEATURES as that was not actually what we want. The OSDMonitor now additionally prevents boot of OSDs without CEPH_FEATUREMASK_STRETCH_MODE when it's enabled in the OSDMap, and OSDMap::get_features() returns it as a needed OSD feature.

Poked a little more at the ElectionLogic first-election zero-score handling so it now passes make check (whoops about that last push, though I don't think it was likely to show up anywhere outside the unit test framework), and I think I fixed the syntax errors in the docs. More tests should happen overnight, though I had to filter out cephadm as they currently seem to be broken in master.

gregsfortytwo · 2020-07-20T21:49:09Z

https://pulpito.ceph.com/gregf-2020-07-20_09:30:49-rados-wip-stretch-mode-4-distro-basic-smithi/

Two jobs hung so I killed them. 5243222 was running ceph_test_msgr and the process was stuck in the MessengerTest.ConnectionRaceReuseBannerTest; 5243339 seemed to have a network hiccup and then the DaemonWatchdog tried and failed to kill the test (I see a hangup coming from osd.0 despite it running happily when I looked, and an issue getting command descriptions from osd.6 but no other indications it had failed).

Of the failures:
5243114: Nautilus OSDs are crashing in a segfault somewhere in/under OSD::_committed_osd_maps. I see they are also failing to encode OSD maps with the expected CRC after getting an incremental, but not sure if this is related.
5243163: a mon election interval where mon.c got left out?
5243172: cephadm failures
5243221: OSD incorrectly marked down, but no Tracebacks.
5243224: a mon election interval where one monitor (of 21) got left out, but no Tracebacks.
5243241: cephadm failure
5243272: cephadm failure
5243287: OSDMap commit crashes
5243406: cephadm failure
5243441: cephadm failure

I think the upgrade failures a result of more encoding issues — current code is unconditionally encoding the new stretch values in pg_pool_t and the OSDMap, assuming the features support it. But that doesn't work when the cluster includes a mix of versions — whoops!

gregsfortytwo · 2020-07-20T21:50:20Z

Hmm yep my local testing of a patch to fix the encoding is turning up the same bug now. Besides the immediate mismatche-CRC bug, at first glance this looks to me like the OSD's handling of bad CRCs and fetching full maps is broken. :(

…based Previously we compared it to zero, but we could technically want to require osd.0 as a member, maybe? In any case we have a "DNE" indicator in CRUSH_ITEM_NONE, so use it. Also, for osd_types.h, declare a pg_pool_t::pg_CRUSH_ITEM_NONE to use, since apparently we can't import crush.h there and hard-coding it is bad. Signed-off-by: Greg Farnum <gfarnum@redhat.com>

I was erroneously making a copy of the existing OSDMap's pg_pool_t and then putting it into pending_inc's pools member, but there might have already been a modified one there. Use the convenient get_new_pool() function instead. Signed-off-by: Greg Farnum <gfarnum@redhat.com>

…new compat This struct gets sent out to clients as well as OSDs, and we need them to be able to decode it without crashing/failing. So we can't require it, and happily the OSDs which might act on it will be gated by the OSDMap. Additionally, we don't want older servers to fail OSDMap crc checks, so don't encode the stretch data members if they aren't in use. Signed-off-by: Greg Farnum <gfarnum@redhat.com>

We were previously encoding them unconditionally, but that led to OSDMap crc mismatches between old and new daemons, which was bad. Instead, we only set target_v to encode stretch mode when it's in use, and we assert safety upon doing so. Signed-off-by: Greg Farnum <gfarnum@redhat.com>

Signed-off-by: Greg Farnum <gfarnum@redhat.com>

jdurgin · 2020-08-06T23:31:18Z

latest updates look good to me

jdurgin · 2020-08-18T01:56:48Z

@gregsfortytwo how are the tests looking now, is this ready to go?

Signed-off-by: Greg Farnum <gfarnum@redhat.com>

Conflicts: src/include/ceph_features.h Signed-off-by: Greg Farnum <gfarnum@redhat.com>

gregsfortytwo · 2020-09-15T02:36:34Z

@jdurgin @neha-ojha I still have more testing I want to write for this but it's looking good against our current suite:
https://pulpito.ceph.com/gregf-2020-09-14_05:25:36-rados-wip-stretch-mode-distro-basic-smithi/
and I think it needs to go in before more conflicts get introduced (and, well, so I can put it into our downstream builds for testing).

I tracked all the failures/hangs down to identified issues in the tracker, except for creating https://tracker.ceph.com/issues/47453 -- and one where the initial monitor quorum startup took a couple rounds to quiesce, so it flagged MON_DOWN in the log.

neha-ojha · 2020-09-16T00:16:10Z

@jdurgin @neha-ojha I still have more testing I want to write for this but it's looking good against our current suite:
https://pulpito.ceph.com/gregf-2020-09-14_05:25:36-rados-wip-stretch-mode-distro-basic-smithi/
and I think it needs to go in before more conflicts get introduced (and, well, so I can put it into our downstream builds for testing).

I tracked all the failures/hangs down to identified issues in the tracker, except for creating https://tracker.ceph.com/issues/47453 -- and one where the initial monitor quorum startup took a couple rounds to quiesce, so it flagged MON_DOWN in the log.

@gregsfortytwo I am not seeing any new failures in the FAILED jobs, although some have already been fixed in latest master. Among the DEAD jobs, https://tracker.ceph.com/issues/47453 is definitely new, need to check with @ifed01 and 5433261/5433382 are also failing abruptly but don't think this PR is related.

gregsfortytwo · 2020-09-16T06:55:45Z

@gregsfortytwo I am not seeing any new failures in the FAILED jobs, although some have already been fixed in latest master. Among the DEAD jobs, https://tracker.ceph.com/issues/47453 is definitely new, need to check with @ifed01 and 5433261/5433382 are also failing abruptly but don't think this PR is related.

@neha-ojha, is that an approval? I need you or @jdurgin to actually click the review button! :)

ifed01 · 2020-09-16T10:01:13Z

@jdurgin @neha-ojha I still have more testing I want to write for this but it's looking good against our current suite:
https://pulpito.ceph.com/gregf-2020-09-14_05:25:36-rados-wip-stretch-mode-distro-basic-smithi/
and I think it needs to go in before more conflicts get introduced (and, well, so I can put it into our downstream builds for testing).
I tracked all the failures/hangs down to identified issues in the tracker, except for creating https://tracker.ceph.com/issues/47453 -- and one where the initial monitor quorum startup took a couple rounds to quiesce, so it flagged MON_DOWN in the log.

@gregsfortytwo I am not seeing any new failures in the FAILED jobs, although some have already been fixed in latest master. Among the DEAD jobs, https://tracker.ceph.com/issues/47453 is definitely new, need to check with @ifed01 and 5433261/5433382 are also failing abruptly but don't think this PR is related.

Honestly I have no clue where https://tracker.ceph.com/issues/47453 comes from. Neither have any idea how to catch it... Would QA re-run reveal it again?
Generally we're getting complains about corrupted RocksDB from time to time for various Ceph releases so it's rather not a recent regression. No evidence whether all of them have the same root cause though...
E.g. here is another log I got yesterday (note that's a monitor's and octopus release, hence no bluefs and lower stuff):
debug 2020-09-15T12:49:48.523+0000 7fdb85d8f700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 3388771841, got 3472779923 in /var/lib/ceph/mon/ceph-b/store.db/005294.sst offset 108089 size 3925 code = 2 Rocksdb transaction:
Put( Prefix = m key = 'n_sync'0x006c6174'est_monmap' Value size = 379)
Put( Prefix = m key = 'n_sync'0x00696e5f'sync' Value size = 8)
Put( Prefix = m key = 'n_sync'0x006c6173't_committed_floor' Value size = 8)
/home/abuild/rpmbuild/BUILD/ceph-15.2.4-827-g318de690ed/src/mon/MonitorDBStore.h: In function 'int MonitorDBStore::apply_transaction(MonitorDBStore::TransactionRef)' thread 7fdb85d8f700 time 2020-09-15T12:49:48.529255+0000
/home/abuild/rpmbuild/BUILD/ceph-15.2.4-827-g318de690ed/src/mon/MonitorDBStore.h: 354: ceph_abort_msg("failed to write to db")
ceph version 15.2.4-827-g318de690ed (318de69) octopus (stable)
1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)+0xe1) [0x7fdb92e284c9]
2: (MonitorDBStore::apply_transaction(std::shared_ptrMonitorDBStore::Transaction)+0x107f) [0x55af70694b3f]
3: (Monitor::sync_start(entity_addrvec_t&, bool)+0x1ec) [0x55af706a38ec]
4: (Monitor::handle_probe_reply(boost::intrusive_ptr)+0xc01) [0x55af706be611]
5: (Monitor::handle_probe(boost::intrusive_ptr)+0x1af) [0x55af706bfd1f]
6: (Monitor::dispatch_op(boost::intrusive_ptr)+0x10c8) [0x55af706d4868]
7: (Monitor::_ms_dispatch(Message*)+0x4fa) [0x55af706d513a]
8: (Dispatcher::ms_dispatch2(boost::intrusive_ptr const&)+0x58) [0x55af707032b8]
9: (DispatchQueue::entry()+0x11c2) [0x7fdb93019d42]
10: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fdb930b87bd]
11: (()+0x84f9) [0x7fdb91c7c4f9]
12: (clone()+0x3f) [0x7fdb90e7ffbf]

jdurgin

update looks good

neha-ojha · 2020-09-16T15:28:07Z

@gregsfortytwo I am not seeing any new failures in the FAILED jobs, although some have already been fixed in latest master. Among the DEAD jobs, https://tracker.ceph.com/issues/47453 is definitely new, need to check with @ifed01 and 5433261/5433382 are also failing abruptly but don't think this PR is related.

@neha-ojha, is that an approval? I need you or @jdurgin to actually click the review button! :)

Actually, I'd like to run this PR on top of latest master, since your upgrade test runs are encountering a bug that's now fixed.

yuriw · 2020-09-16T15:35:34Z

wip-yuri7-testing-2020-09-16-1533-master

…se.py The "mon add" command now lets you pass in arbitrary numbers of strings, so that you can include locations, so this test is invalid. I considered updating it to only allow a single non-spaced string, but datacenter=site1 rack=abc host=host1 is accepted elsewhere, so let's keep that consistent and just remove this test instead. Signed-off-by: Greg Farnum <gfarnum@redhat.com>

gregsfortytwo · 2020-09-17T01:08:55Z

Okay, Yuri's run had 2 upgrade tests fail but they were both due to finding CephFS client evictions in the logs; things still ran to completion.

I pushed a final new patch to deal with the issue in ceph_test_argparse.py that was discovered in backporting to nautilus, and I created a ticket so that we can prevent discovering things at that late stage in future: https://tracker.ceph.com/issues/47509

neha-ojha · 2020-09-17T16:39:14Z

@gregsfortytwo I am not seeing any new failures in the FAILED jobs, although some have already been fixed in latest master. Among the DEAD jobs, https://tracker.ceph.com/issues/47453 is definitely new, need to check with @ifed01 and 5433261/5433382 are also failing abruptly but don't think this PR is related.

@neha-ojha, is that an approval? I need you or @jdurgin to actually click the review button! :)

Actually, I'd like to run this PR on top of latest master, since your upgrade test runs are encountering a bug that's now fixed.

@gregsfortytwo https://pulpito.ceph.com/yuriw-2020-09-16_17:34:03-rados-wip-yuri7-testing-2020-09-16-1533-master-distro-basic-smithi/ - the upgrade tests look good and no other related failures, go ahead with the merge.

gregsfortytwo added feature core needs-review labels Jul 2, 2020

gregsfortytwo requested review from athanatos, jdurgin, neha-ojha and a team July 2, 2020 20:53

gregsfortytwo requested a review from a team as a code owner July 2, 2020 20:53

athanatos reviewed Jul 2, 2020

View reviewed changes

athanatos reviewed Jul 6, 2020

View reviewed changes

athanatos reviewed Jul 7, 2020

View reviewed changes

gregsfortytwo added 13 commits July 8, 2020 04:26

elector: actually persist connectivity scores on a regular basis

b07e68d

Signed-off-by: Greg Farnum <gfarnum@redhat.com>

test: centralize checking leader and epoch matches across ElectionLogics

7384fa9

Signed-off-by: Greg Farnum <gfarnum@redhat.com>

elector: track the previous election winner for use in testing

ef8a687

The leader_acked value resets to -1 once the election completes and we need to be able to check it after that. Whoops! Signed-off-by: Greg Farnum <gfarnum@redhat.com>

elector: ElectionLogic: implement a connectivity-based ElectionStrategy

d2d9f9a

We defer to other leaders based on our understanding of their total connectivity score to mons with which we peer. Signed-off-by: Greg Farnum <gfarnum@redhat.com>

test: elector: update test_election to pass around the ConnectionTracker

c53594a

Signed-off-by: Greg Farnum <gfarnum@redhat.com>

test: elector: make macros to run test scenarios against each Electio…

c6fef13

…nStrategy Switch the ElectionLogic to require an ElectionStrategy input on construction. Signed-off-by: Greg Farnum <gfarnum@redhat.com>

test: elector: explicitly count election messages in flight

1cd67d0

This will let us queue up other messages (like pings) that don't imply the election is unstable. Signed-off-by: Greg Farnum <gfarnum@redhat.com>

test: elector: build connection tracking and sharing by pings!

360e9a5

Signed-off-by: Greg Farnum <gfarnum@redhat.com>

test: elector: check that we stabilize after flapping connections

a2f747f

Signed-off-by: Greg Farnum <gfarnum@redhat.com>

gregsfortytwo force-pushed the wip-stretch-mode branch from 76d4195 to 6b4f1d6 Compare July 20, 2020 07:14

gregsfortytwo added 5 commits July 21, 2020 17:59

doc: describe stretch mode for users and developers

0a86fe2

Signed-off-by: Greg Farnum <gfarnum@redhat.com>

gregsfortytwo force-pushed the wip-stretch-mode branch from 6b4f1d6 to 0a86fe2 Compare July 21, 2020 18:22

gregsfortytwo added 3 commits September 14, 2020 02:32

Merge remote-tracking branch 'origin/master' into wip-stretch-mode

d026253

features: change STRETCH_MODE feature to avoid a conflict

f4375d1

Signed-off-by: Greg Farnum <gfarnum@redhat.com>

Merge remote-tracking branch 'origin/master' into wip-stretch-mode

9506d09

Conflicts: src/include/ceph_features.h Signed-off-by: Greg Farnum <gfarnum@redhat.com>

gregsfortytwo mentioned this pull request Sep 16, 2020

DNM: backport stretch clusters to nautilus #37173

Closed

3 tasks

jdurgin approved these changes Sep 16, 2020

View reviewed changes

yuriw added the wip-yuri7-testing label Sep 16, 2020

jdurgin approved these changes Sep 17, 2020

View reviewed changes

neha-ojha merged commit 8ba0a61 into ceph:master Sep 18, 2020

neha-ojha mentioned this pull request Oct 23, 2020

mon: add connection scoring and a score-based leader election system #32336

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a new stretch mode for 2-site Ceph clusters #35906

Add a new stretch mode for 2-site Ceph clusters #35906

gregsfortytwo commented Jul 2, 2020

gregsfortytwo commented Jul 2, 2020

gregsfortytwo commented Jul 2, 2020

athanatos Jul 2, 2020

gregsfortytwo Jul 7, 2020

athanatos Jul 7, 2020

athanatos Jul 2, 2020

gregsfortytwo Jul 6, 2020

athanatos Jul 6, 2020

athanatos Jul 6, 2020

athanatos Jul 6, 2020

gregsfortytwo commented Jul 6, 2020

athanatos Jul 7, 2020

athanatos commented Jul 7, 2020 •

edited

gregsfortytwo commented Jul 20, 2020

gregsfortytwo commented Jul 20, 2020

gregsfortytwo commented Jul 20, 2020

jdurgin commented Aug 6, 2020

jdurgin commented Aug 18, 2020

gregsfortytwo commented Sep 15, 2020

neha-ojha commented Sep 16, 2020

gregsfortytwo commented Sep 16, 2020

ifed01 commented Sep 16, 2020

jdurgin left a comment

neha-ojha commented Sep 16, 2020

yuriw commented Sep 16, 2020

gregsfortytwo commented Sep 17, 2020

neha-ojha commented Sep 17, 2020

Add a new stretch mode for 2-site Ceph clusters #35906

Add a new stretch mode for 2-site Ceph clusters #35906

Conversation

gregsfortytwo commented Jul 2, 2020

gregsfortytwo commented Jul 2, 2020

gregsfortytwo commented Jul 2, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gregsfortytwo commented Jul 6, 2020

Choose a reason for hiding this comment

athanatos commented Jul 7, 2020 • edited

gregsfortytwo commented Jul 20, 2020

gregsfortytwo commented Jul 20, 2020

gregsfortytwo commented Jul 20, 2020

jdurgin commented Aug 6, 2020

jdurgin commented Aug 18, 2020

gregsfortytwo commented Sep 15, 2020

neha-ojha commented Sep 16, 2020

gregsfortytwo commented Sep 16, 2020

ifed01 commented Sep 16, 2020

jdurgin left a comment

Choose a reason for hiding this comment

neha-ojha commented Sep 16, 2020

yuriw commented Sep 16, 2020

gregsfortytwo commented Sep 17, 2020

neha-ojha commented Sep 17, 2020

athanatos commented Jul 7, 2020 •

edited