release-20.2: kvserver: fix delaying of splits with uninitialized followers #65500

tbg · 2021-05-20T07:33:46Z

This backport was requested by @mwang1026 on behalf of the Bulk I/O team.

Backport 2/2 commits from #64060.

/cc @cockroachdb/release

Bursts of splits (i.e. a sequence of splits for which each split splits
the right-hand side of a prior split) can cause issues. This is because
splitting a range in which a replica needs a snapshot results in two
ranges in which a replica needs a snapshot where additionally there
needs to be a sequencing between the two snapshots (one cannot apply
a snapshot for the post-split replica until the pre-split replica has
moved out of the way). The result of a long burst of splits such as
occurring in RESTORE and IMPORT operations is then an overload of the
snapshot queue with lots of wasted work, unavailable followers with
operations hanging on them, and general mayhem in the logs. Since
bulk operations also write a lot of data to the raft logs, log
truncations then create an additional snapshot burden; in short,
everything will be unhappy for a few hours and the cluster may
effectively be unavailable.

This isn't news to us and in fact was a big problem "back in 2018".
When we first started to understand the issue, we introduced a mechanism
that would delay splits (#32594) with the desired effect of ensuring
that, all followers had caught up to ~all of the previous splits.
This helped, but didn't explain why we were seeing snapshots in the
first place.

Investigating more, we realized that snapshots were sometimes spuriously
requested by an uninitialized replica on the right-hand side which was
contacted by another member of the right-hand side that had already been
initialized by the split executing on the left-hand side; this snapshot
was almost always unnecessary since the local left-hand side would
usually initialize the right-hand side moments later. To address this,
in #32594 we started unconditionally dropping the first ~seconds worth
of requests to an uninitialized range, and the mechanism was improved in
#32782 and will now only do this if a local neighboring replica is
expected to perform the split soon.

With all this in place, you would've expected us to have all bases
covered but it turns out that we are still running into issues prior
to this PR.

Concretely, whenever the aforementioned mechanism drops a message from
the leader (a MsgApp), the leader will only contact the replica every
second until it responds. It responds when it has been initialized via
its left neighbor's splits and the leader reaches out again, i.e. an
average of ~500ms after being initialized. However, by that time, it is
itself already at the bottom of a long chain of splits, and the 500ms
delay is delaying how long it takes for the rest of the chain to get
initialized. Since the delay compounds on each link of the chain, the
depth of the chain effectively determines the total delay experienced at
the end. This would eventually exceed the patience of the mechanism that
would suppress the snapshots, and snapshots would be requested. We would
descend into madness similar to that experienced in the absence of the
mechanism in the first place.

The mechanism in #32594 could have helped here, but unfortunately it
did not, as it routinely missed the fact that followers were not
initialized yet. This is because during a split burst, the replica
orchestrating the split was typically only created an instant before,
and its raft group hadn't properly transitioned to leader status yet.
This meant that in effect it wasn't delaying the splits at all.

This commit adjusts the logic to delay splits to avoid this problem.
While clamoring for leadership, the delay is upheld. Once collapsed
into a definite state, the existing logic pretty much did the right
thing, as it waited for the right-hand side to be in initialized.

Closes #61396.

cc @cockroachdb/kv

Release note (bug fix): Fixed a scenario in which a rapid sequence
of splits could trigger a storm of Raft snapshots. This would be
accompanied by log messages of the form "would have dropped incoming
MsgApp, but allowing due to ..." and tended to occur as part of
RESTORE/IMPORT operations.

Bursts of splits (i.e. a sequence of splits for which each split splits the right-hand side of a prior split) can cause issues. This is because splitting a range in which a replica needs a snapshot results in two ranges in which a replica needs a snapshot where additionally there needs to be a sequencing between the two snapshots (one cannot apply a snapshot for the post-split replica until the pre-split replica has moved out of the way). The result of a long burst of splits such as occurring in RESTORE and IMPORT operations is then an overload of the snapshot queue with lots of wasted work, unavailable followers with operations hanging on them, and general mayhem in the logs. Since bulk operations also write a lot of data to the raft logs, log truncations then create an additional snapshot burden; in short, everything will be unhappy for a few hours and the cluster may effectively be unavailable. This isn't news to us and in fact was a big problem "back in 2018". When we first started to understand the issue, we introduced a mechanism that would delay splits (cockroachdb#32594) with the desired effect of ensuring that, all followers had caught up to ~all of the previous splits. This helped, but didn't explain why we were seeing snapshots in the first place. Investigating more, we realized that snapshots were sometimes spuriously requested by an uninitialized replica on the right-hand side which was contacted by another member of the right-hand side that had already been initialized by the split executing on the left-hand side; this snapshot was almost always unnecessary since the local left-hand side would usually initialize the right-hand side moments later. To address this, in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth of requests to an uninitialized range, and the mechanism was improved in cockroachdb#32782 and will now only do this if a local neighboring replica is expected to perform the split soon. With all this in place, you would've expected us to have all bases covered but it turns out that we are still running into issues prior to this PR. Concretely, whenever the aforementioned mechanism drops a message from the leader (a MsgApp), the leader will only contact the replica every second until it responds. It responds when it has been initialized via its left neighbor's splits and the leader reaches out again, i.e. an average of ~500ms after being initialized. However, by that time, it is itself already at the bottom of a long chain of splits, and the 500ms delay is delaying how long it takes for the rest of the chain to get initialized. Since the delay compounds on each link of the chain, the depth of the chain effectively determines the total delay experienced at the end. This would eventually exceed the patience of the mechanism that would suppress the snapshots, and snapshots would be requested. We would descend into madness similar to that experienced in the absence of the mechanism in the first place. The mechanism in cockroachdb#32594 could have helped here, but unfortunately it did not, as it routinely missed the fact that followers were not initialized yet. This is because during a split burst, the replica orchestrating the split was typically only created an instant before, and its raft group hadn't properly transitioned to leader status yet. This meant that in effect it wasn't delaying the splits at all. This commit adjusts the logic to delay splits to avoid this problem. While clamoring for leadership, the delay is upheld. Once collapsed into a definite state, the existing logic pretty much did the right thing, as it waited for the right-hand side to be in initialized. Release note (bug fix): Fixed a scenario in which a rapid sequence of splits could trigger a storm of Raft snapshots. This would be accompanied by log messages of the form "would have dropped incoming MsgApp, but allowing due to ..." and tended to occur as part of RESTORE/IMPORT operations.

I noticed that TestSplitBurstWithSlowFollower would average only <10 splits per second even when no lagging replica is introduced (i.e. the `time.Sleep` commented out). Investigating the raft chatter suggested that after campaigning, the leaseholder of the right-hand side would not be processed by its Store for a `Ready` handling cycle until after ~50ms (the coalesced heartbeat timeout) in the test had passed, and a similar delay was observed when it was sending out its votes. Adding a call to `enqueueRaftUpdateCheck` fixes both, we end up at ~100 splits per second. Release note: None

cockroach-teamcity · 2021-05-20T07:33:54Z

This change is

adityamaru · 2021-05-21T09:02:38Z

I'm out of my depth when it comes to reviewing the code changes, but assuming it was a clean backport it LGTM 👍

tbg · 2021-05-31T09:54:31Z

Friendly ping @erikgrinaker

tbg added 2 commits May 20, 2021 09:32

tbg requested review from a team and adityamaru and removed request for a team May 20, 2021 07:33

adityamaru approved these changes May 21, 2021

View reviewed changes

tbg requested a review from erikgrinaker May 25, 2021 10:04

erikgrinaker approved these changes May 31, 2021

View reviewed changes

tbg merged commit 31cc103 into cockroachdb:release-20.2 Jun 1, 2021

tbg deleted the backport20.2-64060 branch June 1, 2021 10:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-20.2: kvserver: fix delaying of splits with uninitialized followers #65500

release-20.2: kvserver: fix delaying of splits with uninitialized followers #65500

tbg commented May 20, 2021

cockroach-teamcity commented May 20, 2021

adityamaru commented May 21, 2021

tbg commented May 31, 2021

release-20.2: kvserver: fix delaying of splits with uninitialized followers #65500

release-20.2: kvserver: fix delaying of splits with uninitialized followers #65500

Conversation

tbg commented May 20, 2021

cockroach-teamcity commented May 20, 2021

adityamaru commented May 21, 2021

tbg commented May 31, 2021