New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: delay manual splits that would result in more snapshots #32594

Merged
merged 1 commit into from Nov 30, 2018

Conversation

Projects
None yet
4 participants
@tbg
Copy link
Member

tbg commented Nov 26, 2018

This is unpolished, but I had used an earlier version of this with what at the
time looked like success. At this point I suspect that this is the best way to
suppress Raft snapshot growth in IMPORT/RESTORE.

(Definitely needs tests).


When a Range has followers that aren't replicating properly, splitting
that range results in a right-hand side with followers in a similar
state. Certain workloads (restore/import/presplit) can run large numbers
of splits against a given range, and this can result in a large number
of Raft snapshots that backs up the Raft snapshot queue.

Ideally we'd never have any ranges that require a snapshot, but over
the last weeks it has become clear that this is very difficult to
achieve since the knowledge required to decide whether a snapshot
can efficiently be prevented is distributed across multiple nodes
that don't share the necessary information.

This is a bit of a nuclear option to prevent the likely last big
culprit in large numbers of Raft snapshots in #31409.

With this change, we should expect to see Raft snapshots regularly when
a split/scatter phase of an import/restore is active, but never large
volumes at once.

Release note: None

@tbg tbg requested a review from cockroachdb/core-prs as a code owner Nov 26, 2018

@cockroach-teamcity

This comment has been minimized.

Copy link
Member

cockroach-teamcity commented Nov 26, 2018

This change is Reviewable

@tbg tbg requested a review from petermattis Nov 26, 2018

@tbg tbg force-pushed the tbg:fix/raft-delayed-snaps branch from 063f372 to 21da18d Nov 26, 2018

@petermattis
Copy link
Contributor

petermattis left a comment

Reviewed 3 of 3 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica_command.go, line 282 at r2 (raw file):

	}
	if tPostWait == (time.Time{}) {
		tPostWait = timeutil.Now()

I'm not understanding what is motivating this handling of tPostWait. I think you want a boolean indicating whether you ever had a time.After call, and if yes then output the delayed split for message. As it is, it appears you'll always output that message.


pkg/storage/replica_command.go, line 393 at r2 (raw file):

	log.Infof(ctx, "initiating a split of this range at key %s [r%d]%s",
		extra, splitKey, rightDesc.RangeID)

Is the argument reordering intentional? Seems like that messes with the log message strangely.

@tbg tbg force-pushed the tbg:fix/raft-delayed-snaps branch 2 times, most recently from 02d4795 to 3f1a1f4 Nov 26, 2018

@tbg
Copy link
Member

tbg left a comment

The small fry missing from this change aside, assuming this addresses the problem, is this what we want? Does any alternative approach come to mind? I concede defeat on avoiding Raft snapshots. The RaftPostSplitSuppressSnapshotTicks mechanism makes snapshots much worse when it delays without the split trigger showing up, and yet it seems impossible to do much better. I tried (unpushed WIP) to move that detection to the receiver (i.e. the Raft leader of the uninitialized RHS) and this works very well unless the Raft leader of the RHS isn't colocated with that of the LHS. This is mostly true today (though not in the test that I added for this mechanism, which is now unfixably flaky) and I imagine that in the future we won't want splits to pin the RHS and LHS leaseholder together as is done today. (And even if that were somehow circumvented, the problem is still that a single Raft snapshot for whatever reason is multiplied by whatever number of splits hit the range in that state, unless there's a mechanism such as this one).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica_command.go, line 282 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'm not understanding what is motivating this handling of tPostWait. I think you want a boolean indicating whether you ever had a time.After call, and if yes then output the delayed split for message. As it is, it appears you'll always output that message.

I think this works as intended. Note that we initialize tPostWait := tPreWait and zero it out only when we're actually waiting (i.e. when ticks > 0, though that should really be ticks = 1). I agree that a bool is saner though, done.


pkg/storage/replica_command.go, line 393 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Is the argument reordering intentional? Seems like that messes with the log message strangely.

No, and I just fixed this because the output looked really confusing.

@tbg tbg force-pushed the tbg:fix/raft-delayed-snaps branch from 3f1a1f4 to 0e46399 Nov 26, 2018

@petermattis
Copy link
Contributor

petermattis left a comment

The other mitigation we might try is to speed up Raft snapshots when they are very small. For restore / import, the Raft snapshots should be mostly empty. For a range with real data that requires a snapshot, splitting will double the number of snapshots, but each snapshot should be half the size.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica_command.go, line 290 at r3 (raw file):

	}

	if elapsed := timeutil.Since(tPreWait); waited {

Nit: this should probably be if waited { extra += fmt.Sprintf("...", timeutil.Since(tPreWait)) }

@petermattis
Copy link
Contributor

petermattis left a comment

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@petermattis
Copy link
Contributor

petermattis left a comment

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/storage/replica_command.go, line 235 at r3 (raw file):

	// up. If we run out of time, we'll resume the split no matter what.
	_ = r.maybeDropMsgAppResp
	maxDelaySplitToAvoidSnapshotTicks := 5 + r.store.cfg.RaftPostSplitSuppressSnapshotTicks

I'm not sure this is long enough to wait. If the Raft snapshot queue is backed up it can take minutes to clear. So this just seems like it will wait for a bit, then add more work to the queue.

@bdarnell
Copy link
Member

bdarnell left a comment

I don't love this but if it can put an end to the saga of raft snapshots I'm for it.

One more idea is whether we can tune the order in which splits happen. Would reversing the order of splits make us less likely to hit pathological cases? Or if both ascending and descending are problematic, maybe a pattern like binary search to avoid hitting the same ranges repeatedly?

Reviewed 1 of 1 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/storage/replica_command.go, line 234 at r3 (raw file):

	// plus a bit of padding to give a snapshot a chance to catch the follower
	// up. If we run out of time, we'll resume the split no matter what.
	_ = r.maybeDropMsgAppResp

Is this just to provide a grep/guru target or did you mean to call this method? (if it's just a guru target say so in the comment)

@petermattis

This comment has been minimized.

Copy link
Contributor

petermattis commented Nov 27, 2018

I don't love this but if it can put an end to the saga of raft snapshots I'm for it.

Note that the settings in this PR delay splits by up to 5s. In testing, @tbg and I have found this to be insufficient for getting rid of Raft snapshot backlog. I've been testing with settings that delay splits for up to 3.5m. The longest split I've seen delayed is 34.5s, though that comes with the benefit of no Raft snapshot backlog. Is that still palatable?

One thing that leaves me a bit dissatisfied is that I don't have a clear picture of what is going on here. This is my own failing: I haven't taken the time to trace through how these Raft snapshots are accumulating. I'm hoping to either do that, or wait until @tbg produces a clear writeup.

@bdarnell

This comment has been minimized.

Copy link
Member

bdarnell commented Nov 27, 2018

The longest split I've seen delayed is 34.5s, though that comes with the benefit of no Raft snapshot backlog. Is that still palatable?

As long as that worst-case only happens either during large restore/imports or during network distress, I think it's bearable.

@tbg

This comment has been minimized.

Copy link
Member

tbg commented Nov 27, 2018

I second what @petermattis said, it's still not 100% clear what's going on. I think we might actually be seeing priority inversion at work here because preemptive snapshots routinely take 15+ seconds. Putting logging like in this PR in place has been helpful. I'd like to confirm this (or a similar theory) before merging this, and while we're there, the Raft snapshot queue (which very likely is the bottleneck here) needs to get better introspection.

@tbg tbg force-pushed the tbg:fix/raft-delayed-snaps branch 2 times, most recently from 4173ec2 to f4040c1 Nov 27, 2018

@petermattis
Copy link
Contributor

petermattis left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/replica_command.go, line 235 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'm not sure this is long enough to wait. If the Raft snapshot queue is backed up it can take minutes to clear. So this just seems like it will wait for a bit, then add more work to the queue.

In my experimentation with this change, bumping this value significantly produces a benefit. In particular, without this change I see a number of messages about delaying a split for 5s only to let it through despite it possibly causing a Raft snapshot. If a bump the value I see splits delayed for longer (17.2s in one example), but no spike in Raft snapshots.


pkg/storage/replica_command.go, line 296 at r9 (raw file):

				return false, nil
			})
			r.raftMu.Unlock()

Don't you want to do this before the select?

@tbg
Copy link
Member

tbg left a comment

No code updates in this round of comments. More tomorrow.

Using the backoff only for certain AdminSplit (those issued by backup/restore) is something we can possibly backport so I'm planning to make that change (after asserting that it fixes all the problems and making sure that long backoffs are traced back to a root cause via a log message).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/replica_command.go, line 234 at r3 (raw file):

see the _ assignment below)

What should I change this to? (This helps navigate and also prevents naming rot)

_ = ... // guru assignment

?


pkg/storage/replica_command.go, line 235 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

In my experimentation with this change, bumping this value significantly produces a benefit. In particular, without this change I see a number of messages about delaying a split for 5s only to let it through despite it possibly causing a Raft snapshot. If a bump the value I see splits delayed for longer (17.2s in one example), but no spike in Raft snapshots.

I've seen the same. I think ~15s is sort of a magic number because that's how long the average snapshot will be blocked. A preemptive snapshot is usually ~30mb in this workload and they're rate limited to 2mb/s and can block a Raft snapshot to the same target. So a Raft snapshot will often have to wait ~15s.


pkg/storage/replica_command.go, line 296 at r9 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Don't you want to do this before the select?

Good point, will do.

@tbg
Copy link
Member

tbg left a comment

(I'm also going to experiment with splitting the receiver-side snapshot semaphore into two, so that we always allow one Raft snapshot to be in flight, even if there's already a preemptive one).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

@tbg

This comment has been minimized.

Copy link
Member

tbg commented Nov 28, 2018

For the record (I think I only mentioned this privately so far): testing this PR has discovered what I think is a Raft bug, where a follower isn't tracked as replicating after it has applied a snapshot that is up-to-date with the leader's index. This is mostly inconsequential, but would trip up the wait logic. I'll link to the upstream PR once it's out (and I still think I found a bug).

@tbg tbg force-pushed the tbg:fix/raft-delayed-snaps branch from 7b0f789 to 417df15 Nov 28, 2018

tbg added a commit to tbg/cockroach that referenced this pull request Nov 28, 2018

storage: improve split-snapshot warning
It now differentiates between the case in which the follower needs a
snapshot vs the one in which it is still being probed. We're not
expecting the probing case once the bug described here is fixed:

cockroachdb#32594 (comment)

Release note: None
@tbg
Copy link
Member

tbg left a comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/replica_command.go, line 234 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…
see the _ assignment below)

What should I change this to? (This helps navigate and also prevents naming rot)

_ = ... // guru assignment

?

Done.


pkg/storage/replica_command.go, line 290 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Nit: this should probably be if waited { extra += fmt.Sprintf("...", timeutil.Since(tPreWait)) }

Done, I think.


pkg/storage/replica_command.go, line 296 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Good point, will do.

Done.

@tbg tbg force-pushed the tbg:fix/raft-delayed-snaps branch from 417df15 to 2eb4f12 Nov 28, 2018

tbg added a commit to tbg/cockroach that referenced this pull request Nov 28, 2018

storage: improve split-snapshot warning
It now differentiates between the case in which the follower needs a
snapshot vs the one in which it is still being probed. We're not
expecting the probing case once the bug described here is fixed:

cockroachdb#32594 (comment)

Release note: None
@tbg
Copy link
Member

tbg left a comment

This PR needs lots of cleanup love, but I think the basic mechanism now works. I've run the first few minutes of restore2TB repeatedly and while I always see snapshots and sometimes see the backoff mechanism "give up" (14s in my current testing branch), the reason is always because a follower still needs a snapshot. I definitely saw the Raft bug pop up in almost every run until I had tuned all the Raft proposal knobs required (need to wake the leader), and it's been reliably gone since. I think we're getting somewhere here!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

@petermattis
Copy link
Contributor

petermattis left a comment

Progress! Let me know when you want me to give this another thorough look.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/replica_command.go, line 305 at r23 (raw file):

			if pr.State != raft.ProgressStateReplicate {
				extra += fmt.Sprintf("; replica r%d/%d not caught up: %+v", r.RangeID, replicaID, &pr)

Adding on to extra every tick loop seems excessive. I think you want to do something where you construct a newExtra string for the progress, and then after the loop if !done you set extra = newExtra.

@tbg tbg force-pushed the tbg:fix/raft-delayed-snaps branch from 96f7f0b to 2ab0f5b Nov 30, 2018

@tbg

This comment has been minimized.

Copy link
Member

tbg commented Nov 30, 2018

bors r=petermattis

craig bot pushed a commit that referenced this pull request Nov 30, 2018

Merge #32594
32594: storage: delay manual splits that would result in more snapshots r=petermattis a=tbg

This is unpolished, but I had used an earlier version of this with what at the
time looked like success. At this point I suspect that this is the best way to
suppress Raft snapshot growth in IMPORT/RESTORE.

(Definitely needs tests).

----

When a Range has followers that aren't replicating properly, splitting
that range results in a right-hand side with followers in a similar
state. Certain workloads (restore/import/presplit) can run large numbers
of splits against a given range, and this can result in a large number
of Raft snapshots that backs up the Raft snapshot queue.

Ideally we'd never have any ranges that require a snapshot, but over
the last weeks it has become clear that this is very difficult to
achieve since the knowledge required to decide whether a snapshot
can efficiently be prevented is distributed across multiple nodes
that don't share the necessary information.

This is a bit of a nuclear option to prevent the likely last big
culprit in large numbers of Raft snapshots in #31409.

With this change, we should expect to see Raft snapshots regularly when
a split/scatter phase of an import/restore is active, but never large
volumes at once.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@craig

This comment has been minimized.

Copy link

craig bot commented Nov 30, 2018

Build failed

@tbg

This comment has been minimized.

Copy link
Member

tbg commented Nov 30, 2018

@knz I'm not sure what's going on here, but TC thinks the test above simply "didn't print anything" for 1h. I'm fairly sure the storage package failed (my change is pretty invasive there), but it has an 8min timeout and was run with -v so that it shouldn't have buffered. Do you have any idea whether your recent line buffering change can have to do with that (I'm thinking no).

Going to re-run for now since TC is green -- maybe I caught a bad agent?

bors r=petermattis

craig bot pushed a commit that referenced this pull request Nov 30, 2018

Merge #32594
32594: storage: delay manual splits that would result in more snapshots r=petermattis a=tbg

This is unpolished, but I had used an earlier version of this with what at the
time looked like success. At this point I suspect that this is the best way to
suppress Raft snapshot growth in IMPORT/RESTORE.

(Definitely needs tests).

----

When a Range has followers that aren't replicating properly, splitting
that range results in a right-hand side with followers in a similar
state. Certain workloads (restore/import/presplit) can run large numbers
of splits against a given range, and this can result in a large number
of Raft snapshots that backs up the Raft snapshot queue.

Ideally we'd never have any ranges that require a snapshot, but over
the last weeks it has become clear that this is very difficult to
achieve since the knowledge required to decide whether a snapshot
can efficiently be prevented is distributed across multiple nodes
that don't share the necessary information.

This is a bit of a nuclear option to prevent the likely last big
culprit in large numbers of Raft snapshots in #31409.

With this change, we should expect to see Raft snapshots regularly when
a split/scatter phase of an import/restore is active, but never large
volumes at once.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@craig

This comment has been minimized.

Copy link

craig bot commented Nov 30, 2018

Build succeeded

@craig craig bot merged commit 2ab0f5b into cockroachdb:master Nov 30, 2018

3 checks passed

GitHub CI (Cockroach) TeamCity build finished
Details
bors Build succeeded
Details
license/cla Contributor License Agreement is signed.
Details

@tbg tbg deleted the tbg:fix/raft-delayed-snaps branch Nov 30, 2018

tbg added a commit to tbg/cockroach that referenced this pull request Dec 11, 2018

storage: improve split-snapshot warning
It now differentiates between the case in which the follower needs a
snapshot vs the one in which it is still being probed. We're not
expecting the probing case once the bug described here is fixed:

cockroachdb#32594 (comment)

Release note: None

tbg added a commit to tbg/cockroach that referenced this pull request Dec 11, 2018

storage: add comment to replicaGC with pending split
Persists a recent finding I made experimenting with cockroachdb#32594.

Release note: None

tbg added a commit to tbg/cockroach that referenced this pull request Dec 11, 2018

backport-2.1:
Backports cockroachdb#32594. This didn't apply cleanly, but only because I never
cherry-picked cockroachdb#32594 and cockroachdb#32594 refers a variable introduced within.

/cc @cockroachdb/release

Release note: None

tbg added a commit to tbg/cockroach that referenced this pull request Dec 11, 2018

backport-2.1: storage: delay manual splits that would result in more …
…snapshots

Backports cockroachdb#32594. This didn't apply cleanly, but only because I never
cherry-picked cockroachdb#32594 and cockroachdb#32594 refers a variable introduced within.

Fixes cockroachdb#32898.
Fixes cockroachdb#32900.
Fixes cockroachdb#32895.

/cc @cockroachdb/release

Release note: None

tbg added a commit to tbg/cockroach that referenced this pull request Dec 11, 2018

backport-2.1: storage: delay manual splits that would result in more …
…snapshots

Backports cockroachdb#32594. This didn't apply cleanly, but only because I never
cherry-picked cockroachdb#32594 and cockroachdb#32594 refers a variable introduced within.

Fixes cockroachdb#32898.
Fixes cockroachdb#32900.
Fixes cockroachdb#32895.

/cc @cockroachdb/release

Release note (bug fix): resolve a cluster degradation scenario that
could occur during IMPORT/RESTORE operations, manifested through a
high number of pending Raft snapshots.

tbg added a commit to tbg/cockroach that referenced this pull request Dec 11, 2018

backport-2.1: storage: delay manual splits that would result in more …
…snapshots

Backports cockroachdb#32594. This didn't apply cleanly, but only because I never
cherry-picked cockroachdb#32594 and cockroachdb#32594 refers a variable introduced within.

Fixes cockroachdb#32898.
Fixes cockroachdb#32900.
Fixes cockroachdb#32895.

/cc @cockroachdb/release

Release note (bug fix): resolve a cluster degradation scenario that
could occur during IMPORT/RESTORE operations, manifested through a
high number of pending Raft snapshots.

knz added a commit to knz/cockroach that referenced this pull request Jan 9, 2019

storage: bump RaftDelaySplitToSuppressSnapshotTicks
The admin split path must accommodate a scenario where a range with
not-yet-replicated followers is being manually split multiple
times (eg. IMPORT during TPCC test fixtures). This scenario results in
a bunch of replicas that all need to be populated with
snapshots. To avoid backing up the raft snapshot queue, a heuristic
was put in place (cockroachdb#32594) to delay the admin split if there is another
snapshot being applied already.

As shown by investigation in a failing test, there is a mismatch
between the configured max delay for this heuristic (20s) and the
actual duration of the snapshot application - the latter is limited by
the max bandwidth for snapshots, 2 MB/s resulting in 32s applications
in the worst case. We (Tobias and I) suspect that the heuristic thus
fails to wait long enough to have the protective effect it was
designed to provide.

The current patch increases this delay to exceed this snapshot
application duration estimate to about 50s.

Note that this scenario is not likely to occur now that cockroachdb#32782 has
been merged (this reduces the need for raft snapshots during splits);
however in release-2.1 where that patch was not applied, the scenario
is more likely.

Release note (bug fix): resolve a cluster degradation scenario that
could occur during IMPORT/RESTORE operations, manifested through a
high number of pending Raft snapshots.

knz added a commit to knz/cockroach that referenced this pull request Jan 9, 2019

storage: bump RaftDelaySplitToSuppressSnapshotTicks
The admin split path must accommodate a scenario where a range with
not-yet-replicated followers is being manually split multiple
times (eg. IMPORT during TPCC test fixtures). This scenario results in
a bunch of replicas that all need to be populated with
snapshots. To avoid backing up the raft snapshot queue, a heuristic
was put in place (cockroachdb#32594) to delay the admin split if there is another
snapshot being applied already.

As shown by investigation in a failing test, there is a mismatch
between the configured max delay for this heuristic (20s) and the
actual duration of the snapshot application - the latter is limited by
the max bandwidth for snapshots, 2 MB/s resulting in 32s applications
in the worst case. We (Tobias and I) suspect that the heuristic thus
fails to wait long enough to have the protective effect it was
designed to provide.

The current patch increases this delay to exceed this snapshot
application duration estimate to about 50s.

Note that this scenario is not likely to occur now that cockroachdb#32782 has
been merged (this reduces the need for raft snapshots during splits);
however in release-2.1 where that patch was not applied, the scenario
is more likely.

Release note (bug fix): resolve a cluster degradation scenario that
could occur during IMPORT/RESTORE operations, manifested through a
high number of pending Raft snapshots.

craig bot pushed a commit that referenced this pull request Jan 9, 2019

Merge #33573 #33582
33573: roachtest: add MinVersion to scrub tests r=lucy-zhang a=lucy-zhang

The `scrub` tests have been passing on master, but failing on 2.1 because there
was a bug (fixed in #32908).

Release note: None

33582: storage: bump RaftDelaySplitToSuppressSnapshotTicks r=knz a=knz

The admin split path must accommodate a scenario where a range with
not-yet-replicated followers is being manually split multiple
times (eg. IMPORT during TPCC test fixtures). This scenario results in
a bunch of replicas that all need to be populated with
snapshots. To avoid backing up the raft snapshot queue, a heuristic
was put in place (#32594) to delay the admin split if there is another
snapshot being applied already.

As shown by investigation in a failing test, there is a mismatch
between the configured max delay for this heuristic (20s) and the
actual duration of the snapshot application - the latter is limited by
the max bandwidth for snapshots, 2 MB/s resulting in 32s applications
in the worst case. We (Tobias and I) suspect that the heuristic thus
fails to wait long enough to have the protective effect it was
designed to provide.

The current patch increases this delay to exceed this snapshot
application duration estimate to about 50s.

Note that this scenario is not likely to occur now that #32782 has
been merged (this reduces the need for raft snapshots during splits);
however in release-2.1 where that patch was not applied, the scenario
is more likely.

Release note (bug fix): resolve a cluster degradation scenario that
could occur during IMPORT/RESTORE operations, manifested through a
high number of pending Raft snapshots.

Co-authored-by: Lucy Zhang <lucy-zhang@users.noreply.github.com>
Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com>

knz added a commit to knz/cockroach that referenced this pull request Jan 9, 2019

storage: bump RaftDelaySplitToSuppressSnapshotTicks
The admin split path must accommodate a scenario where a range with
not-yet-replicated followers is being manually split multiple
times (eg. IMPORT during TPCC test fixtures). This scenario results in
a bunch of replicas that all need to be populated with
snapshots. To avoid backing up the raft snapshot queue, a heuristic
was put in place (cockroachdb#32594) to delay the admin split if there is another
snapshot being applied already.

As shown by investigation in a failing test, there is a mismatch
between the configured max delay for this heuristic (20s) and the
actual duration of the snapshot application - the latter is limited by
the max bandwidth for snapshots, 2 MB/s resulting in 32s applications
in the worst case. We (Tobias and I) suspect that the heuristic thus
fails to wait long enough to have the protective effect it was
designed to provide.

The current patch increases this delay to exceed this snapshot
application duration estimate to about 50s.

Release note (bug fix): resolve a cluster degradation scenario that
could occur during IMPORT/RESTORE operations, manifested through a
high number of pending Raft snapshots.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment