roachtest: replace kv/restart/nodes=12 with perturbation/long/restart by tbg · Pull Request #170089 · cockroachdb/cockroach

tbg · 2026-05-11T07:34:49Z

Replace kv/restart/nodes=12 with a new heavyweight variant of the
existing perturbation/full/restart test.

The first commit drops kv/restart/nodes=12. The second adds an
addLong() registration helper and uses it to register
perturbation/long/restart (Weekly suite, 3h timeout) — same setup as
perturbation/full/restart but with a 2h fill duration so the cluster
reaches a non-trivial steady state before the perturbation. A fixup
commit (squash before merge) adds in-source napkin math next to
addLong explaining what 10 minutes of downtime means for the raft
backlog at the cluster's measured throughput.

How the new test compares to `kv/restart/nodes=12`

The perturbation framework gives us baseline / perturbation / recovery
windows with throughput- and latency-impact ratios that flow through to
roachperf, plus phase markers (phases.json) that align with the
workload histograms — much more useful for tracking restart-related
regressions than the original test's continuous "QPS stays above 50%"
assertion.

The shape of the workload is different, but the new test is still a
heavy stressor for the recovering node, in some ways more so:

	`kv/restart/nodes=12` (removed)	`perturbation/long/restart` (new)
Duration	3h test (2h fill, 10m downtime)	~2.5h (2h fill, 10m downtime, 5m windows)
Workload mix	90% writes / 10% reads, 8 KiB blocks	50% writes / 50% reads (50% follower-reads), 4 KiB blocks
Cluster	12 × 8 vcpu, PD-SSD (no localSSD)	12 × 16 vcpu, 2 × localSSD
Rate cap	`--max-rate=5000` cluster-wide	`ratioOfMax=0.5` of measured throughput
Sustained writes	~4.5k writes/s (capped)	~20k writes/s (uncapped, measured)
Cluster-wide raft accumulated during downtime	~22 GiB	~48 GiB
Down-node raft backlog (RF=3, 12 nodes)	~5–6 GiB	~12 GiB
Pass criteria	QPS ≥ 50% of `--max-rate` continuously	Throughput drop ≤ 25% of baseline in perturbation/recovery windows

So the new test puts roughly 2× more raft data in front of the
recovering node than the test it replaces, despite smaller blocks and a
lower write fraction, because there is no artificial cluster-wide rate
cap — the workload self-tunes to half of the measured cluster maximum.

Per-range raft log vs `RaftLogTruncationThreshold`

What matters for how the down node recovers (log replay vs raft
snapshot) is not the cluster-wide volume but the per-range raft log size
when the leader decides to truncate. With splits=10000 and ~25% of
replicas on a 12-node RF=3 cluster, the down node owns ~2500 ranges. At
~12 GiB of raft data spread across those ranges, the average per-range
log is ~5 MiB, well below RaftLogTruncationThreshold (16 MiB; see
pkg/base/config.go). With the default kv workload's near-uniform key
distribution, the variance across ranges is tight enough that most
ranges should stay below the threshold and recover via log replay, not
snapshot. The shape would shift toward snapshot ingest if any of these
grew: the per-range write skew (zipfian, hot ranges), the cluster
throughput, or perturbationDuration.

What the new test does not reproduce as faithfully is the original
test's IO-bandwidth-bound recovery scenario: the original used PD-SSD
specifically so that bandwidth was the gating resource on recovery. The
new variant runs on localSSD with significantly more headroom, so
recovery is more likely to be CPU- or raft-pipeline-bound than
disk-bound. If a follow-up wants to faithfully reproduce the
IO-overload-on-recovery scenario, an additional disk-bandwidth-limited
variant (e.g. `addLong` with `diskBandwidthLimit` set) would be the
way to do it.

First run on master

A first run of the test on this branch passed comfortably; impact
ratios from the run (for record):

op	window	p99 latency	p50 latency	throughput
write	perturbation	1.24x	1.11x	1.00x
write	recovery	2.01x	1.16x	1.01x
read	perturbation	1.94x	1.00x	1.00x
read	recovery	3.67x	1.10x	1.01x
follower-read	perturbation	2.05x	1.09x	1.00x
follower-read	recovery	4.05x	1.18x	1.00x

Throughput stayed flat throughout (well within the 1.25x limit). The
notable observation is that recovery ratios are higher than
perturbation ratios for every operation — the catchup work on the
rejoining node degrades foreground latency more than the downtime
itself. This is exactly the IO-overload-on-recovery shape that the
original test was after, just at a milder amplitude (no throughput
crash, only latency).

Touches #170047.

Epic: none
Release note: None

trunk-io · 2026-05-11T07:34:53Z

😎 Merged successfully - details.

cockroach-teamcity · 2026-05-11T07:35:10Z

This change is

blathers-crl · 2026-05-11T11:34:00Z

Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link)

Drop kv/restart/nodes=12. The next commit registers perturbation/long/restart as a replacement using the perturbation framework's measurement infrastructure (baseline/perturbation/recovery with roachperf integration), which is more useful for tracking restart-related regressions than the QPS-floor assertion in this test. Touches cockroachdb#170047. Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>

Register perturbation/long/restart as a Weekly variant of the existing perturbation/full/restart test, with a 2h fill duration so the target node accumulates enough behind it during the perturbation to make recovery non-trivial. This replaces kv/restart/nodes=12 (removed in the previous commit) — at lower write intensity than the test it replaces (50/50 r/w with 4KB blocks vs 90% writes with 8KB blocks on PD-SSD), but relying on the perturbation framework's baseline/perturbation/recovery roachperf instrumentation to surface regressions. The new addLong() helper is wired only for restart{} in RegisterTests rather than from inside register(), so other perturbations (backup, intents, decommission, ...) don't grow long variants by default. Future heavyweight perturbations can opt in the same way. Touches cockroachdb#170047. Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>

arulajmani

@arulajmani reviewed 3 files and all commit messages, and made 1 comment.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on angeladietz).

tbg force-pushed the tbg/roachtest-perturbation-long-restart branch 2 times, most recently from 82658eb to 40bab44 Compare May 11, 2026 12:43

tbg force-pushed the tbg/roachtest-perturbation-long-restart branch from 40bab44 to 9bdf947 Compare May 11, 2026 12:53

tbg marked this pull request as ready for review May 11, 2026 12:54

tbg requested a review from angeladietz May 11, 2026 12:54

arulajmani approved these changes May 11, 2026

View reviewed changes

trunk-io Bot merged commit 18eb14f into cockroachdb:master May 12, 2026
29 checks passed

celeste-cockroachdb Bot added the target-release-26.3.0 label May 12, 2026

tbg mentioned this pull request May 18, 2026

roachtest: kv/restart/nodes=12 failed #170047

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: replace kv/restart/nodes=12 with perturbation/long/restart#170089

roachtest: replace kv/restart/nodes=12 with perturbation/long/restart#170089
trunk-io[bot] merged 2 commits into
cockroachdb:masterfrom
tbg:tbg/roachtest-perturbation-long-restart

tbg commented May 11, 2026 •

edited

Loading

Uh oh!

trunk-io Bot commented May 11, 2026 •

edited

Loading

Uh oh!

cockroach-teamcity commented May 11, 2026

Uh oh!

blathers-crl Bot commented May 11, 2026

Uh oh!

arulajmani left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tbg commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How the new test compares to kv/restart/nodes=12

Per-range raft log vs RaftLogTruncationThreshold

First run on master

Uh oh!

trunk-io Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented May 11, 2026

Uh oh!

blathers-crl Bot commented May 11, 2026

Uh oh!

arulajmani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tbg commented May 11, 2026 •

edited

Loading

How the new test compares to `kv/restart/nodes=12`

Per-range raft log vs `RaftLogTruncationThreshold`

trunk-io Bot commented May 11, 2026 •

edited

Loading