Skip to content

roachtest: replace kv/restart/nodes=12 with perturbation/long/restart#170089

Merged
trunk-io[bot] merged 2 commits into
cockroachdb:masterfrom
tbg:tbg/roachtest-perturbation-long-restart
May 12, 2026
Merged

roachtest: replace kv/restart/nodes=12 with perturbation/long/restart#170089
trunk-io[bot] merged 2 commits into
cockroachdb:masterfrom
tbg:tbg/roachtest-perturbation-long-restart

Conversation

@tbg
Copy link
Copy Markdown
Member

@tbg tbg commented May 11, 2026

Replace kv/restart/nodes=12 with a new heavyweight variant of the
existing perturbation/full/restart test.

The first commit drops kv/restart/nodes=12. The second adds an
addLong() registration helper and uses it to register
perturbation/long/restart (Weekly suite, 3h timeout) — same setup as
perturbation/full/restart but with a 2h fill duration so the cluster
reaches a non-trivial steady state before the perturbation. A fixup
commit (squash before merge) adds in-source napkin math next to
addLong explaining what 10 minutes of downtime means for the raft
backlog at the cluster's measured throughput.

How the new test compares to kv/restart/nodes=12

The perturbation framework gives us baseline / perturbation / recovery
windows with throughput- and latency-impact ratios that flow through to
roachperf, plus phase markers (phases.json) that align with the
workload histograms — much more useful for tracking restart-related
regressions than the original test's continuous "QPS stays above 50%"
assertion.

The shape of the workload is different, but the new test is still a
heavy stressor for the recovering node, in some ways more so:

kv/restart/nodes=12 (removed) perturbation/long/restart (new)
Duration 3h test (2h fill, 10m downtime) ~2.5h (2h fill, 10m downtime, 5m windows)
Workload mix 90% writes / 10% reads, 8 KiB blocks 50% writes / 50% reads (50% follower-reads), 4 KiB blocks
Cluster 12 × 8 vcpu, PD-SSD (no localSSD) 12 × 16 vcpu, 2 × localSSD
Rate cap --max-rate=5000 cluster-wide ratioOfMax=0.5 of measured throughput
Sustained writes ~4.5k writes/s (capped) ~20k writes/s (uncapped, measured)
Cluster-wide raft accumulated during downtime ~22 GiB ~48 GiB
Down-node raft backlog (RF=3, 12 nodes) ~5–6 GiB ~12 GiB
Pass criteria QPS ≥ 50% of --max-rate continuously Throughput drop ≤ 25% of baseline in perturbation/recovery windows

So the new test puts roughly 2× more raft data in front of the
recovering node than the test it replaces, despite smaller blocks and a
lower write fraction, because there is no artificial cluster-wide rate
cap — the workload self-tunes to half of the measured cluster maximum.

Per-range raft log vs RaftLogTruncationThreshold

What matters for how the down node recovers (log replay vs raft
snapshot) is not the cluster-wide volume but the per-range raft log size
when the leader decides to truncate. With splits=10000 and ~25% of
replicas on a 12-node RF=3 cluster, the down node owns ~2500 ranges. At
~12 GiB of raft data spread across those ranges, the average per-range
log is ~5 MiB
, well below RaftLogTruncationThreshold (16 MiB; see
pkg/base/config.go). With the default kv workload's near-uniform key
distribution, the variance across ranges is tight enough that most
ranges should stay below the threshold and recover via log replay, not
snapshot. The shape would shift toward snapshot ingest if any of these
grew: the per-range write skew (zipfian, hot ranges), the cluster
throughput, or perturbationDuration.

What the new test does not reproduce as faithfully is the original
test's IO-bandwidth-bound recovery scenario: the original used PD-SSD
specifically so that bandwidth was the gating resource on recovery. The
new variant runs on localSSD with significantly more headroom, so
recovery is more likely to be CPU- or raft-pipeline-bound than
disk-bound. If a follow-up wants to faithfully reproduce the
IO-overload-on-recovery scenario, an additional disk-bandwidth-limited
variant (e.g. `addLong` with `diskBandwidthLimit` set) would be the
way to do it.

First run on master

A first run of the test on this branch passed comfortably; impact
ratios from the run (for record):

op window p99 latency p50 latency throughput
write perturbation 1.24x 1.11x 1.00x
write recovery 2.01x 1.16x 1.01x
read perturbation 1.94x 1.00x 1.00x
read recovery 3.67x 1.10x 1.01x
follower-read perturbation 2.05x 1.09x 1.00x
follower-read recovery 4.05x 1.18x 1.00x

Throughput stayed flat throughout (well within the 1.25x limit). The
notable observation is that recovery ratios are higher than
perturbation ratios for every operation
— the catchup work on the
rejoining node degrades foreground latency more than the downtime
itself. This is exactly the IO-overload-on-recovery shape that the
original test was after, just at a milder amplitude (no throughput
crash, only latency).

Touches #170047.

Epic: none
Release note: None

@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented May 11, 2026

😎 Merged successfully - details.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@blathers-crl
Copy link
Copy Markdown

blathers-crl Bot commented May 11, 2026

Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link)

Drop kv/restart/nodes=12. The next commit registers
perturbation/long/restart as a replacement using the perturbation
framework's measurement infrastructure (baseline/perturbation/recovery
with roachperf integration), which is more useful for tracking
restart-related regressions than the QPS-floor assertion in this test.

Touches cockroachdb#170047.

Epic: none
Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
@tbg tbg force-pushed the tbg/roachtest-perturbation-long-restart branch 2 times, most recently from 82658eb to 40bab44 Compare May 11, 2026 12:43
Register perturbation/long/restart as a Weekly variant of the existing
perturbation/full/restart test, with a 2h fill duration so the target
node accumulates enough behind it during the perturbation to make
recovery non-trivial. This replaces kv/restart/nodes=12 (removed in the
previous commit) — at lower write intensity than the test it replaces
(50/50 r/w with 4KB blocks vs 90% writes with 8KB blocks on PD-SSD), but
relying on the perturbation framework's baseline/perturbation/recovery
roachperf instrumentation to surface regressions.

The new addLong() helper is wired only for restart{} in RegisterTests
rather than from inside register(), so other perturbations (backup,
intents, decommission, ...) don't grow long variants by default. Future
heavyweight perturbations can opt in the same way.

Touches cockroachdb#170047.

Epic: none
Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
@tbg tbg force-pushed the tbg/roachtest-perturbation-long-restart branch from 40bab44 to 9bdf947 Compare May 11, 2026 12:53
@tbg tbg marked this pull request as ready for review May 11, 2026 12:54
@tbg tbg requested a review from angeladietz May 11, 2026 12:54
Copy link
Copy Markdown
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@arulajmani reviewed 3 files and all commit messages, and made 1 comment.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on angeladietz).

@trunk-io trunk-io Bot merged commit 18eb14f into cockroachdb:master May 12, 2026
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants