roachtest: replace kv/restart/nodes=12 with perturbation/long/restart#170089
Merged
trunk-io[bot] merged 2 commits intoMay 12, 2026
Merged
Conversation
Contributor
|
😎 Merged successfully - details. |
Member
|
Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link) |
Drop kv/restart/nodes=12. The next commit registers perturbation/long/restart as a replacement using the perturbation framework's measurement infrastructure (baseline/perturbation/recovery with roachperf integration), which is more useful for tracking restart-related regressions than the QPS-floor assertion in this test. Touches cockroachdb#170047. Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
82658eb to
40bab44
Compare
Register perturbation/long/restart as a Weekly variant of the existing
perturbation/full/restart test, with a 2h fill duration so the target
node accumulates enough behind it during the perturbation to make
recovery non-trivial. This replaces kv/restart/nodes=12 (removed in the
previous commit) — at lower write intensity than the test it replaces
(50/50 r/w with 4KB blocks vs 90% writes with 8KB blocks on PD-SSD), but
relying on the perturbation framework's baseline/perturbation/recovery
roachperf instrumentation to surface regressions.
The new addLong() helper is wired only for restart{} in RegisterTests
rather than from inside register(), so other perturbations (backup,
intents, decommission, ...) don't grow long variants by default. Future
heavyweight perturbations can opt in the same way.
Touches cockroachdb#170047.
Epic: none
Release note: None
Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
40bab44 to
9bdf947
Compare
arulajmani
approved these changes
May 11, 2026
Collaborator
arulajmani
left a comment
There was a problem hiding this comment.
@arulajmani reviewed 3 files and all commit messages, and made 1 comment.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on angeladietz).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replace
kv/restart/nodes=12with a new heavyweight variant of theexisting
perturbation/full/restarttest.The first commit drops
kv/restart/nodes=12. The second adds anaddLong()registration helper and uses it to registerperturbation/long/restart(Weekly suite, 3h timeout) — same setup asperturbation/full/restartbut with a 2h fill duration so the clusterreaches a non-trivial steady state before the perturbation. A fixup
commit (squash before merge) adds in-source napkin math next to
addLongexplaining what 10 minutes of downtime means for the raftbacklog at the cluster's measured throughput.
How the new test compares to
kv/restart/nodes=12The perturbation framework gives us baseline / perturbation / recovery
windows with throughput- and latency-impact ratios that flow through to
roachperf, plus phase markers (
phases.json) that align with theworkload histograms — much more useful for tracking restart-related
regressions than the original test's continuous "QPS stays above 50%"
assertion.
The shape of the workload is different, but the new test is still a
heavy stressor for the recovering node, in some ways more so:
kv/restart/nodes=12(removed)perturbation/long/restart(new)--max-rate=5000cluster-wideratioOfMax=0.5of measured throughput--max-ratecontinuouslySo the new test puts roughly 2× more raft data in front of the
recovering node than the test it replaces, despite smaller blocks and a
lower write fraction, because there is no artificial cluster-wide rate
cap — the workload self-tunes to half of the measured cluster maximum.
Per-range raft log vs
RaftLogTruncationThresholdWhat matters for how the down node recovers (log replay vs raft
snapshot) is not the cluster-wide volume but the per-range raft log size
when the leader decides to truncate. With
splits=10000and ~25% ofreplicas on a 12-node RF=3 cluster, the down node owns ~2500 ranges. At
~12 GiB of raft data spread across those ranges, the average per-range
log is ~5 MiB, well below
RaftLogTruncationThreshold(16 MiB; seepkg/base/config.go). With the defaultkvworkload's near-uniform keydistribution, the variance across ranges is tight enough that most
ranges should stay below the threshold and recover via log replay, not
snapshot. The shape would shift toward snapshot ingest if any of these
grew: the per-range write skew (zipfian, hot ranges), the cluster
throughput, or
perturbationDuration.What the new test does not reproduce as faithfully is the original
test's IO-bandwidth-bound recovery scenario: the original used PD-SSD
specifically so that bandwidth was the gating resource on recovery. The
new variant runs on localSSD with significantly more headroom, so
recovery is more likely to be CPU- or raft-pipeline-bound than
disk-bound. If a follow-up wants to faithfully reproduce the
IO-overload-on-recovery scenario, an additional disk-bandwidth-limited
variant (e.g. `addLong` with `diskBandwidthLimit` set) would be the
way to do it.
First run on master
A first run of the test on this branch passed comfortably; impact
ratios from the run (for record):
Throughput stayed flat throughout (well within the 1.25x limit). The
notable observation is that recovery ratios are higher than
perturbation ratios for every operation — the catchup work on the
rejoining node degrades foreground latency more than the downtime
itself. This is exactly the IO-overload-on-recovery shape that the
original test was after, just at a milder amplitude (no throughput
crash, only latency).
Touches #170047.
Epic: none
Release note: None