Skip to content

testcluster: fix CrashNode isolation using Partitioner#166174

Open
pav-kv wants to merge 3 commits intocockroachdb:masterfrom
pav-kv:partitioner-crashnode
Open

testcluster: fix CrashNode isolation using Partitioner#166174
pav-kv wants to merge 3 commits intocockroachdb:masterfrom
pav-kv:partitioner-crashnode

Conversation

@pav-kv
Copy link
Collaborator

@pav-kv pav-kv commented Mar 19, 2026

CrashNode's circuit breaker isolation is insufficient: it only blocks outbound RPCs from the crashing node. Server-side responses on existing gRPC streams (e.g. MsgAppResp sent during raft snapshot application) can still escape after CrashClone, leaking false durability signals into the cluster.

This PR replaces circuit breakers with the Partitioner's bidirectional stream interceptors, which block both SendMsg and RecvMsg on client streams.

Commit 1 moves the Partitioner from kvnemesis into TestCluster:

  • New EnablePartitioner flag on TestClusterArgs
  • TestCluster handles interceptor registration (AddServer) and address mapping (Start, startServer)
  • kvnemesis no longer manages the Partitioner directly; uses tc.Partitioner()

Commit 2 fixes CrashNode isolation:

  • Replaces isolateNodeFromPeers (circuit breakers) with bidirectional partitions via AddPartition/RemovePartition
  • Partitions are added before CrashClone and removed after stopServerLocked

Fixes #166145

@trunk-io
Copy link
Contributor

trunk-io bot commented Mar 19, 2026

Merging to master in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@pav-kv pav-kv changed the title Partitioner crashnode testcluster,kvnemesis: support node partitioning properly Mar 19, 2026
@pav-kv pav-kv force-pushed the partitioner-crashnode branch 7 times, most recently from 5d0f28d to 5fb068f Compare March 19, 2026 16:53
Add a reproducer for the race condition where MsgAppResp escapes after
CrashClone during raft snapshot application. The race occurs because
CrashNode's circuit breaker isolation only blocks outbound RPCs, while
server-side responses on existing gRPC streams can still be sent.

The repro uses BeforeSnapshotSSTIngestion to block SST ingestion after
MsgAppResp has already been sent but before the data is persisted.
PostCrashCloneFn then releases the blocked snapshot, allowing the
MsgAppResp to escape and advance the leader's match index beyond what
the crash snapshot contains.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@pav-kv pav-kv force-pushed the partitioner-crashnode branch from 5fb068f to d763d35 Compare March 19, 2026 18:04
pav-kv and others added 2 commits March 19, 2026 18:09
Move the Partitioner setup from kvnemesis into TestCluster, controlled
by the EnablePartitioner flag on TestClusterArgs. This encapsulates
interceptor registration and node address mapping, simplifying test
setup for any test that needs network partitions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace circuit breaker isolation in CrashNode with the Partitioner's
bidirectional stream interceptors. Circuit breakers only block outbound
RPCs from the crashing node, but server-side responses on existing gRPC
streams (e.g., MsgAppResp sent during raft snapshot application) can
still escape after CrashClone. The Partitioner's interceptors block both
SendMsg and RecvMsg on client streams, preventing peers from reading
responses sent by the crashing node's server-side handlers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@pav-kv pav-kv force-pushed the partitioner-crashnode branch from d763d35 to 48bd787 Compare March 19, 2026 18:13
@pav-kv pav-kv changed the title testcluster,kvnemesis: support node partitioning properly testcluster: fix CrashNode isolation using Partitioner Mar 19, 2026
// is because we need to install the testing knobs / RPC interceptors before
// the server is started. Find a way around this.
nodeID := roachpb.NodeID(len(tc.Servers) + 1)
tc.partitioner.RegisterTestingKnobs(nodeID, &sk.ContextTestingKnobs)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe RegisterTestingKnobs should panic if it tries to override existing interceptors, just so that there is no unexpected side effects.

@pav-kv pav-kv marked this pull request as ready for review March 19, 2026 18:19
@pav-kv pav-kv requested review from a team as code owners March 19, 2026 18:19
@pav-kv pav-kv requested review from miraradeva and removed request for a team March 19, 2026 18:19
@pav-kv
Copy link
Collaborator Author

pav-kv commented Mar 19, 2026

@miraradeva the first commit is the repro for messages escaping a partitioned node, and causing #166145. I verified that it reliably fails, and the failure disappears after the fix. I will remove it when merging.

@blathers-crl
Copy link

blathers-crl bot commented Mar 19, 2026

Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

release-26.2: kv/kvnemesis: TestKVNemesisMultiNode_Crash failed

2 participants