testcluster: fix CrashNode isolation using Partitioner#166174
Open
pav-kv wants to merge 3 commits intocockroachdb:masterfrom
Open
testcluster: fix CrashNode isolation using Partitioner#166174pav-kv wants to merge 3 commits intocockroachdb:masterfrom
pav-kv wants to merge 3 commits intocockroachdb:masterfrom
Conversation
Contributor
|
Merging to
|
Member
5d0f28d to
5fb068f
Compare
Add a reproducer for the race condition where MsgAppResp escapes after CrashClone during raft snapshot application. The race occurs because CrashNode's circuit breaker isolation only blocks outbound RPCs, while server-side responses on existing gRPC streams can still be sent. The repro uses BeforeSnapshotSSTIngestion to block SST ingestion after MsgAppResp has already been sent but before the data is persisted. PostCrashCloneFn then releases the blocked snapshot, allowing the MsgAppResp to escape and advance the leader's match index beyond what the crash snapshot contains. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5fb068f to
d763d35
Compare
Move the Partitioner setup from kvnemesis into TestCluster, controlled by the EnablePartitioner flag on TestClusterArgs. This encapsulates interceptor registration and node address mapping, simplifying test setup for any test that needs network partitions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace circuit breaker isolation in CrashNode with the Partitioner's bidirectional stream interceptors. Circuit breakers only block outbound RPCs from the crashing node, but server-side responses on existing gRPC streams (e.g., MsgAppResp sent during raft snapshot application) can still escape after CrashClone. The Partitioner's interceptors block both SendMsg and RecvMsg on client streams, preventing peers from reading responses sent by the crashing node's server-side handlers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
d763d35 to
48bd787
Compare
pav-kv
commented
Mar 19, 2026
| // is because we need to install the testing knobs / RPC interceptors before | ||
| // the server is started. Find a way around this. | ||
| nodeID := roachpb.NodeID(len(tc.Servers) + 1) | ||
| tc.partitioner.RegisterTestingKnobs(nodeID, &sk.ContextTestingKnobs) |
Collaborator
Author
There was a problem hiding this comment.
Maybe RegisterTestingKnobs should panic if it tries to override existing interceptors, just so that there is no unexpected side effects.
Collaborator
Author
|
@miraradeva the first commit is the repro for messages escaping a partitioned node, and causing #166145. I verified that it reliably fails, and the failure disappears after the fix. I will remove it when merging. |
|
Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CrashNode's circuit breaker isolation is insufficient: it only blocks outbound RPCs from the crashing node. Server-side responses on existing gRPC streams (e.g.MsgAppRespsent during raft snapshot application) can still escape afterCrashClone, leaking false durability signals into the cluster.This PR replaces circuit breakers with the
Partitioner's bidirectional stream interceptors, which block bothSendMsgandRecvMsgon client streams.Commit 1 moves the
PartitionerfromkvnemesisintoTestCluster:EnablePartitionerflag onTestClusterArgsTestClusterhandles interceptor registration (AddServer) and address mapping (Start,startServer)kvnemesisno longer manages thePartitionerdirectly; usestc.Partitioner()Commit 2 fixes
CrashNodeisolation:isolateNodeFromPeers(circuit breakers) with bidirectional partitions viaAddPartition/RemovePartitionCrashCloneand removed afterstopServerLockedFixes #166145