roachtest: splits/largerange/size=32GiB,nodes=6 failed #114349

cockroach-teamcity · 2023-11-13T16:34:24Z

roachtest.splits/largerange/size=32GiB,nodes=6 failed with artifacts on release-23.2 @ c702128fa39c4aa84e810b898064fd2d66ace765:

(sql_runner.go:255).Query: error executing 'SHOW RANGES FROM TABLE bank.bank': pq: replica unavailable: (n2,s2):2VOTER_DEMOTING_LEARNER unable to serve request to r462:/Table/106/1/9{7953640-8211751} [(n5,s5):5, (n2,s2):2VOTER_DEMOTING_LEARNER, (n6,s6):3, (n3,s3):6VOTER_INCOMING, next=7, gen=399]: closed timestamp: 1699893138.952467078,0 (2023-11-13 16:32:18); raft status: {"id":"2","term":8,"vote":"2","commit":50,"lead":"2","raftState":"StateLeader","applied":50,"progress":{"2":{"match":115,"next":116,"state":"StateReplicate"},"3":{"match":0,"next":1,"state":"StateSnapshot"},"5":{"match":114,"next":116,"state":"StateReplicate"},"6":{"match":0,"next":1,"state":"StateSnapshot"}},"leadtransferee":"0"}: encountered poisoned latch /Local/RangeID/462/r/AbortSpan/"509778ae-c62c-45eb-9006-67a7fe602666"@0,0
(monitor.go:153).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/splits/largerange/size=32GiB_nodes=6/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_metamorphicBuild=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!

Jira issue: CRDB-33473}

The text was updated successfully, but these errors were encountered:

shralex · 2023-11-14T23:51:03Z

@kvoli looks like a test infra thing ?

andrewbaptist · 2023-11-15T14:20:15Z

I'm not sure - what does the "encountered poisoned latch" here mean?

erikgrinaker · 2023-11-15T14:23:24Z

The replica circuit breakers tripped, poisoning latches for in-flight writes to avoid latch waiters getting permanently stuck.

Sounds like we lost quorum. Haven't looked at artifacts here at all.

andrewbaptist · 2023-11-15T15:12:24Z

I took one more look at the logs, and you are correct that it is tripping a circuit breaker, but this is unexpected. We see this in the logs:

from n1:

I231113 16:32:18.650939 332 13@kv/kvserver/store_rebalancer.go:658 ⋮ [T1,Vsystem,n1,s1,store-rebalancer,obj=‹cpu›] 3112  rebalancing r462 (‹(queries-per-second=0.4 cpu-per-second=12ms)› load) to better balance load: voters from (n1,s1):1,(n2,s2):2,(n6,s6):3 to [n2,s2 n5,s5 n6,s6]; non-voters from  to []
I231113 16:32:21.485026 332 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n1,s1,r462/1:‹/Table/106/1/9{79536…-82117…}›,*kvpb.AdminChangeReplicasRequest] 3255  change replicas (add [] remove []): existing descriptor r462:‹/Table/106/1/9{7953640-8211751}› [(n1,s1):1VOTER_DEMOTING_LEARNER, (n2,s2):2, (n6,s6):3, (n5,s5):5VOTER_INCOMING, next=6, gen=395]

from n2:

I231113 16:32:21.582311 52870 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n2,s2,r462/2:‹/Table/106/1/9{79536…-82117…}›] 1485  proposing SIMPLE(l6) [(n3,s3):6LEARNER]: after=[(n5,s5):5 (n2,s2):2 (n6,s6):3 (n3,s3):6LEARNER] next=7
I231113 16:32:21.952380 53123 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n2,s2,r462/2:‹/Table/106/1/9{79536…-82117…}›] 1529  proposing ENTER_JOINT(r2 l2 v6) [(n3,s3):6VOTER_INCOMING], [(n2,s2):2VOTER_DEMOTING_LEARNER]: after=[(n5,s5):5 (n2,s2):2VOTER_DEMOTING_LEARNER (n6,s6):3 (n3,s3):6VOTER_INCOMING] next=7
W231113 16:32:36.962659 53083 kv/kvserver/spanlatch/manager.go:577 ⋮ [T1,Vsystem,n2,s2,r462/2:‹/Table/106/1/9{79536…-82117…}›] 758  have been waiting 15s to acquire read latch ‹/Local/Range/Table/106/1/97953640/RangeDescriptor›@0,0, held by write latch ‹/Local/Range/Table/106/1/97953640/RangeDescriptor›@0,0

I don't think this is overload because this is the only range that is an issue. It appears that n2 got the lease and then "got stuck"

from n3:

I231113 16:32:21.938107 33243 13@kv/kvserver/replica_raftstorage.go:557 ⋮ [T1,Vsystem,n3,s3,r462/6:{-}] 327  applying snapshot 2f9b8182 from (n5,s5):5 at applied index 48
I231113 16:32:21.938107 33243 13@kv/kvserver/replica_raftstorage.go:557 ⋮ [T1,Vsystem,n3,s3,r462/6:{-}] 327  applying snapshot 2f9b8182 from (n5,s5):5 at applied index 48
I231113 16:32:21.943148 33243 kv/kvserver/replica_raftstorage.go:579 ⋮ [T1,Vsystem,n3,s3,r462/6:‹/Table/106/1/9{79536…-82117…}›] 216  applied snapshot 2f9b8182 from (n5,s5):5 at applied index 48 (total=5ms data=29 MiB ingestion=6@4ms)

from n5:

I231113 16:32:18.667483 238 kv/kvserver/store_remove_replica.go:290 ⋮ [T1,Vsystem,n5] 536  removing uninitialized replica [n5,s5,r462/4:{-}]
I231113 16:32:21.382503 24758 kv/kvserver/replica_raftstorage.go:579 ⋮ [T1,Vsystem,n5,s5,r462/5:‹/Table/106/1/9{79536…-82117…}›] 582  applied snapshot 9c88efbd from (n1,s1):1 at applied index 33 (total=4ms data=29 MiB ingestion=6@3ms)
I231113 16:32:21.534164 24946 kv/kvserver/replica_command.go:2282 ⋮ [T1,Vsystem,n5,replicate,s5,r462/5:‹/Table/106/1/9{79536…-82117…}›] 583  we were trying to exit a joint config but found that we are no longer in one; skipping
E231113 16:33:21.555797 24972 kv/kvserver/queue.go:1160 ⋮ [T1,Vsystem,n5,replicate,s5,r462/5:‹/Table/106/1/9{79536…-82117…}›] 855  operation ‹"replicate queue process replica 462"› timed out after 1m0.006s (given timeout 1m0s): change replicas of r462 failed: txn exec: context deadline exceeded

from n6:

I231113 16:32:22.346644 31320 13@kv/kvserver/replica_raftstorage.go:557 ⋮ [T1,Vsystem,n6,s6,r462/3:{-}] 157  applying snapshot 799a0858 from (n5,s5):5VOTER_INCOMING at applied index 37
W231113 16:32:21.430857 242 kv/kvserver/split_trigger_helper.go:146 ⋮ [T1,Vsystem,n6,s6,r462/3:{-},raft] 379  would have dropped incoming MsgApp to wait for split trigger, but allowing because uninitialized replica was created 45.248846985s (>20s) ago

I can take a bit more of a look later this afternoon unless someone picks this up. But this seems like a deadlock...

andrewbaptist · 2023-11-15T21:15:36Z

Jotting notes for the timeline:

Unknown: n6 is not a learner but needs a snapshot for most of the time- how did it get out of learner status?

16:31:32.965206 n1 - r70 rebalancing ‹voter› n2,s2 to n5,s5 - this is what causes n5 to become a learner since it doesn't get the snapshot when the splits start

16:31:36.039374 n1 - r461 splits to create r462 (size is 4.2GiB) - replicas n6/3  and n5/4 are still waiting for a snapshot

16:32:18.650939 n1 - store rebalancer begins rebalances from 1,2,6 (5 LEARNER) -> 2,5,6 - creating n5/4
16:32:18.652743 n1 - store rebalancer - remove n5/4 => gen 393
16:32:18.657009 n1 - add [n5/5-LEARNER] remove []): existing  [n1/1, n2/2, n6/3] => gen 394

16:32:21.366871 n1 - streamed snapshot at applied index 33 to n5/5
16:32:21.377980 n5 - applying snapshot from n1/1 at applied index 33

16:32:21.398427 n1 - change replicas add n5/5-VOTER_INCOMING remove n1/1-VOTER_DEMOTING_LEARNER 
16:32:21.427 - n1 - lease transfer to n5
16:32:21.536843 n1 - change replicas remove n1/1-LEARNER => n2/2, n6/3, n5/5 

16:32:21.558 n5 - n1 admin transfer lease transfer to n2 (last step of store rebalancing)

16:32:21.551888 n5 - change replicas add n3/6 to [n5/5, n2/2, n6/3]  (fails and retries since not leaseholder)

16:32:21.563416 n6 - would have dropped incoming MsgApp to wait for split trigger, but allowing because uninitialized replica was created 45.381409855s (>20s) ago - BUG?

16:32:21.570487 n5 - change replicas add n3/6 remove n2/2 to [n5/5, n2/2, n6/3] (succeeds)
16:32:21.582311 n2 -  proposing SIMPLE(l6) [(n3,s3):6LEARNER]: after=[n5/5 n2/2 n6/3 n3/6LEARNER 

16:32:21.943148 n3 - applied snapshot 2f9b8182 from (n5,s5):5 at applied index 48

16:32:21.952380 n2 - proposing ENTER_JOINT add n3, remove n2 after=[n5/5 n2/2-VOTER_DEMOTING_LEARNER n6/3 n3/6-VOTER_INCOMING] - HUNG
16:32:21.956000 n5 - change replicas - leave joint => gen 400 -> HUNG


16:32:22.332579 n5 -  streamed snapshot 799a0858 at applied index 37 to n6/3 - after 0.9s
16:32:22.346644 n6 - applying snapshot 799a0858 from (n5,s5):5VOTER_INCOMING at applied index 37

... system hung

16:32:36.962659 n2 - have been waiting 15s to acquire read latch ‹/Local/Range/Table/106/1/97953640/RangeDescriptor›@0,0, held by write latch ‹/Local/Range/Table/106/1/97953640/RangeDescriptor›@0,0

16:33:21.555153 n5 - error processing replica: change replicas of r462 failed: txn exec: context deadline exceeded

Lease history:

lease
n1 - 1699893085784953569 - 1699893145217695207
16:31:25.784 - 16:32:25.217

n5 - 1699893141427306670 - 1699893147427207021
16:32:21.427 - 16:32:27.427

n2 - 1699893141558284520 - 1699893147558234018
16:32:21.558 - 16:32:27.558

Raft Status

n2 -         "raft_applied_index": 50,
n3 -         "raft_applied_index": 48,
n5 -         "raft_applied_index": 50,
n6 -         "raft_applied_index": 37,

I understand why n6 is behind - since it hit the 45s timeout, but its not clear why n3 is behind, maybe its because of the structure of the joint configs.

At this point the joint configs are:
2, 5, 6 and 3, 5, 6
The second one can't make progress since there isn't a quorum.

erikgrinaker · 2023-11-16T11:44:07Z

@pavelkalinnikov pointed out that this looks similar to #104588. Replicas 3 and 6 (confusingly on nodes n6 and n3 respectively) are both waiting for snapshots, which stalls the second quorum. The leader 2 says:

  "raft_state": {
    "replica_id": 2,
    "hard_state": {
      "term": 8,
      "vote": 2,
      "commit": 50
    },
    "lead": 2,
    "state": "StateLeader",
    "applied": 50,
    "progress": {
      "2": {
        "match": 159,
        "next": 160,
        "state": "StateReplicate"
      },
      "3": {
        "next": 1,
        "state": "StateSnapshot",
        "paused": true,
        "pending_snapshot": 45
      },
      "5": {
        "match": 159,
        "next": 160,
        "state": "StateReplicate"
      },
      "6": {
        "next": 1,
        "state": "StateSnapshot",
        "paused": true,
        "pending_snapshot": 48
      }
    }

erikgrinaker · 2023-11-16T11:54:24Z

Yeah, this looks a lot like etcd-io/raft#84. n6 did apply a snapshot from n5 at applied index 37:

I231113 16:32:22.346644 31320 13@kv/kvserver/replica_raftstorage.go:557 ⋮ [T1,Vsystem,n6,s6,r462/3:{-}] 157  applying snapshot 799a0858 from (n5,s5):5VOTER_INCOMING at applied index 37

This is a perfectly fine snapshot, because the leader's log is truncated at index 10:

      "truncated_state": {
        "index": 10,
        "term": 5
      },

But the leader isn't cool with this, because it insists that n6 must apply a snapshot beyond applied index 48:

      "6": {
        "next": 1,
        "state": "StateSnapshot",
        "paused": true,
        "pending_snapshot": 48
      }

erikgrinaker · 2023-11-16T12:46:47Z

While the immediate problem here is #106813 and etcd-io/raft#84, I've been trying to understand why we even allowed entering this joint quorum that wouldn't be able to make progress.

We have some safety checks when executing a configuration change, but this only takes into account the liveness status of replicas, not their replication status. As the comment points out, callers generally don't check this at all, which is why we do it so far down the stack. My guess is that the replicate queue and allocator don't do any checks like this either elsewhere -- is that accurate @kvoli?

We could consider strengthening the checks here to consider replicas not in StateReplicate to be unavailable, but that would require access to the leader's Raft state. The replicate queue appears to be the only caller here though, and it only runs on leaseholder, so maybe that's a reasonable assumption -- at the very least, we could do a best-effort check that's skipped if we don't have leader state.

Does this all seem reasonable @kvoli? Do we have any existing issues for this, or should I write one up?

cockroach/pkg/kv/kvserver/replica_command.go

Lines 2366 to 2398 in b06849b

    
           // Run (and retry for a bit) a sanity check that the configuration 
        
           // resulting from this change is able to meet quorum. It's 
        
           // important to do this at this low layer as there are multiple 
        
           // entry points that are not generally too careful. For example, 
        
           // before the below check existed, the store rebalancer could 
        
           // carry out operations that would lead to a loss of quorum. 
        
           // 
        
           // See: 
        
           // https://github.com/cockroachdb/cockroach/issues/54444#issuecomment-707706553 
        
           replicas := crt.Desc.Replicas() 
        
           // We consider stores marked as "suspect" to be alive for the purposes of 
        
           // determining whether the range can achieve quorum since these stores are 
        
           // known to be currently live but have failed a liveness heartbeat in the 
        
           // recent past. 
        
           // 
        
           // Note that the allocator will avoid rebalancing to stores that are 
        
           // currently marked suspect. See uses of StorePool.getStoreList() in 
        
           // allocator.go. 
        
           liveReplicas, _ := args.liveAndDeadReplicas(replicas.Descriptors(), 
        
           	true /* includeSuspectAndDrainingStores */) 
        
           if !replicas.CanMakeProgress( 
        
           	func(rDesc roachpb.ReplicaDescriptor) bool { 
        
           		for _, inner := range liveReplicas { 
        
           			if inner.ReplicaID == rDesc.ReplicaID { 
        
           				return true 
        
           			} 
        
           		} 
        
           		return false 
        
           	}) { 
        
           	// NB: we use newQuorumError which is recognized by the replicate queue. 
        
           	return newQuorumError("range %s cannot make progress with proposed changes add=%v del=%v "+ 
        
           		"based on live replicas %v", crt.Desc, crt.Added(), crt.Removed(), liveReplicas) 
        
           }

erikgrinaker · 2023-11-16T12:48:51Z

I'm going to downgrade this to a GA blocker, since it's a known, pre-existing problem. It seems unlikely that we'll be able to get the upstream Raft changes properly tested and merged in time for 23.2, so I think it's unlikely we'll actually block 23.2 over this.

blathers-crl · 2023-11-16T12:52:58Z

cc @cockroachdb/replication

kvoli · 2023-11-16T15:25:52Z

My guess is that the replicate queue and allocator don't do any checks like this either elsewhere -- is that accurate @kvoli?

They do checks for replica removal:

cockroach/pkg/kv/kvserver/allocator/plan/replicate.go

Lines 586 to 591 in 64384f6

    
           candidates = allocatorimpl.FilterUnremovableReplicas(ctx, raftStatus, existingVoters, lastReplAdded) 
        
           log.KvDistribution.VEventf(ctx, 3, "filtered unremovable replicas from %v to get %v as candidates for removal: %s", 
        
           	existingVoters, candidates, rangeRaftProgress(raftStatus, existingVoters)) 
        
           if len(candidates) > 0 { 
        
           	break 
        
           }

and for rebalancing:

cockroach/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go

Lines 1772 to 1775 in 64384f6

    
           if targetType == VoterTarget && raftStatus != nil && raftStatus.Progress != nil { 
        
           	replicaCandidates = simulateFilterUnremovableReplicas( 
        
           		ctx, raftStatus, replicaCandidates, newReplica.ReplicaID) 
        
           }

The replica rebalancing check assumes the newly added replica is up to date. This could be changed to not consider the newly added replica -- however this would pause any rebalancing when a node goes down (temporarily), which doesn't appear desirable for load bearing systems.

Does this all seem reasonable @kvoli? Do we have any existing issues for this, or should I write one up?

Seems reasonable to me, I filed #114570.

erikgrinaker · 2023-11-16T15:27:13Z

The replica rebalancing check assumes the newly added replica is up to date. This could be changed to not consider the newly added replica -- however this would pause any rebalancing when a node goes down (temporarily), which doesn't appear desirable for load bearing systems.

Yeah, let's not do that. It's reasonable to assume the snapshot will catch it up, which it will given #106813. I'll see if I can get etcd-io/raft#84 ready tomorrow.

Prior to this commit, the leader would not take into account snapshots reported by a follower unless they matched or exceeded the tracked PendingSnapshot index (which is the leader's last index at the time of requesting the snapshot). This is too inflexible: the leader should take into account any snapshot that reconnects the follower to its log. This PR adds a config option ResumeReplicateBelowPendingSnapshot that enables this behavior. In doing so, it addresses long-standing problems that we've encountered in CockroachDB. Unless you create the snapshot immediately and locally when raft emits an MsgSnap, it's difficult/impossible to later synthesize a snapshot at the requested index. It is possible to get one above the requested index which raft always accepted, but CockroachDB delegates snapshots to followers who might be behind on applying the log, and it is awkward to have to wait for log application to send the snapshot just to satisfy an overly strict condition in raft. Additionally, CockroachDB also sends snapshots preemptively when adding a new replica since there are qualitative differences between an initial snapshot and one needed to reconnect to the log and one does not want to wait for raft to round-trip to the follower to realize that a snapshot is needed. In this case, the sent snapshot is commonly behind the PendingSnapshot since the leader transitions the follower into StateProbe when a snapshot is already in flight. Touches cockroachdb/cockroach#84242. Touches cockroachdb/cockroach#87553. Touches cockroachdb/cockroach#87554. Touches cockroachdb/cockroach#97971. Touches cockroachdb/cockroach#114349. See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143. Signed-off-by: Erik Grinaker <grinaker@cockroachlabs.com> Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>

A leader will not take into account snapshots reported by a follower unless they match or exceed the tracked PendingSnapshot index (which is the leader's last index at the time of requesting the snapshot). This is too inflexible: the leader should take into account any snapshot that reconnects the follower to its log. This PR adds a config option ResumeReplicateBelowPendingSnapshot that enables this behavior. In doing so, it addresses long-standing problems that we've encountered in CockroachDB. Unless you create the snapshot immediately and locally when raft emits an MsgSnap, it's difficult/impossible to later synthesize a snapshot at the requested index. It is possible to get one above the requested index which raft always accepted, but CockroachDB delegates snapshots to followers who might be behind on applying the log, and it is awkward to have to wait for log application to send the snapshot just to satisfy an overly strict condition in raft. Additionally, CockroachDB also sends snapshots preemptively when adding a new replica since there are qualitative differences between an initial snapshot and one needed to reconnect to the log and one does not want to wait for raft to round-trip to the follower to realize that a snapshot is needed. In this case, the sent snapshot is commonly behind the PendingSnapshot since the leader transitions the follower into StateProbe when a snapshot is already in flight. Touches cockroachdb/cockroach#84242. Touches cockroachdb/cockroach#87553. Touches cockroachdb/cockroach#87554. Touches cockroachdb/cockroach#97971. Touches cockroachdb/cockroach#114349. See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143. Signed-off-by: Erik Grinaker <grinaker@cockroachlabs.com> Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>

A leader will not take into account snapshots reported by a follower unless they match or exceed the tracked PendingSnapshot index (which is the leader's last indexat the time of requesting the snapshot). This is too inflexible: the leader should take into account any snapshot that reconnects the follower to its log. This PR adds a config option ResumeReplicateBelowPendingSnapshot that enables this behavior. In doing so, it addresses long-standing problems that we've encountered in CockroachDB. Unless you create the snapshot immediately and locally when raft emits an MsgSnap, it's difficult/impossible to later synthesize a snapshot at the requested index. It is possible to get one above the requested index which raft always accepted, but CockroachDB delegates snapshots to followers who might be behind on applying the log, and it is awkward to have to wait for log application to send the snapshot just to satisfy an overly strict condition in raft. Additionally, CockroachDB also sends snapshots preemptively when adding a new replica since there are qualitative differences between an initial snapshot and one needed to reconnect to the log and one does not want to wait for raft to round-trip to the follower to realize that a snapshot is needed. In this case, the sent snapshot is commonly behind the PendingSnapshot since the leader transitions the follower into StateProbe when a snapshot is already in flight. Touches cockroachdb/cockroach#84242. Touches cockroachdb/cockroach#87553. Touches cockroachdb/cockroach#87554. Touches cockroachdb/cockroach#97971. Touches cockroachdb/cockroach#114349. See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143. Signed-off-by: Erik Grinaker <grinaker@cockroachlabs.com> Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>

erikgrinaker · 2023-11-17T12:01:43Z

Submitted upstream fix in etcd-io/raft#110, CRDB fix in #114640.

A leader will not take into account snapshots reported by a follower unless they match or exceed the tracked PendingSnapshot index (which is the leader's last indexat the time of requesting the snapshot). This is too inflexible: the leader should take into account any snapshot that reconnects the follower to its log. This PR adds a config option ResumeReplicateBelowPendingSnapshot that enables this behavior. In doing so, it addresses long-standing problems that we've encountered in CockroachDB. Unless you create the snapshot immediately and locally when raft emits an MsgSnap, it's difficult/impossible to later synthesize a snapshot at the requested index. It is possible to get one above the requested index which raft always accepted, but CockroachDB delegates snapshots to followers who might be behind on applying the log, and it is awkward to have to wait for log application to send the snapshot just to satisfy an overly strict condition in raft. Additionally, CockroachDB also sends snapshots preemptively when adding a new replica since there are qualitative differences between an initial snapshot and one needed to reconnect to the log and one does not want to wait for raft to round-trip to the follower to realize that a snapshot is needed. In this case, the sent snapshot is commonly behind the PendingSnapshot since the leader transitions the follower into StateProbe when a snapshot is already in flight. Touches cockroachdb/cockroach#84242. Touches cockroachdb/cockroach#87553. Touches cockroachdb/cockroach#87554. Touches cockroachdb/cockroach#97971. Touches cockroachdb/cockroach#114349. See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143. Signed-off-by: Erik Grinaker <grinaker@cockroachlabs.com> Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>

erikgrinaker · 2023-11-29T14:25:05Z

Removing GA-blocker, since this is a pre-existing problem.

kvoli · 2023-12-01T15:27:43Z

Assigning P2, as this is a pre-existing problem but somewhat specific to this test and we would expect the system to recover shortly in most settings.

A leader will not take into account snapshots reported by a follower unless they match or exceed the tracked PendingSnapshot index (which is the leader's last indexat the time of requesting the snapshot). This is too inflexible: the leader should take into account any snapshot that reconnects the follower to its log. This PR adds a config option ResumeReplicateBelowPendingSnapshot that enables this behavior. In doing so, it addresses long-standing problems that we've encountered in CockroachDB. Unless you create the snapshot immediately and locally when raft emits an MsgSnap, it's difficult/impossible to later synthesize a snapshot at the requested index. It is possible to get one above the requested index which raft always accepted, but CockroachDB delegates snapshots to followers who might be behind on applying the log, and it is awkward to have to wait for log application to send the snapshot just to satisfy an overly strict condition in raft. Additionally, CockroachDB also sends snapshots preemptively when adding a new replica since there are qualitative differences between an initial snapshot and one needed to reconnect to the log and one does not want to wait for raft to round-trip to the follower to realize that a snapshot is needed. In this case, the sent snapshot is commonly behind the PendingSnapshot since the leader transitions the follower into StateProbe when a snapshot is already in flight. Touches cockroachdb/cockroach#84242. Touches cockroachdb/cockroach#87553. Touches cockroachdb/cockroach#87554. Touches cockroachdb/cockroach#97971. Touches cockroachdb/cockroach#114349. See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143. Signed-off-by: Erik Grinaker <grinaker@cockroachlabs.com> Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>

A leader will not take into account snapshots reported by a follower unless they match or exceed the tracked PendingSnapshot index (which is the leader's last indexat the time of requesting the snapshot). This is too inflexible: the leader should take into account any snapshot that reconnects the follower to its log. This PR makes that change. In doing so, it addresses long-standing problems that we've encountered in CockroachDB. Unless you create the snapshot immediately and locally when raft emits an MsgSnap, it's difficult/impossible to later synthesize a snapshot at the requested index. It is possible to get one above the requested index which raft always accepted, but CockroachDB delegates snapshots to followers who might be behind on applying the log, and it is awkward to have to wait for log application to send the snapshot just to satisfy an overly strict condition in raft. Additionally, CockroachDB also sends snapshots preemptively when adding a new replica since there are qualitative differences between an initial snapshot and one needed to reconnect to the log and one does not want to wait for raft to round-trip to the follower to realize that a snapshot is needed. In this case, the sent snapshot is commonly behind the PendingSnapshot since the leader transitions the follower into StateProbe when a snapshot is already in flight. Touches cockroachdb/cockroach#84242. Touches cockroachdb/cockroach#87553. Touches cockroachdb/cockroach#87554. Touches cockroachdb/cockroach#97971. Touches cockroachdb/cockroach#114349. See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143. Signed-off-by: Erik Grinaker <grinaker@cockroachlabs.com> Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>

cockroach-teamcity added this to the 23.2 milestone Nov 13, 2023

cockroach-teamcity added this to roachtest/unit test backlog in KV Nov 13, 2023

andrewbaptist self-assigned this Nov 15, 2023

blathers-crl bot added the T-kv-replication KV Replication Team label Nov 16, 2023

erikgrinaker removed the T-kv KV Team label Nov 16, 2023

erikgrinaker unassigned andrewbaptist Nov 16, 2023

kvoli mentioned this issue Nov 16, 2023

kvserver: include only voters in state replicate when checking if range can make progress on replication changes #114570

Open

erikgrinaker mentioned this issue Nov 17, 2023

Accept any snapshot that allows replication etcd-io/raft#110

Merged

erikgrinaker mentioned this issue Nov 17, 2023

kvserver: enable raft.Config.ResumeReplicateBelowPendingSnapshot #114640

Closed

erikgrinaker removed the GA-blocker label Nov 29, 2023

kvoli added the P-2 Issues/test failures with a fix SLA of 3 months label Dec 1, 2023

erikgrinaker self-assigned this Jan 3, 2024

erikgrinaker mentioned this issue Jan 10, 2024

go.mod: bump etcd-io/raft to f1c02c9909c0 #117614

Merged

craig bot closed this as completed in 60092e0 Jan 10, 2024

KV automation moved this from roachtest/unit test backlog to Closed Jan 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: splits/largerange/size=32GiB,nodes=6 failed #114349

roachtest: splits/largerange/size=32GiB,nodes=6 failed #114349

cockroach-teamcity commented Nov 13, 2023 •

edited by rickystewart

shralex commented Nov 14, 2023

andrewbaptist commented Nov 15, 2023

erikgrinaker commented Nov 15, 2023 •

edited

andrewbaptist commented Nov 15, 2023

andrewbaptist commented Nov 15, 2023 •

edited

erikgrinaker commented Nov 16, 2023

erikgrinaker commented Nov 16, 2023

erikgrinaker commented Nov 16, 2023

erikgrinaker commented Nov 16, 2023

blathers-crl bot commented Nov 16, 2023

kvoli commented Nov 16, 2023

erikgrinaker commented Nov 16, 2023

erikgrinaker commented Nov 17, 2023 •

edited

erikgrinaker commented Nov 29, 2023

kvoli commented Dec 1, 2023

roachtest: splits/largerange/size=32GiB,nodes=6 failed #114349

roachtest: splits/largerange/size=32GiB,nodes=6 failed #114349

Comments

cockroach-teamcity commented Nov 13, 2023 • edited by rickystewart

shralex commented Nov 14, 2023

andrewbaptist commented Nov 15, 2023

erikgrinaker commented Nov 15, 2023 • edited

andrewbaptist commented Nov 15, 2023

andrewbaptist commented Nov 15, 2023 • edited

erikgrinaker commented Nov 16, 2023

erikgrinaker commented Nov 16, 2023

erikgrinaker commented Nov 16, 2023

erikgrinaker commented Nov 16, 2023

blathers-crl bot commented Nov 16, 2023

kvoli commented Nov 16, 2023

erikgrinaker commented Nov 16, 2023

erikgrinaker commented Nov 17, 2023 • edited

erikgrinaker commented Nov 29, 2023

kvoli commented Dec 1, 2023

cockroach-teamcity commented Nov 13, 2023 •

edited by rickystewart

erikgrinaker commented Nov 15, 2023 •

edited

andrewbaptist commented Nov 15, 2023 •

edited

erikgrinaker commented Nov 17, 2023 •

edited