Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvserver: TestReplicaTombstone failed #104477

Closed
cockroach-teamcity opened this issue Jun 7, 2023 · 6 comments · Fixed by #105044
Closed

kv/kvserver: TestReplicaTombstone failed #104477

cockroach-teamcity opened this issue Jun 7, 2023 · 6 comments · Fixed by #105044
Assignees
Labels
A-testing Testing tools and infrastructure branch-release-23.1 Used to mark GA and release blockers and technical advisories for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team
Projects
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jun 7, 2023

kv/kvserver.TestReplicaTombstone failed with artifacts on release-23.1 @ 1a5dc2aad16ca4185afcda75d923804324dcfeb9:

=== RUN   TestReplicaTombstone
--- FAIL: TestReplicaTombstone (40.99s)
=== RUN   TestReplicaTombstone/(4)_(4.1)_raft_messages_to_newer_replicaID_
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestReplicaTombstone_(4)_(4.1)_raft_messages_to_newer_replicaID_2468918913
    test_log_scope.go:79: use -show-logs to present logs inline
    client_replica_test.go:3654: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/client_replica_test.go:3654
        	Error:      	Expect "error sending couldn't accept range_id:64 coordinator_replica:<node_id:1 store_id:1 replica_id:1 type:VOTER_FULL > recipient_replica:<node_id:3 store_id:3 replica_id:4 type:LEARNER > delegated_sender:<node_id:1 store_id:1 replica_id:1 type:VOTER_FULL > priority:REBALANCE type:INITIAL term:9 first_index:41 descriptor_generation:9 queue_on_delegate_len:-1 snap_id:b792b7c5-3e44-4ca3-b924-2e16ba3f5ed2: sender: (n1,s1):1 is not fit to send snapshot for (n1,s1):1; sender term: 10, coordinator term: 9" to match "boom"
        	Test:       	TestReplicaTombstone/(4)_(4.1)_raft_messages_to_newer_replicaID_
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestReplicaTombstone_(4)_(4.1)_raft_messages_to_newer_replicaID_2468918913
    --- FAIL: TestReplicaTombstone/(4)_(4.1)_raft_messages_to_newer_replicaID_ (7.44s)

Parameters: TAGS=bazel,gss,deadlock

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-28551

@cockroach-teamcity cockroach-teamcity added branch-release-23.1 Used to mark GA and release blockers and technical advisories for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team labels Jun 7, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Jun 7, 2023
@cockroach-teamcity cockroach-teamcity added this to roachtest/unit test backlog in KV Jun 7, 2023
@irfansharif
Copy link
Contributor

irfansharif commented Jun 7, 2023

It's running with b42f862 from #102228.

@pav-kv
Copy link
Collaborator

pav-kv commented Jun 13, 2023

1632 runs so far, 0 failures, over 31m35s

Upd: failed after 35 min of stressing.

@pav-kv
Copy link
Collaborator

pav-kv commented Jun 13, 2023

The error that a successful run expects and sees:

error sending couldn't accept range_id:64 coordinator_replica:<node_id:1 store_id:1 replica_id:1 type:VOTER_FULL > recipient_replica:<node_id:3 store_id:3 replica_id:4 type:LEARNER > delegated_sender:<node_id:1 store_id:1 replica_id:1 type:VOTER_FULL > priority:REBALANCE type:INITIAL term:6 first_index:29 descriptor_generation:9 queue_on_delegate_len:-1 snap_id:6ccbbd1e-af05-4a5e-9b1c-09b7501fd2a8 : recv msg error: grpc: boom [code 2/Unknown]

Failing runs see:

error sending couldn't accept range_id:64 coordinator_replica:<node_id:1 store_id:1 replica_id:1 type:VOTER_FULL > recipient_replica:<node_id:3 store_id:3 replica_id:4 type:LEARNER > delegated_sender:<node_id:1 store_id:1 replica_id:1 type:VOTER_FULL > priority:REBALANCE type:INITIAL term:9 first_index:41 descriptor_generation:9 queue_on_delegate_len:-1 snap_id:b792b7c5-3e44-4ca3-b924-2e16ba3f5ed2: sender: (n1,s1):1 is not fit to send snapshot for (n1,s1):1; sender term: 10, coordinator term: 9

I.e. the "is not fit to send snapshot" error from a delegated snapshot machinery is not expected.

log.Infof(
ctx,
"sender: %v is not fit to send snapshot for %v; sender term: %v coordinator term: %v",
req.DelegatedSender, req.CoordinatorReplica, replTerm, req.Term,
)
return errors.Errorf(
"sender: %v is not fit to send snapshot for %v; sender term: %v, coordinator term: %v",
req.DelegatedSender, req.CoordinatorReplica, replTerm, req.Term,
)

Not sure why delegated snapshots kick in in this test.

@pav-kv pav-kv added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv Anything in KV that doesn't belong in a more specific category. labels Jun 13, 2023
@pav-kv pav-kv removed the A-kv Anything in KV that doesn't belong in a more specific category. label Jun 13, 2023
@pav-kv
Copy link
Collaborator

pav-kv commented Jun 13, 2023

@andrewbaptist Could you take a look? Do we expect delegated snapshots here? LMK if you need full logs of the good/bad runs.

@pav-kv
Copy link
Collaborator

pav-kv commented Jun 13, 2023

Logs from a good run contain:

[T1,n1,s1,r63/1:‹/{Table/62-Max}›,*kvpb.AdminSplitRequest] initiating a split of this range at key /Table/Max [r64] (‹manual›)
[T1,n2,s2,r64/2:‹/{Table/Max-Max}›] applied INITIAL snapshot ebb2d197 from (n1,s1):1 at applied index 15 (total=1ms data=844 B ingestion=6@1ms)
[T1,n3,s3,r64/3:‹/{Table/Max-Max}›] applied INITIAL snapshot 4ab0f270 from (n1,s1):1 at applied index 16 (total=1ms data=844 B ingestion=6@0ms)

// a bunch of raft message drops
[T1,n3] dropping r64 Raft message ‹1->3 MsgApp Term:6 Log:6/22 Commit:22›
...

[T1,n3] removing replica r64/3
[T1,n3] removed 9 (0+9) keys in 0ms [clear=0ms commit=0ms]

[T1,n1,s1,r64/1:‹/{Table/Max-Max}›,*kvpb.AdminChangeReplicasRequest] attempt 1: delegate snapshot ‹range_id:64 coordinator_replica:<node_id:1 store_id:1 replica_id:1 type:VOTER_FULL > recipient_replica:<node_id:3 store_id:3 replica_id:4 type:LEARNER > delegated_sender:<node_id:1 store_id:1 replica_id:1 type:VOTER_FULL > priority:REBALANCE type:INITIAL term:6 first_index:29 descriptor_generation:9 queue_on_delegate_len:-1 snap_id:6ccbbd1e-af05-4a5e-9b1c-09b7501fd2a8> : recv msg error: grpc: boom [code 2/Unknown]
[T1,n1,s1,r64/1:‹/{Table/Max-Max}›,*kvpb.AdminChangeReplicasRequest] could not successfully add and upreplicate LEARNER replica(s) on [n3,s3], rolling back: error sending couldn't accept <...>

Logs from a bad run contain:

[T1,n1,s1,r63/1:‹/{Table/62-Max}›,*kvpb.AdminSplitRequest] initiating a split of this range at key /Table/Max [r64] (‹manual›)
[T1,n2,s2,r64/2:‹/{Table/Max-Max}›] applied INITIAL snapshot 1f5d4fa8 from (n1,s1):1 at applied index 15 (total=1ms data=823 B ingestion=6@0ms)
[T1,n3,s3,r64/3:‹/{Table/Max-Max}›] applied INITIAL snapshot 6bc76a9f from (n1,s1):1 at applied index 15 (total=1ms data=823 B ingestion=6@0ms)

// a bunch of raft message drops
[T1,n3] dropping r64 Raft message ‹1->3 MsgApp Term:6 Log:6/21 Commit:21 Entries:[6/22 EntryNormal <omitted>]›
...

[T1,n1,s1,r64/1:‹/{Table/Max-Max}›] sender: (n1,s1):1 is not fit to send snapshot for (n1,s1):1; sender term: 8 coordinator term: 7
[T1,n1,s1,r64/1:‹/{Table/Max-Max}›,*kvpb.AdminChangeReplicasRequest] attempt 1: delegate snapshot ‹range_id:64 coordinator_replica:<node_id:1 store_id:1 replica_id:1 type:VOTER_FULL > recipient_replica:<node_id:3 store_id:3 replica_id:4 type:LEARNER > delegated_sender:<node_id:1 store_id:1 replica_id:1 type:VOTER_FULL > priority:REBALANCE type:INITIAL term:7 first_index:30 descriptor_generation:9 queue_on_delegate_len:-1 snap_id:c7b035b5-97ec-4d60-9e68-cc0c065cbc90 › request failed error sending couldn't accept ‹range_id:64 coordinator_replica:<node_id:1 store_id:1 replica_id:1 type:VOTER_FULL > recipient_replica:<node_id:3 store_id:3 replica_id:4 type:LEARNER > delegated_sender:<node_id:1 store_id:1 replica_id:1 type:VOTER_FULL > priority:REBALANCE type:INITIAL term:7 first_index:30 descriptor_generation:9 queue_on_delegate_len:-1 snap_id:c7b035b5-97ec-4d60-9e68-cc0c065cbc90 ›: sender: (n1,s1):1 is not fit to send snapshot for (n1,s1):1; sender term: 8, coordinator term: 7

[T1,n1,s1,r64/1:‹/{Table/Max-Max}›,*kvpb.AdminChangeReplicasRequest] could not successfully add and upreplicate LEARNER replica(s) on [n3,s3], rolling back: error sending couldn't accept ‹...›: sender: (n1,s1):1 is not fit to send snapshot for (n1,s1):1; 
sender term: 8, coordinator term: 7

[T1,n3] 281  removing replica r64/3
[T1,n3] 282  removed 9 (0+9) keys in 0ms [clear=0ms commit=0ms]

[T1,n1,s1,r64/1:‹/{Table/Max-Max}›,*kvpb.AdminChangeReplicasRequest] 284  rolled back LEARNER n3,s3 in r64:‹/{Table/Max-Max}› [(n1,s1):1, (n2,s2):2, (n3,s3):4LEARNER, next=5, gen=9, sticky=9223372036.854775807,2147483647]

@pav-kv pav-kv added the A-testing Testing tools and infrastructure label Jun 13, 2023
@andrewbaptist
Copy link
Collaborator

I'll take a look. In this case it is not a "true" delegated snapshot, since the coordinator and the sender are both n1. It just always uses the same machinery to send snapshots. The fact that the term changed here is a little surprising, but it could have changed for a different reason. I might need to weaken that check to not apply it in the case of self-delegation, or change it so that it allows a newer Term

andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue Jun 16, 2023
Previously delegated snapshots had a check where it would not allow
delegation if the term or generation were different between the sender
and the recipient. This was incorrect as on the coordinator it was
possible that the term and generation could increment. This change
allows the sending of changes as long as the sender (leaseholder or
delegate) has an equal or greater term or generation.

Epic: none
Fixes: cockroachdb#104477

Release note: None
craig bot pushed a commit that referenced this issue Jun 26, 2023
105044: kvserver: allow term and generation increments r=pavelkalinnikov a=andrewbaptist

Previously delegated snapshots had a check where it would not allow delegation if the term or generation were different between the sender and the recipient. This was incorrect as on the coordinator it was possible that the term and generation could increment. This change allows the sending of changes as long as the sender (leaseholder or delegate) has an equal or greater term or generation.

Epic: none
Fixes: #104477

Release note: None

105561: backupccl: skip timing based regression test under deadlock r=rhu713 a=adityamaru

TestCleanupIntentsDuringBackupPerformanceRegression ensures that intent resolution does not take >20s when running in a controlled environment. We have seen instances of this test when run under deadlock taking 22-24s causing the test to fail. Given that this test is very sensitive to timing, it does not make sense to run it under deadlock.

Fixes: #100034
Release note: None

105562: concurrency: small bug fix r=nvanbenschoten a=arulajmani

Previously, as seen in the testing diff, an inactive waiter/claimant on a lock could continue to remain a distinguished waiter on that lock. This is now fixed.

Epic: none

Release note: None

Co-authored-by: Andrew Baptist <baptist@cockroachlabs.com>
Co-authored-by: adityamaru <adityamaru@gmail.com>
Co-authored-by: Arul Ajmani <arulajmani@gmail.com>
@craig craig bot closed this as completed in fe3f6c2 Jun 26, 2023
KV automation moved this from roachtest/unit test backlog to Closed Jun 26, 2023
blathers-crl bot pushed a commit that referenced this issue Mar 26, 2024
Previously delegated snapshots had a check where it would not allow
delegation if the term or generation were different between the sender
and the recipient. This was incorrect as on the coordinator it was
possible that the term and generation could increment. This change
allows the sending of changes as long as the sender (leaseholder or
delegate) has an equal or greater term or generation.

Epic: none
Fixes: #104477

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure branch-release-23.1 Used to mark GA and release blockers and technical advisories for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team
Projects
KV
Closed
Development

Successfully merging a pull request may close this issue.

4 participants