Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/colflow: TestDrainingAfterRemoteError failed #109729

Closed
cockroach-teamcity opened this issue Aug 30, 2023 · 9 comments · Fixed by #110806
Closed

sql/colflow: TestDrainingAfterRemoteError failed #109729

cockroach-teamcity opened this issue Aug 30, 2023 · 9 comments · Fixed by #110806
Assignees
Labels
branch-master Failures on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-robot Originated from a bot. T-kv-replication KV Replication Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Aug 30, 2023

sql/colflow.TestDrainingAfterRemoteError failed with artifacts on master @ f464c0607bdd695aceec4e51704b9dc11d465204:

=== RUN   TestDrainingAfterRemoteError
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/7f1b06f9801c2789f14134f543b8b883/logTestDrainingAfterRemoteError3264099528
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:132: automatically injected virtual cluster under test; see comment at top of test_server_shim.go for details.
    draining_test.go:83: error executing 'SELECT count(*) FROM small;': pq: 10 pool closed: released
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/7f1b06f9801c2789f14134f543b8b883/logTestDrainingAfterRemoteError3264099528
--- FAIL: TestDrainingAfterRemoteError (10.13s)
Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/sql-queries

This test on roachdash | Improve this report!

Jira issue: CRDB-31069

@cockroach-teamcity cockroach-teamcity added branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-queries SQL Queries Team labels Aug 30, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Aug 30, 2023
@rharding6373
Copy link
Collaborator

If my code sleuthing is correct, the pool that was closed and reused is from the tenant rate limiter: https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/tenantrate/limiter.go#L167

That seems to be where the next stage of investigation should happen.

Attached are the TC logs I captured before the artifacts expired.
colflowtest.log
test.log

@yuzefovich
Copy link
Member

That sounds right. My hypothesis is that because this test involves a test cluster with multiple nodes and we currently don't wait for all pods to be online for the test tenant (I believe we're currently only guaranteed that only the test tenant pod on node 1 is up), we might be attempting to connect to the remote pod that is not up yet, or something along those lines.

@yuzefovich yuzefovich self-assigned this Sep 9, 2023
@yuzefovich
Copy link
Member

I spent some time looking into this. (For some reason, it appears that enabling sql.distsql.direct_columnar_scans.enabled makes the reproduction faster (but with it disabled the problem still reproduces).)

I added debug.PrintStack() to see who is releasing the tenant rate limiter:

goroutine 680 [running]:
runtime/debug.Stack()
	GOROOT/src/runtime/debug/stack.go:24 +0x65
runtime/debug.PrintStack()
	GOROOT/src/runtime/debug/stack.go:16 +0x19
github.com/cockroachdb/cockroach/pkg/util/quotapool.(*AbstractPool).Close(0xc00723e9a0, {0x50c8ee8, 0x8})
	github.com/cockroachdb/cockroach/pkg/util/quotapool/quotapool.go:157 +0xda
github.com/cockroachdb/cockroach/pkg/kv/kvserver/tenantrate.(*limiter).Release(0xc003bb3508)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/tenantrate/limiter.go:167 +0x45
github.com/cockroachdb/cockroach/pkg/kv/kvserver/tenantrate.(*LimiterFactory).Release(0xc001fe7200, {0x6416ff0?, 0xc003bb3508})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/tenantrate/factory.go:136 +0x166
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).postDestroyRaftMuLocked(0xc002a07900, {0x64409c8?, 0xc005d9d830?}, {0x0, 0x178319bc6bf43370, 0x0, 0x103a2, 0x109a2, 0x19e, 0x4a45, ...})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_destroy.go:101 +0xd5
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleChangeReplicasResult(0xc002a07900, {0x64409c8, 0xc005d9d830}, 0xc009735a00)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_result.go:576 +0x390
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaStateMachine).handleNonTrivialReplicatedEvalResult(0xc002a07a08, {0x64409c8, 0xc005d9d830}, 0xc009c13398)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go:340 +0x617
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaStateMachine).ApplySideEffects(0xc002a07a08, {0x64409c8, 0xc005d9d830}, {0x647f748?, 0xc003be2f08})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go:186 +0x627
github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.mapCheckedCmdIter({0x7fd1aa04e720?, 0xc002a07dd8}, 0xc0087e45e8)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/cmd.go:214 +0x158
github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).applyOneBatch(0xc0087e4b38, {0x64409c8, 0xc005d9d830}, {0x646f740, 0xc002a07d78})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:295 +0x1f8
github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries(0xc0087e4b38, {0x64409c8, 0xc005d9d830})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:251 +0x9a
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked(_, {_, _}, {{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1052 +0x1399
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady(_, {_, _}, {{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:739 +0x19b
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady(0xc00457f500, 0xc00228ac00?)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:689 +0x145
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftSchedulerShard).worker(0xc0011072d0, {0x64409c8, 0xc00261e540}, {0x64266e0, 0xc00457f500}, 0xc00457ea80)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:418 +0x1f2
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).Start.func2({0x64409c8?, 0xc00261e540?})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:321 +0x45
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484 +0x146
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:475 +0x43b

and then a different goroutine attempts to use that already-released tenant rate limiter:

goroutine 6819 [running]:
runtime/debug.Stack()
	GOROOT/src/runtime/debug/stack.go:24 +0x65
runtime/debug.PrintStack()
	GOROOT/src/runtime/debug/stack.go:16 +0x19
github.com/cockroachdb/cockroach/pkg/util/quotapool.(*AbstractPool).acquireFastPath(0xc00723e9a0, {0x64409c8?, 0xc00891f830?}, {0x6417040, 0xc005d2c1a0})
	github.com/cockroachdb/cockroach/pkg/util/quotapool/quotapool.go:327 +0x274
github.com/cockroachdb/cockroach/pkg/util/quotapool.(*AbstractPool).Acquire(0xc00723e9a0, {0x64409c8, 0xc00891f830}, {0x6417040, 0xc005d2c1a0})
	github.com/cockroachdb/cockroach/pkg/util/quotapool/quotapool.go:218 +0x187
github.com/cockroachdb/cockroach/pkg/kv/kvserver/tenantrate.(*limiter).Wait(0xc003bb3508, {0x64409c8, 0xc00891f830}, {0x64253c0?, 0xc005fb0f40?, 0x0?, 0x1?})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/tenantrate/limiter.go:128 +0x129
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).maybeRateLimitBatch(0xc002a07900, {0x64409c8, 0xc00891f830}, 0xc0079146c0?)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_rate_limit.go:33 +0xca
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).SendWithWriteBytes(0xc002a07900, {0x64409c8?, 0xc00891f800?}, 0xc0079146c0)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:159 +0x43b
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).SendWithWriteBytes(0xc00457f500, {0x64409c8?, 0xc00891f7d0?}, 0xc0079146c0)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_send.go:193 +0x885
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).SendWithWriteBytes(0xa?, {0x64409c8, 0xc00891f7d0}, 0xc0079146c0)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:202 +0x10f
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal(0xc0003c6800, {0x64409c8?, 0xc00891f7a0?}, {0x6?}, 0xc0079146c0)
	github.com/cockroachdb/cockroach/pkg/server/node.go:1337 +0x55b
github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch(0xc0003c6800, {0x64409c8, 0xc00891f710}, 0xc0079146c0)
	github.com/cockroachdb/cockroach/pkg/server/node.go:1468 +0x2f5
github.com/cockroachdb/cockroach/pkg/kv/kvpb._Internal_Batch_Handler.func1({0x64409c8, 0xc00891f710}, {0x50599e0?, 0xc0079146c0})
	github.com/cockroachdb/cockroach/pkg/kv/kvpb/bazel-out/k8-fastbuild/bin/pkg/kv/kvpb/kvpb_go_proto_/github.com/cockroachdb/cockroach/pkg/kv/kvpb/api.pb.go:10300 +0x78
github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor.ServerInterceptor.func1({0x64409c8, 0xc00891f710}, {0x50599e0, 0xc0079146c0}, 0xc0081ba540, 0xc00697da70)
	github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor/grpc_interceptor.go:97 +0x254
google.golang.org/grpc.getChainUnaryHandler.func1({0x64409c8, 0xc00891f710}, {0x50599e0, 0xc0079146c0})
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1163 +0xb9
github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func3({0x64409c8, 0xc00891f710}, {0x50599e0, 0xc0079146c0}, 0xc0081ba540?, 0xc008b1ce80)
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:169 +0x83
google.golang.org/grpc.getChainUnaryHandler.func1({0x64409c8, 0xc00891f710}, {0x50599e0, 0xc0079146c0})
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1163 +0xb9
github.com/cockroachdb/cockroach/pkg/rpc.kvAuth.unaryInterceptor({0xc001f2e000?, {{0x4398c7?}, {0x6478070?, 0xc0025cb620?}}}, {0x64409c8, 0xc00891f620}, {0x50599e0, 0xc0079146c0}, 0xc0081ba540, 0xc008b1ce40)
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/auth.go:105 +0x288
google.golang.org/grpc.getChainUnaryHandler.func1({0x64409c8, 0xc00891f620}, {0x50599e0, 0xc0079146c0})
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1163 +0xb9
github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func1.1({0x64409c8?, 0xc00891f620?})
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:136 +0x3c
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc001043050, {0x64409c8, 0xc00891f620}, {0x0?, 0xc00a40f980?}, 0xc00a40f978)
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:336 +0xd1
github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func1({0x64409c8?, 0xc00891f620?}, {0x50599e0?, 0xc0079146c0?}, 0xc0081ba540?, 0xc00697da70?)
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:134 +0x95
google.golang.org/grpc.chainUnaryInterceptors.func1({0x64409c8, 0xc00891f620}, {0x50599e0, 0xc0079146c0}, 0xc009217a50?, 0x499cce0?)
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1154 +0x8f
github.com/cockroachdb/cockroach/pkg/kv/kvpb._Internal_Batch_Handler({0x5037fe0?, 0xc0003c6800}, {0x64409c8, 0xc00891f620}, 0xc005618af0, 0xc002603460)
	github.com/cockroachdb/cockroach/pkg/kv/kvpb/bazel-out/k8-fastbuild/bin/pkg/kv/kvpb/kvpb_go_proto_/github.com/cockroachdb/cockroach/pkg/kv/kvpb/api.pb.go:10302 +0x138
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0001421e0, {0x648ac40, 0xc0066d8820}, 0xc0079145a0, 0xc0025dd980, 0x8fae540, 0x0)
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1336 +0xd33
google.golang.org/grpc.(*Server).handleStream(0xc0001421e0, {0x648ac40, 0xc0066d8820}, 0xc0079145a0, 0x0)
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1704 +0xa36
google.golang.org/grpc.(*Server).serveStreams.func1.2()
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:965 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:963 +0x28a

Not sure where to go from here though. It appears that there is some kind of race between the replica change in the first goroutine and a BatchRequest touching that replica being rate limited.

@erikgrinaker @pavelkalinnikov do I understand it correctly that postDestroyRaftMuLocked is called when a replica is being destroyed?

@cockroach-teamcity
Copy link
Member Author

sql/colflow.TestDrainingAfterRemoteError failed with artifacts on master @ 2e849209c9d39e8fdb5001bdb53c99fa63e7981c:

=== RUN   TestDrainingAfterRemoteError
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/0fc162134fb21f43faaa92b5fbbb2da0/logTestDrainingAfterRemoteError1194033033
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:124: automatically injected virtual cluster under test; see comment at top of test_server_shim.go for details.
    draining_test.go:82: error executing 'ALTER TABLE small EXPERIMENTAL_RELOCATE VALUES (ARRAY[2], 2)': pq: while carrying out changes [{ADD_VOTER n2,s2} {REMOVE_VOTER n1,s1}]: change replicas of r65 failed: descriptor changed: [expected] r65:/Tenant/1{0-1} [(n1,s1):1, next=2, gen=2, sticky=1694532623.028812140,0] != [actual] r65:/Tenant/1{0-1} [(n1,s1):1, next=3, gen=4, sticky=1694532623.028812140,0]
    panic.go:540: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/0fc162134fb21f43faaa92b5fbbb2da0/logTestDrainingAfterRemoteError1194033033
--- FAIL: TestDrainingAfterRemoteError (13.76s)

Parameters: TAGS=bazel,gss , stress=true

Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@erikgrinaker
Copy link
Contributor

@erikgrinaker @pavelkalinnikov do I understand it correctly that postDestroyRaftMuLocked is called when a replica is being destroyed?

That's right. My memory of the life cycle details here are a bit fuzzy, but we typically use Replica.mu.destroyStatus, Replica.raftMu, and Replica.readOnlyCmdMu to synchronize with new and existing requests respectively.

I'm not immediately seeing any synchronization in Store.SendWithWriteBytes() and Replica.SendWithWriteBytes(), so it seems plausible that we've destroyed the replica between the store looking up the replica with GetReplicaIfExists and checking the rate limiter.

I can dig into this a bit more tomorrow.

@erikgrinaker
Copy link
Contributor

I'll reassign this to myself, so I don't forget about it, since this points towards a replica lifecycle race anyway. Let me know if you'd prefer to keep it.

@erikgrinaker erikgrinaker added the T-kv-replication KV Replication Team label Sep 12, 2023
@blathers-crl
Copy link

blathers-crl bot commented Sep 12, 2023

cc @cockroachdb/replication

@erikgrinaker erikgrinaker removed the T-sql-queries SQL Queries Team label Sep 12, 2023
@pav-kv
Copy link
Collaborator

pav-kv commented Sep 12, 2023

Previously we had a similar bug fixed in #95524, where we had a nil deref of this limiter. Can be related.

@erikgrinaker erikgrinaker added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Sep 13, 2023
@erikgrinaker erikgrinaker assigned pav-kv and unassigned erikgrinaker Sep 18, 2023
@pav-kv
Copy link
Collaborator

pav-kv commented Sep 18, 2023

The error is returned because there is a race between destroying a replica (which releases the rate limiter), and the incoming requests to the replica (which use the rate limiter, and return an error if it's released). The rate limiter is used without proper synchronization with the destruction status for performance reasons: we don't want to be holding any mutexes while being blocked on the limiter.

The solution @erikgrinaker and I converged on is to check the return error of the limiter Wait function. The 10 pool closed: released error we're seeing in this test is an ErrClosed error that propagated all the way up to SQL, but instead should have been treated as a "replica removed" error, and cause a retry.


Some logs from a failing test run below. Logs indicate that the replica removal is due to moving voters between nodes (which the test does with the ALTER TABLE small EXPERIMENTAL_RELOCATE VALUES ... statement 1 line above).

I230918 11:18:41.419727 5816 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n1,s1,r65/1:‹/Tenant/1{0-1}›,*kvpb.AdminChangeReplicasRequest] 1  change replicas (add [(n2,s2):2LEARNER] remove []): existing descriptor r65:‹/Tenant/1{0-1}› [(n1,s1):1, next=2, gen=2, sticky=1695039520.174954000,0]
I230918 11:18:41.426184 5816 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n1,s1,r65/1:‹/Tenant/1{0-1}›,*kvpb.AdminChangeReplicasRequest] 5  change replicas (add [(n2,s2):2VOTER_INCOMING] remove [(n1,s1):1VOTER_DEMOTING_LEARNER]): existing descriptor r65:‹/Tenant/1{0-1}› [(n1,s1):1, (n2,s2):2LEARNER, next=3, gen=3, sticky=1695039520.174954000,0]
I230918 11:18:41.428176 5816 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n1,s1,r65/1:‹/Tenant/1{0-1}›,*kvpb.AdminChangeReplicasRequest] 7  change replicas (add [] remove []): existing descriptor r65:‹/Tenant/1{0-1}› [(n1,s1):1VOTER_DEMOTING_LEARNER, (n2,s2):2VOTER_INCOMING, next=3, gen=4, sticky=1695039520.174954000,0]
I230918 11:18:41.436776 5816 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n1,s1,r65/1:‹/Tenant/1{0-1}›,*kvpb.AdminChangeReplicasRequest] 9  change replicas (add [] remove [(n1,s1):1LEARNER]): existing descriptor r65:‹/Tenant/1{0-1}› [(n1,s1):1LEARNER, (n2,s2):2, next=3, gen=5, sticky=1695039520.174954000,0]

I230918 11:18:41.446554 432 kv/kvserver/store_remove_replica.go:157 ⋮ [T1,Vsystem,n1,s1,r65/1:‹/Tenant/1{0-1}›,raft] 368  removing replica r65/1

@craig craig bot closed this as completed in c3c93c0 Sep 20, 2023
@andrewbaptist andrewbaptist added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-robot Originated from a bot. T-kv-replication KV Replication Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants