New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql/colflow: TestDrainingAfterRemoteError failed #109729
Comments
If my code sleuthing is correct, the pool that was closed and reused is from the tenant rate limiter: https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/tenantrate/limiter.go#L167 That seems to be where the next stage of investigation should happen. Attached are the TC logs I captured before the artifacts expired. |
That sounds right. My hypothesis is that because this test involves a test cluster with multiple nodes and we currently don't wait for all pods to be online for the test tenant (I believe we're currently only guaranteed that only the test tenant pod on node 1 is up), we might be attempting to connect to the remote pod that is not up yet, or something along those lines. |
I spent some time looking into this. (For some reason, it appears that enabling I added
and then a different goroutine attempts to use that already-released tenant rate limiter:
Not sure where to go from here though. It appears that there is some kind of race between the replica change in the first goroutine and a BatchRequest touching that replica being rate limited. @erikgrinaker @pavelkalinnikov do I understand it correctly that |
sql/colflow.TestDrainingAfterRemoteError failed with artifacts on master @ 2e849209c9d39e8fdb5001bdb53c99fa63e7981c:
Parameters: |
That's right. My memory of the life cycle details here are a bit fuzzy, but we typically use I'm not immediately seeing any synchronization in I can dig into this a bit more tomorrow. |
I'll reassign this to myself, so I don't forget about it, since this points towards a replica lifecycle race anyway. Let me know if you'd prefer to keep it. |
cc @cockroachdb/replication |
Previously we had a similar bug fixed in #95524, where we had a nil deref of this limiter. Can be related. |
The error is returned because there is a race between destroying a replica (which releases the rate limiter), and the incoming requests to the replica (which use the rate limiter, and return an error if it's released). The rate limiter is used without proper synchronization with the destruction status for performance reasons: we don't want to be holding any mutexes while being blocked on the limiter. The solution @erikgrinaker and I converged on is to check the return error of the limiter Some logs from a failing test run below. Logs indicate that the replica removal is due to moving voters between nodes (which the test does with the
|
sql/colflow.TestDrainingAfterRemoteError failed with artifacts on master @ f464c0607bdd695aceec4e51704b9dc11d465204:
Help
See also: How To Investigate a Go Test Failure (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-31069
The text was updated successfully, but these errors were encountered: