-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ccl/changefeedccl: TestChangefeedWithSimpleDistributionStrategy failed #121408
Comments
This test expects the 64 ranges to be evenly distributed across 6 nodes, but they were only distributed to 5 (note that node 4 is missing from the "found partitions" list). The balanced distribution algorithm only distributes to nodes initially assigned by distsql PartitionSpans, so it's most likely that distsql only used 5/6 nodes. All 64 ranges are accounted for, so distsql may not have used node 4 for some reason. Did something happen to that node during the test? I was unable to replicate this test using the random seed. I think that the test could be made more resilient, and we should look whether we can enable the expensive logging for the test so we have useful debug information should this happen again. Before removing the release blocker label, let's try to stress the test with the new logging around the rebalancer enabled and see if we can get a repro. I don't think this is a bug in the changefeed rebalancer, but it would be great to confirm that using the logs. |
I think Jay mentioned that the planner could return fewer nodes than available in the thread here https://cockroachlabs.slack.com/archives/C065X5307U3/p1709060892451439?thread_ts=1709052881.656199&cid=C065X5307U3. Do you might know why? |
There are a few reasons why this might happen:
I could see 1 or 2 happening in this test. Maybe something happened to node 4 when the test ran, so it was determined to be unhealthy during planning and the ranges it would have had were planned on a different node instead. Or maybe during the test the leaseholders changed from their original assignments and so the oracle didn't return node 4 for any spans. Alternatively, maybe there is some indeterminism in the test setup that splits the ranges across the 6 nodes, so the oracle didn't return node 4 because it never had leases for any ranges. I did a cursory look at this code and it seemed deterministic, but maybe I missed something. None of these point to a bug in the database code, just things that could be hardened in the test. So I'm going to remove the release blocker label. |
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
ccl/changefeedccl.TestChangefeedWithSimpleDistributionStrategy failed on master @ 737df7fe75c5698d9ba384ad322f30963582cc42:
Parameters:
|
ccl/changefeedccl.TestChangefeedWithSimpleDistributionStrategy failed on master @ 347cdc76d4c5abb2e872f325e944337a46b5883f:
Parameters:
|
Skipping it in #122814. |
ccl/changefeedccl.TestChangefeedWithSimpleDistributionStrategy failed on master @ 2a5e231716c436781f12452d800651f51c6383b7:
Parameters:
attempt=1
run=20
shard=1
Help
See also: How To Investigate a Go Test Failure (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-37230
The text was updated successfully, but these errors were encountered: