Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/changefeedccl: TestChangefeedWithSimpleDistributionStrategy failed #121408

Open
cockroach-teamcity opened this issue Mar 30, 2024 · 10 comments
Open
Assignees
Labels
A-cdc Change Data Capture branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-cdc
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 30, 2024

ccl/changefeedccl.TestChangefeedWithSimpleDistributionStrategy failed on master @ 2a5e231716c436781f12452d800651f51c6383b7:

=== RUN   TestChangefeedWithSimpleDistributionStrategy
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithSimpleDistributionStrategy41154234
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected an external process virtual cluster under test; see comment at top of test_server_shim.go for details.
    changefeed_dist_test.go:345: found partitions: [{1 /Tenant/10/Table/104/1{-/2}, /Tenant/10/Table/104/1/{9-10}, /Tenant/10/Table/104/1/1{2-4}, /Tenant/10/Table/104/1/{41-50} true 14} {2 /Tenant/10/Table/104/1/{2-4}, /Tenant/10/Table/104/1/1{0-2}, /Tenant/10/Table/104/1/1{5-6}, /Tenant/10/Table/104/1/{32-41} true 14} {3 /Tenant/10/Table/104/1/{4-9}, /Tenant/10/Table/104/1/1{4-5}, /Tenant/10/Table/104/1/1{6-8} true 8} {6 /Tenant/10/Table/104/{1/50-2} true 14} {5 /Tenant/10/Table/104/1/{18-32} true 14}]
    changefeed_dist_test.go:458: range counts: [14 14 8 0 14 14 0 0]
    changefeed_dist_test.go:522: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_dist_test.go:522
        	Error:      	"14" is not less than or equal to "12"
        	Test:       	TestChangefeedWithSimpleDistributionStrategy
        	Messages:   	counts [14 14 8 0 14 14 0 0] contains value greater than upper bound 12
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestChangefeedWithSimpleDistributionStrategy41154234
--- FAIL: TestChangefeedWithSimpleDistributionStrategy (168.55s)

Parameters:

  • attempt=1
  • run=20
  • shard=1
Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/cdc

This test on roachdash | Improve this report!

Jira issue: CRDB-37230

@cockroach-teamcity cockroach-teamcity added branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-cdc labels Mar 30, 2024
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone Mar 30, 2024
@blathers-crl blathers-crl bot added the A-cdc Change Data Capture label Mar 30, 2024
@rharding6373 rharding6373 added release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Apr 2, 2024
@rharding6373
Copy link
Collaborator

This test expects the 64 ranges to be evenly distributed across 6 nodes, but they were only distributed to 5 (note that node 4 is missing from the "found partitions" list). The balanced distribution algorithm only distributes to nodes initially assigned by distsql PartitionSpans, so it's most likely that distsql only used 5/6 nodes. All 64 ranges are accounted for, so distsql may not have used node 4 for some reason. Did something happen to that node during the test? I was unable to replicate this test using the random seed.

I think that the test could be made more resilient, and we should look whether we can enable the expensive logging for the test so we have useful debug information should this happen again.

Before removing the release blocker label, let's try to stress the test with the new logging around the rebalancer enabled and see if we can get a repro. I don't think this is a bug in the changefeed rebalancer, but it would be great to confirm that using the logs.

@wenyihu6
Copy link
Contributor

wenyihu6 commented Apr 3, 2024

I think Jay mentioned that the planner could return fewer nodes than available in the thread here https://cockroachlabs.slack.com/archives/C065X5307U3/p1709060892451439?thread_ts=1709052881.656199&cid=C065X5307U3. Do you might know why?

@rharding6373
Copy link
Collaborator

There are a few reasons why this might happen:

  1. The oracle didn't use a node (e.g., not a leaseholder, or isn't a replica depending on the type of oracle)
  2. The node is determined to be unhealthy at the time of planning
  3. There's a locality filter that doesn't include a node (CDC doesn't care about this, since we don't want to plan changefeeds on nodes outside of the locality filter)

I could see 1 or 2 happening in this test. Maybe something happened to node 4 when the test ran, so it was determined to be unhealthy during planning and the ranges it would have had were planned on a different node instead. Or maybe during the test the leaseholders changed from their original assignments and so the oracle didn't return node 4 for any spans.

Alternatively, maybe there is some indeterminism in the test setup that splits the ranges across the 6 nodes, so the oracle didn't return node 4 because it never had leases for any ranges. I did a cursory look at this code and it seemed deterministic, but maybe I missed something.

None of these point to a bug in the database code, just things that could be hardened in the test. So I'm going to remove the release blocker label.

@rharding6373 rharding6373 removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Apr 3, 2024
@cockroach-teamcity

This comment was marked as duplicate.

@rharding6373 rharding6373 added the P-2 Issues/test failures with a fix SLA of 3 months label Apr 17, 2024
@cockroach-teamcity

This comment was marked as duplicate.

@wenyihu6 wenyihu6 removed their assignment Apr 17, 2024
@cockroach-teamcity

This comment was marked as duplicate.

@cockroach-teamcity
Copy link
Member Author

ccl/changefeedccl.TestChangefeedWithSimpleDistributionStrategy failed on master @ 737df7fe75c5698d9ba384ad322f30963582cc42:

=== RUN   TestChangefeedWithSimpleDistributionStrategy
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithSimpleDistributionStrategy2950077343
    test_log_scope.go:81: use -show-logs to present logs inline
    changefeed_dist_test.go:345: found partitions: [{5 /Table/104/1{-/32} true 32} {6 /Table/104/{1/32-2} true 32}]
    changefeed_dist_test.go:458: range counts: [0 0 0 0 32 32 0 0]
    changefeed_dist_test.go:522: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_dist_test.go:522
        	Error:      	"32" is not less than or equal to "12"
        	Test:       	TestChangefeedWithSimpleDistributionStrategy
        	Messages:   	counts [0 0 0 0 32 32 0 0] contains value greater than upper bound 12
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestChangefeedWithSimpleDistributionStrategy2950077343
--- FAIL: TestChangefeedWithSimpleDistributionStrategy (104.66s)

Parameters:

  • attempt=1
  • run=12
  • shard=1
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

ccl/changefeedccl.TestChangefeedWithSimpleDistributionStrategy failed on master @ 347cdc76d4c5abb2e872f325e944337a46b5883f:

=== RUN   TestChangefeedWithSimpleDistributionStrategy
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestChangefeedWithSimpleDistributionStrategy1006945489
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
    changefeed_dist_test.go:345: found partitions: [{1 /Tenant/10/Table/104/1{-/9}, /Tenant/10/Table/104/1/1{4-8}, /Tenant/10/Table/104/1/3{1-2} true 14} {3 /Tenant/10/Table/104/1/4{0-5}, /Tenant/10/Table/104/1/4{8-9}, /Tenant/10/Table/104/1/5{0-1}, /Tenant/10/Table/104/1/5{2-3}, /Tenant/10/Table/104/1/5{5-7}, /Tenant/10/Table/104/1/{59-60}, /Tenant/10/Table/104/{1/61-2} true 14} {4 /Tenant/10/Table/104/1/{9-14}, /Tenant/10/Table/104/1/3{3-4}, /Tenant/10/Table/104/1/3{5-7}, /Tenant/10/Table/104/1/{38-40} true 10} {5 /Tenant/10/Table/104/1/{18-31} true 13} {6 /Tenant/10/Table/104/1/3{2-3}, /Tenant/10/Table/104/1/3{4-5}, /Tenant/10/Table/104/1/3{7-8}, /Tenant/10/Table/104/1/4{5-8}, /Tenant/10/Table/104/1/{49-50}, /Tenant/10/Table/104/1/5{1-2}, /Tenant/10/Table/104/1/5{3-5}, /Tenant/10/Table/104/1/5{7-9}, /Tenant/10/Table/104/1/6{0-1} true 13}]
    changefeed_dist_test.go:458: range counts: [14 0 14 10 13 13 0 0]
    changefeed_dist_test.go:522: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_dist_test.go:522
        	Error:      	"14" is not less than or equal to "12"
        	Test:       	TestChangefeedWithSimpleDistributionStrategy
        	Messages:   	counts [14 0 14 10 13 13 0 0] contains value greater than upper bound 12
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestChangefeedWithSimpleDistributionStrategy1006945489
--- FAIL: TestChangefeedWithSimpleDistributionStrategy (167.45s)

Parameters:

  • attempt=1
  • run=1
  • shard=1
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@wenyihu6
Copy link
Contributor

Skipping it in #122814.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-cdc
Projects
None yet
Development

No branches or pull requests

4 participants