Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/streamingccl/streamingest: TestStreamingReplanOnLag failed #120688

Closed
cockroach-teamcity opened this issue Mar 19, 2024 · 6 comments · Fixed by #124983
Closed

ccl/streamingccl/streamingest: TestStreamingReplanOnLag failed #120688

cockroach-teamcity opened this issue Mar 19, 2024 · 6 comments · Fixed by #124983
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA skipped-test T-disaster-recovery
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 19, 2024

ccl/streamingccl/streamingest.TestStreamingReplanOnLag failed on master @ a2f1f379ee52ceee2b2aa6769f1fa162f8d6b8a7:

=== RUN   TestStreamingReplanOnLag
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestStreamingReplanOnLag4262448888
    test_log_scope.go:81: use -show-logs to present logs inline
    testutils.go:488: condition failed to evaluate within 45s: from testutils.go:521: waiting for stream ingestion job progress 1710827875.000000000,0 to advance beyond 1710827877.574577007,0
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestStreamingReplanOnLag4262448888
--- FAIL: TestStreamingReplanOnLag (53.50s)

Parameters:

  • attempt=1
  • run=1
  • shard=15
Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-36820

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery labels Mar 19, 2024
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone Mar 19, 2024
@msbutler
Copy link
Collaborator

failure unrelated to actual test, so likely an infra flake

@msbutler msbutler added P-3 Issues/test failures with no fix SLA and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 19, 2024
@rickystewart
Copy link
Collaborator

@msbutler I'm not sure about that.

  1. This test failed in 80% of the 30 runs that happened last night. I find "infra flake" to be a dubious explanation for why that happened.
  2. The test seems to be failing with similar symptoms elsewhere, e.g. example

@stevendanna
Copy link
Collaborator

I took a peak at the logs on this one and the behaviour is relatively confusing to me. I'll backfill some notes but I'm gonna bump this to P1.

@stevendanna stevendanna added P-1 Issues/test failures with a fix SLA of 1 month and removed P-3 Issues/test failures with no fix SLA labels Mar 19, 2024
@stevendanna
Copy link
Collaborator

Here is what I find confusing:

I240319 05:58:13.895615 26373 ccl/streamingccl/streamproducer/event_stream.go:93 ⋮ [T1,Vsystem,n1,client=127.0.0.1:47792,hostssl,user=root] 778  starting physical replication event stream: tenant=10 initial_scan_timestamp=1710827892.699424453,0 previous_replicated_time=0,0
I240319 05:58:13.895679 26373 ccl/streamingccl/streamproducer/event_stream.go:140 ⋮ [T1,Vsystem,n1,client=127.0.0.1:47792,hostssl,user=root] 779  starting event stream with initial scan at 1710827892.699424453,0
I240319 05:58:13.901532 26258 ccl/streamingccl/streamingest/ingest_span_configs.go:307 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=952754151792803843›] 780  flushing full span configuration state (49 records)
I240319 05:58:13.908532 26389 ccl/streamingccl/streamproducer/event_stream.go:93 ⋮ [T1,Vsystem,n3,client=127.0.0.1:39846,hostssl,user=root] 781  starting physical replication event stream: tenant=10 initial_scan_timestamp=1710827892.699424453,0 previous_replicated_time=0,0
I240319 05:58:13.908600 26389 ccl/streamingccl/streamproducer/event_stream.go:140 ⋮ [T1,Vsystem,n3,client=127.0.0.1:39846,hostssl,user=root] 782  starting event stream with initial scan at 1710827892.699424453,0
I240319 05:58:13.950779 26380 ccl/streamingccl/streamproducer/event_stream.go:93 ⋮ [T1,Vsystem,n2,client=127.0.0.1:56462,hostssl,user=root] 783  starting physical replication event stream: tenant=10 initial_scan_timestamp=1710827892.699424453,0 previous_replicated_time=0,0
I240319 05:58:13.950848 26380 ccl/streamingccl/streamproducer/event_stream.go:140 ⋮ [T1,Vsystem,n2,client=127.0.0.1:56462,hostssl,user=root] 784  starting event stream with initial scan at 1710827892.699424453,0

But then we time out waiting to reach that initial scan timestamp:

I240319 05:58:16.915729 62 ccl/streamingccl/replicationtestutils/testutils.go:488 ⋮ [-] 821  SucceedsSoon: waiting for stream ingestion job progress 1710827890.000000000,0 to advance beyond 1710827892.699424453,0

That is odd in itself. But even more odd is the timestamp in our job progress. At least, it was odd to me until I remembered that we started quantising our checkpoints. My guess is we forgot to update this test.

@cockroach-teamcity
Copy link
Member Author

ccl/streamingccl/streamingest.TestStreamingReplanOnLag failed on master @ f0116ea373a2b87155e7f0264df4f783ce177360:

=== RUN   TestStreamingReplanOnLag
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestStreamingReplanOnLag964593691
    test_log_scope.go:81: use -show-logs to present logs inline
    testutils.go:488: condition failed to evaluate within 45s: from testutils.go:521: waiting for stream ingestion job progress 1710915230.000000000,0 to advance beyond 1710915230.064217292,0
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestStreamingReplanOnLag964593691
--- FAIL: TestStreamingReplanOnLag (53.39s)

Parameters:

  • attempt=1
  • run=1
  • shard=15
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

rickystewart added a commit to rickystewart/cockroach that referenced this issue Mar 20, 2024
This test is very flaky.

See cockroachdb#120688

Epic: none
Release note: None
craig bot pushed a commit that referenced this issue Mar 20, 2024
120419: sqlstats: simplify transaction latency test r=abarganier,xinhaoz a=dhartunian

Remove need for test case counter which causes a data race.

Fixes: #119580
Epic: None
Release note: None

120633: physicalplan: bias towards streaks in bulk planning r=dt a=dt

The bulk oracle is used when planning large, bulk jobs, that are expected to involve many or all ranges in a cluster, where all nodes are likely to be assigned a large number of spans, and overall plan and the specs that represent it will include a very large number of distinct spans.

These large numbers of distinct spans in the specs can increase the cost of executing such a plan. In particular, processes that maintain a frontier of spans processes or not processed or time at which they are processed, such as CDC and PCR, have to track far more distinct spans in large clusters.

We can, however, in some cases reduce this number of distinct spans, by biasing the assignment of key ranges to nodes during replica selection to pick the same node for sequential ranges. By assigning, say, 10 spans to node one, then ten to two, then ten to three, potentially each node is now only tracking one logical span, that is 10x wider, instead of ten distinct spans.

We can bias towards such streaks only when the candidate replicas for a span include one on the node that would extend the streak, so this is an opportunistic optimization that depends on replica placement making it an option.

Additionally we need to be careful when applying such a bias that we still *distribute* work roughly evenly to achieve our desired overall utilization of the cluster. Thus we only bias towards streaks when the streak length is short or when the node on which we are extending a streak remains within some multiple of the least assigned node, reverting to the normal random selection if this is not the case.

Release note: none.
Epic: none.

120766: sql: increase raft command size limit for some tests r=DrewKimball a=DrewKimball

The tests `TestLargeDynamicRows` and `TestLogic_upsert_non_metamorphic` occasionally flake because they set the raft command size limit to the minimum `4MiB`, and their batch size limiting is inexact. This commit prevents the flake by increasing the limit to `5MiB`. Making the batch size limit exact will still be tracked by #117070.

Informs #117070

Release note: None

120769: sqlstats: skip TestSQLStatsCompactor r=abarganier a=dhartunian

Release note: None

120781: streamingest: skip `TestStreamingReplanOnLag` r=rail a=rickystewart

This test is very flaky.

See #120688

Epic: none
Release note: None

Co-authored-by: David Hartunian <davidh@cockroachlabs.com>
Co-authored-by: David Taylor <tinystatemachine@gmail.com>
Co-authored-by: Drew Kimball <drewk@cockroachlabs.com>
Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
@stevendanna stevendanna added P-3 Issues/test failures with no fix SLA and removed P-1 Issues/test failures with a fix SLA of 1 month labels May 22, 2024
@stevendanna
Copy link
Collaborator

@dt I downgraded this for now, but I think we skipped this after the quantization PR went in. I think some of your subsequent updates probably fixed this and we can unskip it now.

@msbutler msbutler added P-1 Issues/test failures with a fix SLA of 1 month P-3 Issues/test failures with no fix SLA and removed P-3 Issues/test failures with no fix SLA P-1 Issues/test failures with a fix SLA of 1 month labels May 31, 2024
@msbutler msbutler self-assigned this May 31, 2024
craig bot pushed a commit that referenced this issue Jun 3, 2024
124983: streamingccl: unskip TestTenantStreamingReplanOnLag r=kev-cao a=msbutler

Fixes #120688

Release note: none

Co-authored-by: Michael Butler <butler@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Jun 3, 2024
124983: streamingccl: unskip TestTenantStreamingReplanOnLag r=kev-cao a=msbutler

Fixes #120688

Release note: none

Co-authored-by: Michael Butler <butler@cockroachlabs.com>
@craig craig bot closed this as completed in 8358a3d Jun 3, 2024
Dhruv-Sachdev1313 pushed a commit to Dhruv-Sachdev1313/cockroach that referenced this issue Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA skipped-test T-disaster-recovery
Projects
Development

Successfully merging a pull request may close this issue.

5 participants