ccl/streamingccl/streamingest: TestStreamingReplanOnLag failed #120688

cockroach-teamcity · 2024-03-19T06:24:20Z

ccl/streamingccl/streamingest.TestStreamingReplanOnLag failed on master @ a2f1f379ee52ceee2b2aa6769f1fa162f8d6b8a7:

=== RUN   TestStreamingReplanOnLag
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestStreamingReplanOnLag4262448888
    test_log_scope.go:81: use -show-logs to present logs inline
    testutils.go:488: condition failed to evaluate within 45s: from testutils.go:521: waiting for stream ingestion job progress 1710827875.000000000,0 to advance beyond 1710827877.574577007,0
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestStreamingReplanOnLag4262448888
--- FAIL: TestStreamingReplanOnLag (53.50s)

Parameters:

attempt=1
run=1
shard=15

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/disaster-recovery _{This test on roachdash | Improve this report!

Jira issue: CRDB-36820}

The text was updated successfully, but these errors were encountered:

msbutler · 2024-03-19T15:20:13Z

failure unrelated to actual test, so likely an infra flake

rickystewart · 2024-03-19T15:32:35Z

@msbutler I'm not sure about that.

This test failed in 80% of the 30 runs that happened last night. I find "infra flake" to be a dubious explanation for why that happened.
The test seems to be failing with similar symptoms elsewhere, e.g. example

stevendanna · 2024-03-19T15:40:06Z

I took a peak at the logs on this one and the behaviour is relatively confusing to me. I'll backfill some notes but I'm gonna bump this to P1.

stevendanna · 2024-03-19T15:50:05Z

Here is what I find confusing:

I240319 05:58:13.895615 26373 ccl/streamingccl/streamproducer/event_stream.go:93 ⋮ [T1,Vsystem,n1,client=127.0.0.1:47792,hostssl,user=root] 778  starting physical replication event stream: tenant=10 initial_scan_timestamp=1710827892.699424453,0 previous_replicated_time=0,0
I240319 05:58:13.895679 26373 ccl/streamingccl/streamproducer/event_stream.go:140 ⋮ [T1,Vsystem,n1,client=127.0.0.1:47792,hostssl,user=root] 779  starting event stream with initial scan at 1710827892.699424453,0
I240319 05:58:13.901532 26258 ccl/streamingccl/streamingest/ingest_span_configs.go:307 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=952754151792803843›] 780  flushing full span configuration state (49 records)
I240319 05:58:13.908532 26389 ccl/streamingccl/streamproducer/event_stream.go:93 ⋮ [T1,Vsystem,n3,client=127.0.0.1:39846,hostssl,user=root] 781  starting physical replication event stream: tenant=10 initial_scan_timestamp=1710827892.699424453,0 previous_replicated_time=0,0
I240319 05:58:13.908600 26389 ccl/streamingccl/streamproducer/event_stream.go:140 ⋮ [T1,Vsystem,n3,client=127.0.0.1:39846,hostssl,user=root] 782  starting event stream with initial scan at 1710827892.699424453,0
I240319 05:58:13.950779 26380 ccl/streamingccl/streamproducer/event_stream.go:93 ⋮ [T1,Vsystem,n2,client=127.0.0.1:56462,hostssl,user=root] 783  starting physical replication event stream: tenant=10 initial_scan_timestamp=1710827892.699424453,0 previous_replicated_time=0,0
I240319 05:58:13.950848 26380 ccl/streamingccl/streamproducer/event_stream.go:140 ⋮ [T1,Vsystem,n2,client=127.0.0.1:56462,hostssl,user=root] 784  starting event stream with initial scan at 1710827892.699424453,0

But then we time out waiting to reach that initial scan timestamp:

I240319 05:58:16.915729 62 ccl/streamingccl/replicationtestutils/testutils.go:488 ⋮ [-] 821  SucceedsSoon: waiting for stream ingestion job progress 1710827890.000000000,0 to advance beyond 1710827892.699424453,0

That is odd in itself. But even more odd is the timestamp in our job progress. At least, it was odd to me until I remembered that we started quantising our checkpoints. My guess is we forgot to update this test.

cockroach-teamcity · 2024-03-20T06:43:23Z

ccl/streamingccl/streamingest.TestStreamingReplanOnLag failed on master @ f0116ea373a2b87155e7f0264df4f783ce177360:

=== RUN   TestStreamingReplanOnLag
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestStreamingReplanOnLag964593691
    test_log_scope.go:81: use -show-logs to present logs inline
    testutils.go:488: condition failed to evaluate within 45s: from testutils.go:521: waiting for stream ingestion job progress 1710915230.000000000,0 to advance beyond 1710915230.064217292,0
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestStreamingReplanOnLag964593691
--- FAIL: TestStreamingReplanOnLag (53.39s)

Parameters:

attempt=1
run=1
shard=15

Help

See also: How To Investigate a Go Test Failure (internal)

_{This test on roachdash | Improve this report!}

This test is very flaky. See cockroachdb#120688 Epic: none Release note: None

120419: sqlstats: simplify transaction latency test r=abarganier,xinhaoz a=dhartunian Remove need for test case counter which causes a data race. Fixes: #119580 Epic: None Release note: None 120633: physicalplan: bias towards streaks in bulk planning r=dt a=dt The bulk oracle is used when planning large, bulk jobs, that are expected to involve many or all ranges in a cluster, where all nodes are likely to be assigned a large number of spans, and overall plan and the specs that represent it will include a very large number of distinct spans. These large numbers of distinct spans in the specs can increase the cost of executing such a plan. In particular, processes that maintain a frontier of spans processes or not processed or time at which they are processed, such as CDC and PCR, have to track far more distinct spans in large clusters. We can, however, in some cases reduce this number of distinct spans, by biasing the assignment of key ranges to nodes during replica selection to pick the same node for sequential ranges. By assigning, say, 10 spans to node one, then ten to two, then ten to three, potentially each node is now only tracking one logical span, that is 10x wider, instead of ten distinct spans. We can bias towards such streaks only when the candidate replicas for a span include one on the node that would extend the streak, so this is an opportunistic optimization that depends on replica placement making it an option. Additionally we need to be careful when applying such a bias that we still *distribute* work roughly evenly to achieve our desired overall utilization of the cluster. Thus we only bias towards streaks when the streak length is short or when the node on which we are extending a streak remains within some multiple of the least assigned node, reverting to the normal random selection if this is not the case. Release note: none. Epic: none. 120766: sql: increase raft command size limit for some tests r=DrewKimball a=DrewKimball The tests `TestLargeDynamicRows` and `TestLogic_upsert_non_metamorphic` occasionally flake because they set the raft command size limit to the minimum `4MiB`, and their batch size limiting is inexact. This commit prevents the flake by increasing the limit to `5MiB`. Making the batch size limit exact will still be tracked by #117070. Informs #117070 Release note: None 120769: sqlstats: skip TestSQLStatsCompactor r=abarganier a=dhartunian Release note: None 120781: streamingest: skip `TestStreamingReplanOnLag` r=rail a=rickystewart This test is very flaky. See #120688 Epic: none Release note: None Co-authored-by: David Hartunian <davidh@cockroachlabs.com> Co-authored-by: David Taylor <tinystatemachine@gmail.com> Co-authored-by: Drew Kimball <drewk@cockroachlabs.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>

stevendanna · 2024-05-22T15:10:31Z

@dt I downgraded this for now, but I think we skipped this after the quantization PR went in. I think some of your subsequent updates probably fixed this and we can unskip it now.

124983: streamingccl: unskip TestTenantStreamingReplanOnLag r=kev-cao a=msbutler Fixes #120688 Release note: none Co-authored-by: Michael Butler <butler@cockroachlabs.com>

Fixes cockroachdb#120688 Release note: none

cockroach-teamcity added this to the 24.1 milestone Mar 19, 2024

msbutler added P-3 Issues/test failures with no fix SLA and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 19, 2024

stevendanna added P-1 Issues/test failures with a fix SLA of 1 month and removed P-3 Issues/test failures with no fix SLA labels Mar 19, 2024

rickystewart mentioned this issue Mar 20, 2024

streamingest: skip TestStreamingReplanOnLag #120781

Merged

rickystewart added a commit to rickystewart/cockroach that referenced this issue Mar 20, 2024

streamingest: skip TestStreamingReplanOnLag

2d8d51b

This test is very flaky. See cockroachdb#120688 Epic: none Release note: None

mgartner added the skipped-test label Mar 21, 2024

stevendanna added P-3 Issues/test failures with no fix SLA and removed P-1 Issues/test failures with a fix SLA of 1 month labels May 22, 2024

msbutler added P-1 Issues/test failures with a fix SLA of 1 month P-3 Issues/test failures with no fix SLA and removed P-3 Issues/test failures with no fix SLA P-1 Issues/test failures with a fix SLA of 1 month labels May 31, 2024

msbutler self-assigned this May 31, 2024

msbutler mentioned this issue Jun 3, 2024

streamingccl: unskip TestTenantStreamingReplanOnLag #124983

Merged

craig bot pushed a commit that referenced this issue Jun 3, 2024

Merge #124983

227b009

124983: streamingccl: unskip TestTenantStreamingReplanOnLag r=kev-cao a=msbutler Fixes #120688 Release note: none Co-authored-by: Michael Butler <butler@cockroachlabs.com>

craig bot pushed a commit that referenced this issue Jun 3, 2024

Merge #124983

8a46d74

124983: streamingccl: unskip TestTenantStreamingReplanOnLag r=kev-cao a=msbutler Fixes #120688 Release note: none Co-authored-by: Michael Butler <butler@cockroachlabs.com>

craig bot closed this as completed in 8358a3d Jun 3, 2024

Dhruv-Sachdev1313 pushed a commit to Dhruv-Sachdev1313/cockroach that referenced this issue Jun 7, 2024

streamingccl: unskip TestTenantStreamingReplanOnLag

cdb2396

Fixes cockroachdb#120688 Release note: none

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ccl/streamingccl/streamingest: TestStreamingReplanOnLag failed #120688

ccl/streamingccl/streamingest: TestStreamingReplanOnLag failed #120688

cockroach-teamcity commented Mar 19, 2024 •

edited by cockroach-jira-scripts

Loading

msbutler commented Mar 19, 2024

rickystewart commented Mar 19, 2024

stevendanna commented Mar 19, 2024

stevendanna commented Mar 19, 2024

cockroach-teamcity commented Mar 20, 2024

stevendanna commented May 22, 2024

ccl/streamingccl/streamingest: TestStreamingReplanOnLag failed #120688

ccl/streamingccl/streamingest: TestStreamingReplanOnLag failed #120688

Comments

cockroach-teamcity commented Mar 19, 2024 • edited by cockroach-jira-scripts Loading

msbutler commented Mar 19, 2024

rickystewart commented Mar 19, 2024

stevendanna commented Mar 19, 2024

stevendanna commented Mar 19, 2024

cockroach-teamcity commented Mar 20, 2024

stevendanna commented May 22, 2024

cockroach-teamcity commented Mar 19, 2024 •

edited by cockroach-jira-scripts

Loading