Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: TestLogic_upsert_non_metamorphic failed due to too large batch size #117070

Closed
cockroach-teamcity opened this issue Dec 25, 2023 · 10 comments
Closed
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-sql-queries SQL Queries Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Dec 25, 2023

pkg/sql/logictest/tests/fakedist-vec-off/fakedist-vec-off_test.TestLogic_upsert_non_metamorphic failed on master @ 53e4efd389167110d687e5dea8bf915fe2582a25:

    defer srv.Stopper().Stop(...)
    ts := srv.ApplicationLayer()

See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization

pkg/testutils/physicalplanutils/fake_resolver.go:27: (FakeResolverForTestCluster)
	NOTICE: .NodeID() called via implicit interface StorageLayerInterface;
HINT: consider using .StorageLayer().NodeID() instead.

pkg/testutils/physicalplanutils/fake_resolver.go:28: (FakeResolverForTestCluster)
	NOTICE: .AdvRPCAddr() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().AdvRPCAddr() instead.
TIP: consider replacing the test server initialization from:
    ts, ... := serverutils.StartServer(t, ...)
    defer ts.Stopper().Stop(...)
to:
    srv, ... := serverutils.StartServer(t, ...)
    defer srv.Stopper().Stop(...)
    ts := srv.ApplicationLayer()

See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization

pkg/testutils/physicalplanutils/fake_resolver.go:27: (FakeResolverForTestCluster)
	NOTICE: .NodeID() called via implicit interface StorageLayerInterface;
HINT: consider using .StorageLayer().NodeID() instead.

pkg/testutils/physicalplanutils/fake_resolver.go:28: (FakeResolverForTestCluster)
	NOTICE: .AdvRPCAddr() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().AdvRPCAddr() instead.
TIP: consider replacing the test server initialization from:
    ts, ... := serverutils.StartServer(t, ...)
    defer ts.Stopper().Stop(...)
to:
    srv, ... := serverutils.StartServer(t, ...)
    defer srv.Stopper().Stop(...)
    ts := srv.ApplicationLayer()

See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization
[10:13:42] --- progress: /var/lib/engflow/worker/work/3/exec/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/fakedist-vec-off/fakedist-vec-off_test_/fakedist-vec-off_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/upsert_non_metamorphic: 2 statements
[10:13:48] --- done: /var/lib/engflow/worker/work/3/exec/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/fakedist-vec-off/fakedist-vec-off_test_/fakedist-vec-off_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/upsert_non_metamorphic with config fakedist-vec-off: 3 tests, 0 failures
    logic.go:4005: 
        /var/lib/engflow/worker/work/3/exec/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/fakedist-vec-off/fakedist-vec-off_test_/fakedist-vec-off_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/upsert_non_metamorphic:22: error while processing
    logic.go:4005: 
        /var/lib/engflow/worker/work/3/exec/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/fakedist-vec-off/fakedist-vec-off_test_/fakedist-vec-off_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/upsert_non_metamorphic:22: 
        expected success, but found
        (XXUUU) command is too large: 4207837 bytes (max: 4194304)
        replica_raft.go:271: in evalAndPropose()
    panic.go:541: -- test log scope end --
test logs left over in: outputs.zip/logTestLogic_upsert_non_metamorphic2229304595
--- FAIL: TestLogic_upsert_non_metamorphic (26.41s)

Parameters:

  • attempt=1
  • race=true
  • run=1
  • shard=41
Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/sql-queries

This test on roachdash | Improve this report!

Jira issue: CRDB-34955

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-queries SQL Queries Team labels Dec 25, 2023
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone Dec 25, 2023
@michae2
Copy link
Collaborator

michae2 commented Dec 29, 2023

Looks like this is a dupe of #116762, but that was closed by skipping 3-node-tenant under race.

I was not able to reproduce this by running for an hour on a gceworker with:

./dev test pkg/sql/logictest/tests/fakedist-vec-off --filter=TestLogic_upsert_non_metamorphic --race --stress --count 10000

@DrewKimball
Copy link
Collaborator

I notice the mutations continue adding to a batch until the condition *.ApproximateMutationBytes() >= maxBatchByteSize returns true, which means we'll tend to overrun the limit in memory-intensive cases. We could fix that, but since the default mutation batch size is significantly smaller than the default raft command size, it shouldn't be too much of a problem in practice. So, maybe we should just fix the test by slightly increasing the command size limit, say, to 5mb.

@rharding6373 rharding6373 changed the title pkg/sql/logictest/tests/fakedist-vec-off/fakedist-vec-off_test: TestLogic_upsert_non_metamorphic failed sql: TestLogic_upsert_non_metamorphic failed due to too large batch size Jan 7, 2024
@rharding6373 rharding6373 removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jan 9, 2024
@rharding6373
Copy link
Collaborator

Removing release-blocker label since we believe this to be an unrealistic batch size setup.

@rharding6373 rharding6373 added the P-3 Issues/test failures with no fix SLA label Jan 9, 2024
@DrewKimball
Copy link
Collaborator

I haven't been able to reproduce, but I'm pretty sure the issue I mentioned here is what's causing the failure. The test uses rows of size ~100KB, and the failures show the limit exceeded by ~10KB.

For the related copy test TestLargeDynamicRows failures, the rows are ~2MB. Copy conservatively uses 1/3 of the max command size as its limit, but also suffers from the problem that it checks whether it's exceeded the limit after adding a row to the next batch. Since a single row is already half the max command size, its easy to create a batch that exceeds the limit.

@rytaft
Copy link
Collaborator

rytaft commented Jan 30, 2024

[triage] maybe we could make limit 5 MB instead of 4 in the test?

DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Mar 20, 2024
The tests `TestLargeDynamicRows` and `TestLogic_upsert_non_metamorphic`
occasionally flake because they set the raft command size limit to the
minimum `4MiB`, and their batch size limiting is inexact. This commit
prevents the flake by increasing the limit to `5MiB`. Making the batch
size limit exact will still be tracked by cockroachdb#117070.

Informs cockroachdb#117070

Release note: None
craig bot pushed a commit that referenced this issue Mar 20, 2024
120419: sqlstats: simplify transaction latency test r=abarganier,xinhaoz a=dhartunian

Remove need for test case counter which causes a data race.

Fixes: #119580
Epic: None
Release note: None

120633: physicalplan: bias towards streaks in bulk planning r=dt a=dt

The bulk oracle is used when planning large, bulk jobs, that are expected to involve many or all ranges in a cluster, where all nodes are likely to be assigned a large number of spans, and overall plan and the specs that represent it will include a very large number of distinct spans.

These large numbers of distinct spans in the specs can increase the cost of executing such a plan. In particular, processes that maintain a frontier of spans processes or not processed or time at which they are processed, such as CDC and PCR, have to track far more distinct spans in large clusters.

We can, however, in some cases reduce this number of distinct spans, by biasing the assignment of key ranges to nodes during replica selection to pick the same node for sequential ranges. By assigning, say, 10 spans to node one, then ten to two, then ten to three, potentially each node is now only tracking one logical span, that is 10x wider, instead of ten distinct spans.

We can bias towards such streaks only when the candidate replicas for a span include one on the node that would extend the streak, so this is an opportunistic optimization that depends on replica placement making it an option.

Additionally we need to be careful when applying such a bias that we still *distribute* work roughly evenly to achieve our desired overall utilization of the cluster. Thus we only bias towards streaks when the streak length is short or when the node on which we are extending a streak remains within some multiple of the least assigned node, reverting to the normal random selection if this is not the case.

Release note: none.
Epic: none.

120766: sql: increase raft command size limit for some tests r=DrewKimball a=DrewKimball

The tests `TestLargeDynamicRows` and `TestLogic_upsert_non_metamorphic` occasionally flake because they set the raft command size limit to the minimum `4MiB`, and their batch size limiting is inexact. This commit prevents the flake by increasing the limit to `5MiB`. Making the batch size limit exact will still be tracked by #117070.

Informs #117070

Release note: None

120769: sqlstats: skip TestSQLStatsCompactor r=abarganier a=dhartunian

Release note: None

120781: streamingest: skip `TestStreamingReplanOnLag` r=rail a=rickystewart

This test is very flaky.

See #120688

Epic: none
Release note: None

Co-authored-by: David Hartunian <davidh@cockroachlabs.com>
Co-authored-by: David Taylor <tinystatemachine@gmail.com>
Co-authored-by: Drew Kimball <drewk@cockroachlabs.com>
Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
@DrewKimball DrewKimball removed their assignment Apr 1, 2024
@rytaft rytaft added P-2 Issues/test failures with a fix SLA of 3 months and removed P-3 Issues/test failures with no fix SLA labels Apr 17, 2024
@rytaft
Copy link
Collaborator

rytaft commented Apr 17, 2024

This keeps failing, so increasing the priority to P-2

@mgartner
Copy link
Collaborator

We need to:

@DrewKimball
Copy link
Collaborator

It looks like the roachtest failures are using the default command size limit, so the same fix might not apply.

@yuzefovich
Copy link
Member

I'm lowering this to P-3 since we haven't seen this failure in logic tests in a while and moving it to the backlog. Note that there is a separate issue for the COPY roachtest #121413 which seems more frequent, so that one is still P-2.

@yuzefovich yuzefovich added P-3 Issues/test failures with no fix SLA and removed P-2 Issues/test failures with a fix SLA of 3 months labels May 21, 2024
@yuzefovich
Copy link
Member

We actually haven't had a failure in unit tests since #120766 merged about two months ago. All recent failures were in the roachtest (tracked in #121413 the fix to which is being merged right now), so I think it's safe to close this issue as having been fixed by #120766

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-sql-queries SQL Queries Team
Projects
Archived in project
Development

No branches or pull requests

7 participants