Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl: TestTenantLogic_create_index timed out #88202

Closed
yuzefovich opened this issue Sep 20, 2022 · 11 comments · Fixed by #88468
Closed

ccl: TestTenantLogic_create_index timed out #88202

yuzefovich opened this issue Sep 20, 2022 · 11 comments · Fixed by #88468
Assignees
Labels
branch-release-22.2 Used to mark release blockers and technical advisories for 22.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@yuzefovich
Copy link
Member

yuzefovich commented Sep 20, 2022

On the staging build here:

=== RUN   TestTenantLogic_create_index/resume-with-diff-tenant-resume-spans
    logic.go:2819: let $schema_changer_state = on
    logic.go:3069: 
        
        /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/sandbox/processwrapper-sandbox/7835/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/ccl/logictestccl/tests/3node-tenant/3node-tenant_test_/3node-tenant_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/create_index:264: 
        expected success, but found
        (XXUUU) setting updated but timed out waiting to read new value
        set_cluster_setting.go:572: in waitForSettingUpdate()
    logic.go:2015: 
         pq: setting updated but timed out waiting to read new value
[00:06:01] --- done: /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/sandbox/processwrapper-sandbox/7835/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/ccl/logictestccl/tests/3node-tenant/3node-tenant_test_/3node-tenant_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/create_index with config 3node-tenant: 51 tests, 2 failures
panic: test timed out after 59m55s

Seems related to #87201, cc @ajwerner

Jira issue: CRDB-19728

@yuzefovich yuzefovich added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Sep 20, 2022
@yuzefovich yuzefovich added this to Triage in SQL Foundations via automation Sep 20, 2022
@blathers-crl blathers-crl bot added the T-sql-schema-deprecated Use T-sql-foundations instead label Sep 20, 2022
@ajwerner ajwerner self-assigned this Sep 20, 2022
@ajwerner
Copy link
Contributor

There's something fishy about this one. I say that because we log the side effect with an event logger before we wait for the new setting value. We don't see a log event for that reset in the test logs.

@ajwerner
Copy link
Contributor

I rescind my analysis. We attach the logging to a different transaction than we use to write the cluster setting (for better or for worse, and most likely, for worse).

@ajwerner
Copy link
Contributor

The timeout here is because the pausepoint was still set when we went to do post-test dropping of the database.

@postamar postamar moved this from Triage to General in SQL Foundations Sep 20, 2022
@blathers-crl
Copy link

blathers-crl bot commented Sep 22, 2022

Hi @ajwerner, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@ajwerner ajwerner added the branch-release-22.2 Used to mark release blockers and technical advisories for 22.2 label Sep 22, 2022
@ajwerner
Copy link
Contributor

There's something very fishy going on here. I've added more logging, and it appears that the rangefeed is just missing the deletion event straight. It looks like there's a catch-up scan that somehow misses it or something. I'm honestly not sure. The repro is pretty slow. I'll spend more time digging tomorrow, but it's worrying.

@ajwerner
Copy link
Contributor

I think this might be related to logictest-use-mvcc-range-tombstones-for-point-deletes. @erikgrinaker does that make sense to you? I notice that in all the cases, it seems like the rangefeed does not see the delete operation and it seems like this metamorphic variable is true. At first I thought there was no way this could be it because we'd see it failing all the time, but then I looked more closely, and I noticed that it'll only be possible if the transaction leaves an intent to be resolved. That will only happen if we miss the 1PC optimization. In general, the transaction for set cluster setting will be 1PC, so I think that's what makes this rare.

Right now I'm headed to the office and have it running with that disabled, but I'm realizing as I type this that I could just as well disable 1PC and see whether it fails readily.

@erikgrinaker
Copy link
Contributor

It looks like there's a catch-up scan that somehow misses it or something.

I think this checks out. This is the only externally visible behavior that will differ with logictest-use-mvcc-range-tombstones-for-point-deletes. We still emit real-time point delete rangefeed events, since the tombstone replacement happens after the write (or after intent resolution), but the catchup scan won't see the point tombstones because they no longer exist, and range tombstones are ignored.

Is that the only issue here? If so, I can add a testing knob which replaces rangefeed events for point-sized range tombstones with regular point tombstone events. Alternatively, we can skip the relevant tests when that parameter is enabled.

@ajwerner
Copy link
Contributor

It will mean that any attempt to reset a cluster setting might fail in the rare case that a catch-up scan occurs.

This actually explains some other failures we saw related to enabling settings.

Another option is to somehow make the settingwatcher observe the range deletion tombstones. I suspect that the spanconfig watcher will also be sad about missing entries.

@erikgrinaker
Copy link
Contributor

Ok, I'll look into it, but may not get around to it until tomorrow.

@ajwerner
Copy link
Contributor

Honestly, I'm inclined to do something hacky here like avoid this tombstone on system tables in the resolve intent code.

@erikgrinaker
Copy link
Contributor

That's fine too.

@craig craig bot closed this as completed in #88468 Sep 22, 2022
@craig craig bot closed this as completed in d390148 Sep 22, 2022
SQL Foundations automation moved this from General to Closed Sep 22, 2022
@exalate-issue-sync exalate-issue-sync bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 10, 2023
@blathers-crl blathers-crl bot added this to Triage in SQL Foundations May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-22.2 Used to mark release blockers and technical advisories for 22.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
SQL Foundations
  
Done [after migration]
Development

Successfully merging a pull request may close this issue.

3 participants