Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvserver: TestLeasePreferencesDuringOutage failed #120605

Closed
cockroach-teamcity opened this issue Mar 17, 2024 · 1 comment · Fixed by #120643
Closed

kv/kvserver: TestLeasePreferencesDuringOutage failed #120605

cockroach-teamcity opened this issue Mar 17, 2024 · 1 comment · Fixed by #120643
Assignees
Labels
A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 17, 2024

kv/kvserver.TestLeasePreferencesDuringOutage failed on master @ 067e48d29b9093038f6fcf2074cd761ffdcd4fe2:

=== RUN   TestLeasePreferencesDuringOutage
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestLeasePreferencesDuringOutage2038843802
    test_log_scope.go:81: use -show-logs to present logs inline
    client_lease_test.go:1131: condition failed to evaluate within 45s: from client_lease_test.go:1141: expected no replicas in dc=sf, but found replica in dc=sf node_id=2 desc=r68:/{Table/100-Max} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=5, sticky=9223372036.854775807,2147483647]
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestLeasePreferencesDuringOutage2038843802
--- FAIL: TestLeasePreferencesDuringOutage (49.73s)

Parameters:

  • attempt=1
  • deadlock=true
  • run=1
  • shard=25
Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-36769

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Mar 17, 2024
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone Mar 17, 2024
@kvoli kvoli self-assigned this Mar 18, 2024
@kvoli
Copy link
Collaborator

kvoli commented Mar 18, 2024

I can reproduce this in under 200 runs with:

dev test pkg/kv/kvserver -f TestLeasePreferencesDuringOutage -v --stress -- --define -gotags=bazel,gss,deadlock

This is failing on the up-replication step, where there should be no voters on n2 or n3.

@kvoli kvoli added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure P-2 Issues/test failures with a fix SLA of 3 months and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 18, 2024
kvoli added a commit to kvoli/cockroach that referenced this issue Mar 26, 2024
A replica lease queue was introduced in cockroachdb#119155, which processes lease
transfers for replicas. Previously, the replicate queue handled lease
transfers. The linked change omitted updating the `ReplicationManual`
test cluster mode to disable the lease queue, resulting in unexpected
lease transfers in some tests.

Disable the lease queue under `ReplicationManual` replication mode. Misc
tests are also updated to disable/enable the lease queue appropriately.

Epic: None
Touches: cockroachdb#120605
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Mar 26, 2024
A replica lease queue was introduced in cockroachdb#119155, which processes lease
transfers for replicas. Previously, the replicate queue handled lease
transfers. The linked change omitted updating the `ReplicationManual`
test cluster mode to disable the lease queue, resulting in unexpected
lease transfers in some tests.

Disable the lease queue under `ReplicationManual` replication mode. Misc
tests are also updated to disable/enable the lease queue appropriately.

Epic: None
Touches: cockroachdb#120605
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Mar 26, 2024
After cockroachdb#118966, lease preference satisfaction is no longer tied to
up-replication. As such, there is no guarantee that a range will
up-replicate before the lease is transferred to satisfy a preference.
This caused `TestLeasePreferencesDuringOutage` to occasionally fail as
the test relies on the replica scanner to enqueue replicas into the
replicate queue to up-replicate, whereas previously it would be enqueued
into for lease preferences initially.

Only assert on the lease preference, as this is all that is guaranteed
to occur quickly.

Fixes: cockroachdb#120605
Release note: None
craig bot pushed a commit that referenced this issue Mar 26, 2024
120857: compose: remove PG ComposeCompare test r=rafiss a=rafiss

This test does not provide us much value and is too flaky to be useful. Most of the time it fails are due to minor differences in things like names, formatting, or precision, and accommodating each of these differences is not worth it.

fixes #109400
fixes #116150
fixes #112154
Release note: None


121052: base,testutils,kvserver: disable lease queue in replication manual  r=arulajmani a=kvoli

A replica lease queue was introduced in #119155, which processes lease
transfers for replicas. Previously, the replicate queue handled lease
transfers. The linked change omitted updating the `ReplicationManual`
test cluster mode to disable the lease queue, resulting in unexpected
lease transfers in some tests.

Disable the lease queue under `ReplicationManual` replication mode. Misc
tests are also updated to disable/enable the lease queue appropriately.

Epic: None
Touches: #120605
Release note: None

121120: roachtest: deflake admission-control/multitenant-fairness r=sumeerbhola a=aadityasondhi

In #120236, we removed the `failed` metadata from the `crdb_internal.statement_statistics` table.

Fixes #120586.
Fixes #120587.
Fixes #120588.
Fixes #120589.

Release note: None

121132: batcheval: move test to `large` RBE pool r=rail a=rickystewart

Epic: CRDB-8308
Release note: None

Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Co-authored-by: Austen McClernon <austen@cockroachlabs.com>
Co-authored-by: Aaditya Sondhi <20070511+aadityasondhi@users.noreply.github.com>
Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Mar 27, 2024
119719: kv: disable circuit breaker on destroyed replica r=lyang24 a=lyang24

This commit disables circuit breaker on replica that is destoryed or in the process of being destroyed.

Informs #104567

Release note: None

120643: kvserver: don't wait for replication in outage lease pref test r=andrewbaptist a=kvoli

After #118966, lease preference satisfaction is no longer tied to
up-replication. As such, there is no guarantee that a range will
up-replicate before the lease is transferred to satisfy a preference.
This caused `TestLeasePreferencesDuringOutage` to occasionally fail as
the test relies on the replica scanner to enqueue replicas into the
replicate queue to up-replicate, whereas previously it would be enqueued
into for lease preferences initially.

Only assert on the lease preference, as this is all that is guaranteed
to occur quickly.

Fixes: #120605
Release note: None

120727: sql: add crdb_internal.protect_cluster builtin r=dt a=stevendanna

Migration tooling uses historical queries that may need to run for many minutes or hours. Ensuring that these queries can continue to completion requires protecting the table's data from garbage collection.

The new builtin crdb_internal.protect_cluster creates a cluster-wide PTS record at the given timestamp. The timestamp will expire in 24 hours or when the returned job ID is canceled.

We re-purpose the stream ingestion producer job as the owner of this timestamp. This has a couple of advantages over a session-scoped builtin for now:

- Ability for an operator to remove the PTS record explicitly using CANCEL JOB.

- Some visibility into the source of the PTS record via the jobs table.

We use a cluster-wide PTS record rather than a table-specific PTS record because to be sure that we can do a historical query as of a given timestamp also requires that we have access to at least the descriptor and namespace table at that timestamp.

We've re-used the stream ingestion producer job for now as it also gives a simple way for the caller to extend the PTS record beyond 24 hours with crdb_internal.replication_stream_progress(job_id, protectTS).

Epic: CC-27068

Release note: None

Co-authored-by: lyang24 <lanqingy@usc.edu>
Co-authored-by: Austen McClernon <austen@cockroachlabs.com>
Co-authored-by: Steven Danna <danna@cockroachlabs.com>
Co-authored-by: David Taylor <tinystatemachine@gmail.com>
@craig craig bot closed this as completed in 6d92706 Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team
Projects
No open projects
Status: Closed
Development

Successfully merging a pull request may close this issue.

2 participants