Skip to content

Conversation

@tbg
Copy link
Member

@tbg tbg commented Oct 31, 2025

The test was checking on replication by asserting that the number of ranges
times the replication factor is equal to the number of replicas.

Replicas can persist after no longer being part of a range (waiting for
replicaGC), which is why this test was flaky.

I first tried to fix it by lowering the replicaGC queue's interval at
which it checks ranges (defaults to 12h), but looking more closely at
what the test was doing, I decided to change that instead by no longer
caring about what replicas each store reports. What matters is what
the meta ranges report, and now the test checks replication similar
to how many other tests do, by querying ranges_no_leases.

Fixes #146435.
Fixes #147763.
Fixes #156447.

Epic: none

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg
Copy link
Member Author

tbg commented Oct 31, 2025

@tbg tbg requested review from nicktrav and pav-kv October 31, 2025 09:58
@tbg tbg marked this pull request as ready for review October 31, 2025 10:00
The test was checking on replication by asserting that the number of ranges
times the replication factor is equal to the number of replicas.

Replicas can persist after no longer being part of a range (waiting for
replicaGC), which is why this test was flaky.

I first tried to fix it by lowering the replicaGC queue's interval at
which it checks ranges (defaults to 12h), but looking more closely at
what the test was doing, I decided to change that instead by no longer
caring about what replicas each store reports. What matters is what
the meta ranges report, and now the test checks replication similar
to how many other tests do, by querying ranges_no_leases.

Fixes cockroachdb#146435.
Fixes cockroachdb#147763.
Fixes cockroachdb#156447.

Epic: none
@tbg tbg force-pushed the multi-store-remove branch from 17758fb to 956b1b8 Compare October 31, 2025 12:16
Copy link
Collaborator

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % nit

@tbg
Copy link
Member Author

tbg commented Oct 31, 2025

TFTR! I'll wait for the roachtest run to clear before merge.

Re your inline comment: cnt provides the global count, so that the test prints the actual amount of outstanding work (alongside a few concrete rangeIDs to look into).

@tbg
Copy link
Member Author

tbg commented Oct 31, 2025

Beautiful

2025/10/31 13:35:58 multi_store_remove.go:162: test status: awaiting range relocation
2025/10/31 13:35:58 multi_store_remove.go:184: 541 ranges remain on removed store; for example: {5,6,7,8,15,17,18,19,20,23}
2025/10/31 13:36:03 multi_store_remove.go:184: 541 ranges remain on removed store; for example: {5,6,7,8,15,17,18,19,20,23}
2025/10/31 13:36:09 multi_store_remove.go:184: 541 ranges remain on removed store; for example: {5,6,7,8,15,17,18,19,20,23}
2025/10/31 13:36:14 multi_store_remove.go:184: 541 ranges remain on removed store; for example: {5,6,7,8,15,17,18,19,20,23}
2025/10/31 13:36:19 multi_store_remove.go:184: 472 ranges remain on removed store; for example: {5,6,7,8,15,17,18,20,23,25}
2025/10/31 13:36:24 multi_store_remove.go:184: 375 ranges remain on removed store; for example: {5,6,7,8,15,17,18,20,23,25}
2025/10/31 13:36:29 multi_store_remove.go:184: 280 ranges remain on removed store; for example: {5,6,7,15,17,18,20,23,29,30}
2025/10/31 13:36:34 multi_store_remove.go:184: 189 ranges remain on removed store; for example: {5,7,17,20,23,29,30,31,32,41}
2025/10/31 13:36:39 multi_store_remove.go:184: 129 ranges remain on removed store; for example: {5,17,23,29,30,31,41,43,49,59}
2025/10/31 13:36:44 multi_store_remove.go:184: 86 ranges remain on removed store; for example: {17,23,31,41,43,59,75,88,97,98}
2025/10/31 13:36:49 multi_store_remove.go:184: 72 ranges remain on removed store; for example: {17,23,31,41,43,59,97,98,102,145}
2025/10/31 13:36:54 multi_store_remove.go:184: 52 ranges remain on removed store; for example: {17,23,31,41,43,59,97,98,102,145}
2025/10/31 13:36:59 multi_store_remove.go:184: 26 ranges remain on removed store; for example: {23,31,59,145,176,220,227,232,259,264}
2025/10/31 13:37:04 multi_store_remove.go:184: 7 ranges remain on removed store; for example: {31,266,308,419,1000,1046,1084}
2025/10/31 13:37:10 multi_store_remove.go:180: all ranges moved off removed store

bors r+

craig bot pushed a commit that referenced this pull request Oct 31, 2025
156620: roachtest: deflake multi-store-remove r=tbg a=tbg

The test was checking on replication by asserting that the number of ranges
times the replication factor is equal to the number of replicas.

Replicas can persist after no longer being part of a range (waiting for
replicaGC), which is why this test was flaky.

I first tried to fix it by lowering the replicaGC queue's interval at
which it checks ranges (defaults to 12h), but looking more closely at
what the test was doing, I decided to change that instead by no longer
caring about what replicas each store reports. What matters is what
the meta ranges report, and now the test checks replication similar
to how many other tests do, by querying ranges_no_leases.

Fixes #146435.
Fixes #147763.
Fixes #156447.

Epic: none


156626: concurrency: fix a few typos r=arulajmani a=stevendanna

Release note: none
Epic: None

Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: Steven Danna <danna@cockroachlabs.com>
@craig
Copy link
Contributor

craig bot commented Oct 31, 2025

Build failed (retrying...):

craig bot pushed a commit that referenced this pull request Oct 31, 2025
156368: kvcoord: more tracing for leaseholder backoffs r=tbg a=tbg

Informs #156284.

Epic: none

156620: roachtest: deflake multi-store-remove r=tbg a=tbg

The test was checking on replication by asserting that the number of ranges
times the replication factor is equal to the number of replicas.

Replicas can persist after no longer being part of a range (waiting for
replicaGC), which is why this test was flaky.

I first tried to fix it by lowering the replicaGC queue's interval at
which it checks ranges (defaults to 12h), but looking more closely at
what the test was doing, I decided to change that instead by no longer
caring about what replicas each store reports. What matters is what
the meta ranges report, and now the test checks replication similar
to how many other tests do, by querying ranges_no_leases.

Fixes #146435.
Fixes #147763.
Fixes #156447.

Epic: none


Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
@craig
Copy link
Contributor

craig bot commented Oct 31, 2025

Build failed (retrying...):

craig bot pushed a commit that referenced this pull request Oct 31, 2025
156472: kvserver: add destroy replica to TestReplicaLifecycleDataDriven  r=pav-kv a=arulajmani

See individual commits for details.

156546: acceptance: Add COCKROACH_ACCEPTANCE_ALLOW_UNSAFE env var r=alyshanjahani-crl a=alyshanjahani-crl

Adding an environment variable to override the allow_unsafe_internals session variable is this simplest way to get the acceptance tests to pass when the default will turn to false in 26.1.

This was tested in #156398, and we see the CI acceptance job pass.

Fixes: #154663
Release note: None

156573: opt: fix LTREE <@ index constraint spans r=mgartner a=mgartner

Fixes #156478

Release note (bug fix): A bug has been fixed that caused incorrect
results for queries that filter indexed `LTREE` columns with the `<`@``
(contained-by) operator. This bug has been present since v25.4.0.

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Michael Erickson <michae2@cockroachlabs.com>


156581: roachtest: set allow_unsafe_internals=true for roachtests r=alyshanjahani-crl a=alyshanjahani-crl

This commit instruments the connections used by roachtests with the allow_unsafe_internals session variable set to true.

This is in preparation for when the default value of the session variable gets set to false in 26.1.

Fixes: #154674
Release note: None

156583: opt/props: make MinSelectivity variadic r=mgartner a=mgartner

`MinSelectivity` is now variadic. `MinSelectivity3` is obsolete and has
been removed.

Release note: None


156584: copy: fix vectorized auto commit behavior r=yuzefovich a=yuzefovich

When we implemented the vectorized INSERT which supports COPY in some cases, we missed one condition for auto-committing the txn that is present in the regular `tableWriterBase` path. Namely, we need to check whether the deadline that might be set on the txn hasn't expired yet, and if it has, we shouldn't be auto-committing and should be leaving it up to the connExecutor (which will try to refresh the deadline). The impact of the bug is that often if COPY took longer than 40s (controlled via `server.sqlliveness.ttl`), we'd hit the txn retry error and propagate it to the client.

Fixes: #155300.

Release note (bug fix): Previously, the "atomic" COPY command (controlled via `copy_from_atomic_enabled`, which is `true` by default) could encounter RETRY_COMMIT_DEADLINE_EXCEEDED txn errors if the whole command took 1 minute or more. This was the case only when the vectorized engine was used for COPY and is now fixed.

156620: roachtest: deflake multi-store-remove r=tbg a=tbg

The test was checking on replication by asserting that the number of ranges
times the replication factor is equal to the number of replicas.

Replicas can persist after no longer being part of a range (waiting for
replicaGC), which is why this test was flaky.

I first tried to fix it by lowering the replicaGC queue's interval at
which it checks ranges (defaults to 12h), but looking more closely at
what the test was doing, I decided to change that instead by no longer
caring about what replicas each store reports. What matters is what
the meta ranges report, and now the test checks replication similar
to how many other tests do, by querying ranges_no_leases.

Fixes #146435.
Fixes #147763.
Fixes #156447.

Epic: none


156654: sql: add optimizer_use_max_frequency_selectivity session setting r=mgartner a=mgartner

The `optimizer_use_max_frequency_selectivity` session setting has been
added. It is enabled by default. Disabling it reverts the selectivity
improvements added in #151409.

Release note: None


Co-authored-by: Arul Ajmani <arulajmani@gmail.com>
Co-authored-by: alyshanjahani-crl <alyshan@cockroachlabs.com>
Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
@craig
Copy link
Contributor

craig bot commented Oct 31, 2025

Build failed (retrying...):

craig bot pushed a commit that referenced this pull request Oct 31, 2025
155133: scbuild: allow multiple column renames r=rafiss a=rafiss

Please review each commit individually.

informs #148340

### schemaexpr: add isComputed to ColumnLookupFn

### scbuild: move function definitions to helpers.go

### schemaexpr: add iterColsWithLookupFn

This function allows the computed column validation to be used from the
declarative schema changer -- it uses a lookup function to access
columns rather than referring to the table descriptor directly.

### scbuild: use ValidateComputedColumnExpressionWithLookup function

This validates the computed column by referencing the builder state
rather than the table descriptor. The table descriptor would not reflect
the schema changes that are being staged by the declarative schema
changer.

### scplan: relax constraint for cleaning up non-index-backed constraints

This rule was matching all non-index-backed constraints, but it only
needs to apply to UNIQUE WITHOUT INDEX constraints.

### scplan: tighten rule for transient check constraints

This rule was only needed for the ALTER COLUMN TYPE op, so the rule is
updated to only apply in that case.

### scbuild: allow multiple column renames

This includes new dependency rules so that we fully remove the old name
before allowing a new column to take on that name, and another one to
make sure that a name is available before we create a computed column
expression that uses that name.

Release note: None

156472: kvserver: add destroy replica to TestReplicaLifecycleDataDriven  r=pav-kv a=arulajmani

See individual commits for details.

156584: copy: fix vectorized auto commit behavior r=yuzefovich a=yuzefovich

When we implemented the vectorized INSERT which supports COPY in some cases, we missed one condition for auto-committing the txn that is present in the regular `tableWriterBase` path. Namely, we need to check whether the deadline that might be set on the txn hasn't expired yet, and if it has, we shouldn't be auto-committing and should be leaving it up to the connExecutor (which will try to refresh the deadline). The impact of the bug is that often if COPY took longer than 40s (controlled via `server.sqlliveness.ttl`), we'd hit the txn retry error and propagate it to the client.

Fixes: #155300.

Release note (bug fix): Previously, the "atomic" COPY command (controlled via `copy_from_atomic_enabled`, which is `true` by default) could encounter RETRY_COMMIT_DEADLINE_EXCEEDED txn errors if the whole command took 1 minute or more. This was the case only when the vectorized engine was used for COPY and is now fixed.

156620: roachtest: deflake multi-store-remove r=tbg a=tbg

The test was checking on replication by asserting that the number of ranges
times the replication factor is equal to the number of replicas.

Replicas can persist after no longer being part of a range (waiting for
replicaGC), which is why this test was flaky.

I first tried to fix it by lowering the replicaGC queue's interval at
which it checks ranges (defaults to 12h), but looking more closely at
what the test was doing, I decided to change that instead by no longer
caring about what replicas each store reports. What matters is what
the meta ranges report, and now the test checks replication similar
to how many other tests do, by querying ranges_no_leases.

Fixes #146435.
Fixes #147763.
Fixes #156447.

Epic: none


156654: sql: add optimizer_use_max_frequency_selectivity session setting r=mgartner a=mgartner

The `optimizer_use_max_frequency_selectivity` session setting has been
added. It is enabled by default. Disabling it reverts the selectivity
improvements added in #151409.

Release note: None


156673: build: update `MungeTestXML` and `MergeTestXMLs` behavior r=rail a=rickystewart

The `test.xml` produced by `rules_go` changed with the latest upgrade; this change restores the previous behavior.

Epic: DEVINF-1477
Closes #156597
Release note: none

156685: sql: append copies of system column descriptors r=bghal a=bghal

Previously, the shared descriptors common to system columns meant
changes in one could cause side-effects in another. This was of
particular concern for the schema changer removing the `ColumnHidden`
element during a table drop and consequently exposing the column in
other tables. This change appends copies to avoid that side-effect.

Part of: #139605

Release note: None


Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Co-authored-by: Arul Ajmani <arulajmani@gmail.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
Co-authored-by: Brendan Gerrity <brendan.gerrity@cockroachlabs.com>
@craig
Copy link
Contributor

craig bot commented Oct 31, 2025

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Oct 31, 2025

@craig craig bot merged commit 98557a6 into cockroachdb:master Oct 31, 2025
24 checks passed
@tbg
Copy link
Member Author

tbg commented Nov 10, 2025

blathers backport 25.4 25.3 25.2

@blathers-crl
Copy link

blathers-crl bot commented Nov 10, 2025

Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches.


Issue #146435: branch-release-25.4.


Issue #147763: branch-release-25.2, branch-release-25.3, branch-release-25.4.


Issue #156447: branch-release-25.2, branch-release-25.3, branch-release-25.4.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachtest: multi-store-remove failed roachtest: multi-store-remove failed roachtest: multi-store-remove failed

3 participants