-
Notifications
You must be signed in to change notification settings - Fork 4k
roachtest: deflake multi-store-remove #156620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The test was checking on replication by asserting that the number of ranges times the replication factor is equal to the number of replicas. Replicas can persist after no longer being part of a range (waiting for replicaGC), which is why this test was flaky. I first tried to fix it by lowering the replicaGC queue's interval at which it checks ranges (defaults to 12h), but looking more closely at what the test was doing, I decided to change that instead by no longer caring about what replicas each store reports. What matters is what the meta ranges report, and now the test checks replication similar to how many other tests do, by querying ranges_no_leases. Fixes cockroachdb#146435. Fixes cockroachdb#147763. Fixes cockroachdb#156447. Epic: none
17758fb to
956b1b8
Compare
pav-kv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % nit
|
TFTR! I'll wait for the roachtest run to clear before merge. Re your inline comment: |
|
Beautiful bors r+ |
156620: roachtest: deflake multi-store-remove r=tbg a=tbg The test was checking on replication by asserting that the number of ranges times the replication factor is equal to the number of replicas. Replicas can persist after no longer being part of a range (waiting for replicaGC), which is why this test was flaky. I first tried to fix it by lowering the replicaGC queue's interval at which it checks ranges (defaults to 12h), but looking more closely at what the test was doing, I decided to change that instead by no longer caring about what replicas each store reports. What matters is what the meta ranges report, and now the test checks replication similar to how many other tests do, by querying ranges_no_leases. Fixes #146435. Fixes #147763. Fixes #156447. Epic: none 156626: concurrency: fix a few typos r=arulajmani a=stevendanna Release note: none Epic: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Steven Danna <danna@cockroachlabs.com>
|
Build failed (retrying...): |
156368: kvcoord: more tracing for leaseholder backoffs r=tbg a=tbg Informs #156284. Epic: none 156620: roachtest: deflake multi-store-remove r=tbg a=tbg The test was checking on replication by asserting that the number of ranges times the replication factor is equal to the number of replicas. Replicas can persist after no longer being part of a range (waiting for replicaGC), which is why this test was flaky. I first tried to fix it by lowering the replicaGC queue's interval at which it checks ranges (defaults to 12h), but looking more closely at what the test was doing, I decided to change that instead by no longer caring about what replicas each store reports. What matters is what the meta ranges report, and now the test checks replication similar to how many other tests do, by querying ranges_no_leases. Fixes #146435. Fixes #147763. Fixes #156447. Epic: none Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
|
Build failed (retrying...): |
156472: kvserver: add destroy replica to TestReplicaLifecycleDataDriven r=pav-kv a=arulajmani See individual commits for details. 156546: acceptance: Add COCKROACH_ACCEPTANCE_ALLOW_UNSAFE env var r=alyshanjahani-crl a=alyshanjahani-crl Adding an environment variable to override the allow_unsafe_internals session variable is this simplest way to get the acceptance tests to pass when the default will turn to false in 26.1. This was tested in #156398, and we see the CI acceptance job pass. Fixes: #154663 Release note: None 156573: opt: fix LTREE <@ index constraint spans r=mgartner a=mgartner Fixes #156478 Release note (bug fix): A bug has been fixed that caused incorrect results for queries that filter indexed `LTREE` columns with the `<`@`` (contained-by) operator. This bug has been present since v25.4.0. Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Michael Erickson <michae2@cockroachlabs.com> 156581: roachtest: set allow_unsafe_internals=true for roachtests r=alyshanjahani-crl a=alyshanjahani-crl This commit instruments the connections used by roachtests with the allow_unsafe_internals session variable set to true. This is in preparation for when the default value of the session variable gets set to false in 26.1. Fixes: #154674 Release note: None 156583: opt/props: make MinSelectivity variadic r=mgartner a=mgartner `MinSelectivity` is now variadic. `MinSelectivity3` is obsolete and has been removed. Release note: None 156584: copy: fix vectorized auto commit behavior r=yuzefovich a=yuzefovich When we implemented the vectorized INSERT which supports COPY in some cases, we missed one condition for auto-committing the txn that is present in the regular `tableWriterBase` path. Namely, we need to check whether the deadline that might be set on the txn hasn't expired yet, and if it has, we shouldn't be auto-committing and should be leaving it up to the connExecutor (which will try to refresh the deadline). The impact of the bug is that often if COPY took longer than 40s (controlled via `server.sqlliveness.ttl`), we'd hit the txn retry error and propagate it to the client. Fixes: #155300. Release note (bug fix): Previously, the "atomic" COPY command (controlled via `copy_from_atomic_enabled`, which is `true` by default) could encounter RETRY_COMMIT_DEADLINE_EXCEEDED txn errors if the whole command took 1 minute or more. This was the case only when the vectorized engine was used for COPY and is now fixed. 156620: roachtest: deflake multi-store-remove r=tbg a=tbg The test was checking on replication by asserting that the number of ranges times the replication factor is equal to the number of replicas. Replicas can persist after no longer being part of a range (waiting for replicaGC), which is why this test was flaky. I first tried to fix it by lowering the replicaGC queue's interval at which it checks ranges (defaults to 12h), but looking more closely at what the test was doing, I decided to change that instead by no longer caring about what replicas each store reports. What matters is what the meta ranges report, and now the test checks replication similar to how many other tests do, by querying ranges_no_leases. Fixes #146435. Fixes #147763. Fixes #156447. Epic: none 156654: sql: add optimizer_use_max_frequency_selectivity session setting r=mgartner a=mgartner The `optimizer_use_max_frequency_selectivity` session setting has been added. It is enabled by default. Disabling it reverts the selectivity improvements added in #151409. Release note: None Co-authored-by: Arul Ajmani <arulajmani@gmail.com> Co-authored-by: alyshanjahani-crl <alyshan@cockroachlabs.com> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
|
Build failed (retrying...): |
155133: scbuild: allow multiple column renames r=rafiss a=rafiss Please review each commit individually. informs #148340 ### schemaexpr: add isComputed to ColumnLookupFn ### scbuild: move function definitions to helpers.go ### schemaexpr: add iterColsWithLookupFn This function allows the computed column validation to be used from the declarative schema changer -- it uses a lookup function to access columns rather than referring to the table descriptor directly. ### scbuild: use ValidateComputedColumnExpressionWithLookup function This validates the computed column by referencing the builder state rather than the table descriptor. The table descriptor would not reflect the schema changes that are being staged by the declarative schema changer. ### scplan: relax constraint for cleaning up non-index-backed constraints This rule was matching all non-index-backed constraints, but it only needs to apply to UNIQUE WITHOUT INDEX constraints. ### scplan: tighten rule for transient check constraints This rule was only needed for the ALTER COLUMN TYPE op, so the rule is updated to only apply in that case. ### scbuild: allow multiple column renames This includes new dependency rules so that we fully remove the old name before allowing a new column to take on that name, and another one to make sure that a name is available before we create a computed column expression that uses that name. Release note: None 156472: kvserver: add destroy replica to TestReplicaLifecycleDataDriven r=pav-kv a=arulajmani See individual commits for details. 156584: copy: fix vectorized auto commit behavior r=yuzefovich a=yuzefovich When we implemented the vectorized INSERT which supports COPY in some cases, we missed one condition for auto-committing the txn that is present in the regular `tableWriterBase` path. Namely, we need to check whether the deadline that might be set on the txn hasn't expired yet, and if it has, we shouldn't be auto-committing and should be leaving it up to the connExecutor (which will try to refresh the deadline). The impact of the bug is that often if COPY took longer than 40s (controlled via `server.sqlliveness.ttl`), we'd hit the txn retry error and propagate it to the client. Fixes: #155300. Release note (bug fix): Previously, the "atomic" COPY command (controlled via `copy_from_atomic_enabled`, which is `true` by default) could encounter RETRY_COMMIT_DEADLINE_EXCEEDED txn errors if the whole command took 1 minute or more. This was the case only when the vectorized engine was used for COPY and is now fixed. 156620: roachtest: deflake multi-store-remove r=tbg a=tbg The test was checking on replication by asserting that the number of ranges times the replication factor is equal to the number of replicas. Replicas can persist after no longer being part of a range (waiting for replicaGC), which is why this test was flaky. I first tried to fix it by lowering the replicaGC queue's interval at which it checks ranges (defaults to 12h), but looking more closely at what the test was doing, I decided to change that instead by no longer caring about what replicas each store reports. What matters is what the meta ranges report, and now the test checks replication similar to how many other tests do, by querying ranges_no_leases. Fixes #146435. Fixes #147763. Fixes #156447. Epic: none 156654: sql: add optimizer_use_max_frequency_selectivity session setting r=mgartner a=mgartner The `optimizer_use_max_frequency_selectivity` session setting has been added. It is enabled by default. Disabling it reverts the selectivity improvements added in #151409. Release note: None 156673: build: update `MungeTestXML` and `MergeTestXMLs` behavior r=rail a=rickystewart The `test.xml` produced by `rules_go` changed with the latest upgrade; this change restores the previous behavior. Epic: DEVINF-1477 Closes #156597 Release note: none 156685: sql: append copies of system column descriptors r=bghal a=bghal Previously, the shared descriptors common to system columns meant changes in one could cause side-effects in another. This was of particular concern for the schema changer removing the `ColumnHidden` element during a table drop and consequently exposing the column in other tables. This change appends copies to avoid that side-effect. Part of: #139605 Release note: None Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com> Co-authored-by: Arul Ajmani <arulajmani@gmail.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com> Co-authored-by: Brendan Gerrity <brendan.gerrity@cockroachlabs.com>
|
Build failed (retrying...): |
|
blathers backport 25.4 25.3 25.2 |
|
Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches. Issue #146435: branch-release-25.4. Issue #147763: branch-release-25.2, branch-release-25.3, branch-release-25.4. Issue #156447: branch-release-25.2, branch-release-25.3, branch-release-25.4. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
The test was checking on replication by asserting that the number of ranges
times the replication factor is equal to the number of replicas.
Replicas can persist after no longer being part of a range (waiting for
replicaGC), which is why this test was flaky.
I first tried to fix it by lowering the replicaGC queue's interval at
which it checks ranges (defaults to 12h), but looking more closely at
what the test was doing, I decided to change that instead by no longer
caring about what replicas each store reports. What matters is what
the meta ranges report, and now the test checks replication similar
to how many other tests do, by querying ranges_no_leases.
Fixes #146435.
Fixes #147763.
Fixes #156447.
Epic: none