Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logictest: select_for_update is flaky #124197

Open
cockroach-teamcity opened this issue May 15, 2024 · 14 comments
Open

logictest: select_for_update is flaky #124197

cockroach-teamcity opened this issue May 15, 2024 · 14 comments
Assignees
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months skipped-test T-sql-queries SQL Queries Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented May 15, 2024

pkg/sql/logictest/tests/local/local_test.TestLogic_select_for_update failed with artifacts on master @ 6300c3c3367ad46ac48bf24915cf0d73cae446a0:

=== RUN   TestLogic_select_for_update
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/e30685d5351bb37f7931bdadd013af1b/logTestLogic_select_for_update2572762351
    test_log_scope.go:81: use -show-logs to present logs inline
    logic.go:2968: 
         
        /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/sandbox/processwrapper-sandbox/11928/execroot/com_github_cockroachdb_cockroach/bazel-out/aarch64-fastbuild/bin/pkg/sql/logictest/tests/local/local_test_/local_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/select_for_update:575: SELECT k, u FROM t3 WHERE u = 2 FOR UPDATE SKIP LOCKED
        expected:
            2  2
            4  2
            
        but found (query options: "rowsort" -> ignore the following ordering of rows) :
            
[04:00:43] --- done: /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/sandbox/processwrapper-sandbox/11928/execroot/com_github_cockroachdb_cockroach/bazel-out/aarch64-fastbuild/bin/pkg/sql/logictest/tests/local/local_test_/local_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/select_for_update with config local: 141 tests, 1 failures
    logic.go:4135: 
        /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/sandbox/processwrapper-sandbox/11928/execroot/com_github_cockroachdb_cockroach/bazel-out/aarch64-fastbuild/bin/pkg/sql/logictest/tests/local/local_test_/local_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/select_for_update:581: error while processing
    logic.go:4135: /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/sandbox/processwrapper-sandbox/11928/execroot/com_github_cockroachdb_cockroach/bazel-out/aarch64-fastbuild/bin/pkg/sql/logictest/tests/local/local_test_/local_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/select_for_update:581: too many errors encountered, skipping the rest of the input
    panic.go:626: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/e30685d5351bb37f7931bdadd013af1b/logTestLogic_select_for_update2572762351
--- FAIL: TestLogic_select_for_update (1.27s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

Jira issue: CRDB-38757

@mgartner
Copy link
Collaborator

This reproduces quickly with ./dev test pkg/sql/logictest/tests/local -f TestLogic_select_for_update --stress. It looks like something in the last ~50 commits on master is causing this. I'm bisecting now.

@mgartner
Copy link
Collaborator

I'm having trouble bisecting because I think I was overly optimistic on how quickly the failure occurs under stress. Sometimes it will fail in the first few seconds, but other times I'll make it through all 1000 iterations without a failure, on a commit that I believe contains the change.

@cockroach-teamcity
Copy link
Member Author

pkg/sql/logictest/tests/local/local_test.TestLogic_select_for_update failed on master @ 992d4573fb8b71e28717b4d14c4c7ebccb38c412:

=== RUN   TestLogic_select_for_update
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestLogic_select_for_update2654335675
    test_log_scope.go:81: use -show-logs to present logs inline
[07:36:53] setting distsql_workmem='38107B';
    logic.go:2968: 
         
        /var/lib/engflow/worker/work/2/exec/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/local/local_test_/local_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/select_for_update:575: SELECT k, u FROM t3 WHERE u = 2 FOR UPDATE SKIP LOCKED
        expected:
            2  2
            4  2
            
        but found (query options: "rowsort" -> ignore the following ordering of rows) :
            
[07:36:53] --- done: /var/lib/engflow/worker/work/2/exec/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/local/local_test_/local_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/select_for_update with config local: 141 tests, 1 failures
    logic.go:4135: 
        /var/lib/engflow/worker/work/2/exec/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/local/local_test_/local_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/select_for_update:581: error while processing
    logic.go:4135: /var/lib/engflow/worker/work/2/exec/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/local/local_test_/local_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/select_for_update:581: too many errors encountered, skipping the rest of the input
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestLogic_select_for_update2654335675
--- FAIL: TestLogic_select_for_update (0.91s)

Parameters:

  • attempt=1
  • run=23
  • shard=15
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

Copy link

pkg/sql/logictest/tests/local/local_test.TestLogic_select_for_update failed on master @ e6b993509e6ecdc0a0a88a8363cdd8a206ecf7f3:

=== RUN   TestLogic_select_for_update
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestLogic_select_for_update3919272814
    test_log_scope.go:81: use -show-logs to present logs inline
[19:45:39] setting distsql_workmem='36530B';
    logic.go:2968: 
         
        /var/lib/engflow/worker/work/1/exec/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/local/local_test_/local_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/select_for_update:575: SELECT k, u FROM t3 WHERE u = 2 FOR UPDATE SKIP LOCKED
        expected:
            2  2
            4  2
            
        but found (query options: "rowsort" -> ignore the following ordering of rows) :
            
[19:45:39] --- done: /var/lib/engflow/worker/work/1/exec/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/local/local_test_/local_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/select_for_update with config local: 141 tests, 1 failures
    logic.go:4135: 
        /var/lib/engflow/worker/work/1/exec/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/local/local_test_/local_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/select_for_update:581: error while processing
    logic.go:4135: /var/lib/engflow/worker/work/1/exec/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/local/local_test_/local_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/select_for_update:581: too many errors encountered, skipping the rest of the input
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestLogic_select_for_update3919272814
--- FAIL: TestLogic_select_for_update (0.77s)

Parameters:

  • attempt=1
  • run=1
  • shard=15
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

pkg/sql/logictest/tests/local/local_test.TestLogic_select_for_update failed with artifacts on master @ c511a3284cd821c40145f151537dd6b971dce1eb:

=== RUN   TestLogic_select_for_update
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/e30685d5351bb37f7931bdadd013af1b/logTestLogic_select_for_update1937186316
    test_log_scope.go:81: use -show-logs to present logs inline
[10:25:50] setting distsql_workmem='75581B';
    logic.go:2968: 
         
        /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/sandbox/processwrapper-sandbox/15423/execroot/com_github_cockroachdb_cockroach/bazel-out/aarch64-fastbuild/bin/pkg/sql/logictest/tests/local/local_test_/local_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/select_for_update:575: SELECT k, u FROM t3 WHERE u = 2 FOR UPDATE SKIP LOCKED
        expected:
            2  2
            4  2
            
        but found (query options: "rowsort" -> ignore the following ordering of rows) :
            
[10:25:51] --- done: /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/sandbox/processwrapper-sandbox/15423/execroot/com_github_cockroachdb_cockroach/bazel-out/aarch64-fastbuild/bin/pkg/sql/logictest/tests/local/local_test_/local_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/select_for_update with config local: 141 tests, 1 failures
    logic.go:4135: 
        /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/sandbox/processwrapper-sandbox/15423/execroot/com_github_cockroachdb_cockroach/bazel-out/aarch64-fastbuild/bin/pkg/sql/logictest/tests/local/local_test_/local_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/select_for_update:581: error while processing
    logic.go:4135: /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/sandbox/processwrapper-sandbox/15423/execroot/com_github_cockroachdb_cockroach/bazel-out/aarch64-fastbuild/bin/pkg/sql/logictest/tests/local/local_test_/local_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/select_for_update:581: too many errors encountered, skipping the rest of the input
    panic.go:626: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/e30685d5351bb37f7931bdadd013af1b/logTestLogic_select_for_update1937186316
--- FAIL: TestLogic_select_for_update (1.51s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@RaduBerinde
Copy link
Member

Can we skip the test for now?

yuzefovich added a commit to yuzefovich/cockroach that referenced this issue May 17, 2024
Informs: cockroachdb#124197.
Epic: None

Release note: None
@yuzefovich yuzefovich changed the title pkg/sql/logictest/tests/local/local_test: TestLogic_select_for_update failed logictest: select_for_update is flaky May 17, 2024
craig bot pushed a commit that referenced this issue May 17, 2024
124344: logictest: skip flaky select_for_update test r=yuzefovich a=yuzefovich

Informs: #124197.
Epic: None

Release note: None

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
miyamo2 added a commit to miyamo2/cockroach that referenced this issue May 17, 2024
mod comment

lease: remove outdated NightlyStress call

This test config is not used anymore.

Release note: None

roachtest: increase timeout of acceptance/multitenant-multiregion

This has flaked due to timeout in CI.

Release note: None

sqlerrors: move gloval var errNoZoneConfigApplies into sqlerrors package

This commit is a mechnical change that moves the gloval variable
`errNoZoneConfigApplies` from package `sql` to package `sqlerrors`, as
it will be used by the DSC when we support zone config in it.

Informs: cockroachdb#117574
Release note: None

scbuildstmt: Rerewrite a few function with better APIs

This commit cleaned up a few functions using the preferred .FilterXxx()
API to retrieve element from an element set. No functional change is
introduced.

Informs: cockroachdb#117574
Release note: None

roachtest: update dsc job compat to mixed version v241-v242

This patch updates the dsc job compat roachtest to
`declarative_schema_changer/job-compatibility-mixed-version-V241-V242`.

Epic: CRDB-35306
Informs: cockroachdb#116395

Release note: None

admission: introduce elastic io token exhausted duration metric

This patch adds a new metric `elastic_io_tokens_exhausted_duration.kv`.
This is similar to the existing
`admission.granter.io_tokens_exhausted_duration.kv`, but for elastic
traffic.

The patch also does some code refactoring to make it easier to use both
regular and elastic equivalent metrics.

Informs cockroachdb#121574.

Release note: None

sql: expose an ability to request redacted stmt bundles

This commit extends the stmt bundle collection infrastructure to support
requesting redacted stmt bundles. Previously, this was only possible via
an explicit `EXPLAIN ANALYZE (DEBUG, REDACT)` stmt, but this commit adds
the support via overloads to `crdb_internal.request_statement_bundle`
builtin as well as the server API. This paves the way for requesting the
redacted bundles via the DB Console, but it'll be done separately.

To support this we need to add another boolean column to
`system.statement_diagnostics_requests` table to indicate whether
a request is for the redacted bundle or not. This requires adding
a migration in which we populate the existing rows with all `false`
values.

In order to reduce the amount of code churn and simplify this change,
this commit repurposes some of the existing code that we introduced the
last time we added a migration for plan-gist matching in stmt bundles.
In other words, this commit simultenously removes some stale code and
handles the new column.

Release note: None

sql: simplify usage of some enum settings

This commit adjusts a few enum settings to use `EnumSetting.String` when
defining the session variables to clean up the code a bit.

Release note: None

execinfra: add a sanity check that DistSQL version is not bumped

The last bump was in 23.1, and we won't ever bump DistSQL version in
backwards-incompatible going forward.

Epic: None

Release note: None

Updated the AUTHORS file to include Naveen Setlur

sqlstats: remove unnecessary StatementFingerprintID construction

Previously we were constructing the stmt fingerprint id unnecessarily
in 2 functions in sqlstats:
- The Next function for the StmtStatsIterator
- PopAllStats, which returns the sql stats collected thus far and
resets the in-memory sql stats containers.

The ID is available in the recorded statement in both of these cases-
there's no need to reconstruct this value and doing is needlessly
expensive.

Epic: none

Release note: None

pgwire: add a metric for number of pipelined requests

Release note (ops change): Added the sql.pgwire.pipeline.count metric,
which is a gauge that measures how many wire protocol commands have been
received by the server, but have not yet begun processing. This metric
will only grow if clients are using the "pipeline mode" of the Postgres
wire protocol.

roachtest: skip multi-store-remove while debugging

This roachtest has failed consistently for the past few days, since
being enabled in cockroachdb#123694. Skip while we debug, to avoid noise.

Touches: cockroachdb#123989.

Release note: None.

clusterversion: remove deprecated internal versions

Remove deprecated 23.2 gates - these features will always be enabled
in any cluster that a 24.2 node is part of.

`TODODelete_V23_1` has a lot of uses and will be removed separately.

Epic: none
Release note: None

sql: fix the type of copy_num_retries_per_batch

We were incorrectly using the boolean getter for this integer variable.
The bug was introduced when this variable was added in
1c1d313.

Release note: None

sql: fix cost_scans_with_default_col_size

This commit fixes an omission where `cost_scans_with_default_col_size`
session variable didn't actually use the corresponding cluster setting
for its default value. In other words, the cluster setting actually had
no effect. The bug has been present since this variable was introduced
in af32d9d.

Release note: None

sql: fix up defaults for a few session variables

This commit adjusts global defaults for a few session variables to use
the postgres style (so that the global default value and session
variable value would be stored alike).

Release note: None

sql: improve env collection in stmt bundles

This commit refactors how we collect session variables and cluster
settings into `env.sql` file in the stmt bundle. Previously, we used
a hard-coded list of session variables that were always included while
ignoring all variables not in the list. If a variable had a value
different from the "binary default" (meaning the value that you got
right after the cluster startup, before any cluster setting
modifications are applied), then it would be in a SET statement that
would run on the `bundle recreate`; if the value was the same as
"binary default", then the SET statement was commented out. Also, all
cluster setting values were included into the file as a comment.

This commit makes it so that we include all session variables and
cluster settings that differ from their defaults. This allows us to
recreate the environment while also reducing the overhead of including
everything into `env.sql`.

For session variables further clarification is warranted:
- all read-only variables have been audited, and most are explicitly
excluded. Out of 23 currently present session variables, only 9 were
deemed to be possibly useful during investigations, so they are not
excluded (one of 9 is `crdb_version` which we print out separately, so
it's actually excluded too).
- for writable variables, we include it if its value differs from the
"binary" default (the one that you get on a fresh cluster) or from the
"cluster" default (the one that you get on a fresh session on the
current cluster). The latter kind is obtained via the provided
`settings.Values` container whereas the former is obtained via a global
singleton container.

Additionally, this commit adjusts the SET and SET CLUSTER SETTING
statements to have single quotes around the setting values that need them.
For cluster settings all types work with having single quotes, but for
session variables some (like integers) don't work with quotes while
others (like strings) need them. This commit adds a map for those that
need them as well as a simple test to catch some of the missing ones
(the list might be incomplete given that the test exercises the default
config). All values are adjusted to fit on a single line (we have some
cluster settings that might span multiple lines).

It also adjusts `EXPLAIN (OPT, ENV)` to include the list of cluster
settings with non-default values.

Release note: None

roachprod: fetch secrets from cloud store

Due to the complexity of fetching the secrets from the secrets
manager, the secrets are now maintained in cloud storage.

Fixes: cockroachdb#117125
Epic: none

roachprod: deploy cockroach

Previously, to upgrade a cluster it had to be done manually by running several
commands to copy over new binaries, drain and stop cockroach and then running
the same start command.

This change introduces a `deploy` command that works similar as `stage` but also
deploys the new version to the cluster by doing all of the above. It does it
node for node to ensure a cluster remains in a healthy state especially if a
load balancer is used to manage connections.

Epic: None
Release Note: None

roachprod: add deploy applications explanation

Epic: None
Release Notes: None

roachtest: add cluster settings operations

This change adds operations to mutate cluster settings. The values are mutated
based on a preset frequency and RNG. For example, if the frequency of a cluster
setting operation is set to "30 minutes" it will only change to a new value
every 30 minutes, even if the operation has been run multiple times within the
same 30-minute window. Running in the same window will just set the same value
again.

Epic: None
Release Note: None

roachtest: fix `OperationRequiresNodes` dependency check

Epic: None
Release Note: None

roachprod: cluster settings operation owners

Epic: None
Release Notes: None

roachprod: fix UT to ignore the yaml format

currently, the test is looking for the exact match of the yaml.
this can fail as the yaml is generated from a map and the values
may not be in the same sequence. a better approach is to use
a yaml parser, which will be done very soon.

Fixes: cockroachdb#124276
Epic: none

pgwire: deflake TestPipelineMetric

The wrong context was used for the server, which meant that it could get
cancelled too early.

Release note: None

roachprod, roachtest: use same cluster name sanitization

Previously, roachtest had it's own function to sanitize
cluster names, while roachprod had it's own function to
verify cluster names. This change removes both and opts
instead to use vm.DNSSafeName.

This change also introduces MalformedClusterNameError
which gives a hint on what is wrong with the name and
tells roachtest not to retry cluster creation.

intentresolver: mark the test heavy to prevent OOM

Epic: none
Release note: none

ui: move enum type to avoid circular dependencies

When importing the ViewMode enum into a unit test, an error is thrown in
localStorage.reducer.ts relating to ViewMode being undefined. This is
occurring due to a circular dependency. To fix, ViewMode is moved to
a separate types.ts file

Release note: None

ui: prevent /_status/nodes calls for secondary tenants

/_status/nodes is not implemented for secondary tenants, resulting in 501
responses in cloud for serverless tenants. To fix, this api is now gated by
a tenant check and will no longer be called by serverless / secondary tenants.

Fixes: CC-26900
Epic: None
Release note: None

cli/doctor: doctor should read from the right jobs table

In cockroachdb#97762, we started writing a job's payload (and progress)
information to the `system.jobs_info` table. As a result, we
had to change the parts of our code that relied on the `system.jobs`
table to use `crdb_internal.system_jobs` instead (since that table
would inaccurately report that some `payload`s were `NULL`).
This change did not occur for the in-memory representation of the jobs
table created by debug doctor -- which can result in missing job
false-positives. This patch updates debug doctor's representation of
the jobs table by referring to `crdb_internal.system_jobs` instead.

Epic: none
Fixes: cockroachdb#122675

Release note: None

doctor: skip validation for dropped descriptors

In some cases, dropped descriptors appear in our `system.descriptors`
table with dangling job mutations without an associated job.
This patch teaches debug doctor examine to skip validation
on such dropped descriptors.

Epic: none

Fixes: cockroachdb#123477
Fixes: cockroachdb#122956

Release note: none

backupccl: test that we actually downloaded all data during online restore

Release note: none

Epic: none

roachtest: check downloaded spans in online restore roundtrip test

Epic: none

Release note: none

workflows: set timeouts for GitHub Actions Essential CI jobs

I've selected the timeouts according to how long each job seems to run
in practice, with a big buffer to allow for load on the remote execution
cluster, etc. The longest-running jobs I've given 60 minutes, with less
time for the shorter jobs.

Epic: CRDB-8308
Release note: None

master: Update pkg/testutils/release/cockroach_releases.yaml

Update pkg/testutils/release/cockroach_releases.yaml with recent values.

Epic: None
Release note: None
Release justification: test-only updates

cdctest: simplify orderValidator.NoteRow code

This patch eliminates unnecessary nesting in orderValidator's
NoteRow method.

Release note: None

cdctest: rename MakeCountValidator to NewCountValidator

This patch renames `MakeCountValidator` to `NewCountValidator` to
reflect that it returns a pointer, not a struct.

Release note: None

roachtest: rename cdc/sink-chaos to cdc/kafka-chaos and add more logging

This patch renames the `cdc/sink-chaos` test to `cdc/kafka-chaos` to
more accurately reflect that it is a Kafka-only test. It also adds
logging for the Kafka chaos loop iteration number so that we can tell
when the Kafka cluster is restarting from the logs.

Release note: None

roachtest: add order validation to CDC Kafka roachtests

This patch adds order validation to CDC Kafka roachtests so that we
can build more confidence in our ordering guarantees. It can be enabled
for a roachtest either by directly setting the `validateOrder` flag on a
`kafkaManager` before creating consumers, or indirectly by setting the
`validateOrder` flag on `kafkaFeedArgs` for tests that use `cdcTester`.

Release note: None

sql: alter primary key hangs if index references exist

Previously, ALTER PRIMARY KEY could hang if an index reference existed
from either a view or function using the index selection syntax. This
was becasue neither the declarative schema changer or legacy schema
changer support updating these references. To address this, this patch
will prevent ALTER PRIMARY KEY if such references exist to any indexes
that will be recreated.

Fixes: cockroachdb#123017

Release note (bug fix): ALTER PRIMARY KEY could hang if the table had
any indexes which were referred to by views or functions using the force
index syntax.

roachprod: remove logging from updatePrometheusTargets

It's hard to tell what the logging is about when it happens in the
context of a roachtest. It's also verbose as it prints one log line
for each VM whenever a test starts cockroach.

Epic: none

Release note: None

roachtest: explicitly close monitor reader after closing session

Previously the reader did not close, even though the session is closed. This
change ensures the reader is closed by closing it after the session has been
closed. This seems to have only been a problem with local clusters.

Fixes: cockroachdb#116314

Epic: None
Release Note: None

roachtest: lower max-upgrades in `tpcc/mixed-headroom`

Avoids timeouts.

Fixes: cockroachdb#124264

Release note: None

scripts: update `drtprod` create drt-large

Bring the creation in line with what is currently running on cockroach-drt. Uses
zone expansion to balance the nodes evenly across regions.

Epic: None
Release Note: None

scripts: update `drtprod` drt-large region

In an effort to save cost on regional network transfers the Warsaw Europe Zone
will now be switched to Columbus.

Epic: None
Release Note: None

backupccl: deflake TestDataDriven_external_connections_privileges

This test already tests the permissions end-to-end by asserting that
the user can or cannot complete the expected actions. No need to print
the system database as well.

Fixes cockroachdb#123379
Release note: None

externalconn: implement `SHOW EXTERNAL CONNECTIONS`.

This new query displays redacted URIs of external connections along
with other valuable information.

Epic: CRDB-15001
Fixes: cockroachdb#85905

Release note (sql change): New queries `SHOW EXTERNAL CONNECTIONS` and
`SHOW EXTERNAL CONNECTION <connection name>` have been added. These queries
display redacted connection URIs and other useful information such as the
connection type. Access to these queries is restricted to the owner of the
connection or users with `USAGE` privilege. Admin or root users will have
unrestricted access to all connections.

kvstorage: speed up scan for range descriptors

On startup we currently scan through all `/Local/Range/` keys in order
to find the range descriptors. This involves going through all
`/Local/Range/Transaction` records and tombstones so we end up reading
through a lot of data.

This change switches to using `SeekGE`:
 - when we encounter a key that has a suffix before `rdsc` like
   `/Local/Range/foo/prbe`, we seek directly to the corresponding
   descriptor key `/Local/Range/foo/rdsc`;
 - after we decode a descriptor, we seek straight to the next possible
   range descriptor `/Local/Range/EndKey/rdsc` using the `EndKey` in
   the descriptor.

Note that inside Pebble, seeks to keys that are very close to the
current position are optimized to try to iterate through the next few
keys before doing a real seek. So this change should not be
detrimental even when there aren't a lot of keys to skip.

I experimented on a store captured after running kv0 with batch size 2
against a single node. Before this change we step through 353 keys but
the iterator has to internally step through 160k tombstones:
```
range descriptor iteration done: 339 keys, 65 range descriptors (by suffix: map[qlpt:81 rdsc:65 txn-:193]);
  scan stats: stepped 353 times (169.743k internal); seeked 2 times (2 internal)...
```

After, we seek around the transaction records and end up doing just a
seek per range:
```
range descriptor iteration done: 65 range descriptors, 0 intents, 0 tombstones
+‹(interface (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)),›
+‹(internal-stats: (block-bytes: (total 563KB, cached 26KB, read-time 755.128µs))...
```

Fixes: cockroachdb#109740

kvserver: read lease under mutex when switching lease type

A race could occur when a replica queue and post lease application both
attempted to switch the lease type. This race would cause the queue to
not process the replica because the lease type had already changed. As a
result, lease preference violations might not have been quickly
resolved by the lease queue.

Read the lease under the same mutex used for requesting the lease, when
possibly switching the lease type.

Resolves: cockroachdb#123998
Release note: None

logictest: skip flaky select_for_update test

Informs: cockroachdb#124197.
Epic: None

Release note: None

mod comment

roachtest: save cluster earlier if debug-always is set

When the --debug-always flag is passed, we mark the cluster
as saved to let the cluster destroy know not to destroy it.

Previously, we only marked a cluster as saved at the end of
a test. However, if the caller terminated the test through
ctrl c, which also destroys all clusters, it would lead to a
race condition where sometimes the cluster would not be saved
in time and get destroyed.

This change moves the cluster save to immediantly after the
cluster is created.

Release note: none
Fixes: none
Epic: none
@michae2
Copy link
Collaborator

michae2 commented May 17, 2024

I wonder if this is related to #121917?

aadityasondhi pushed a commit to aadityasondhi/cockroach that referenced this issue May 17, 2024
@yuzefovich
Copy link
Member

I bisected this to 25edb96, cc @Dedej-Bergin @rafiss @fqazi thoughts on how that change can affect this test? (I used ./dev test ./pkg/sql/logictest/tests/local -f=TestLogic_select_for_update --stress --cpus=48 --count=1000 on the gceworker for the bisect.)

@yuzefovich yuzefovich added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-queries SQL Queries Team labels May 18, 2024
@blathers-crl blathers-crl bot added this to Triage in SQL Foundations May 18, 2024
@rafiss rafiss self-assigned this May 21, 2024
@rafiss rafiss added the P-1 Issues/test failures with a fix SLA of 1 month label May 21, 2024
@exalate-issue-sync exalate-issue-sync bot removed the P-1 Issues/test failures with a fix SLA of 1 month label May 21, 2024
@rafiss
Copy link
Collaborator

rafiss commented May 23, 2024

I was able to repro with this smaller testdata file:

statement ok
CREATE TABLE t (k INT PRIMARY KEY, v int, FAMILY (k, v))

statement ok
GRANT admin TO testuser

# The SKIP LOCKED wait policy skip rows when a conflicting lock is encountered.

statement ok
INSERT INTO t VALUES (1,1), (2, 2), (3, 3), (4, 4)

statement ok
CREATE TABLE t3 (
  k INT PRIMARY KEY,
  v INT,
  u INT,
  INDEX (u),
  FAMILY (k, v, u)
);
INSERT INTO t3 VALUES (1, 1, 1), (2, 2, 2), (3, 3, 3), (4, 4, 2);

statement ok
BEGIN; UPDATE t SET v = 3 WHERE k = 2; UPDATE t3 SET v = 3 WHERE k = 2

user testuser

statement ok
BEGIN

query II rowsort
SELECT * FROM t FOR UPDATE SKIP LOCKED
----
1  1
3  3
4  4

statement ok
UPDATE t SET v = 4 WHERE k = 3

query II rowsort
SELECT * FROM t FOR UPDATE SKIP LOCKED
----
1  1
3  4
4  4

# All columns are available from the secondary index on u, so no index join is
# needed. The secondary index can produce the row where k=2 since the row is
# only locked in the primary index.
query II rowsort
SELECT k, u FROM t3 WHERE u = 2 FOR UPDATE SKIP LOCKED
----
2  2
4  2

That's pretty surprising to me -- this example doesn't use any GRANT <privilege> ... statements, which were the ones that 25edb96 modified.

@exalate-issue-sync exalate-issue-sync bot added the P-1 Issues/test failures with a fix SLA of 1 month label May 24, 2024
@rafiss
Copy link
Collaborator

rafiss commented May 24, 2024

I was able to reproduce this on 9203985 as well, which is the parent commit of 25edb96. I'm starting another bisect to find the first failing commit.

@michae2
Copy link
Collaborator

michae2 commented May 24, 2024

@rafiss is this entirely serializable? If it's read committed, it might be worth waiting until #121917 is fixed.

@rafiss
Copy link
Collaborator

rafiss commented May 24, 2024

This fails all the way back to d93923b from over a year ago (I still haven't found the first failing commit), so it does seem likely to be #121917, and unrelated to the change we made in 25edb96. I will add this back to the SQL Queries board.

@rafiss rafiss removed their assignment May 24, 2024
@rafiss rafiss added T-sql-queries SQL Queries Team and removed T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels May 24, 2024
@yuzefovich yuzefovich removed this from Triage in SQL Foundations May 24, 2024
@yuzefovich yuzefovich removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label May 24, 2024
@yuzefovich
Copy link
Member

Ack, thanks Rafi. I'm guessing that 25edb96 somehow made the flake more likely.

@rafiss
Copy link
Collaborator

rafiss commented May 24, 2024

My repro with the example above in pkg/sql/logictest/testdata/logic_test/_tmp running with ./dev test ./pkg/sql/logictest/tests/local -f=TestLogic_tmp --stress --count=1000 fails just as quickly before or after 25edb96.

@yuzefovich yuzefovich added P-2 Issues/test failures with a fix SLA of 3 months and removed P-1 Issues/test failures with a fix SLA of 1 month labels May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months skipped-test T-sql-queries SQL Queries Team
Projects
Status: Active
Development

No branches or pull requests

6 participants