Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs: migrate jobs to write their Payload and Progress to the jobs_info table #97762

Closed
adityamaru opened this issue Feb 27, 2023 · 6 comments · Fixed by #98608
Closed

jobs: migrate jobs to write their Payload and Progress to the jobs_info table #97762

adityamaru opened this issue Feb 27, 2023 · 6 comments · Fixed by #98608
Assignees
Labels
A-disaster-recovery A-jobs branch-release-23.1 Used to mark GA and release blockers and technical advisories for 23.1 C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) GA-blocker T-jobs

Comments

@adityamaru
Copy link
Contributor

adityamaru commented Feb 27, 2023

In #84866 we introduced a system.jobs_info table that is to be used to store job-specific state. This table allows a job to access and update information about its execution without adversely impacting the execution of other jobs. This is explained in detail in #82638.

As a first step, we want to teach jobs to write their Payload and Progress in their current form but to the jobs_info table instead of the jobs table. This migration will roughly involve:

  1. Double writes for the Payload and Progress of jobs to both the jobs_info and jobs table.

  2. A migration to backfill job_info with the Payload and Progress of all job records in the system.jobs table.

  3. A change to read methods in the registry to pull the Payload and Progress from the job_info table instead of the jobs table.

In the future, jobs may want to decompose their Payload and Progress protos further and write different fields as their own rows in the jobs_info table. This is outside the scope of this issue.

Jira issue: CRDB-24874

Epic CRDB-8964

@adityamaru adityamaru added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-jobs labels Feb 27, 2023
@adityamaru adityamaru self-assigned this Feb 27, 2023
@blathers-crl blathers-crl bot added the T-jobs label Feb 27, 2023
@blathers-crl blathers-crl bot added this to Triage in Jobs Feb 27, 2023
adityamaru added a commit to adityamaru/cockroach that referenced this issue Mar 12, 2023
This change adds a migration and corresponding cluster version
after which every job entry in the system.jobs table will have its
Payload and Progress written to two rows in the system.job_info table.

Informs: cockroachdb#97762

Release note: None
@adityamaru adityamaru added GA-blocker branch-release-23.1 Used to mark GA and release blockers and technical advisories for 23.1 A-disaster-recovery labels Mar 13, 2023
@blathers-crl
Copy link

blathers-crl bot commented Mar 13, 2023

cc @cockroachdb/disaster-recovery

@shermanCRL
Copy link
Contributor

cc @amruss

adityamaru added a commit to adityamaru/cockroach that referenced this issue Mar 14, 2023
This change adds a migration and corresponding cluster version
after which every job entry in the system.jobs table will have its
Payload and Progress written to two rows in the system.job_info table.

Informs: cockroachdb#97762

Release note: None
craig bot pushed a commit that referenced this issue Mar 14, 2023
97138: ui: add error code to stmt and txn insights details pages r=gtr a=gtr

Part of: #87785.

Previously, the stmt and txn insights details pages did not show any
further information for failed executions. This commit adds an "error
code" column to the insights table for a failed execution in the stmt
and txn insights details pages. Additionally, a "status" column was
added to the stmt and txn workload insights tables which is either
"Completed" or "Failed".

Future work involves adding the error message string in addition to the
error code but it needs to be redacted first. Additionally, the txn
status is missing the implementation of a "Cancelled" status.

Note to reviewers: only consider the second commit, as the first is 
required to get the txn status.

- Loom [demo](https://www.loom.com/share/e82b97ff9f034d82b98640170eb54408).

Release note (ui change): Adds error code column to the insights table
for a failed execution in the stmt and txn insights details page. Adds
status column to the stmt and txn workload insights tables.

98410: cluster-ui: tenants use sqlstats-supplied regions r=matthewtodd a=matthewtodd

Fixes #98056.

As of #95449, the SQL Activity pages in the DB Console can draw regions information directly from the sqlstats tables, rather than having to translate node IDs to regions on page load.
    
Here, we make that switch, but for non-system tenants only, because:
    
1. The ephemeral nature of serverless nodes made this view-time mapping especially problematic in that context. (See further notes in #95449.)
    
2. The system-tenant views also include KV node IDs in a special Regions/Nodes column, which we are unable to recreate given the backend storage structure. (Future design work might suggest removing these node IDs altogether, for a unified UI.)

# Screenshots!
## Statements, with and without regions filter
<img width="1372" alt="statements" src="https://user-images.githubusercontent.com/5261/225033247-739df90a-9173-4aab-a666-a61a1ceeb579.png">
<img width="1372" alt="statements - filtered" src="https://user-images.githubusercontent.com/5261/225033271-1c0d0f82-3dd4-48ea-bdef-11f19af97a85.png">

## Statement details
<img width="1372" alt="statement details" src="https://user-images.githubusercontent.com/5261/225033338-6dff4a6e-a4a3-48c6-863a-84f1375b0a61.png">

## Transactions, with and without regions filter
<img width="1372" alt="transactions" src="https://user-images.githubusercontent.com/5261/225033366-65f44e95-3549-47cc-b0f2-67ad48a1a1fa.png">
<img width="1372" alt="transactions - filtered" src="https://user-images.githubusercontent.com/5261/225033391-50b9a2dc-e9a1-457b-84b1-837426eba35e.png">

## Transaction details
<img width="1372" alt="transaction details" src="https://user-images.githubusercontent.com/5261/225033505-3fdeceef-35dc-4e06-af25-ab4d0c53518f.png">

Release note: None

98464: jobs,upgrades: add migration to backfill job_info table r=dt a=adityamaru

This change adds a migration and corresponding cluster version
after which every job entry in the system.jobs table will have its
Payload and Progress written to two rows in the system.job_info table.

Informs: #97762

Release note: None

98510: backupccl: update restore/nodeshutdown tests to use new roachtest framework r=adityamaru a=msbutler

The restore/nodeshutdown tests have been using a very old workload that will not be restorable when #93804 lands. This patch changes the restore/nodeshutdown workload to a 80GB tpce restore and moves the tests to run on aws instead of gcp.

Release note: None

Epic: None

98579: upgrade/upgrades: skip TestUpgradeSchemaChangerElements r=smg260 a=smg260

Refs: #98062

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes

Release note: None
Epic: None

Co-authored-by: gtr <gerardo@cockroachlabs.com>
Co-authored-by: Matthew Todd <todd@cockroachlabs.com>
Co-authored-by: adityamaru <adityamaru@gmail.com>
Co-authored-by: Michael Butler <butler@cockroachlabs.com>
Co-authored-by: Miral Gadani <miral@cockroachlabs.com>
adityamaru added a commit to adityamaru/cockroach that referenced this issue Mar 16, 2023
This change touches all the parts of the code that were
relying on the `system.jobs` table to read the payload and/or
progress corresponding to a job. If the cluster version has advanced
past V23_1JobInfoTableIsBackfilled, every job in the system.jobs table
has a payload and progress entry written to the `job_info` table.
This change migrates callers to use this new table when reading the
payload and progress of a job.

The most important changes are in the logic that drives `crdb_internal.system_jobs`,
which in turn drives `crdb_internal.jobs` and `SHOW JOBS`. Additionally,
there are changes in how the registry resolves the progress and payload
when loading or resuming a job.

Several tests that read the payload and progress from `system.jobs` now
rely on `crdb_internal.system_jobs` to fetch this information from either
the jobs table or job_info table depending on the cluster version.

Fixes: cockroachdb#97762

Release note: None
adityamaru added a commit to adityamaru/cockroach that referenced this issue Mar 31, 2023
This change introduces a cluster version after which the
payload and progress of a job will not be written to the
system.jobs table. This will ensure that the system.job_info
table is the single, source of truth for these two pieces of
information.

This cluster version has an associated upgrade that schema changes
the `payload` column of the `system.jobs` table to be nullable,
thereby allowing us to stop writing to it. This upgrade step
is necessary for a future patch where we will drop the payload
and progress columns. Without this intermediate upgrade step the
`ALTER TABLE ... DROP COLUMN` upgrade job will attempt to write
to dropped columns as part of its execution thereby failing to
run the upgrade.

Informs: cockroachdb#97762

Release note: None
@adityamaru
Copy link
Contributor Author

I'm going to mark this as done, all the relevant pieces have landed in 23.1. Any future improvements can be tracked separately.

Disaster Recovery Backlog automation moved this from Triage to Done Mar 31, 2023
annrpom added a commit to annrpom/cockroach that referenced this issue Apr 30, 2024
In cockroachdb#97762, we started writing a job's payload (and progress)
information to the `system.jobs_info` table. As a result, we
had to change the parts of our code that relied on the `system.jobs`
table to use `crdb_internal.system_jobs` instead (since that table
would inaccurately report that some `payload`s were `NULL`).
This change did not occur for the in-memory representation of the jobs
table created by debug doctor -- which can result in missing job
false-positives. This patch updates debug doctor's representation of
the jobs table by referring to `crdb_internal.system_jobs` instead.

Epic: none
Fixes: #

Release note: None
annrpom added a commit to annrpom/cockroach that referenced this issue May 2, 2024
In cockroachdb#97762, we started writing a job's payload (and progress)
information to the `system.jobs_info` table. As a result, we
had to change the parts of our code that relied on the `system.jobs`
table to use `crdb_internal.system_jobs` instead (since that table
would inaccurately report that some `payload`s were `NULL`).
This change did not occur for the in-memory representation of the jobs
table created by debug doctor -- which can result in missing job
false-positives. This patch updates debug doctor's representation of
the jobs table by referring to `crdb_internal.system_jobs` instead.

Epic: none
Fixes: #

Release note: None
annrpom added a commit to annrpom/cockroach that referenced this issue May 2, 2024
In cockroachdb#97762, we started writing a job's payload (and progress)
information to the `system.jobs_info` table. As a result, we
had to change the parts of our code that relied on the `system.jobs`
table to use `crdb_internal.system_jobs` instead (since that table
would inaccurately report that some `payload`s were `NULL`).
This change did not occur for the in-memory representation of the jobs
table created by debug doctor -- which can result in missing job
false-positives. This patch updates debug doctor's representation of
the jobs table by referring to `crdb_internal.system_jobs` instead.

Epic: none
Fixes: #

Release note: None
annrpom added a commit to annrpom/cockroach that referenced this issue May 2, 2024
In cockroachdb#97762, we started writing a job's payload (and progress)
information to the `system.jobs_info` table. As a result, we
had to change the parts of our code that relied on the `system.jobs`
table to use `crdb_internal.system_jobs` instead (since that table
would inaccurately report that some `payload`s were `NULL`).
This change did not occur for the in-memory representation of the jobs
table created by debug doctor -- which can result in missing job
false-positives. This patch updates debug doctor's representation of
the jobs table by referring to `crdb_internal.system_jobs` instead.

Epic: none
Fixes: cockroachdb#122675

Release note: None
annrpom added a commit to annrpom/cockroach that referenced this issue May 6, 2024
In cockroachdb#97762, we started writing a job's payload (and progress)
information to the `system.jobs_info` table. As a result, we
had to change the parts of our code that relied on the `system.jobs`
table to use `crdb_internal.system_jobs` instead (since that table
would inaccurately report that some `payload`s were `NULL`).
This change did not occur for the in-memory representation of the jobs
table created by debug doctor -- which can result in missing job
false-positives. This patch updates debug doctor's representation of
the jobs table by referring to `crdb_internal.system_jobs` instead.

Epic: none
Fixes: cockroachdb#122675

Release note: None
annrpom added a commit to annrpom/cockroach that referenced this issue May 14, 2024
In cockroachdb#97762, we started writing a job's payload (and progress)
information to the `system.jobs_info` table. As a result, we
had to change the parts of our code that relied on the `system.jobs`
table to use `crdb_internal.system_jobs` instead (since that table
would inaccurately report that some `payload`s were `NULL`).
This change did not occur for the in-memory representation of the jobs
table created by debug doctor -- which can result in missing job
false-positives. This patch updates debug doctor's representation of
the jobs table by referring to `crdb_internal.system_jobs` instead.

Epic: none
Fixes: cockroachdb#122675

Release note: None
annrpom added a commit to annrpom/cockroach that referenced this issue May 15, 2024
In cockroachdb#97762, we started writing a job's payload (and progress)
information to the `system.jobs_info` table. As a result, we
had to change the parts of our code that relied on the `system.jobs`
table to use `crdb_internal.system_jobs` instead (since that table
would inaccurately report that some `payload`s were `NULL`).
This change did not occur for the in-memory representation of the jobs
table created by debug doctor -- which can result in missing job
false-positives. This patch updates debug doctor's representation of
the jobs table by referring to `crdb_internal.system_jobs` instead.

Epic: none
Fixes: cockroachdb#122675

Release note: None
craig bot pushed a commit that referenced this issue May 16, 2024
123298: cli/doctor, doctor: use right jobs table, skip dropped descs r=fqazi a=annrpom

### cli/doctor: doctor should read from the right jobs table
In #97762, we started writing a job's payload (and progress)
information to the `system.jobs_info` table. As a result, we
had to change the parts of our code that relied on the `system.jobs`
table to use `crdb_internal.system_jobs` instead (since that table
would inaccurately report that some `payload`s were `NULL`).
This change did not occur for the in-memory representation of the jobs
table created by debug doctor -- which can result in missing job
false-positives. This patch updates debug doctor's representation of
the jobs table by referring to `crdb_internal.system_jobs` instead.

Epic: none
Fixes: #122675

Release note: None


---

### doctor: skip validation for dropped descriptors
In some cases, dropped descriptors appear in our `system.descriptors`
table with dangling job mutations without an associated job.
This patch teaches debug doctor examine to skip validation
on such dropped descriptors.

Epic: none

Fixes: #123477
Fixes: #122956

Release note: none

Co-authored-by: Annie Pompa <annie@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this issue May 16, 2024
In #97762, we started writing a job's payload (and progress)
information to the `system.jobs_info` table. As a result, we
had to change the parts of our code that relied on the `system.jobs`
table to use `crdb_internal.system_jobs` instead (since that table
would inaccurately report that some `payload`s were `NULL`).
This change did not occur for the in-memory representation of the jobs
table created by debug doctor -- which can result in missing job
false-positives. This patch updates debug doctor's representation of
the jobs table by referring to `crdb_internal.system_jobs` instead.

Epic: none
Fixes: #122675

Release note: None
miyamo2 added a commit to miyamo2/cockroach that referenced this issue May 17, 2024
mod comment

lease: remove outdated NightlyStress call

This test config is not used anymore.

Release note: None

roachtest: increase timeout of acceptance/multitenant-multiregion

This has flaked due to timeout in CI.

Release note: None

sqlerrors: move gloval var errNoZoneConfigApplies into sqlerrors package

This commit is a mechnical change that moves the gloval variable
`errNoZoneConfigApplies` from package `sql` to package `sqlerrors`, as
it will be used by the DSC when we support zone config in it.

Informs: cockroachdb#117574
Release note: None

scbuildstmt: Rerewrite a few function with better APIs

This commit cleaned up a few functions using the preferred .FilterXxx()
API to retrieve element from an element set. No functional change is
introduced.

Informs: cockroachdb#117574
Release note: None

roachtest: update dsc job compat to mixed version v241-v242

This patch updates the dsc job compat roachtest to
`declarative_schema_changer/job-compatibility-mixed-version-V241-V242`.

Epic: CRDB-35306
Informs: cockroachdb#116395

Release note: None

admission: introduce elastic io token exhausted duration metric

This patch adds a new metric `elastic_io_tokens_exhausted_duration.kv`.
This is similar to the existing
`admission.granter.io_tokens_exhausted_duration.kv`, but for elastic
traffic.

The patch also does some code refactoring to make it easier to use both
regular and elastic equivalent metrics.

Informs cockroachdb#121574.

Release note: None

sql: expose an ability to request redacted stmt bundles

This commit extends the stmt bundle collection infrastructure to support
requesting redacted stmt bundles. Previously, this was only possible via
an explicit `EXPLAIN ANALYZE (DEBUG, REDACT)` stmt, but this commit adds
the support via overloads to `crdb_internal.request_statement_bundle`
builtin as well as the server API. This paves the way for requesting the
redacted bundles via the DB Console, but it'll be done separately.

To support this we need to add another boolean column to
`system.statement_diagnostics_requests` table to indicate whether
a request is for the redacted bundle or not. This requires adding
a migration in which we populate the existing rows with all `false`
values.

In order to reduce the amount of code churn and simplify this change,
this commit repurposes some of the existing code that we introduced the
last time we added a migration for plan-gist matching in stmt bundles.
In other words, this commit simultenously removes some stale code and
handles the new column.

Release note: None

sql: simplify usage of some enum settings

This commit adjusts a few enum settings to use `EnumSetting.String` when
defining the session variables to clean up the code a bit.

Release note: None

execinfra: add a sanity check that DistSQL version is not bumped

The last bump was in 23.1, and we won't ever bump DistSQL version in
backwards-incompatible going forward.

Epic: None

Release note: None

Updated the AUTHORS file to include Naveen Setlur

sqlstats: remove unnecessary StatementFingerprintID construction

Previously we were constructing the stmt fingerprint id unnecessarily
in 2 functions in sqlstats:
- The Next function for the StmtStatsIterator
- PopAllStats, which returns the sql stats collected thus far and
resets the in-memory sql stats containers.

The ID is available in the recorded statement in both of these cases-
there's no need to reconstruct this value and doing is needlessly
expensive.

Epic: none

Release note: None

pgwire: add a metric for number of pipelined requests

Release note (ops change): Added the sql.pgwire.pipeline.count metric,
which is a gauge that measures how many wire protocol commands have been
received by the server, but have not yet begun processing. This metric
will only grow if clients are using the "pipeline mode" of the Postgres
wire protocol.

roachtest: skip multi-store-remove while debugging

This roachtest has failed consistently for the past few days, since
being enabled in cockroachdb#123694. Skip while we debug, to avoid noise.

Touches: cockroachdb#123989.

Release note: None.

clusterversion: remove deprecated internal versions

Remove deprecated 23.2 gates - these features will always be enabled
in any cluster that a 24.2 node is part of.

`TODODelete_V23_1` has a lot of uses and will be removed separately.

Epic: none
Release note: None

sql: fix the type of copy_num_retries_per_batch

We were incorrectly using the boolean getter for this integer variable.
The bug was introduced when this variable was added in
1c1d313.

Release note: None

sql: fix cost_scans_with_default_col_size

This commit fixes an omission where `cost_scans_with_default_col_size`
session variable didn't actually use the corresponding cluster setting
for its default value. In other words, the cluster setting actually had
no effect. The bug has been present since this variable was introduced
in af32d9d.

Release note: None

sql: fix up defaults for a few session variables

This commit adjusts global defaults for a few session variables to use
the postgres style (so that the global default value and session
variable value would be stored alike).

Release note: None

sql: improve env collection in stmt bundles

This commit refactors how we collect session variables and cluster
settings into `env.sql` file in the stmt bundle. Previously, we used
a hard-coded list of session variables that were always included while
ignoring all variables not in the list. If a variable had a value
different from the "binary default" (meaning the value that you got
right after the cluster startup, before any cluster setting
modifications are applied), then it would be in a SET statement that
would run on the `bundle recreate`; if the value was the same as
"binary default", then the SET statement was commented out. Also, all
cluster setting values were included into the file as a comment.

This commit makes it so that we include all session variables and
cluster settings that differ from their defaults. This allows us to
recreate the environment while also reducing the overhead of including
everything into `env.sql`.

For session variables further clarification is warranted:
- all read-only variables have been audited, and most are explicitly
excluded. Out of 23 currently present session variables, only 9 were
deemed to be possibly useful during investigations, so they are not
excluded (one of 9 is `crdb_version` which we print out separately, so
it's actually excluded too).
- for writable variables, we include it if its value differs from the
"binary" default (the one that you get on a fresh cluster) or from the
"cluster" default (the one that you get on a fresh session on the
current cluster). The latter kind is obtained via the provided
`settings.Values` container whereas the former is obtained via a global
singleton container.

Additionally, this commit adjusts the SET and SET CLUSTER SETTING
statements to have single quotes around the setting values that need them.
For cluster settings all types work with having single quotes, but for
session variables some (like integers) don't work with quotes while
others (like strings) need them. This commit adds a map for those that
need them as well as a simple test to catch some of the missing ones
(the list might be incomplete given that the test exercises the default
config). All values are adjusted to fit on a single line (we have some
cluster settings that might span multiple lines).

It also adjusts `EXPLAIN (OPT, ENV)` to include the list of cluster
settings with non-default values.

Release note: None

roachprod: fetch secrets from cloud store

Due to the complexity of fetching the secrets from the secrets
manager, the secrets are now maintained in cloud storage.

Fixes: cockroachdb#117125
Epic: none

roachprod: deploy cockroach

Previously, to upgrade a cluster it had to be done manually by running several
commands to copy over new binaries, drain and stop cockroach and then running
the same start command.

This change introduces a `deploy` command that works similar as `stage` but also
deploys the new version to the cluster by doing all of the above. It does it
node for node to ensure a cluster remains in a healthy state especially if a
load balancer is used to manage connections.

Epic: None
Release Note: None

roachprod: add deploy applications explanation

Epic: None
Release Notes: None

roachtest: add cluster settings operations

This change adds operations to mutate cluster settings. The values are mutated
based on a preset frequency and RNG. For example, if the frequency of a cluster
setting operation is set to "30 minutes" it will only change to a new value
every 30 minutes, even if the operation has been run multiple times within the
same 30-minute window. Running in the same window will just set the same value
again.

Epic: None
Release Note: None

roachtest: fix `OperationRequiresNodes` dependency check

Epic: None
Release Note: None

roachprod: cluster settings operation owners

Epic: None
Release Notes: None

roachprod: fix UT to ignore the yaml format

currently, the test is looking for the exact match of the yaml.
this can fail as the yaml is generated from a map and the values
may not be in the same sequence. a better approach is to use
a yaml parser, which will be done very soon.

Fixes: cockroachdb#124276
Epic: none

pgwire: deflake TestPipelineMetric

The wrong context was used for the server, which meant that it could get
cancelled too early.

Release note: None

roachprod, roachtest: use same cluster name sanitization

Previously, roachtest had it's own function to sanitize
cluster names, while roachprod had it's own function to
verify cluster names. This change removes both and opts
instead to use vm.DNSSafeName.

This change also introduces MalformedClusterNameError
which gives a hint on what is wrong with the name and
tells roachtest not to retry cluster creation.

intentresolver: mark the test heavy to prevent OOM

Epic: none
Release note: none

ui: move enum type to avoid circular dependencies

When importing the ViewMode enum into a unit test, an error is thrown in
localStorage.reducer.ts relating to ViewMode being undefined. This is
occurring due to a circular dependency. To fix, ViewMode is moved to
a separate types.ts file

Release note: None

ui: prevent /_status/nodes calls for secondary tenants

/_status/nodes is not implemented for secondary tenants, resulting in 501
responses in cloud for serverless tenants. To fix, this api is now gated by
a tenant check and will no longer be called by serverless / secondary tenants.

Fixes: CC-26900
Epic: None
Release note: None

cli/doctor: doctor should read from the right jobs table

In cockroachdb#97762, we started writing a job's payload (and progress)
information to the `system.jobs_info` table. As a result, we
had to change the parts of our code that relied on the `system.jobs`
table to use `crdb_internal.system_jobs` instead (since that table
would inaccurately report that some `payload`s were `NULL`).
This change did not occur for the in-memory representation of the jobs
table created by debug doctor -- which can result in missing job
false-positives. This patch updates debug doctor's representation of
the jobs table by referring to `crdb_internal.system_jobs` instead.

Epic: none
Fixes: cockroachdb#122675

Release note: None

doctor: skip validation for dropped descriptors

In some cases, dropped descriptors appear in our `system.descriptors`
table with dangling job mutations without an associated job.
This patch teaches debug doctor examine to skip validation
on such dropped descriptors.

Epic: none

Fixes: cockroachdb#123477
Fixes: cockroachdb#122956

Release note: none

backupccl: test that we actually downloaded all data during online restore

Release note: none

Epic: none

roachtest: check downloaded spans in online restore roundtrip test

Epic: none

Release note: none

workflows: set timeouts for GitHub Actions Essential CI jobs

I've selected the timeouts according to how long each job seems to run
in practice, with a big buffer to allow for load on the remote execution
cluster, etc. The longest-running jobs I've given 60 minutes, with less
time for the shorter jobs.

Epic: CRDB-8308
Release note: None

master: Update pkg/testutils/release/cockroach_releases.yaml

Update pkg/testutils/release/cockroach_releases.yaml with recent values.

Epic: None
Release note: None
Release justification: test-only updates

cdctest: simplify orderValidator.NoteRow code

This patch eliminates unnecessary nesting in orderValidator's
NoteRow method.

Release note: None

cdctest: rename MakeCountValidator to NewCountValidator

This patch renames `MakeCountValidator` to `NewCountValidator` to
reflect that it returns a pointer, not a struct.

Release note: None

roachtest: rename cdc/sink-chaos to cdc/kafka-chaos and add more logging

This patch renames the `cdc/sink-chaos` test to `cdc/kafka-chaos` to
more accurately reflect that it is a Kafka-only test. It also adds
logging for the Kafka chaos loop iteration number so that we can tell
when the Kafka cluster is restarting from the logs.

Release note: None

roachtest: add order validation to CDC Kafka roachtests

This patch adds order validation to CDC Kafka roachtests so that we
can build more confidence in our ordering guarantees. It can be enabled
for a roachtest either by directly setting the `validateOrder` flag on a
`kafkaManager` before creating consumers, or indirectly by setting the
`validateOrder` flag on `kafkaFeedArgs` for tests that use `cdcTester`.

Release note: None

sql: alter primary key hangs if index references exist

Previously, ALTER PRIMARY KEY could hang if an index reference existed
from either a view or function using the index selection syntax. This
was becasue neither the declarative schema changer or legacy schema
changer support updating these references. To address this, this patch
will prevent ALTER PRIMARY KEY if such references exist to any indexes
that will be recreated.

Fixes: cockroachdb#123017

Release note (bug fix): ALTER PRIMARY KEY could hang if the table had
any indexes which were referred to by views or functions using the force
index syntax.

roachprod: remove logging from updatePrometheusTargets

It's hard to tell what the logging is about when it happens in the
context of a roachtest. It's also verbose as it prints one log line
for each VM whenever a test starts cockroach.

Epic: none

Release note: None

roachtest: explicitly close monitor reader after closing session

Previously the reader did not close, even though the session is closed. This
change ensures the reader is closed by closing it after the session has been
closed. This seems to have only been a problem with local clusters.

Fixes: cockroachdb#116314

Epic: None
Release Note: None

roachtest: lower max-upgrades in `tpcc/mixed-headroom`

Avoids timeouts.

Fixes: cockroachdb#124264

Release note: None

scripts: update `drtprod` create drt-large

Bring the creation in line with what is currently running on cockroach-drt. Uses
zone expansion to balance the nodes evenly across regions.

Epic: None
Release Note: None

scripts: update `drtprod` drt-large region

In an effort to save cost on regional network transfers the Warsaw Europe Zone
will now be switched to Columbus.

Epic: None
Release Note: None

backupccl: deflake TestDataDriven_external_connections_privileges

This test already tests the permissions end-to-end by asserting that
the user can or cannot complete the expected actions. No need to print
the system database as well.

Fixes cockroachdb#123379
Release note: None

externalconn: implement `SHOW EXTERNAL CONNECTIONS`.

This new query displays redacted URIs of external connections along
with other valuable information.

Epic: CRDB-15001
Fixes: cockroachdb#85905

Release note (sql change): New queries `SHOW EXTERNAL CONNECTIONS` and
`SHOW EXTERNAL CONNECTION <connection name>` have been added. These queries
display redacted connection URIs and other useful information such as the
connection type. Access to these queries is restricted to the owner of the
connection or users with `USAGE` privilege. Admin or root users will have
unrestricted access to all connections.

kvstorage: speed up scan for range descriptors

On startup we currently scan through all `/Local/Range/` keys in order
to find the range descriptors. This involves going through all
`/Local/Range/Transaction` records and tombstones so we end up reading
through a lot of data.

This change switches to using `SeekGE`:
 - when we encounter a key that has a suffix before `rdsc` like
   `/Local/Range/foo/prbe`, we seek directly to the corresponding
   descriptor key `/Local/Range/foo/rdsc`;
 - after we decode a descriptor, we seek straight to the next possible
   range descriptor `/Local/Range/EndKey/rdsc` using the `EndKey` in
   the descriptor.

Note that inside Pebble, seeks to keys that are very close to the
current position are optimized to try to iterate through the next few
keys before doing a real seek. So this change should not be
detrimental even when there aren't a lot of keys to skip.

I experimented on a store captured after running kv0 with batch size 2
against a single node. Before this change we step through 353 keys but
the iterator has to internally step through 160k tombstones:
```
range descriptor iteration done: 339 keys, 65 range descriptors (by suffix: map[qlpt:81 rdsc:65 txn-:193]);
  scan stats: stepped 353 times (169.743k internal); seeked 2 times (2 internal)...
```

After, we seek around the transaction records and end up doing just a
seek per range:
```
range descriptor iteration done: 65 range descriptors, 0 intents, 0 tombstones
+‹(interface (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)),›
+‹(internal-stats: (block-bytes: (total 563KB, cached 26KB, read-time 755.128µs))...
```

Fixes: cockroachdb#109740

kvserver: read lease under mutex when switching lease type

A race could occur when a replica queue and post lease application both
attempted to switch the lease type. This race would cause the queue to
not process the replica because the lease type had already changed. As a
result, lease preference violations might not have been quickly
resolved by the lease queue.

Read the lease under the same mutex used for requesting the lease, when
possibly switching the lease type.

Resolves: cockroachdb#123998
Release note: None

logictest: skip flaky select_for_update test

Informs: cockroachdb#124197.
Epic: None

Release note: None

mod comment

roachtest: save cluster earlier if debug-always is set

When the --debug-always flag is passed, we mark the cluster
as saved to let the cluster destroy know not to destroy it.

Previously, we only marked a cluster as saved at the end of
a test. However, if the caller terminated the test through
ctrl c, which also destroys all clusters, it would lead to a
race condition where sometimes the cluster would not be saved
in time and get destroyed.

This change moves the cluster save to immediantly after the
cluster is created.

Release note: none
Fixes: none
Epic: none
annrpom added a commit to annrpom/cockroach that referenced this issue May 17, 2024
In cockroachdb#97762, we started writing a job's payload (and progress)
information to the `system.jobs_info` table. As a result, we
had to change the parts of our code that relied on the `system.jobs`
table to use `crdb_internal.system_jobs` instead (since that table
would inaccurately report that some `payload`s were `NULL`).
This change did not occur for the in-memory representation of the jobs
table created by debug doctor -- which can result in missing job
false-positives. This patch updates debug doctor's representation of
the jobs table by referring to `crdb_internal.system_jobs` instead.

Epic: none
Fixes: cockroachdb#122675

Release note: None
annrpom added a commit to annrpom/cockroach that referenced this issue May 20, 2024
In cockroachdb#97762, we started writing a job's payload (and progress)
information to the `system.jobs_info` table. As a result, we
had to change the parts of our code that relied on the `system.jobs`
table to use `crdb_internal.system_jobs` instead (since that table
would inaccurately report that some `payload`s were `NULL`).
This change did not occur for the in-memory representation of the jobs
table created by debug doctor -- which can result in missing job
false-positives. This patch updates debug doctor's representation of
the jobs table by referring to `crdb_internal.system_jobs` instead.

Epic: none
Fixes: cockroachdb#122675

Release note: None
annrpom added a commit to annrpom/cockroach that referenced this issue Jun 25, 2024
In cockroachdb#97762, we started writing a job's payload (and progress)
information to the `system.jobs_info` table. As a result, we
had to change the parts of our code that relied on the `system.jobs`
table to use `crdb_internal.system_jobs` instead (since that table
would inaccurately report that some `payload`s were `NULL`).
This change did not occur for the in-memory representation of the jobs
table created by debug doctor -- which can result in missing job
false-positives. This patch updates debug doctor's representation of
the jobs table by referring to `crdb_internal.system_jobs` instead.

Epic: none
Fixes: cockroachdb#122675

Release note: None
annrpom added a commit to annrpom/cockroach that referenced this issue Jun 25, 2024
In cockroachdb#97762, we started writing a job's payload (and progress)
information to the `system.jobs_info` table. As a result, we
had to change the parts of our code that relied on the `system.jobs`
table to use `crdb_internal.system_jobs` instead (since that table
would inaccurately report that some `payload`s were `NULL`).
This change did not occur for the in-memory representation of the jobs
table created by debug doctor -- which can result in missing job
false-positives. This patch updates debug doctor's representation of
the jobs table by referring to `crdb_internal.system_jobs` instead.

Epic: none
Fixes: cockroachdb#122675

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery A-jobs branch-release-23.1 Used to mark GA and release blockers and technical advisories for 23.1 C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) GA-blocker T-jobs
Projects
Jobs
Triage
Development

Successfully merging a pull request may close this issue.

3 participants