Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: add multi-store-remove roachtest #123694

Merged
merged 1 commit into from
May 10, 2024

Conversation

nicktrav
Copy link
Collaborator

@nicktrav nicktrav commented May 6, 2024

While CRDB technically supports multi-store nodes, our test coverage is limited. Add a new roachtest that validates that a store on a node can be removed, and its ranges will eventually be moved to other stores after the removed store has been declared "dead".

This provides confidence that in the event of failure / corruption of a specific store, the node can be brought back with the problematic store removed, allowing ranges present in the remaining stores on the node to continue to function.

Touches #106620.

Release note: None.

Epic: None.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nicktrav nicktrav force-pushed the nickt.multi-store-decomm branch 2 times, most recently from d1989ac to 2dcf6ff Compare May 6, 2024 21:59
@nicktrav nicktrav requested review from itsbilal and jbowens May 7, 2024 14:45
@nicktrav nicktrav marked this pull request as ready for review May 7, 2024 14:45
@nicktrav nicktrav requested a review from a team as a code owner May 7, 2024 14:45
@nicktrav nicktrav requested review from herkolategan and renatolabs and removed request for a team May 7, 2024 14:45
Comment on lines +132 to +137
// Wait for up-replication.
// NOTE: At the time of writing, under-replicated ranges are computed using
// node liveness, rather than store liveness, so we instead compare the range
// count to the current per-store replica count to compute whether all there
// are still under-replicated ranges from the dead store.
// TODO(travers): Once #123561 is solved, re-work this.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @nvanbenschoten @andrewbaptist - related to the conversation last week. Let me know if you have thoughts on a better way, for now.

Copy link
Member

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nicktrav
Copy link
Collaborator Author

nicktrav commented May 9, 2024

TFTR!

bors r=itsbilal

@andrewbaptist
Copy link
Collaborator

bors r-

There is a failure in Bazel extended

  pkg/cmd/roachtest/tests/multi_store_remove.go:113:3: (github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test.Test).Fatal call has possible Printf formatting directive %s (printf)

@craig
Copy link
Contributor

craig bot commented May 9, 2024

Canceled.

While CRDB technically supports multi-store nodes, our test coverage is
limited. Add a new roachtest that validates that a store on a node can
be removed, and its ranges will eventually be moved to other stores
after the removed store has been declared "dead".

This provides confidence that in the event of failure / corruption of a
specific store, the node can be brought back with the problematic store
removed, allowing ranges present in the remaining stores on the node to
continue to function.

Touches cockroachdb#106620.

Release note: None.
@nicktrav nicktrav added backport-23.2.x Flags PRs that need to be backported to 23.2. backport-24.1.x Flags PRs that need to be backported to 24.1. labels May 10, 2024
@nicktrav
Copy link
Collaborator Author

Fixed lint issue. Retrying.

bors r=itsbilal

@craig craig bot merged commit f171ff6 into cockroachdb:master May 10, 2024
22 checks passed
Copy link

blathers-crl bot commented May 10, 2024

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 94faed5 to blathers/backport-release-23.2-123694: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.2.x failed. See errors above.


error creating merge commit from 94faed5 to blathers/backport-release-24.1-123694: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 24.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@nicktrav nicktrav deleted the nickt.multi-store-decomm branch May 14, 2024 14:03
nicktrav added a commit to nicktrav/cockroach that referenced this pull request May 15, 2024
This roachtest has failed consistently for the past few days, since
being enabled in cockroachdb#123694. Skip while we debug, to avoid noise.

Touches: cockroachdb#123989.

Release note: None.
craig bot pushed a commit that referenced this pull request May 15, 2024
124177: roachtest: skip multi-store-remove while debugging r=itsbilal,sumeerbhola a=nicktrav

This roachtest has failed consistently for the past few days, since being enabled in #123694. Skip while we debug, to avoid noise.

Touches: #123989.

Release note: None.

Epic: None.

Co-authored-by: Nick Travers <travers@cockroachlabs.com>
miyamo2 added a commit to miyamo2/cockroach that referenced this pull request May 17, 2024
mod comment

lease: remove outdated NightlyStress call

This test config is not used anymore.

Release note: None

roachtest: increase timeout of acceptance/multitenant-multiregion

This has flaked due to timeout in CI.

Release note: None

sqlerrors: move gloval var errNoZoneConfigApplies into sqlerrors package

This commit is a mechnical change that moves the gloval variable
`errNoZoneConfigApplies` from package `sql` to package `sqlerrors`, as
it will be used by the DSC when we support zone config in it.

Informs: cockroachdb#117574
Release note: None

scbuildstmt: Rerewrite a few function with better APIs

This commit cleaned up a few functions using the preferred .FilterXxx()
API to retrieve element from an element set. No functional change is
introduced.

Informs: cockroachdb#117574
Release note: None

roachtest: update dsc job compat to mixed version v241-v242

This patch updates the dsc job compat roachtest to
`declarative_schema_changer/job-compatibility-mixed-version-V241-V242`.

Epic: CRDB-35306
Informs: cockroachdb#116395

Release note: None

admission: introduce elastic io token exhausted duration metric

This patch adds a new metric `elastic_io_tokens_exhausted_duration.kv`.
This is similar to the existing
`admission.granter.io_tokens_exhausted_duration.kv`, but for elastic
traffic.

The patch also does some code refactoring to make it easier to use both
regular and elastic equivalent metrics.

Informs cockroachdb#121574.

Release note: None

sql: expose an ability to request redacted stmt bundles

This commit extends the stmt bundle collection infrastructure to support
requesting redacted stmt bundles. Previously, this was only possible via
an explicit `EXPLAIN ANALYZE (DEBUG, REDACT)` stmt, but this commit adds
the support via overloads to `crdb_internal.request_statement_bundle`
builtin as well as the server API. This paves the way for requesting the
redacted bundles via the DB Console, but it'll be done separately.

To support this we need to add another boolean column to
`system.statement_diagnostics_requests` table to indicate whether
a request is for the redacted bundle or not. This requires adding
a migration in which we populate the existing rows with all `false`
values.

In order to reduce the amount of code churn and simplify this change,
this commit repurposes some of the existing code that we introduced the
last time we added a migration for plan-gist matching in stmt bundles.
In other words, this commit simultenously removes some stale code and
handles the new column.

Release note: None

sql: simplify usage of some enum settings

This commit adjusts a few enum settings to use `EnumSetting.String` when
defining the session variables to clean up the code a bit.

Release note: None

execinfra: add a sanity check that DistSQL version is not bumped

The last bump was in 23.1, and we won't ever bump DistSQL version in
backwards-incompatible going forward.

Epic: None

Release note: None

Updated the AUTHORS file to include Naveen Setlur

sqlstats: remove unnecessary StatementFingerprintID construction

Previously we were constructing the stmt fingerprint id unnecessarily
in 2 functions in sqlstats:
- The Next function for the StmtStatsIterator
- PopAllStats, which returns the sql stats collected thus far and
resets the in-memory sql stats containers.

The ID is available in the recorded statement in both of these cases-
there's no need to reconstruct this value and doing is needlessly
expensive.

Epic: none

Release note: None

pgwire: add a metric for number of pipelined requests

Release note (ops change): Added the sql.pgwire.pipeline.count metric,
which is a gauge that measures how many wire protocol commands have been
received by the server, but have not yet begun processing. This metric
will only grow if clients are using the "pipeline mode" of the Postgres
wire protocol.

roachtest: skip multi-store-remove while debugging

This roachtest has failed consistently for the past few days, since
being enabled in cockroachdb#123694. Skip while we debug, to avoid noise.

Touches: cockroachdb#123989.

Release note: None.

clusterversion: remove deprecated internal versions

Remove deprecated 23.2 gates - these features will always be enabled
in any cluster that a 24.2 node is part of.

`TODODelete_V23_1` has a lot of uses and will be removed separately.

Epic: none
Release note: None

sql: fix the type of copy_num_retries_per_batch

We were incorrectly using the boolean getter for this integer variable.
The bug was introduced when this variable was added in
1c1d313.

Release note: None

sql: fix cost_scans_with_default_col_size

This commit fixes an omission where `cost_scans_with_default_col_size`
session variable didn't actually use the corresponding cluster setting
for its default value. In other words, the cluster setting actually had
no effect. The bug has been present since this variable was introduced
in af32d9d.

Release note: None

sql: fix up defaults for a few session variables

This commit adjusts global defaults for a few session variables to use
the postgres style (so that the global default value and session
variable value would be stored alike).

Release note: None

sql: improve env collection in stmt bundles

This commit refactors how we collect session variables and cluster
settings into `env.sql` file in the stmt bundle. Previously, we used
a hard-coded list of session variables that were always included while
ignoring all variables not in the list. If a variable had a value
different from the "binary default" (meaning the value that you got
right after the cluster startup, before any cluster setting
modifications are applied), then it would be in a SET statement that
would run on the `bundle recreate`; if the value was the same as
"binary default", then the SET statement was commented out. Also, all
cluster setting values were included into the file as a comment.

This commit makes it so that we include all session variables and
cluster settings that differ from their defaults. This allows us to
recreate the environment while also reducing the overhead of including
everything into `env.sql`.

For session variables further clarification is warranted:
- all read-only variables have been audited, and most are explicitly
excluded. Out of 23 currently present session variables, only 9 were
deemed to be possibly useful during investigations, so they are not
excluded (one of 9 is `crdb_version` which we print out separately, so
it's actually excluded too).
- for writable variables, we include it if its value differs from the
"binary" default (the one that you get on a fresh cluster) or from the
"cluster" default (the one that you get on a fresh session on the
current cluster). The latter kind is obtained via the provided
`settings.Values` container whereas the former is obtained via a global
singleton container.

Additionally, this commit adjusts the SET and SET CLUSTER SETTING
statements to have single quotes around the setting values that need them.
For cluster settings all types work with having single quotes, but for
session variables some (like integers) don't work with quotes while
others (like strings) need them. This commit adds a map for those that
need them as well as a simple test to catch some of the missing ones
(the list might be incomplete given that the test exercises the default
config). All values are adjusted to fit on a single line (we have some
cluster settings that might span multiple lines).

It also adjusts `EXPLAIN (OPT, ENV)` to include the list of cluster
settings with non-default values.

Release note: None

roachprod: fetch secrets from cloud store

Due to the complexity of fetching the secrets from the secrets
manager, the secrets are now maintained in cloud storage.

Fixes: cockroachdb#117125
Epic: none

roachprod: deploy cockroach

Previously, to upgrade a cluster it had to be done manually by running several
commands to copy over new binaries, drain and stop cockroach and then running
the same start command.

This change introduces a `deploy` command that works similar as `stage` but also
deploys the new version to the cluster by doing all of the above. It does it
node for node to ensure a cluster remains in a healthy state especially if a
load balancer is used to manage connections.

Epic: None
Release Note: None

roachprod: add deploy applications explanation

Epic: None
Release Notes: None

roachtest: add cluster settings operations

This change adds operations to mutate cluster settings. The values are mutated
based on a preset frequency and RNG. For example, if the frequency of a cluster
setting operation is set to "30 minutes" it will only change to a new value
every 30 minutes, even if the operation has been run multiple times within the
same 30-minute window. Running in the same window will just set the same value
again.

Epic: None
Release Note: None

roachtest: fix `OperationRequiresNodes` dependency check

Epic: None
Release Note: None

roachprod: cluster settings operation owners

Epic: None
Release Notes: None

roachprod: fix UT to ignore the yaml format

currently, the test is looking for the exact match of the yaml.
this can fail as the yaml is generated from a map and the values
may not be in the same sequence. a better approach is to use
a yaml parser, which will be done very soon.

Fixes: cockroachdb#124276
Epic: none

pgwire: deflake TestPipelineMetric

The wrong context was used for the server, which meant that it could get
cancelled too early.

Release note: None

roachprod, roachtest: use same cluster name sanitization

Previously, roachtest had it's own function to sanitize
cluster names, while roachprod had it's own function to
verify cluster names. This change removes both and opts
instead to use vm.DNSSafeName.

This change also introduces MalformedClusterNameError
which gives a hint on what is wrong with the name and
tells roachtest not to retry cluster creation.

intentresolver: mark the test heavy to prevent OOM

Epic: none
Release note: none

ui: move enum type to avoid circular dependencies

When importing the ViewMode enum into a unit test, an error is thrown in
localStorage.reducer.ts relating to ViewMode being undefined. This is
occurring due to a circular dependency. To fix, ViewMode is moved to
a separate types.ts file

Release note: None

ui: prevent /_status/nodes calls for secondary tenants

/_status/nodes is not implemented for secondary tenants, resulting in 501
responses in cloud for serverless tenants. To fix, this api is now gated by
a tenant check and will no longer be called by serverless / secondary tenants.

Fixes: CC-26900
Epic: None
Release note: None

cli/doctor: doctor should read from the right jobs table

In cockroachdb#97762, we started writing a job's payload (and progress)
information to the `system.jobs_info` table. As a result, we
had to change the parts of our code that relied on the `system.jobs`
table to use `crdb_internal.system_jobs` instead (since that table
would inaccurately report that some `payload`s were `NULL`).
This change did not occur for the in-memory representation of the jobs
table created by debug doctor -- which can result in missing job
false-positives. This patch updates debug doctor's representation of
the jobs table by referring to `crdb_internal.system_jobs` instead.

Epic: none
Fixes: cockroachdb#122675

Release note: None

doctor: skip validation for dropped descriptors

In some cases, dropped descriptors appear in our `system.descriptors`
table with dangling job mutations without an associated job.
This patch teaches debug doctor examine to skip validation
on such dropped descriptors.

Epic: none

Fixes: cockroachdb#123477
Fixes: cockroachdb#122956

Release note: none

backupccl: test that we actually downloaded all data during online restore

Release note: none

Epic: none

roachtest: check downloaded spans in online restore roundtrip test

Epic: none

Release note: none

workflows: set timeouts for GitHub Actions Essential CI jobs

I've selected the timeouts according to how long each job seems to run
in practice, with a big buffer to allow for load on the remote execution
cluster, etc. The longest-running jobs I've given 60 minutes, with less
time for the shorter jobs.

Epic: CRDB-8308
Release note: None

master: Update pkg/testutils/release/cockroach_releases.yaml

Update pkg/testutils/release/cockroach_releases.yaml with recent values.

Epic: None
Release note: None
Release justification: test-only updates

cdctest: simplify orderValidator.NoteRow code

This patch eliminates unnecessary nesting in orderValidator's
NoteRow method.

Release note: None

cdctest: rename MakeCountValidator to NewCountValidator

This patch renames `MakeCountValidator` to `NewCountValidator` to
reflect that it returns a pointer, not a struct.

Release note: None

roachtest: rename cdc/sink-chaos to cdc/kafka-chaos and add more logging

This patch renames the `cdc/sink-chaos` test to `cdc/kafka-chaos` to
more accurately reflect that it is a Kafka-only test. It also adds
logging for the Kafka chaos loop iteration number so that we can tell
when the Kafka cluster is restarting from the logs.

Release note: None

roachtest: add order validation to CDC Kafka roachtests

This patch adds order validation to CDC Kafka roachtests so that we
can build more confidence in our ordering guarantees. It can be enabled
for a roachtest either by directly setting the `validateOrder` flag on a
`kafkaManager` before creating consumers, or indirectly by setting the
`validateOrder` flag on `kafkaFeedArgs` for tests that use `cdcTester`.

Release note: None

sql: alter primary key hangs if index references exist

Previously, ALTER PRIMARY KEY could hang if an index reference existed
from either a view or function using the index selection syntax. This
was becasue neither the declarative schema changer or legacy schema
changer support updating these references. To address this, this patch
will prevent ALTER PRIMARY KEY if such references exist to any indexes
that will be recreated.

Fixes: cockroachdb#123017

Release note (bug fix): ALTER PRIMARY KEY could hang if the table had
any indexes which were referred to by views or functions using the force
index syntax.

roachprod: remove logging from updatePrometheusTargets

It's hard to tell what the logging is about when it happens in the
context of a roachtest. It's also verbose as it prints one log line
for each VM whenever a test starts cockroach.

Epic: none

Release note: None

roachtest: explicitly close monitor reader after closing session

Previously the reader did not close, even though the session is closed. This
change ensures the reader is closed by closing it after the session has been
closed. This seems to have only been a problem with local clusters.

Fixes: cockroachdb#116314

Epic: None
Release Note: None

roachtest: lower max-upgrades in `tpcc/mixed-headroom`

Avoids timeouts.

Fixes: cockroachdb#124264

Release note: None

scripts: update `drtprod` create drt-large

Bring the creation in line with what is currently running on cockroach-drt. Uses
zone expansion to balance the nodes evenly across regions.

Epic: None
Release Note: None

scripts: update `drtprod` drt-large region

In an effort to save cost on regional network transfers the Warsaw Europe Zone
will now be switched to Columbus.

Epic: None
Release Note: None

backupccl: deflake TestDataDriven_external_connections_privileges

This test already tests the permissions end-to-end by asserting that
the user can or cannot complete the expected actions. No need to print
the system database as well.

Fixes cockroachdb#123379
Release note: None

externalconn: implement `SHOW EXTERNAL CONNECTIONS`.

This new query displays redacted URIs of external connections along
with other valuable information.

Epic: CRDB-15001
Fixes: cockroachdb#85905

Release note (sql change): New queries `SHOW EXTERNAL CONNECTIONS` and
`SHOW EXTERNAL CONNECTION <connection name>` have been added. These queries
display redacted connection URIs and other useful information such as the
connection type. Access to these queries is restricted to the owner of the
connection or users with `USAGE` privilege. Admin or root users will have
unrestricted access to all connections.

kvstorage: speed up scan for range descriptors

On startup we currently scan through all `/Local/Range/` keys in order
to find the range descriptors. This involves going through all
`/Local/Range/Transaction` records and tombstones so we end up reading
through a lot of data.

This change switches to using `SeekGE`:
 - when we encounter a key that has a suffix before `rdsc` like
   `/Local/Range/foo/prbe`, we seek directly to the corresponding
   descriptor key `/Local/Range/foo/rdsc`;
 - after we decode a descriptor, we seek straight to the next possible
   range descriptor `/Local/Range/EndKey/rdsc` using the `EndKey` in
   the descriptor.

Note that inside Pebble, seeks to keys that are very close to the
current position are optimized to try to iterate through the next few
keys before doing a real seek. So this change should not be
detrimental even when there aren't a lot of keys to skip.

I experimented on a store captured after running kv0 with batch size 2
against a single node. Before this change we step through 353 keys but
the iterator has to internally step through 160k tombstones:
```
range descriptor iteration done: 339 keys, 65 range descriptors (by suffix: map[qlpt:81 rdsc:65 txn-:193]);
  scan stats: stepped 353 times (169.743k internal); seeked 2 times (2 internal)...
```

After, we seek around the transaction records and end up doing just a
seek per range:
```
range descriptor iteration done: 65 range descriptors, 0 intents, 0 tombstones
+‹(interface (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)),›
+‹(internal-stats: (block-bytes: (total 563KB, cached 26KB, read-time 755.128µs))...
```

Fixes: cockroachdb#109740

kvserver: read lease under mutex when switching lease type

A race could occur when a replica queue and post lease application both
attempted to switch the lease type. This race would cause the queue to
not process the replica because the lease type had already changed. As a
result, lease preference violations might not have been quickly
resolved by the lease queue.

Read the lease under the same mutex used for requesting the lease, when
possibly switching the lease type.

Resolves: cockroachdb#123998
Release note: None

logictest: skip flaky select_for_update test

Informs: cockroachdb#124197.
Epic: None

Release note: None

mod comment

roachtest: save cluster earlier if debug-always is set

When the --debug-always flag is passed, we mark the cluster
as saved to let the cluster destroy know not to destroy it.

Previously, we only marked a cluster as saved at the end of
a test. However, if the caller terminated the test through
ctrl c, which also destroys all clusters, it would lead to a
race condition where sometimes the cluster would not be saved
in time and get destroyed.

This change moves the cluster save to immediantly after the
cluster is created.

Release note: none
Fixes: none
Epic: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.2.x Flags PRs that need to be backported to 23.2. backport-24.1.x Flags PRs that need to be backported to 24.1.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants