-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: add multi-store-remove roachtest #123694
Conversation
d1989ac
to
2dcf6ff
Compare
// Wait for up-replication. | ||
// NOTE: At the time of writing, under-replicated ranges are computed using | ||
// node liveness, rather than store liveness, so we instead compare the range | ||
// count to the current per-store replica count to compute whether all there | ||
// are still under-replicated ranges from the dead store. | ||
// TODO(travers): Once #123561 is solved, re-work this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @nvanbenschoten @andrewbaptist - related to the conversation last week. Let me know if you have thoughts on a better way, for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
TFTR! bors r=itsbilal |
bors r- There is a failure in Bazel extended
|
Canceled. |
While CRDB technically supports multi-store nodes, our test coverage is limited. Add a new roachtest that validates that a store on a node can be removed, and its ranges will eventually be moved to other stores after the removed store has been declared "dead". This provides confidence that in the event of failure / corruption of a specific store, the node can be brought back with the problematic store removed, allowing ranges present in the remaining stores on the node to continue to function. Touches cockroachdb#106620. Release note: None.
2dcf6ff
to
94faed5
Compare
Fixed lint issue. Retrying. bors r=itsbilal |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 94faed5 to blathers/backport-release-23.2-123694: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.2.x failed. See errors above. error creating merge commit from 94faed5 to blathers/backport-release-24.1-123694: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 24.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This roachtest has failed consistently for the past few days, since being enabled in cockroachdb#123694. Skip while we debug, to avoid noise. Touches: cockroachdb#123989. Release note: None.
124177: roachtest: skip multi-store-remove while debugging r=itsbilal,sumeerbhola a=nicktrav This roachtest has failed consistently for the past few days, since being enabled in #123694. Skip while we debug, to avoid noise. Touches: #123989. Release note: None. Epic: None. Co-authored-by: Nick Travers <travers@cockroachlabs.com>
mod comment lease: remove outdated NightlyStress call This test config is not used anymore. Release note: None roachtest: increase timeout of acceptance/multitenant-multiregion This has flaked due to timeout in CI. Release note: None sqlerrors: move gloval var errNoZoneConfigApplies into sqlerrors package This commit is a mechnical change that moves the gloval variable `errNoZoneConfigApplies` from package `sql` to package `sqlerrors`, as it will be used by the DSC when we support zone config in it. Informs: cockroachdb#117574 Release note: None scbuildstmt: Rerewrite a few function with better APIs This commit cleaned up a few functions using the preferred .FilterXxx() API to retrieve element from an element set. No functional change is introduced. Informs: cockroachdb#117574 Release note: None roachtest: update dsc job compat to mixed version v241-v242 This patch updates the dsc job compat roachtest to `declarative_schema_changer/job-compatibility-mixed-version-V241-V242`. Epic: CRDB-35306 Informs: cockroachdb#116395 Release note: None admission: introduce elastic io token exhausted duration metric This patch adds a new metric `elastic_io_tokens_exhausted_duration.kv`. This is similar to the existing `admission.granter.io_tokens_exhausted_duration.kv`, but for elastic traffic. The patch also does some code refactoring to make it easier to use both regular and elastic equivalent metrics. Informs cockroachdb#121574. Release note: None sql: expose an ability to request redacted stmt bundles This commit extends the stmt bundle collection infrastructure to support requesting redacted stmt bundles. Previously, this was only possible via an explicit `EXPLAIN ANALYZE (DEBUG, REDACT)` stmt, but this commit adds the support via overloads to `crdb_internal.request_statement_bundle` builtin as well as the server API. This paves the way for requesting the redacted bundles via the DB Console, but it'll be done separately. To support this we need to add another boolean column to `system.statement_diagnostics_requests` table to indicate whether a request is for the redacted bundle or not. This requires adding a migration in which we populate the existing rows with all `false` values. In order to reduce the amount of code churn and simplify this change, this commit repurposes some of the existing code that we introduced the last time we added a migration for plan-gist matching in stmt bundles. In other words, this commit simultenously removes some stale code and handles the new column. Release note: None sql: simplify usage of some enum settings This commit adjusts a few enum settings to use `EnumSetting.String` when defining the session variables to clean up the code a bit. Release note: None execinfra: add a sanity check that DistSQL version is not bumped The last bump was in 23.1, and we won't ever bump DistSQL version in backwards-incompatible going forward. Epic: None Release note: None Updated the AUTHORS file to include Naveen Setlur sqlstats: remove unnecessary StatementFingerprintID construction Previously we were constructing the stmt fingerprint id unnecessarily in 2 functions in sqlstats: - The Next function for the StmtStatsIterator - PopAllStats, which returns the sql stats collected thus far and resets the in-memory sql stats containers. The ID is available in the recorded statement in both of these cases- there's no need to reconstruct this value and doing is needlessly expensive. Epic: none Release note: None pgwire: add a metric for number of pipelined requests Release note (ops change): Added the sql.pgwire.pipeline.count metric, which is a gauge that measures how many wire protocol commands have been received by the server, but have not yet begun processing. This metric will only grow if clients are using the "pipeline mode" of the Postgres wire protocol. roachtest: skip multi-store-remove while debugging This roachtest has failed consistently for the past few days, since being enabled in cockroachdb#123694. Skip while we debug, to avoid noise. Touches: cockroachdb#123989. Release note: None. clusterversion: remove deprecated internal versions Remove deprecated 23.2 gates - these features will always be enabled in any cluster that a 24.2 node is part of. `TODODelete_V23_1` has a lot of uses and will be removed separately. Epic: none Release note: None sql: fix the type of copy_num_retries_per_batch We were incorrectly using the boolean getter for this integer variable. The bug was introduced when this variable was added in 1c1d313. Release note: None sql: fix cost_scans_with_default_col_size This commit fixes an omission where `cost_scans_with_default_col_size` session variable didn't actually use the corresponding cluster setting for its default value. In other words, the cluster setting actually had no effect. The bug has been present since this variable was introduced in af32d9d. Release note: None sql: fix up defaults for a few session variables This commit adjusts global defaults for a few session variables to use the postgres style (so that the global default value and session variable value would be stored alike). Release note: None sql: improve env collection in stmt bundles This commit refactors how we collect session variables and cluster settings into `env.sql` file in the stmt bundle. Previously, we used a hard-coded list of session variables that were always included while ignoring all variables not in the list. If a variable had a value different from the "binary default" (meaning the value that you got right after the cluster startup, before any cluster setting modifications are applied), then it would be in a SET statement that would run on the `bundle recreate`; if the value was the same as "binary default", then the SET statement was commented out. Also, all cluster setting values were included into the file as a comment. This commit makes it so that we include all session variables and cluster settings that differ from their defaults. This allows us to recreate the environment while also reducing the overhead of including everything into `env.sql`. For session variables further clarification is warranted: - all read-only variables have been audited, and most are explicitly excluded. Out of 23 currently present session variables, only 9 were deemed to be possibly useful during investigations, so they are not excluded (one of 9 is `crdb_version` which we print out separately, so it's actually excluded too). - for writable variables, we include it if its value differs from the "binary" default (the one that you get on a fresh cluster) or from the "cluster" default (the one that you get on a fresh session on the current cluster). The latter kind is obtained via the provided `settings.Values` container whereas the former is obtained via a global singleton container. Additionally, this commit adjusts the SET and SET CLUSTER SETTING statements to have single quotes around the setting values that need them. For cluster settings all types work with having single quotes, but for session variables some (like integers) don't work with quotes while others (like strings) need them. This commit adds a map for those that need them as well as a simple test to catch some of the missing ones (the list might be incomplete given that the test exercises the default config). All values are adjusted to fit on a single line (we have some cluster settings that might span multiple lines). It also adjusts `EXPLAIN (OPT, ENV)` to include the list of cluster settings with non-default values. Release note: None roachprod: fetch secrets from cloud store Due to the complexity of fetching the secrets from the secrets manager, the secrets are now maintained in cloud storage. Fixes: cockroachdb#117125 Epic: none roachprod: deploy cockroach Previously, to upgrade a cluster it had to be done manually by running several commands to copy over new binaries, drain and stop cockroach and then running the same start command. This change introduces a `deploy` command that works similar as `stage` but also deploys the new version to the cluster by doing all of the above. It does it node for node to ensure a cluster remains in a healthy state especially if a load balancer is used to manage connections. Epic: None Release Note: None roachprod: add deploy applications explanation Epic: None Release Notes: None roachtest: add cluster settings operations This change adds operations to mutate cluster settings. The values are mutated based on a preset frequency and RNG. For example, if the frequency of a cluster setting operation is set to "30 minutes" it will only change to a new value every 30 minutes, even if the operation has been run multiple times within the same 30-minute window. Running in the same window will just set the same value again. Epic: None Release Note: None roachtest: fix `OperationRequiresNodes` dependency check Epic: None Release Note: None roachprod: cluster settings operation owners Epic: None Release Notes: None roachprod: fix UT to ignore the yaml format currently, the test is looking for the exact match of the yaml. this can fail as the yaml is generated from a map and the values may not be in the same sequence. a better approach is to use a yaml parser, which will be done very soon. Fixes: cockroachdb#124276 Epic: none pgwire: deflake TestPipelineMetric The wrong context was used for the server, which meant that it could get cancelled too early. Release note: None roachprod, roachtest: use same cluster name sanitization Previously, roachtest had it's own function to sanitize cluster names, while roachprod had it's own function to verify cluster names. This change removes both and opts instead to use vm.DNSSafeName. This change also introduces MalformedClusterNameError which gives a hint on what is wrong with the name and tells roachtest not to retry cluster creation. intentresolver: mark the test heavy to prevent OOM Epic: none Release note: none ui: move enum type to avoid circular dependencies When importing the ViewMode enum into a unit test, an error is thrown in localStorage.reducer.ts relating to ViewMode being undefined. This is occurring due to a circular dependency. To fix, ViewMode is moved to a separate types.ts file Release note: None ui: prevent /_status/nodes calls for secondary tenants /_status/nodes is not implemented for secondary tenants, resulting in 501 responses in cloud for serverless tenants. To fix, this api is now gated by a tenant check and will no longer be called by serverless / secondary tenants. Fixes: CC-26900 Epic: None Release note: None cli/doctor: doctor should read from the right jobs table In cockroachdb#97762, we started writing a job's payload (and progress) information to the `system.jobs_info` table. As a result, we had to change the parts of our code that relied on the `system.jobs` table to use `crdb_internal.system_jobs` instead (since that table would inaccurately report that some `payload`s were `NULL`). This change did not occur for the in-memory representation of the jobs table created by debug doctor -- which can result in missing job false-positives. This patch updates debug doctor's representation of the jobs table by referring to `crdb_internal.system_jobs` instead. Epic: none Fixes: cockroachdb#122675 Release note: None doctor: skip validation for dropped descriptors In some cases, dropped descriptors appear in our `system.descriptors` table with dangling job mutations without an associated job. This patch teaches debug doctor examine to skip validation on such dropped descriptors. Epic: none Fixes: cockroachdb#123477 Fixes: cockroachdb#122956 Release note: none backupccl: test that we actually downloaded all data during online restore Release note: none Epic: none roachtest: check downloaded spans in online restore roundtrip test Epic: none Release note: none workflows: set timeouts for GitHub Actions Essential CI jobs I've selected the timeouts according to how long each job seems to run in practice, with a big buffer to allow for load on the remote execution cluster, etc. The longest-running jobs I've given 60 minutes, with less time for the shorter jobs. Epic: CRDB-8308 Release note: None master: Update pkg/testutils/release/cockroach_releases.yaml Update pkg/testutils/release/cockroach_releases.yaml with recent values. Epic: None Release note: None Release justification: test-only updates cdctest: simplify orderValidator.NoteRow code This patch eliminates unnecessary nesting in orderValidator's NoteRow method. Release note: None cdctest: rename MakeCountValidator to NewCountValidator This patch renames `MakeCountValidator` to `NewCountValidator` to reflect that it returns a pointer, not a struct. Release note: None roachtest: rename cdc/sink-chaos to cdc/kafka-chaos and add more logging This patch renames the `cdc/sink-chaos` test to `cdc/kafka-chaos` to more accurately reflect that it is a Kafka-only test. It also adds logging for the Kafka chaos loop iteration number so that we can tell when the Kafka cluster is restarting from the logs. Release note: None roachtest: add order validation to CDC Kafka roachtests This patch adds order validation to CDC Kafka roachtests so that we can build more confidence in our ordering guarantees. It can be enabled for a roachtest either by directly setting the `validateOrder` flag on a `kafkaManager` before creating consumers, or indirectly by setting the `validateOrder` flag on `kafkaFeedArgs` for tests that use `cdcTester`. Release note: None sql: alter primary key hangs if index references exist Previously, ALTER PRIMARY KEY could hang if an index reference existed from either a view or function using the index selection syntax. This was becasue neither the declarative schema changer or legacy schema changer support updating these references. To address this, this patch will prevent ALTER PRIMARY KEY if such references exist to any indexes that will be recreated. Fixes: cockroachdb#123017 Release note (bug fix): ALTER PRIMARY KEY could hang if the table had any indexes which were referred to by views or functions using the force index syntax. roachprod: remove logging from updatePrometheusTargets It's hard to tell what the logging is about when it happens in the context of a roachtest. It's also verbose as it prints one log line for each VM whenever a test starts cockroach. Epic: none Release note: None roachtest: explicitly close monitor reader after closing session Previously the reader did not close, even though the session is closed. This change ensures the reader is closed by closing it after the session has been closed. This seems to have only been a problem with local clusters. Fixes: cockroachdb#116314 Epic: None Release Note: None roachtest: lower max-upgrades in `tpcc/mixed-headroom` Avoids timeouts. Fixes: cockroachdb#124264 Release note: None scripts: update `drtprod` create drt-large Bring the creation in line with what is currently running on cockroach-drt. Uses zone expansion to balance the nodes evenly across regions. Epic: None Release Note: None scripts: update `drtprod` drt-large region In an effort to save cost on regional network transfers the Warsaw Europe Zone will now be switched to Columbus. Epic: None Release Note: None backupccl: deflake TestDataDriven_external_connections_privileges This test already tests the permissions end-to-end by asserting that the user can or cannot complete the expected actions. No need to print the system database as well. Fixes cockroachdb#123379 Release note: None externalconn: implement `SHOW EXTERNAL CONNECTIONS`. This new query displays redacted URIs of external connections along with other valuable information. Epic: CRDB-15001 Fixes: cockroachdb#85905 Release note (sql change): New queries `SHOW EXTERNAL CONNECTIONS` and `SHOW EXTERNAL CONNECTION <connection name>` have been added. These queries display redacted connection URIs and other useful information such as the connection type. Access to these queries is restricted to the owner of the connection or users with `USAGE` privilege. Admin or root users will have unrestricted access to all connections. kvstorage: speed up scan for range descriptors On startup we currently scan through all `/Local/Range/` keys in order to find the range descriptors. This involves going through all `/Local/Range/Transaction` records and tombstones so we end up reading through a lot of data. This change switches to using `SeekGE`: - when we encounter a key that has a suffix before `rdsc` like `/Local/Range/foo/prbe`, we seek directly to the corresponding descriptor key `/Local/Range/foo/rdsc`; - after we decode a descriptor, we seek straight to the next possible range descriptor `/Local/Range/EndKey/rdsc` using the `EndKey` in the descriptor. Note that inside Pebble, seeks to keys that are very close to the current position are optimized to try to iterate through the next few keys before doing a real seek. So this change should not be detrimental even when there aren't a lot of keys to skip. I experimented on a store captured after running kv0 with batch size 2 against a single node. Before this change we step through 353 keys but the iterator has to internally step through 160k tombstones: ``` range descriptor iteration done: 339 keys, 65 range descriptors (by suffix: map[qlpt:81 rdsc:65 txn-:193]); scan stats: stepped 353 times (169.743k internal); seeked 2 times (2 internal)... ``` After, we seek around the transaction records and end up doing just a seek per range: ``` range descriptor iteration done: 65 range descriptors, 0 intents, 0 tombstones +‹(interface (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)),› +‹(internal-stats: (block-bytes: (total 563KB, cached 26KB, read-time 755.128µs))... ``` Fixes: cockroachdb#109740 kvserver: read lease under mutex when switching lease type A race could occur when a replica queue and post lease application both attempted to switch the lease type. This race would cause the queue to not process the replica because the lease type had already changed. As a result, lease preference violations might not have been quickly resolved by the lease queue. Read the lease under the same mutex used for requesting the lease, when possibly switching the lease type. Resolves: cockroachdb#123998 Release note: None logictest: skip flaky select_for_update test Informs: cockroachdb#124197. Epic: None Release note: None mod comment roachtest: save cluster earlier if debug-always is set When the --debug-always flag is passed, we mark the cluster as saved to let the cluster destroy know not to destroy it. Previously, we only marked a cluster as saved at the end of a test. However, if the caller terminated the test through ctrl c, which also destroys all clusters, it would lead to a race condition where sometimes the cluster would not be saved in time and get destroyed. This change moves the cluster save to immediantly after the cluster is created. Release note: none Fixes: none Epic: none
While CRDB technically supports multi-store nodes, our test coverage is limited. Add a new roachtest that validates that a store on a node can be removed, and its ranges will eventually be moved to other stores after the removed store has been declared "dead".
This provides confidence that in the event of failure / corruption of a specific store, the node can be brought back with the problematic store removed, allowing ranges present in the remaining stores on the node to continue to function.
Touches #106620.
Release note: None.
Epic: None.