-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver,cli,roachtest,sql: introduce a fully decommissioned bit #50329
kvserver,cli,roachtest,sql: introduce a fully decommissioned bit #50329
Commits on Jul 11, 2020
-
kvserver: rename members of
livenessUpdate
Pulling out a rename refactor out of future commits that then make use of it. Release note: None
Configuration menu - View commit details
-
Copy full SHA for 89ce060 - Browse repository at this point
Copy the full SHA 89ce060View commit details -
kvserver,cli,roachtest,sql: introduce a decommissioned bit
This PR introduces a fully decommissioned bit to CRDB. Previously our Liveness schema only contained a `decommissioning` bool, with consequently no ability to disamiguate between a node currently undergoing decommissioning, and a node that was fully decommissioned. We used some combination of store dead threshold to surface, in our UI, "fully decommissioned" nodes, but it was never quite so. We need this specificity for the Connect RPC. --- We wire up a new `MembershipStatus` enum that's now part of the liveness record. In doing so it elides usage of the `decommissioning` bool used in v20.1. We're careful to maintain an on-the-wire representation of the Liveness record that will be understood by v20.1 nodes, and do so by ensuring the encoding of the enum type is parsed into the semantically equivalent v20.1 representation. Usage of the fully decommissioned bit is gated behind a version flag. A future commit will introduce a mixed-version roachtest testing cross-version compatibility. A future commit will also re-register/unskip an improved version of the `acceptance/decommission` roachtest. We repurpose the `AdminServer.Decommission` RPC to persist `MembershipStatus`es to KV through the lifetime of a node decommissioning/recommissioning. See `cli/node.go` for where that's done. For recommissioning a node, it suffices to simply persist an `ACTIVE` status. When decommissioning a node, since it's a longer running process, we first persist an in-progress `DECOMMISSIONING` status, and once we've moved off all the Replicas in the node, we finalize the decommissioning process by persisting the `DECOMMISSIONED` status. When transitioning between `MembershipStatus`es, we CPut against what's already there, disallowing illegal state transitions. The appropriate error codes are surfaced back to the user. An example would be in attempting to recommission a fully decommissioned node, in which case we'd error out with the following: > ERROR: can only recommission a decommissioning node; n4 found to be > decommissioned Note that this is a behavioral change for `cockroach node recommission`. Previously it was able to recommission any "fully decommissioned" node, regardless of how long ago it's was removed from the cluster. Now recommission serves to only cancel an accidental decommissioning process that wasn't finalized. The `decommissioning` column in `crdb_internal.gossip_liveness` is now powered by this new `MembershipStatus` instead, and we introduce a new `membership` column to it that should be preferred going forward. We also introduce the same column to the output generated by `cockroach node status --decommission`. The `is_decommissioning` column still exists, but is also powered by this `MembershipStatus`. While here, we iron out the events plumbed into `system.eventlog`: it now has a dedicated event for "node decommissioning". --- Release note (general change): `cockroach node recommission` has new semantics. Previously it was able to recommission any decommissioning node, regardless of how long ago it's was decommissioned, or removed from the cluster. Now recommission serves to only cancel an accidental inflight decommissioning process that wasn't finalized. Release note (cli change): We introduce a `membership` column to the output generated by `cockroach node status --decommission`. It should be used in favor of the `is_decommissioning` column going forward. Release note (cli change): The v20.2 cli `cockroach node` family of subcommands will not work with against servers running older version of cockroach, but the v20.1 cli `cockroach node` subcommands will work against v20.2 servers. Release note (cli change): The `is_decommissioning` column found in the output of `cockroach node decommission` is slated for removal in v20.1. Operators should instead use the new `membership` column to determine node membership status.
Configuration menu - View commit details
-
Copy full SHA for 57033b4 - Browse repository at this point
Copy the full SHA 57033b4View commit details -
roachtest: add
decommission/mixed-versions
Add a roachtest stressing randomized `cockroach node {decommission,recommission}` usage in multi-version clusters. Release note: None
Configuration menu - View commit details
-
Copy full SHA for 6826494 - Browse repository at this point
Copy the full SHA 6826494View commit details -
roachtest: improve decommissioning roachtests
Re-write the previously skipped `acceptance/decommission` to account for new node {d,r}ecommissioning semantics. The minimum version to run this test against is v20.2. ```go // runDecommissionRecommission tests a bunch of node // decommissioning/recommissioning procedures, all the while checking for // replica movement and appropriate membership status detection behavior. We go // through partial decommissioning of random nodes, ensuring we're able to undo // those operations. We then fully decommission nodes, verifying it's an // irreversible operation. ``` Release note: None
Configuration menu - View commit details
-
Copy full SHA for 5887553 - Browse repository at this point
Copy the full SHA 5887553View commit details -
cli: improve help prompt for --wait={all,none} for node decommissioning
Now that we have a fully decommissioned bit, we clarify the mechanics of how that is interfaced with through the `--wait` flag. Release note (cli change): We slightly change the mechanics of how the `--wait` flag, as used by `cockroach node decommission`, behaves. Copying over from the help prompt: ``` Specifies when to return during the decommissioning process. Takes any of the following values: - all waits until all target nodes' replica counts have dropped to zero and marks the nodes as fully decommissioned. This is the default. - none marks the targets as decommissioning, but does not wait for the replica counts to drop to zero before returning. If the replica counts are found to be zero, nodes are marked as fully decommissioned. Use when polling manually from an external system. ```
Configuration menu - View commit details
-
Copy full SHA for c045ad8 - Browse repository at this point
Copy the full SHA c045ad8View commit details