Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver,cli,roachtest,sql: introduce a fully decommissioned bit #50329

Merged

Commits on Jul 11, 2020

  1. kvserver: rename members of livenessUpdate

    Pulling out a rename refactor out of future commits that then make use of it.
    
    Release note: None
    irfansharif committed Jul 11, 2020
    Configuration menu
    Copy the full SHA
    89ce060 View commit details
    Browse the repository at this point in the history
  2. kvserver,cli,roachtest,sql: introduce a decommissioned bit

    This PR introduces a fully decommissioned bit to CRDB. Previously our
    Liveness schema only contained a `decommissioning` bool, with
    consequently no ability to disamiguate between a node currently
    undergoing decommissioning, and a node that was fully decommissioned. We
    used some combination of store dead threshold to surface, in our UI,
    "fully decommissioned" nodes, but it was never quite so. We need this
    specificity for the Connect RPC.
    
    ---
    
    We wire up a new `MembershipStatus` enum that's now part of the liveness
    record. In doing so it elides usage of the `decommissioning` bool used
    in v20.1. We're careful to maintain an on-the-wire representation of the
    Liveness record that will be understood by v20.1 nodes, and do so by
    ensuring the encoding of the enum type is parsed into the semantically
    equivalent v20.1 representation. Usage of the fully decommissioned bit
    is gated behind a version flag. A future commit will introduce a
    mixed-version roachtest testing cross-version compatibility. A future
    commit will also re-register/unskip an improved version of the
    `acceptance/decommission` roachtest.
    
    We repurpose the `AdminServer.Decommission` RPC to persist
    `MembershipStatus`es to KV through the lifetime of a node
    decommissioning/recommissioning. See `cli/node.go` for where that's done.
    For recommissioning a node, it suffices to simply persist an `ACTIVE`
    status. When decommissioning a node, since it's a longer running process,
    we first persist an in-progress `DECOMMISSIONING` status, and once we've
    moved off all the Replicas in the node, we finalize the decommissioning
    process by persisting the `DECOMMISSIONED` status.
    
    When transitioning between `MembershipStatus`es, we CPut against
    what's already there, disallowing illegal state transitions. The
    appropriate error codes are surfaced back to the user. An example would
    be in attempting to recommission a fully decommissioned node, in which
    case we'd error out with the following:
    
    > ERROR: can only recommission a decommissioning node; n4 found to be
    > decommissioned
    
    Note that this is a behavioral change for `cockroach node recommission`.
    Previously it was able to recommission any "fully decommissioned" node,
    regardless of how long ago it's was removed from the cluster. Now
    recommission serves to only cancel an accidental decommissioning process
    that wasn't finalized.
    
    The `decommissioning` column in `crdb_internal.gossip_liveness` is now
    powered by this new `MembershipStatus` instead, and we introduce a new
    `membership` column to it that should be preferred going forward. We
    also introduce the same column to the output generated by `cockroach
    node status --decommission`. The `is_decommissioning` column still exists,
    but is also powered by this `MembershipStatus`.
    
    While here, we iron out the events plumbed into `system.eventlog`: it
    now has a dedicated event for "node decommissioning".
    
    ---
    
    Release note (general change): `cockroach node recommission` has new
    semantics. Previously it was able to recommission any decommissioning node,
    regardless of how long ago it's was decommissioned, or removed from the
    cluster. Now recommission serves to only cancel an accidental
    inflight decommissioning process that wasn't finalized.
    
    Release note (cli change): We introduce a `membership` column to the
    output generated by `cockroach node status --decommission`. It should be
    used in favor of the `is_decommissioning` column going forward.
    
    Release note (cli change): The v20.2 cli `cockroach node` family of
    subcommands will not work with against servers running older version of
    cockroach, but the v20.1 cli `cockroach node` subcommands will work
    against v20.2 servers.
    
    Release note (cli change): The `is_decommissioning` column found in the
    output of `cockroach node decommission` is slated for removal in v20.1.
    Operators should instead use the new `membership` column to determine
    node membership status.
    irfansharif committed Jul 11, 2020
    Configuration menu
    Copy the full SHA
    57033b4 View commit details
    Browse the repository at this point in the history
  3. roachtest: add decommission/mixed-versions

    Add a roachtest stressing randomized `cockroach node
    {decommission,recommission}` usage in multi-version clusters.
    
    Release note: None
    irfansharif committed Jul 11, 2020
    Configuration menu
    Copy the full SHA
    6826494 View commit details
    Browse the repository at this point in the history
  4. roachtest: improve decommissioning roachtests

    Re-write the previously skipped `acceptance/decommission` to
    account for new node {d,r}ecommissioning semantics. The minimum version
    to run this test against is v20.2.
    
    ```go
    // runDecommissionRecommission tests a bunch of node
    // decommissioning/recommissioning procedures, all the while checking for
    // replica movement and appropriate membership status detection behavior. We go
    // through partial decommissioning of random nodes, ensuring we're able to undo
    // those operations. We then fully decommission nodes, verifying it's an
    // irreversible operation.
    ```
    
    Release note: None
    irfansharif committed Jul 11, 2020
    Configuration menu
    Copy the full SHA
    5887553 View commit details
    Browse the repository at this point in the history
  5. cli: improve help prompt for --wait={all,none} for node decommissioning

    Now that we have a fully decommissioned bit, we clarify the mechanics of
    how that is interfaced with through the `--wait` flag.
    
    Release note (cli change): We slightly change the mechanics of how the
    `--wait` flag, as used by `cockroach node decommission`, behaves.
    Copying over from the help prompt:
    
    ```
      Specifies when to return during the decommissioning process. Takes any
      of the following values:
    
        - all   waits until all target nodes' replica counts have dropped to zero and
                marks the nodes as fully decommissioned. This is the default.
        - none  marks the targets as decommissioning, but does not wait for the
                replica counts to drop to zero before returning. If the replica counts
                are found to be zero, nodes are marked as fully decommissioned. Use
                when polling manually from an external system.
    ```
    irfansharif committed Jul 11, 2020
    Configuration menu
    Copy the full SHA
    c045ad8 View commit details
    Browse the repository at this point in the history