Skip to content

Commit

Permalink
port #7304 node decommissioning updates to 20.2
Browse files Browse the repository at this point in the history
  • Loading branch information
taroface committed Jun 16, 2020
1 parent f1f9d96 commit 4824a64
Show file tree
Hide file tree
Showing 22 changed files with 213 additions and 132 deletions.
2 changes: 1 addition & 1 deletion _includes/v20.2/faq/planned-maintenance.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ After completing the maintenance work and [restarting the nodes](cockroach-start

{% include copy-clipboard.html %}
~~~ sql
> SET CLUSTER SETTING server.time_until_store_dead = '5m0s';
> RESET CLUSTER SETTING server.time_until_store_dead;
~~~

It's also important to ensure that load balancers do not send client traffic to a node about to be shut down, even if it will only be down for a few seconds. If you find that your load balancer's health check is not always recognizing a node as unready before the node shuts down, you can increase the `server.shutdown.drain_wait` setting, which tells the node to wait in an unready state for the specified duration. For example:
Expand Down
3 changes: 3 additions & 0 deletions _includes/v20.2/prod-deployment/node-shutdown.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
- If the node was started with a process manager like [systemd](https://www.freedesktop.org/wiki/Software/systemd/), stop the node using the process manager. The process manager should be configured to send `SIGTERM` and then, after about 1 minute, `SIGKILL`.
- If the node was started using [`cockroach start`](cockroach-start.html) and is running in the foreground, press `ctrl-c` in the terminal.
- If the node was started using [`cockroach start`](cockroach-start.html) and the `--background` and `--pid-file` flags, run `kill <pid>`, where `<pid>` is the process ID of the node.
Binary file modified images/v20.2/after-decommission1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/v20.2/after-decommission2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/v20.2/before-decommission0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/v20.2/before-decommission1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/v20.2/before-decommission2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/v20.2/cluster-status-after-decommission1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/v20.2/cluster-status-after-decommission2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/v20.2/decommission-multiple1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/v20.2/decommission-multiple2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/v20.2/decommission-multiple3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/v20.2/decommission-multiple4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/v20.2/decommission-multiple5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/v20.2/decommission-multiple6.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/v20.2/decommission-multiple7.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/v20.2/remove-dead-node1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion v20.1/remove-nodes.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ If you try to decommission a node, the cluster will successfully rebalance all r

<div style="text-align: center;"><img src="{{ 'images/v20.1/decommission-scenario3.2.png' | relative_url }}" alt="Decommission Scenario 1" style="max-width:50%" /></div>

To successfully decommission a node in this cluster, you need to first add a 6th node. The decommissioning process can then complete:
To successfully decommission a node in this cluster, you need to add a 6th node. The decommissioning process can then complete:

<div style="text-align: center;"><img src="{{ 'images/v20.1/decommission-scenario3.3.png' | relative_url }}" alt="Decommission Scenario 1" style="max-width:50%" /></div>

Expand Down
8 changes: 4 additions & 4 deletions v20.2/admin-ui-cluster-overview-page.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ Node Status | Description
`LIVE` | Node is online and updating its liveness record.
`SUSPECT` | Node has an [unavailable liveness status](cluster-setup-troubleshooting.html#node-liveness-issues).
`DECOMMISSIONING` | Node is in the [process of decommissioning](remove-nodes.html#how-it-works).
`DECOMMISSIONED` | Node has been decommissioned for permanent removal from the cluster.
`DECOMMISSIONED` | Node has completed decommissioning, has been stopped, and has not [updated its liveness record](cluster-setup-troubleshooting.html#node-liveness-issues) for 5 minutes.
`DEAD` | Node has not [updated its liveness record](cluster-setup-troubleshooting.html#node-liveness-issues) for 5 minutes.

{{site.data.alerts.callout_info}}
Expand All @@ -81,7 +81,7 @@ The following details are also shown.
Column | Description
-------|------------
Node Count | Number of nodes in the locality.
Nodes | Nodes are grouped by locality and displayed with their address. Click the address to view node statistics. Hover over a row and click **Logs** to see the node's log.
Nodes | Nodes are grouped by locality and displayed with their address and node ID (the ID is the number that is prepended by `n`). Click the address to view node statistics. Hover over a row and click **Logs** to see the node's log.
Uptime | Amount of time the node has been running.
Replicas | Number of replicas on the node or in the locality.
Capacity Usage | Percentage of usable disk space occupied by CockroachDB data on the node or in the locality. See [Capacity metrics](#capacity-metrics).
Expand All @@ -91,12 +91,12 @@ Version | Build tag of the CockroachDB version installed on the node.

### Decommissioned Nodes

Nodes that have recently been decommissioned for permanent removal from the cluster are listed in the table of **Recently Decommissioned Nodes**. You can see the full history of decommissioned nodes by clicking "View all decommissioned nodes".
Nodes that have been [decommissioned](remove-nodes.html#how-it-works) will be listed in the table of **Recently Decommissioned Nodes**, indicating that they are removed from the cluster. You can see the full history of decommissioned nodes by clicking "View all decommissioned nodes".

<img src="{{ 'images/v20.2/admin-ui-decommissioned-nodes.png' | relative_url }}" alt="CockroachDB Admin UI node list" style="border:1px solid #eee;max-width:100%" />

{{site.data.alerts.callout_info}}
When you [decommission a node](remove-nodes.html), CockroachDB lets the node finish in-flight requests, rejects any new requests, and transfers all range replicas and range leases off the node so that it can be safely shut down.
When you initiate the [decommissioning process](remove-nodes.html#how-it-works) on a node, CockroachDB transfers all range replicas and range leases off the node so that it can be safely shut down.
{{site.data.alerts.end}}

## Node Map (Enterprise)
Expand Down
43 changes: 28 additions & 15 deletions v20.2/cockroach-node.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,16 +8,17 @@ key: view-node-details.html

To view details for each node in the cluster, use the `cockroach node` [command](cockroach-commands.html) with the appropriate subcommands and flags.

The `cockroach node` command is also used in the process of decommissioning nodes for permanent removal. See [Remove Nodes](remove-nodes.html) for more details.
The `cockroach node` command is also used in the process of decommissioning nodes for removal from the cluster. See [Decommission Nodes](remove-nodes.html) for more details.

## Subcommands

Subcommand | Usage
-----------|------
`ls` | List the ID of each node in the cluster, excluding those that have been decommissioned and are offline.
`status` | View the status of one or all nodes, excluding nodes that have been decommissioned and taken offline. Depending on flags used, this can include details about range/replicas, disk usage, and decommissioning progress.
`decommission` | Decommission nodes for permanent removal. See [Remove Nodes](remove-nodes.html) for more details.
`recommission` | Recommission nodes that were accidentally decommissioned. See [Recommission Nodes](remove-nodes.html#recommission-nodes) for more details.
`decommission` | Decommission nodes for removal from the cluster. See [Decommission Nodes](remove-nodes.html) for more details.
`recommission` | Recommission nodes that have been decommissioned. See [Recommission Nodes](remove-nodes.html#recommission-nodes) for more details.
`drain` | Drain nodes of SQL clients and [distributed SQL](architecture/sql-layer.html#distsql) queries, and prevent ranges from rebalancing onto the node. This is usually done prior to [stopping the node](cockroach-quit.html).

## Synopsis

Expand Down Expand Up @@ -75,6 +76,12 @@ Recommission nodes:
$ cockroach node recommission <node IDs> <flags>
~~~

Drain nodes:

~~~ shell
$ cockroach node drain <flags>
~~~

View help:

~~~ shell
Expand Down Expand Up @@ -114,7 +121,13 @@ The `node decommission` subcommand also supports the following general flag:

Flag | Description
-----|------------
`--wait` | When to return to the client. Possible values: `all`, `none`.<br><br>If `all`, the command returns to the client only after all specified nodes are fully decommissioned. If any specified nodes are offline, the command will not return to the client until those nodes are back online.<br><br>If `none`, the command does not wait for decommissioning to finish; it returns to the client after starting the decommissioning process on all specified nodes that are online. Any specified nodes that are offline will automatically be marked as decommissioned; if they come back online, the cluster will recognize this status and will not rebalance data to the nodes.<br><br>**Default:** `all`
`--wait` | When to return to the client. Possible values: `all`, `none`.<br><br>If `all`, the command returns to the client only after all replicas on all specified nodes have been transferred to other nodes. If any specified nodes are offline, the command will not return to the client until those nodes are back online.<br><br>If `none`, the command does not wait for the decommissioning process to complete; it returns to the client after starting the decommissioning process on all specified nodes that are online. Any specified nodes that are offline will automatically be marked as decommissioning; if they come back online, the cluster will recognize this status and will not rebalance data to the nodes.<br><br>**Default:** `all`

The `node drain` subcommand also supports the following general flag:

Flag | Description
-----|------------
`--drain-wait` | Amount of time to wait for the node to drain before returning to the client. <br><br>**Default:** `10m`

### Client connection

Expand Down Expand Up @@ -160,29 +173,29 @@ Field | Description
`system_bytes` | The amount of data used just by the CockroachDB system.<br><br>**Required flag:** `--stats` or `--all`
`is_available` | If `true`, the node is currently available.<br><br>**Required flag:** None
`is_live` | If `true`, the node is currently live. <br><br>For unavailable clusters (with an unresponsive Admin UI), running the `node status` command and monitoring the `is_live` field is the only way to identify the live nodes in the cluster. However, you need to run the `node status` command on a live node to identify the other live nodes in an unavailable cluster. Figuring out a live node to run the command is a trial-and-error process, so run the command against each node until you get one that responds. <br><br> See [Identify live nodes in an unavailable cluster](#identify-live-nodes-in-an-unavailable-cluster) for more details. <br><br>**Required flag:** None
`gossiped_replicas` | The number of replicas on the node that are active members of a range. After decommissioning, this should be 0.<br><br>**Required flag:** `--decommission` or `--all`
`is_decommissioning` | If `true`, the node is marked for [decommissioning](remove-nodes.html).<br><br>**Required flag:** `--decommission` or `--all`
`is_draining` | If `true`, the range replicas and range leases are being moved off the node. This happens when a live node is being [decommissioned](remove-nodes.html).<br><br>**Required flag:** `--decommission` or `--all`
`gossiped_replicas` | The number of replicas on the node that are active members of a range. After the decommissioning process completes, this should be 0.<br><br>**Required flag:** `--decommission` or `--all`
`is_decommissioning` | If `true`, the node's range replicas are being transferred to other nodes. This happens when a live node is marked for [decommissioning](remove-nodes.html).<br><br>**Required flag:** `--decommission` or `--all`
`is_draining` | If `true`, the node is being drained of in-flight SQL connections, new SQL connections are rejected, and the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) starts returning a `503 Service Unavailable` status. This happens when a live node is being [stopped](cockroach-quit.html).<br><br>**Required flag:** `--decommission` or `--all`

### `node decommission`

Field | Description
------|------------
`id` | The ID of the node.
`is_live` | If `true`, the node is live.
`replicas` | The number of replicas on the node that are active members of a range. After decommissioning, this should be 0.
`is_decommissioning` | If `true`, the node is marked for [decommissioning](remove-nodes.html)
`is_draining` | If `true`, the range replicas and range leases are being moved off the node. This happens when a live node is being [decommissioned](remove-nodes.html).
`replicas` | The number of replicas on the node that are active members of a range. After the decommissioning process completes, this should be 0.
`is_decommissioning` | If `true`, the node's range replicas are being transferred to other nodes. This happens when a live node is marked for [decommissioning](remove-nodes.html).
`is_draining` | If `true`, the node is being drained of in-flight SQL connections, new SQL connections are rejected, and the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) starts returning a `503 Service Unavailable` status. This happens when a live node is being [stopped](cockroach-quit.html).

### `node recommission`

Field | Description
------|------------
`id` | The ID of the node.
`is_live` | If `true`, the node is live.
`replicas` | The number of replicas on the node that are active members of a range. After decommissioning, this should be 0.
`is_decommissioning` | If `true`, the node is marked for [decommissioning](remove-nodes.html).
`is_draining` | If `true`, the range replicas and range leases are being moved off the node. This happens when a live node is being [decommissioned](remove-nodes.html).
`replicas` | The number of replicas on the node that are active members of a range. After the decommissioning process completes, this should be 0.
`is_decommissioning` | If `true`, the node's range replicas are being transferred to other nodes. This happens when a live node is marked for [decommissioning](remove-nodes.html).
`is_draining` | If `true`, the node is being drained of in-flight SQL connections, new SQL connections are rejected, and the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) starts returning a `503 Service Unavailable` status. This happens when a live node is being [stopped](cockroach-quit.html).

## Examples

Expand Down Expand Up @@ -286,7 +299,7 @@ You need to run the `node status` command on a live node to identify the other l

### Decommission nodes

See [Remove Nodes](remove-nodes.html)
See [Decommission Nodes](remove-nodes.html)

### Recommission nodes

Expand All @@ -295,4 +308,4 @@ See [Recommission Nodes](remove-nodes.html#recommission-nodes)
## See also

- [Other Cockroach Commands](cockroach-commands.html)
- [Remove Nodes](remove-nodes.html)
- [Decommission Nodes](remove-nodes.html)
28 changes: 18 additions & 10 deletions v20.2/cockroach-quit.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,19 @@ redirect_from: stop-a-node.html
key: stop-a-node.html
---

This page shows you how to use the `cockroach quit` [command](cockroach-commands.html) to temporarily stop a node that you plan to restart, for example, during the process of [upgrading your cluster's version of CockroachDB](upgrade-cockroach-version.html) or to perform planned maintenance (e.g., upgrading system software).
{{site.data.alerts.callout_danger}}
`cockroach quit` is deprecated. To stop a node, it's best to first run [`cockroach node drain`](cockroach-node.html) and then do one of the following:

For information about permanently removing nodes to downsize a cluster or react to hardware failures, see [Remove Nodes](remove-nodes.html).
{% include {{ page.version.version }}/prod-deployment/node-shutdown.md %}
{{site.data.alerts.end}}

This page shows you how to use the `cockroach quit` [command](cockroach-commands.html) to temporarily stop a node that you plan to restart.

You might do this, for example, during the process of [upgrading your cluster's version of CockroachDB](upgrade-cockroach-version.html) or to perform planned maintenance (e.g., upgrading system software).

{{site.data.alerts.callout_info}}
In other scenarios, such as when downsizing a cluster or reacting to hardware failures, it's best to remove nodes from your cluster entirely. For information about this, see [Decommission Nodes](remove-nodes.html).
{{site.data.alerts.end}}

## Overview

Expand All @@ -17,9 +27,7 @@ For information about permanently removing nodes to downsize a cluster or react
When you stop a node, it performs the following steps:

- Finishes in-flight requests. Note that this is a best effort that times out after the duration specified by the `server.shutdown.query_wait` [cluster setting](cluster-settings.html).
- Transfers all **range leases** and Raft leadership to other nodes.
- Gossips its draining state to the cluster, so that other nodes do not try to distribute query planning to the draining node, and no leases are transferred to the draining node. Note that this is a best effort that times out after the duration specified by the `server.shutdown.drain_wait` [cluster setting](cluster-settings.html), so other nodes may not receive the gossip info in time.
- No new ranges are transferred to the draining node, to avoid a possible loss of quorum after the node shuts down.
- Gossips its draining state to the cluster, so that other nodes do not try to distribute query planning to the draining node. Note that this is a best effort that times out after the duration specified by the `server.shutdown.drain_wait` [cluster setting](cluster-settings.html), so other nodes may not receive the gossip info in time.

If the node then stays offline for a certain amount of time (5 minutes by default), the cluster considers the node dead and starts to transfer its **range replicas** to other nodes as well.

Expand All @@ -29,7 +37,6 @@ Basic terms:

- **Range**: CockroachDB stores all user data and almost all system data in a giant sorted map of key value pairs. This keyspace is divided into "ranges", contiguous chunks of the keyspace, so that every key can always be found in a single range.
- **Range Replica:** CockroachDB replicates each range (3 times by default) and stores each replica on a different node.
- **Range Lease:** For each range, one of the replicas holds the "range lease". This replica, referred to as the "leaseholder", is the one that receives and coordinates all read and write requests for the range.

### Considerations

Expand Down Expand Up @@ -57,7 +64,8 @@ The `quit` command supports the following [general-use](#general), [client conne

Flag | Description
-----|------------
`--decommission` | If specified, the node will be permanently removed instead of temporarily stopped. See [Remove Nodes](remove-nodes.html) for more details.
`--decommission` | If specified, the node will be removed from the cluster instead of temporarily stopped. <br><br>The `--decommission` flag is deprecated. If you want to remove a node from the cluster, start with the [`cockroach node decommission`](cockroach-node.html) command. See [Decommission Nodes](remove-nodes.html) for more details.
`--drain-wait` | Amount of time to wait for the node to drain before stopping the node. See [`cockroach node drain`](cockroach-node.html) for more details.<br><br>**Default:** `10m`

### Client connection

Expand Down Expand Up @@ -109,7 +117,7 @@ If you need to troubleshoot this command's behavior, you can change its [logging
2. Create a `certs` directory and copy the CA certificate and the client certificate and key for the `root` user into the directory.
3. Run the `cockroach quit` command without the `--decommission` flag:
3. Run the `cockroach quit` command:
{% include copy-clipboard.html %}
~~~ shell
Expand All @@ -120,7 +128,7 @@ If you need to troubleshoot this command's behavior, you can change its [logging
<section class="filter-content" markdown="1" data-scope="insecure">
1. [Install the `cockroach` binary](install-cockroachdb.html) on a machine separate from the node.
2. Run the `cockroach quit` command without the `--decommission` flag:
2. Run the `cockroach quit` command:
{% include copy-clipboard.html %}
~~~ shell
Expand All @@ -131,5 +139,5 @@ If you need to troubleshoot this command's behavior, you can change its [logging
## See also
- [Other Cockroach Commands](cockroach-commands.html)
- [Permanently Remove Nodes from a Cluster](remove-nodes.html)
- [Decommission Nodes](remove-nodes.html)
- [Upgrade a Cluster's Version](upgrade-cockroach-version.html)
Loading

0 comments on commit 4824a64

Please sign in to comment.