revise node decommissioning guidance and semantics #7304

taroface · 2020-05-07T23:37:09Z

Fixes #6935.
Fixes #7384.
Fixes #7422.

Updated node decommissioning guidance to use cockroach node decommission over cockroach quit --decommission in 20.1, 19.2, 19.1.
Introduced technical/semantic distinction between "decommissioning" and "decommissioned" nodes. Made some related corrections.
Updated node decommissioning guidance to explicitly require stopping a node after decommissioning.
Documented cockroach node drain and --drain-wait flag.
Removed references to cockroach quit in 20.1 node decommissioning doc. Added cockroach quit deprecation announcement in 20.1.
Corrected node recommissioning guidance.
Added "restarting a decommissioned node" section.
Updated & corrected how this is represented by the Admin UI.
New Admin UI flows and images for 20.1.

cockroach-teamcity · 2020-05-07T23:37:15Z

This change is

cockroach-teamcity · 2020-05-07T23:41:16Z

Online preview: http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/338c40d7b7917af8fa78c49f7de207bb090c56a9/

Edited pages:

tbg · 2020-05-19T08:46:40Z

Looks good from a high level. I'm not sure I left the comments about cockroach quit in the right places (realized only later that there were multiple copies of the docs for the different versions), so please disregard them in that case - you are aware that we don't want to recommend quit in 20.1+ so if that's what the docs do that's great.

tbg

And sorry it took so long.

tbg · 2020-05-19T08:31:04Z

_includes/v20.1/faq/planned-maintenance.md

@@ -1,4 +1,4 @@
-By default, if a node stays offline for more than 5 minutes, the cluster will consider it dead and will rebalance its data to other nodes. Before temporarily stopping nodes for planned maintenance (e.g., upgrading system software), if you expect any nodes to be offline for longer than 5 minutes, you can prevent the cluster from unnecessarily rebalancing data off the nodes by increasing the `server.time_until_store_dead` [cluster setting](cluster-settings.html) to match the estimated maintenance window.
+By default, if a node stays offline for more than 5 minutes, the cluster will consider it dead and will rebalance its data to other nodes. Before [temporarily stopping nodes](../../cockroach-quit.html) for planned maintenance (e.g., upgrading system software), if you expect any nodes to be offline for longer than 5 minutes, you can prevent the cluster from unnecessarily rebalancing data off the nodes by increasing the `server.time_until_store_dead` [cluster setting](cluster-settings.html) to match the estimated maintenance window.

 For example, let's say you want to maintain a group of servers, and the nodes running on the servers may be offline for up to 15 minutes as a result. Before shutting down the nodes, you would change the `server.time_until_store_dead` cluster setting as follows:


For the "resetting back to default" (further below, sorry, can't comment there) I think we'd want to advertise RESET CLUSTER SETTING x instead.

tbg · 2020-05-19T08:32:52Z

v19.1/admin-ui-cluster-overview-page.md


-When you decommission a node, CockroachDB lets the node finish in-flight requests, rejects any new requests, and transfers all range replicas and range leases off the node so that it can be safely shut down. See [Remove Nodes](remove-nodes.html) for more information.
+When you initiate the [decommissioning process](remove-nodes.html#how-it-works) on a node, CockroachDB lets the node finish in-flight requests, rejects any new requests, and transfers all range replicas and range leases off the node so that it can be safely shut down.


Neither the old nor the new one are true. Decommissioning simply moves all data off the node (which includes moving all range leases). SQL sessions keep working forever, until the node is actually shut down. Shutting down the node gracefully will then let the node finish in-flight requests and drain clients etc.

tbg · 2020-05-19T08:33:30Z

v19.1/remove-nodes.md

+You might do this, for example, when downsizing a cluster or reacting to hardware failures.
+
+{{site.data.alerts.callout_info}}
+In other scenarios, such as when [upgrading your cluster's version of CockroachDB](upgrade-cockroach-version.html) or performing planned maintenance (e.g., upgrading system software), you will want to temporarily [stop the node](stop-a-node.html) and restart it later.


Might wanna call out explicitly that that is not what decommissioning is for.

tbg · 2020-05-19T08:34:05Z

v19.1/remove-nodes.md

+1. The node has completed the decommissioning process.
+2. The node has been stopped and has not [updated its liveness record](cluster-setup-troubleshooting.html#node-liveness-issues) for 5 minutes.
+
+The decommissioning process transfers all range replicas on the node to other nodes. During and after this process, the node is considered "decommissioning" and continues to accept new SQL connections. Even without replicas, the node can still function as a gateway to route connections to relevant data. However, note that the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) considers the node "unready" and returns a `503 Service Unavailable` status response code so load balancers stop directing traffic to the node. This behavior has been fixed in v20.1.


what has been fixed in v20.1?

tbg · 2020-05-19T08:34:23Z

v19.1/remove-nodes.md


-During and after decommissioning, the node continues to accept new SQL connections. Even without replicas, the node can still function as a gateway to route connections to relevant data. However, note that the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) considers the node "unready" and returns a `503 Service Unavailable` status response code so load balancers stop directing traffic to the node. This behavior has been fixed in v20.1.
+After all range replicas have been transferred, it's typical to stop the node via a process manager or orchestration tool, or by sending `SIGTERM` manually. You can also use the [`cockroach quit`](stop-a-node.html) command to stop the node. When stopped, the node is drained of in-flight SQL connections and new SQL connections are rejected. At this point, the node is decommissioned.


Let's not mention ./cockroach quit since we wish to get rid of it.

At this point, the node is decommissioned.

You say earlier that it needs to meet two criteria above, but it will take five minutes until it meets the second.

tbg · 2020-05-19T08:42:55Z

v19.1/remove-nodes.md

@@ -225,19 +277,19 @@ Select the **Replication** dashboard, and hover over the **Replicas per Store**

 ### Step 3. Decommission the nodes

-SSH to any live node in the cluster and run the [`cockroach node decommission`](view-node-details.html) command with the IDs of the nodes to officially decommission:
+SSH to any live node in the cluster and run the [`cockroach node decommission`](cockroach-node.html) command with the IDs of the nodes to decommission:


tbg · 2020-05-19T08:43:14Z

v19.1/remove-nodes.md


-For each decommissioned node, SSH to the machine running the node and execute the `cockroach quit` command:
+For each decommissioning node, SSH to the machine running the node and execute the `cockroach quit` command:


tbg · 2020-05-19T08:43:22Z

v19.1/remove-nodes.md

+At this point, the nodes are decommissioned and will no longer appear in timeseries graphs unless you are viewing a time range during which the nodes were live. However, they will never disappear from the **Decommissioned Nodes** list.
+
+{{site.data.alerts.callout_info}}
+If you want to utilize a decommissioned node again, first [recommission](#recommission-nodes) the node to have the cluster rebalance data to the node. Then [restart](start-a-node.html) the node so that it accepts new SQL connections.


tbg · 2020-05-19T08:44:06Z

v19.1/remove-nodes.md

@@ -394,6 +456,11 @@ $ cockroach node status --decommission --insecure --host=<address of any live no
 (5 rows)
 ~~~

+- `is_decommissioning` will be `true` for a node that is undergoing or has completed the decommissioning process. This indicates that replicas are being or have been transferred to other nodes.
+- `is_draining` will be `true` for a node that has been stopped. This indicates that SQL connections are no longer being accepted by the node.


Nodes may be stopped and is_draining == false as well. Just in case this may cause confusion.

tbg · 2020-05-19T08:44:20Z

v19.1/stop-a-node.md

@@ -4,9 +4,13 @@ summary: Learn how to temporarily stop a CockroachDB node.
 toc: true
 ---

-This page shows you how to use the `cockroach quit` [command](cockroach-commands.html) to temporarily stop a node that you plan to restart, for example, during the process of [upgrading your cluster's version of CockroachDB](upgrade-cockroach-version.html) or to perform planned maintenance (e.g., upgrading system software).
+This page shows you how to use the `cockroach quit` [command](cockroach-commands.html) to temporarily stop a node that you plan to restart. 


Ditto about quit

taroface

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jseldess and @tbg)

_includes/v20.1/faq/planned-maintenance.md, line 3 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

For the "resetting back to default" (further below, sorry, can't comment there) I think we'd want to advertise RESET CLUSTER SETTING x instead.

Done.

v19.1/admin-ui-cluster-overview-page.md, line 58 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Neither the old nor the new one are true. Decommissioning simply moves all data off the node (which includes moving all range leases). SQL sessions keep working forever, until the node is actually shut down. Shutting down the node gracefully will then let the node finish in-flight requests and drain clients etc.

Oops, right. Done!

v19.1/remove-nodes.md, line 12 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Might wanna call out explicitly that that is not what decommissioning is for.

Done.

v19.1/remove-nodes.md, line 24 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

what has been fixed in v20.1?

Done. Clarified that the health endpoint behavior was fixed.

v19.1/remove-nodes.md, line 26 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

At this point, the node is decommissioned.

You say earlier that it needs to meet two criteria above, but it will take five minutes until it meets the second.

Thank you! Fixed.

v19.1/remove-nodes.md, line 47 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

You say the same thing just below so I'm confused

Clarified the wording here (and in a following example). It means that adding the extra node won't fix a hung decommissioning process, and that you have to recommission the hung node first.

v19.1/remove-nodes.md, line 75 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Hmm same question here. Maybe I'm just misunderstanding how this renders (the HTML previews are 404)

Done. See above comment.

v19.1/remove-nodes.md, line 115 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is a bit weird - we usually do SSH to the node because it's easiest from there to run commands against that node. But if you have the host, and it's reachable from wherever you are, then you can run the command from there instead via the --host flag. You specify that below, so it's weird that redundantly you ask users to ssh to a machine. I personally would focus on the --host version and avoid recommending to ssh to a machine that runs CRDB since that isn't even always possible (think k8s)

Done. I had been confused by this and should have asked about it.

v19.1/remove-nodes.md, line 169 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Let's avoid cockroach quit.

cockroach quit is removed from this doc starting with 20.1.

v19.1/remove-nodes.md, line 192 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

In the future this will be impossible unless the data dir is wiped (i.e. it counts as a new node). May want to incorporate this in the docs already (it doesn't hurt to do this today already)

Done, added to "Recommission nodes" subheading.

v19.1/remove-nodes.md, line 214 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ditto

Done.

v19.1/remove-nodes.md, line 246 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ditto

Done.

v19.1/remove-nodes.md, line 280 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ditto

Done.

v19.1/remove-nodes.md, line 338 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ditto

Done.

v19.1/remove-nodes.md, line 361 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ditto

Done.

v19.1/remove-nodes.md, line 460 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Nodes may be stopped and is_draining == false as well. Just in case this may cause confusion.

@tbg I'm not quite following here. In what situations will a stopped node have is_draining == false ?

v19.1/stop-a-node.md, line 7 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ditto about quit

I've added a hard-to-miss callout at the top of the 20.1 article that provides the latest guidance on stopping a node. When #6951 is complete I'll link to that article instead.

cockroach-teamcity · 2020-05-27T22:18:15Z

Online preview: http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/80ed2a9e882a6ed6cbbb2ece63562539c5bfd5ba/

Edited pages:

taroface · 2020-05-27T22:33:17Z

TFTR, @tbg! I had one question for you above.

cockroach-teamcity · 2020-05-28T22:41:15Z

Online preview: http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/634d89db5257f10035154bc62dc0ff402c5d11f1/

Edited pages:

tbg

Reviewed 22 of 30 files at r1, 10 of 13 files at r2, 3 of 3 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jseldess, @taroface, and @tbg)

v19.1/remove-nodes.md, line 460 at r1 (raw file):
is_draining is probably true for a node that has been stopped, but it doesn't have to be that way. For example, if you hard-kill a node, it will never mark itself down as draining. A more correct statement is

is_draining == true implies that the node is no longer accepting SQL connections. It is either in the process of shutting down or has already done so.

taroface · 2020-06-09T19:51:32Z

@knz Tobias has reviewed the node decommissioning changes.

Requesting your review as I've also folded your updates from cockroachdb/cockroach#47692 into this PR. In addition, I have updated the node recommissioning guidance as per this thread.

FYI cockroach quit is removed from the guidance starting in 20.1, but remains in 19.2 and 19.1.

knz

Reviewed 3 of 13 files at r2, 9 of 9 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jseldess, @taroface, and @tbg)

v19.1/remove-nodes.md, line 24 at r1 (raw file):

Previously, taroface (Ryan Kuo) wrote…

Done. Clarified that the health endpoint behavior was fixed.

I don't exactly know if that's true?

The /health?ready=0 endpoint reports the node as live (but I think it did that already)

The /health?ready=1 endpoint, hopefully reports the node as non-ready (that's what it should do) - because you want traffic to be redirected to other nodes I think?

v19.1/remove-nodes.md, line 26 at r1 (raw file):

Previously, taroface (Ryan Kuo) wrote…

Thank you! Fixed.

change "5 minutes" to "the configured value of .... (the cluster setting)"

v19.1/remove-nodes.md, line 47 at r1 (raw file):

Previously, taroface (Ryan Kuo) wrote…

Clarified the wording here (and in a following example). It means that adding the extra node won't fix a hung decommissioning process, and that you have to recommission the hung node first.

I am still confused by the text, but now maybe Tobias knows something I don't.

In my book, if there are 3 nodes, and you start decommissioning 1 of them, then the moment you add a 4th node the 3rd will proceed to decommission syccessfully.

The callout's suggestion to "recommission" the node is, in that view, incorrect.

Or is this behavior specific to 19.1 and has been fixed in a later version?

v19.1/remove-nodes.md, line 75 at r1 (raw file):

Previously, taroface (Ryan Kuo) wrote…

Done. See above comment.

see my response above.

v19.1/remove-nodes.md, line 22 at r4 (raw file):

1. The node has completed the decommissioning process.
2. The node has been stopped and has not [updated its liveness record](architecture/replication-layer.html#epoch-based-leases-table-data) for 5 minutes.

replace "for 5 minutes" with "for the duration configured via .... which defaults to 5 minutes"

v19.1/remove-nodes.md, line 94 at r4 (raw file):

- Before decommissioning each node verify that there are no [underreplicated or unavailable ranges](admin-ui-replication-dashboard.html).
- If you have a decommissioning node that appears to be hung, you can [recommission](#recommission-nodes) or [restart](#restart-decommissioned-nodes) the node. If you notice any issues persisting, [contact our support team](support-resources.html).

see my comment above

v19.1/remove-nodes.md, line 350 at r4 (raw file):

</div>

In about 5 minutes, you'll see the nodes move to the **Decommissioned Nodes** list.

change "5 minutes" to "the configured value of ..."

v19.1/remove-nodes.md, line 357 at r4 (raw file):

{{site.data.alerts.callout_info}}
If you want to utilize a decommissioned node again, first [recommission](#recommission-nodes) the node to have the cluster rebalance data to the node. Then [restart](start-a-node.html) the node so that it accepts new SQL connections.

I think the best practice here is no never reuse a decommissioned node again.
In fact, we're going to remove this option completely starting in 20.2.

What's the point to keep this option documented here?

v19.1/remove-nodes.md, line 427 at r4 (raw file):

{% include {{ page.version.version }}/prod-deployment/node-shutdown.md %}

After about 5 minutes, the stopped nodes will be considered [decommissioned](#how-it-works).

"5 minutes"

v19.1/stop-a-node.md, line 18 at r4 (raw file):

- Finishes in-flight requests. Note that this is a best effort that times out after the duration specified by the `server.shutdown.query_wait` [cluster setting](cluster-settings.html).
- Transfers all **range leases** and Raft leadership to other nodes.

I may have missed something - why did you choose to remove the explanation about range leases? From a technical perspective, that part of it remains true.

taroface

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jseldess, @knz, @taroface, and @tbg)

v19.1/remove-nodes.md, line 24 at r1 (raw file):
@knz The current 20.1 docs state that:

During and after decommissioning, the node continues to accept new SQL connections. Even without replicas, the node can still function as a gateway to route connections to relevant data. For this reason, the /health?ready=1 monitoring endpoint continues to consider the node "ready" so load balancers can continue directing traffic to the node.

Is this incorrect? I took it to mean that since the node will accept SQL connections until it is drained, the health endpoint should still indicate "ready". (And then becomes "unready" when cockroach node drain is run?)

v19.1/remove-nodes.md, line 26 at r1 (raw file):

Previously, knz (kena) wrote…

change "5 minutes" to "the configured value of .... (the cluster setting)"

Done.

v19.1/remove-nodes.md, line 47 at r1 (raw file):

Previously, knz (kena) wrote…

I am still confused by the text, but now maybe Tobias knows something I don't.

In my book, if there are 3 nodes, and you start decommissioning 1 of them, then the moment you add a 4th node the 3rd will proceed to decommission syccessfully.

The callout's suggestion to "recommission" the node is, in that view, incorrect.

Or is this behavior specific to 19.1 and has been fixed in a later version?

The 20.2 alpha behaves as you describe, so maybe the original doc was incorrect on this. @jseldess do you know if the currently documented behavior (decommissioning 1 node in a 3-node cluster will hang, unless you recommission that node, add a 4th node, and decommission again) was true of an earlier version? I'm able to simply add the 4th node to complete the decommissioning.

v19.1/remove-nodes.md, line 460 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

is_draining is probably true for a node that has been stopped, but it doesn't have to be that way. For example, if you hard-kill a node, it will never mark itself down as draining. A more correct statement is

is_draining == true implies that the node is no longer accepting SQL connections. It is either in the process of shutting down or has already done so.

Done.

v19.1/remove-nodes.md, line 22 at r4 (raw file):

Previously, knz (kena) wrote…

replace "for 5 minutes" with "for the duration configured via .... which defaults to 5 minutes"

Done, assuming this is server.time_until_store_dead.

v19.1/remove-nodes.md, line 350 at r4 (raw file):

Previously, knz (kena) wrote…

change "5 minutes" to "the configured value of ..."

Done.

v19.1/remove-nodes.md, line 427 at r4 (raw file):

Previously, knz (kena) wrote…

"5 minutes"

Done.

v19.1/stop-a-node.md, line 18 at r4 (raw file):

Previously, knz (kena) wrote…

I may have missed something - why did you choose to remove the explanation about range leases? From a technical perspective, that part of it remains true.

I removed "leases" for the reason Andrei gave here, on Jesse's prior update to this doc: #6825 (review)

taroface

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jseldess, @knz, @taroface, and @tbg)

v19.1/remove-nodes.md, line 357 at r4 (raw file):

Previously, knz (kena) wrote…

I think the best practice here is no never reuse a decommissioned node again.
In fact, we're going to remove this option completely starting in 20.2.

What's the point to keep this option documented here?

Noted - I have removed this option and references to restarting/reusing decommissioned nodes.

cockroach-teamcity · 2020-06-11T23:28:27Z

Online preview: http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/272f2423caac0cc12bebab1336809476d3c96074/

Edited pages:

tbg · 2020-06-12T06:01:38Z

The 20.2 alpha behaves as you describe, so maybe the original doc was

incorrect on this. @jseldess <https://github.com/jseldess> do you know if the currently documented behavior (decommissioning 1 node in a 3-node cluster will hang, unless you recommission that node, add a 4th node, and decommission again) was true of an earlier version? I'm able to simply add the 4th node to complete the decommissioning. At some point in the past, decommissioning multiple nodes at a time was a terrible UX footgun because the nodes would immediately start draining (prior to even being fully decommissioned) and so you could easily nuke the cluster out from under you. I don't remember when we fixed this, but I would think it had been fixed in 19.1 already.

jseldess · 2020-06-12T13:37:26Z

Looks like that note was added to the 19.2 docs by @taroface: #6402 to address #4915. It looks like this behavior was reported by @robert-s-lee and confirmed by @knz. I'm not sure if @taroface or another writer tested this as well. I'll test across the last bunch of versions and report.

knz · 2020-06-12T13:41:00Z

I'm looking at that other PR. In that PR I see two conflicting sentences:

If you try to decommission a node, the process will hang indefinitely because the cluster cannot move the decommissioning node's replicas to the other 2 nodes, which already have a replica of each range:

This sentence is correct.

then the callout:

Adding a 4th node to the cluster at this point will neither enable the decommissioning process to complete nor change the Suspect node status. You can recommission the node to return it to a healthy state.

I believe this sentence is incorrect.

The thing is, I'm not sure. We have changed a few things in the code around this multiple times in the past. It may be that this sentence was correct at some point, and it is not any more (or vice-versa).

jseldess · 2020-06-12T13:47:13Z

Yeah, in that issue, @robert-s-lee reported: once the cluster is in this state, when a 4th node is added, the suspect flag does not go away. I believe this is referring to the "Suspect" indication in the Admin UI. We may have misunderstood the issue here. I'll try to test soon.

jseldess

I tested this on 19.1. I'm pretty certain this is all the suspect status in the UI. It works as expected from a server perspective. If you have 3 nodes, decommission one, decommissioning will hang because there's not other place for the data to be moved and the node will be "suspect" in the UI. When you add a fourth node, the decommission resumes and completes fine. The decommissioned node, however, remains "suspect" in the UI until the node is stopped. Only at that point (or 5 min later) does the node move to the "Decommissioned Nodes" table.

We need to completely remove the note that says:

Adding a 4th node to the cluster at this point will neither enable the decommissioning process to complete nor change the Suspect node status. You can recommission the node to return it to a healthy state.

In addition, if we do anything else right now, I think we may want to look for ways to clarify that, in the UI, a fully decommissioned node will continue to be shown as "suspect" until the node is stopped. That's counterintuitive; if we have a "Decommissioned Nodes" list, it should show nodes that have been decommissioned, whether they are live or not. @knz tells me he and @irfansharif are working on this for 20.2.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jseldess, @knz, @taroface, and @tbg)

jseldess · 2020-06-12T14:50:20Z

On a side note, I'm not sure it make sense to categorize a decommissioning or decommissioned node as "suspect", which indicates something potentially fishy or unintentional.

knz · 2020-06-12T14:55:57Z

I'm not sure it make sense to categorize a decommissioning or decommissioned node as "suspect

I think it's because the UI only has 3 states "good" (green), "off" (red/absent) and "yellow" (everything else)

It might make sense to add a new indicator color "node currently being removed" which will catch both decommissioning and nodes being shut down, and keep yellow for "suspect / don't know".

That's squarely UI design though, cc @dhartunian @piyush-singh

taroface · 2020-06-15T23:17:03Z

I think we may want to look for ways to clarify that, in the UI, a fully decommissioned node will continue to be shown as "suspect" until the node is stopped. That's counterintuitive; if we have a "Decommissioned Nodes" list, it should show nodes that have been decommissioned, whether they are live or not.

In the 20.1 UI, the Node List has a separate status for DECOMMISSIONING and SUSPECT (e.g., liveness not available and/or node was drained but is not decommissioning) nodes. However, the Node Status in the top module still conflates the two as "suspect", which is reflected in our Cluster Overview docs. @piyush-singh @Annebirzin Maybe another word would be less misleading here? Also not sure if the new Overview dashboard will still need to use "suspect" for both statuses?

Once a node is moved into "Recently Decommissioned Nodes" with a DECOMMISSIONED status (e.g., it is drained and shut down), it is no longer counted as "suspect" in the Node Status module.

I think versions 19.1-20.1 of this doc should clarify that a DECOMMISSIONING node will be shown as "suspect" until it becomes DECOMMISSIONED and is moved to the list of Recently Decommissioned Nodes (or "Decommissioned Nodes" in earlier versions). (Will make that in the next update.)

cockroach-teamcity · 2020-06-16T16:47:52Z

Online preview: http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/d4838dd1769bd861fac0b8b2bde286d124964d22/

Edited pages:

jseldess

Thanks, @taroface. LGTM, but there are some broken links to fix before merging:

  d4838dd1769bd861fac0b8b2bde286d124964d22/v19.1/remove-nodes.html
12:47:59
    target does not exist --- d4838dd1769bd861fac0b8b2bde286d124964d22/v19.1/remove-nodes.html --> cockroach-quit.html
12:47:59
    target does not exist --- d4838dd1769bd861fac0b8b2bde286d124964d22/v19.1/remove-nodes.html --> cockroach-node.html
12:47:59
  d4838dd1769bd861fac0b8b2bde286d124964d22/v19.1/view-node-details.html
12:47:59
    target does not exist --- d4838dd1769bd861fac0b8b2bde286d124964d22/v19.1/view-node-details.html --> cockroach-quit.html

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jseldess, @knz, @taroface, and @tbg)

cockroach-teamcity · 2020-06-16T17:50:08Z

Online preview: http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/7691c27f275713aae7dd5f8a38f3ef1bdceff798/

Edited pages:

port #7304 node decommissioning updates to 20.2

revise node decommissioning guidance and semantics

338c40d

taroface requested a review from jseldess May 7, 2020 23:37

taroface mentioned this pull request May 13, 2020

vet Admin UI screenshots in 20.1 docs #7036

Closed

taroface requested a review from tbg May 14, 2020 16:07

tbg approved these changes May 19, 2020

View reviewed changes

jseldess mentioned this pull request May 20, 2020

Node Decommissioning Doc Update #7375

Closed

taroface commented May 27, 2020

View reviewed changes

reviewer comments, fix links

80ed2a9

taroface mentioned this pull request May 27, 2020

update cluster deployment steps to avoid --background #7430

Open

small tweaks

634d89d

tbg self-requested a review June 3, 2020 12:37

tbg approved these changes Jun 3, 2020

View reviewed changes

update with node drain and node recommission

87c5668

taroface requested a review from knz June 9, 2020 19:49

irfansharif mentioned this pull request Jun 9, 2020

cli: deprecated quit --decommission #6935

Closed

knz reviewed Jun 11, 2020

View reviewed changes

reviewer comments

272f242

taroface commented Jun 11, 2020

View reviewed changes

jseldess reviewed Jun 12, 2020

View reviewed changes

clarify "suspect" nodes

d4838dd

taroface requested a review from jseldess June 16, 2020 16:40

jseldess approved these changes Jun 16, 2020

View reviewed changes

fix links

7691c27

taroface merged commit f1f9d96 into master Jun 16, 2020

taroface deleted the decommission-nodes branch June 16, 2020 18:02

taroface added a commit that referenced this pull request Jun 16, 2020

port #7304 node decommissioning updates to 20.2

4824a64

taroface added a commit that referenced this pull request Jun 16, 2020

Merge pull request #7517 from cockroachdb/decommission-nodes-20.2

e6918c2

port #7304 node decommissioning updates to 20.2

taroface mentioned this pull request Jun 16, 2020

port #7304 node decommissioning updates to 20.2 #7517

Merged

taroface mentioned this pull request Sep 2, 2020

kvserver,cli,roachtest,sql: introduce a fully decommissioned bit #8102

Closed

taroface mentioned this pull request Sep 16, 2020

cli: deprecate cockroach quit #7868

Closed

taroface mentioned this pull request Feb 2, 2021

remove incorrect/outdated callout #9692

Merged


		When you decommission a node, CockroachDB lets the node finish in-flight requests, rejects any new requests, and transfers all range replicas and range leases off the node so that it can be safely shut down. See [Remove Nodes](remove-nodes.html) for more information.
		When you initiate the [decommissioning process](remove-nodes.html#how-it-works) on a node, CockroachDB lets the node finish in-flight requests, rejects any new requests, and transfers all range replicas and range leases off the node so that it can be safely shut down.


		During and after decommissioning, the node continues to accept new SQL connections. Even without replicas, the node can still function as a gateway to route connections to relevant data. However, note that the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) considers the node "unready" and returns a `503 Service Unavailable` status response code so load balancers stop directing traffic to the node. This behavior has been fixed in v20.1.
		After all range replicas have been transferred, it's typical to stop the node via a process manager or orchestration tool, or by sending `SIGTERM` manually. You can also use the [`cockroach quit`](stop-a-node.html) command to stop the node. When stopped, the node is drained of in-flight SQL connections and new SQL connections are rejected. At this point, the node is decommissioned.


		For each decommissioned node, SSH to the machine running the node and execute the `cockroach quit` command:
		For each decommissioning node, SSH to the machine running the node and execute the `cockroach quit` command:

revise node decommissioning guidance and semantics #7304

revise node decommissioning guidance and semantics #7304

Conversation

taroface commented May 7, 2020 • edited Loading

cockroach-teamcity commented May 7, 2020

cockroach-teamcity commented May 7, 2020

tbg commented May 19, 2020

tbg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

taroface left a comment

Choose a reason for hiding this comment

cockroach-teamcity commented May 27, 2020

taroface commented May 27, 2020

cockroach-teamcity commented May 28, 2020

tbg left a comment

Choose a reason for hiding this comment

taroface commented Jun 9, 2020 • edited Loading

knz left a comment

Choose a reason for hiding this comment

taroface left a comment • edited Loading

Choose a reason for hiding this comment

taroface left a comment

Choose a reason for hiding this comment

cockroach-teamcity commented Jun 11, 2020

tbg commented Jun 12, 2020 via email

jseldess commented Jun 12, 2020

knz commented Jun 12, 2020

jseldess commented Jun 12, 2020

jseldess left a comment

Choose a reason for hiding this comment

jseldess commented Jun 12, 2020

knz commented Jun 12, 2020

taroface commented Jun 15, 2020 • edited Loading

cockroach-teamcity commented Jun 16, 2020

jseldess left a comment

Choose a reason for hiding this comment

cockroach-teamcity commented Jun 16, 2020

taroface commented May 7, 2020 •

edited

Loading

taroface commented Jun 9, 2020 •

edited

Loading

taroface left a comment •

edited

Loading

taroface commented Jun 15, 2020 •

edited

Loading