Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revise node decommissioning guidance and semantics #7304

Merged
merged 7 commits into from
Jun 16, 2020
Merged

Conversation

taroface
Copy link
Contributor

@taroface taroface commented May 7, 2020

Fixes #6935.
Fixes #7384.
Fixes #7422.

  • Updated node decommissioning guidance to use cockroach node decommission over cockroach quit --decommission in 20.1, 19.2, 19.1.
  • Introduced technical/semantic distinction between "decommissioning" and "decommissioned" nodes. Made some related corrections.
  • Updated node decommissioning guidance to explicitly require stopping a node after decommissioning.
  • Documented cockroach node drain and --drain-wait flag.
  • Removed references to cockroach quit in 20.1 node decommissioning doc. Added cockroach quit deprecation announcement in 20.1.
  • Corrected node recommissioning guidance.
  • Added "restarting a decommissioned node" section.
  • Updated & corrected how this is represented by the Admin UI.
  • New Admin UI flows and images for 20.1.

@taroface taroface requested a review from jseldess May 7, 2020 23:37
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg
Copy link
Member

tbg commented May 19, 2020

Looks good from a high level. I'm not sure I left the comments about cockroach quit in the right places (realized only later that there were multiple copies of the docs for the different versions), so please disregard them in that case - you are aware that we don't want to recommend quit in 20.1+ so if that's what the docs do that's great.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And sorry it took so long.

@@ -1,4 +1,4 @@
By default, if a node stays offline for more than 5 minutes, the cluster will consider it dead and will rebalance its data to other nodes. Before temporarily stopping nodes for planned maintenance (e.g., upgrading system software), if you expect any nodes to be offline for longer than 5 minutes, you can prevent the cluster from unnecessarily rebalancing data off the nodes by increasing the `server.time_until_store_dead` [cluster setting](cluster-settings.html) to match the estimated maintenance window.
By default, if a node stays offline for more than 5 minutes, the cluster will consider it dead and will rebalance its data to other nodes. Before [temporarily stopping nodes](../../cockroach-quit.html) for planned maintenance (e.g., upgrading system software), if you expect any nodes to be offline for longer than 5 minutes, you can prevent the cluster from unnecessarily rebalancing data off the nodes by increasing the `server.time_until_store_dead` [cluster setting](cluster-settings.html) to match the estimated maintenance window.

For example, let's say you want to maintain a group of servers, and the nodes running on the servers may be offline for up to 15 minutes as a result. Before shutting down the nodes, you would change the `server.time_until_store_dead` cluster setting as follows:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the "resetting back to default" (further below, sorry, can't comment there) I think we'd want to advertise RESET CLUSTER SETTING x instead.


When you decommission a node, CockroachDB lets the node finish in-flight requests, rejects any new requests, and transfers all range replicas and range leases off the node so that it can be safely shut down. See [Remove Nodes](remove-nodes.html) for more information.
When you initiate the [decommissioning process](remove-nodes.html#how-it-works) on a node, CockroachDB lets the node finish in-flight requests, rejects any new requests, and transfers all range replicas and range leases off the node so that it can be safely shut down.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither the old nor the new one are true. Decommissioning simply moves all data off the node (which includes moving all range leases). SQL sessions keep working forever, until the node is actually shut down. Shutting down the node gracefully will then let the node finish in-flight requests and drain clients etc.

You might do this, for example, when downsizing a cluster or reacting to hardware failures.

{{site.data.alerts.callout_info}}
In other scenarios, such as when [upgrading your cluster's version of CockroachDB](upgrade-cockroach-version.html) or performing planned maintenance (e.g., upgrading system software), you will want to temporarily [stop the node](stop-a-node.html) and restart it later.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might wanna call out explicitly that that is not what decommissioning is for.

1. The node has completed the decommissioning process.
2. The node has been stopped and has not [updated its liveness record](cluster-setup-troubleshooting.html#node-liveness-issues) for 5 minutes.

The decommissioning process transfers all range replicas on the node to other nodes. During and after this process, the node is considered "decommissioning" and continues to accept new SQL connections. Even without replicas, the node can still function as a gateway to route connections to relevant data. However, note that the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) considers the node "unready" and returns a `503 Service Unavailable` status response code so load balancers stop directing traffic to the node. This behavior has been fixed in v20.1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what has been fixed in v20.1?


During and after decommissioning, the node continues to accept new SQL connections. Even without replicas, the node can still function as a gateway to route connections to relevant data. However, note that the [`/health?ready=1` monitoring endpoint](monitoring-and-alerting.html#health-ready-1) considers the node "unready" and returns a `503 Service Unavailable` status response code so load balancers stop directing traffic to the node. This behavior has been fixed in v20.1.
After all range replicas have been transferred, it's typical to stop the node via a process manager or orchestration tool, or by sending `SIGTERM` manually. You can also use the [`cockroach quit`](stop-a-node.html) command to stop the node. When stopped, the node is drained of in-flight SQL connections and new SQL connections are rejected. At this point, the node is decommissioned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not mention ./cockroach quit since we wish to get rid of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, the node is decommissioned.

You say earlier that it needs to meet two criteria above, but it will take five minutes until it meets the second.

@@ -225,19 +277,19 @@ Select the **Replication** dashboard, and hover over the **Replicas per Store**

### Step 3. Decommission the nodes

SSH to any live node in the cluster and run the [`cockroach node decommission`](view-node-details.html) command with the IDs of the nodes to officially decommission:
SSH to any live node in the cluster and run the [`cockroach node decommission`](cockroach-node.html) command with the IDs of the nodes to decommission:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto


For each decommissioned node, SSH to the machine running the node and execute the `cockroach quit` command:
For each decommissioning node, SSH to the machine running the node and execute the `cockroach quit` command:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

At this point, the nodes are decommissioned and will no longer appear in timeseries graphs unless you are viewing a time range during which the nodes were live. However, they will never disappear from the **Decommissioned Nodes** list.

{{site.data.alerts.callout_info}}
If you want to utilize a decommissioned node again, first [recommission](#recommission-nodes) the node to have the cluster rebalance data to the node. Then [restart](start-a-node.html) the node so that it accepts new SQL connections.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

@@ -394,6 +456,11 @@ $ cockroach node status --decommission --insecure --host=<address of any live no
(5 rows)
~~~

- `is_decommissioning` will be `true` for a node that is undergoing or has completed the decommissioning process. This indicates that replicas are being or have been transferred to other nodes.
- `is_draining` will be `true` for a node that has been stopped. This indicates that SQL connections are no longer being accepted by the node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nodes may be stopped and is_draining == false as well. Just in case this may cause confusion.

@@ -4,9 +4,13 @@ summary: Learn how to temporarily stop a CockroachDB node.
toc: true
---

This page shows you how to use the `cockroach quit` [command](cockroach-commands.html) to temporarily stop a node that you plan to restart, for example, during the process of [upgrading your cluster's version of CockroachDB](upgrade-cockroach-version.html) or to perform planned maintenance (e.g., upgrading system software).
This page shows you how to use the `cockroach quit` [command](cockroach-commands.html) to temporarily stop a node that you plan to restart.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto about quit

Copy link
Contributor Author

@taroface taroface left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jseldess and @tbg)


_includes/v20.1/faq/planned-maintenance.md, line 3 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

For the "resetting back to default" (further below, sorry, can't comment there) I think we'd want to advertise RESET CLUSTER SETTING x instead.

Done.


v19.1/admin-ui-cluster-overview-page.md, line 58 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Neither the old nor the new one are true. Decommissioning simply moves all data off the node (which includes moving all range leases). SQL sessions keep working forever, until the node is actually shut down. Shutting down the node gracefully will then let the node finish in-flight requests and drain clients etc.

Oops, right. Done!


v19.1/remove-nodes.md, line 12 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Might wanna call out explicitly that that is not what decommissioning is for.

Done.


v19.1/remove-nodes.md, line 24 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

what has been fixed in v20.1?

Done. Clarified that the health endpoint behavior was fixed.


v19.1/remove-nodes.md, line 26 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

At this point, the node is decommissioned.

You say earlier that it needs to meet two criteria above, but it will take five minutes until it meets the second.

Thank you! Fixed.


v19.1/remove-nodes.md, line 47 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

You say the same thing just below so I'm confused

Clarified the wording here (and in a following example). It means that adding the extra node won't fix a hung decommissioning process, and that you have to recommission the hung node first.


v19.1/remove-nodes.md, line 75 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Hmm same question here. Maybe I'm just misunderstanding how this renders (the HTML previews are 404)

Done. See above comment.


v19.1/remove-nodes.md, line 115 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is a bit weird - we usually do SSH to the node because it's easiest from there to run commands against that node. But if you have the host, and it's reachable from wherever you are, then you can run the command from there instead via the --host flag. You specify that below, so it's weird that redundantly you ask users to ssh to a machine. I personally would focus on the --host version and avoid recommending to ssh to a machine that runs CRDB since that isn't even always possible (think k8s)

Done. I had been confused by this and should have asked about it.


v19.1/remove-nodes.md, line 169 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Let's avoid cockroach quit.

cockroach quit is removed from this doc starting with 20.1.


v19.1/remove-nodes.md, line 192 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

In the future this will be impossible unless the data dir is wiped (i.e. it counts as a new node). May want to incorporate this in the docs already (it doesn't hurt to do this today already)

Done, added to "Recommission nodes" subheading.


v19.1/remove-nodes.md, line 214 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ditto

Done.


v19.1/remove-nodes.md, line 246 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ditto

Done.


v19.1/remove-nodes.md, line 280 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ditto

Done.


v19.1/remove-nodes.md, line 338 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ditto

Done.


v19.1/remove-nodes.md, line 361 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ditto

Done.


v19.1/remove-nodes.md, line 460 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Nodes may be stopped and is_draining == false as well. Just in case this may cause confusion.

@tbg I'm not quite following here. In what situations will a stopped node have is_draining == false ?


v19.1/stop-a-node.md, line 7 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ditto about quit

I've added a hard-to-miss callout at the top of the 20.1 article that provides the latest guidance on stopping a node. When #6951 is complete I'll link to that article instead.

@taroface
Copy link
Contributor Author

TFTR, @tbg! I had one question for you above.

@tbg tbg self-requested a review June 3, 2020 12:37
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 22 of 30 files at r1, 10 of 13 files at r2, 3 of 3 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jseldess, @taroface, and @tbg)


v19.1/remove-nodes.md, line 460 at r1 (raw file):
is_draining is probably true for a node that has been stopped, but it doesn't have to be that way. For example, if you hard-kill a node, it will never mark itself down as draining. A more correct statement is

is_draining == true implies that the node is no longer accepting SQL connections. It is either in the process of shutting down or has already done so.

@taroface taroface requested a review from knz June 9, 2020 19:49
@taroface
Copy link
Contributor Author

taroface commented Jun 9, 2020

@knz Tobias has reviewed the node decommissioning changes.

Requesting your review as I've also folded your updates from cockroachdb/cockroach#47692 into this PR. In addition, I have updated the node recommissioning guidance as per this thread.

FYI cockroach quit is removed from the guidance starting in 20.1, but remains in 19.2 and 19.1.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 13 files at r2, 9 of 9 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jseldess, @taroface, and @tbg)


v19.1/remove-nodes.md, line 24 at r1 (raw file):

Previously, taroface (Ryan Kuo) wrote…

Done. Clarified that the health endpoint behavior was fixed.

I don't exactly know if that's true?

The /health?ready=0 endpoint reports the node as live (but I think it did that already)

The /health?ready=1 endpoint, hopefully reports the node as non-ready (that's what it should do) - because you want traffic to be redirected to other nodes I think?


v19.1/remove-nodes.md, line 26 at r1 (raw file):

Previously, taroface (Ryan Kuo) wrote…

Thank you! Fixed.

change "5 minutes" to "the configured value of .... (the cluster setting)"


v19.1/remove-nodes.md, line 47 at r1 (raw file):

Previously, taroface (Ryan Kuo) wrote…

Clarified the wording here (and in a following example). It means that adding the extra node won't fix a hung decommissioning process, and that you have to recommission the hung node first.

I am still confused by the text, but now maybe Tobias knows something I don't.

In my book, if there are 3 nodes, and you start decommissioning 1 of them, then the moment you add a 4th node the 3rd will proceed to decommission syccessfully.

The callout's suggestion to "recommission" the node is, in that view, incorrect.

Or is this behavior specific to 19.1 and has been fixed in a later version?


v19.1/remove-nodes.md, line 75 at r1 (raw file):

Previously, taroface (Ryan Kuo) wrote…

Done. See above comment.

see my response above.


v19.1/remove-nodes.md, line 22 at r4 (raw file):

1. The node has completed the decommissioning process.
2. The node has been stopped and has not [updated its liveness record](architecture/replication-layer.html#epoch-based-leases-table-data) for 5 minutes.

replace "for 5 minutes" with "for the duration configured via .... which defaults to 5 minutes"


v19.1/remove-nodes.md, line 94 at r4 (raw file):

- Before decommissioning each node verify that there are no [underreplicated or unavailable ranges](admin-ui-replication-dashboard.html).
- If you have a decommissioning node that appears to be hung, you can [recommission](#recommission-nodes) or [restart](#restart-decommissioned-nodes) the node. If you notice any issues persisting, [contact our support team](support-resources.html).

see my comment above


v19.1/remove-nodes.md, line 350 at r4 (raw file):

</div>

In about 5 minutes, you'll see the nodes move to the **Decommissioned Nodes** list.

change "5 minutes" to "the configured value of ..."


v19.1/remove-nodes.md, line 357 at r4 (raw file):

{{site.data.alerts.callout_info}}
If you want to utilize a decommissioned node again, first [recommission](#recommission-nodes) the node to have the cluster rebalance data to the node. Then [restart](start-a-node.html) the node so that it accepts new SQL connections.

I think the best practice here is no never reuse a decommissioned node again.
In fact, we're going to remove this option completely starting in 20.2.

What's the point to keep this option documented here?


v19.1/remove-nodes.md, line 427 at r4 (raw file):

{% include {{ page.version.version }}/prod-deployment/node-shutdown.md %}

After about 5 minutes, the stopped nodes will be considered [decommissioned](#how-it-works).

"5 minutes"


v19.1/stop-a-node.md, line 18 at r4 (raw file):

- Finishes in-flight requests. Note that this is a best effort that times out after the duration specified by the `server.shutdown.query_wait` [cluster setting](cluster-settings.html).
- Transfers all **range leases** and Raft leadership to other nodes.

I may have missed something - why did you choose to remove the explanation about range leases? From a technical perspective, that part of it remains true.

Copy link
Contributor Author

@taroface taroface left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jseldess, @knz, @taroface, and @tbg)


v19.1/remove-nodes.md, line 24 at r1 (raw file):
@knz The current 20.1 docs state that:

During and after decommissioning, the node continues to accept new SQL connections. Even without replicas, the node can still function as a gateway to route connections to relevant data. For this reason, the /health?ready=1 monitoring endpoint continues to consider the node "ready" so load balancers can continue directing traffic to the node.

Is this incorrect? I took it to mean that since the node will accept SQL connections until it is drained, the health endpoint should still indicate "ready". (And then becomes "unready" when cockroach node drain is run?)


v19.1/remove-nodes.md, line 26 at r1 (raw file):

Previously, knz (kena) wrote…

change "5 minutes" to "the configured value of .... (the cluster setting)"

Done.


v19.1/remove-nodes.md, line 47 at r1 (raw file):

Previously, knz (kena) wrote…

I am still confused by the text, but now maybe Tobias knows something I don't.

In my book, if there are 3 nodes, and you start decommissioning 1 of them, then the moment you add a 4th node the 3rd will proceed to decommission syccessfully.

The callout's suggestion to "recommission" the node is, in that view, incorrect.

Or is this behavior specific to 19.1 and has been fixed in a later version?

The 20.2 alpha behaves as you describe, so maybe the original doc was incorrect on this. @jseldess do you know if the currently documented behavior (decommissioning 1 node in a 3-node cluster will hang, unless you recommission that node, add a 4th node, and decommission again) was true of an earlier version? I'm able to simply add the 4th node to complete the decommissioning.


v19.1/remove-nodes.md, line 460 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

is_draining is probably true for a node that has been stopped, but it doesn't have to be that way. For example, if you hard-kill a node, it will never mark itself down as draining. A more correct statement is

is_draining == true implies that the node is no longer accepting SQL connections. It is either in the process of shutting down or has already done so.

Done.


v19.1/remove-nodes.md, line 22 at r4 (raw file):

Previously, knz (kena) wrote…

replace "for 5 minutes" with "for the duration configured via .... which defaults to 5 minutes"

Done, assuming this is server.time_until_store_dead.


v19.1/remove-nodes.md, line 350 at r4 (raw file):

Previously, knz (kena) wrote…

change "5 minutes" to "the configured value of ..."

Done.


v19.1/remove-nodes.md, line 427 at r4 (raw file):

Previously, knz (kena) wrote…

"5 minutes"

Done.


v19.1/stop-a-node.md, line 18 at r4 (raw file):

Previously, knz (kena) wrote…

I may have missed something - why did you choose to remove the explanation about range leases? From a technical perspective, that part of it remains true.

I removed "leases" for the reason Andrei gave here, on Jesse's prior update to this doc: #6825 (review)

Copy link
Contributor Author

@taroface taroface left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jseldess, @knz, @taroface, and @tbg)


v19.1/remove-nodes.md, line 357 at r4 (raw file):

Previously, knz (kena) wrote…

I think the best practice here is no never reuse a decommissioned node again.
In fact, we're going to remove this option completely starting in 20.2.

What's the point to keep this option documented here?

Noted - I have removed this option and references to restarting/reusing decommissioned nodes.

@tbg
Copy link
Member

tbg commented Jun 12, 2020 via email

@jseldess
Copy link
Contributor

Looks like that note was added to the 19.2 docs by @taroface: #6402 to address #4915. It looks like this behavior was reported by @robert-s-lee and confirmed by @knz. I'm not sure if @taroface or another writer tested this as well. I'll test across the last bunch of versions and report.

@knz
Copy link
Contributor

knz commented Jun 12, 2020

I'm looking at that other PR. In that PR I see two conflicting sentences:

If you try to decommission a node, the process will hang indefinitely because the cluster cannot move the decommissioning node's replicas to the other 2 nodes, which already have a replica of each range:

This sentence is correct.

then the callout:

Adding a 4th node to the cluster at this point will neither enable the decommissioning process to complete nor change the Suspect node status. You can recommission the node to return it to a healthy state.

I believe this sentence is incorrect.

The thing is, I'm not sure. We have changed a few things in the code around this multiple times in the past. It may be that this sentence was correct at some point, and it is not any more (or vice-versa).

@jseldess
Copy link
Contributor

Yeah, in that issue, @robert-s-lee reported: once the cluster is in this state, when a 4th node is added, the suspect flag does not go away. I believe this is referring to the "Suspect" indication in the Admin UI. We may have misunderstood the issue here. I'll try to test soon.

Copy link
Contributor

@jseldess jseldess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this on 19.1. I'm pretty certain this is all the suspect status in the UI. It works as expected from a server perspective. If you have 3 nodes, decommission one, decommissioning will hang because there's not other place for the data to be moved and the node will be "suspect" in the UI. When you add a fourth node, the decommission resumes and completes fine. The decommissioned node, however, remains "suspect" in the UI until the node is stopped. Only at that point (or 5 min later) does the node move to the "Decommissioned Nodes" table.

We need to completely remove the note that says:

Adding a 4th node to the cluster at this point will neither enable the decommissioning process to complete nor change the Suspect node status. You can recommission the node to return it to a healthy state.

In addition, if we do anything else right now, I think we may want to look for ways to clarify that, in the UI, a fully decommissioned node will continue to be shown as "suspect" until the node is stopped. That's counterintuitive; if we have a "Decommissioned Nodes" list, it should show nodes that have been decommissioned, whether they are live or not. @knz tells me he and @irfansharif are working on this for 20.2.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jseldess, @knz, @taroface, and @tbg)

@jseldess
Copy link
Contributor

On a side note, I'm not sure it make sense to categorize a decommissioning or decommissioned node as "suspect", which indicates something potentially fishy or unintentional.

@knz
Copy link
Contributor

knz commented Jun 12, 2020

I'm not sure it make sense to categorize a decommissioning or decommissioned node as "suspect

I think it's because the UI only has 3 states "good" (green), "off" (red/absent) and "yellow" (everything else)

It might make sense to add a new indicator color "node currently being removed" which will catch both decommissioning and nodes being shut down, and keep yellow for "suspect / don't know".

That's squarely UI design though, cc @dhartunian @piyush-singh

@taroface
Copy link
Contributor Author

taroface commented Jun 15, 2020

I think we may want to look for ways to clarify that, in the UI, a fully decommissioned node will continue to be shown as "suspect" until the node is stopped. That's counterintuitive; if we have a "Decommissioned Nodes" list, it should show nodes that have been decommissioned, whether they are live or not.

In the 20.1 UI, the Node List has a separate status for DECOMMISSIONING and SUSPECT (e.g., liveness not available and/or node was drained but is not decommissioning) nodes. However, the Node Status in the top module still conflates the two as "suspect", which is reflected in our Cluster Overview docs. @piyush-singh @Annebirzin Maybe another word would be less misleading here? Also not sure if the new Overview dashboard will still need to use "suspect" for both statuses?

image

Once a node is moved into "Recently Decommissioned Nodes" with a DECOMMISSIONED status (e.g., it is drained and shut down), it is no longer counted as "suspect" in the Node Status module.

Screen Shot 2020-06-15 at 12 31 21 PM

I think versions 19.1-20.1 of this doc should clarify that a DECOMMISSIONING node will be shown as "suspect" until it becomes DECOMMISSIONED and is moved to the list of Recently Decommissioned Nodes (or "Decommissioned Nodes" in earlier versions). (Will make that in the next update.)

@taroface taroface requested a review from jseldess June 16, 2020 16:40
Copy link
Contributor

@jseldess jseldess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @taroface. LGTM, but there are some broken links to fix before merging:

  d4838dd1769bd861fac0b8b2bde286d124964d22/v19.1/remove-nodes.html
12:47:59
    target does not exist --- d4838dd1769bd861fac0b8b2bde286d124964d22/v19.1/remove-nodes.html --> cockroach-quit.html
12:47:59
    target does not exist --- d4838dd1769bd861fac0b8b2bde286d124964d22/v19.1/remove-nodes.html --> cockroach-node.html
12:47:59
  d4838dd1769bd861fac0b8b2bde286d124964d22/v19.1/view-node-details.html
12:47:59
    target does not exist --- d4838dd1769bd861fac0b8b2bde286d124964d22/v19.1/view-node-details.html --> cockroach-quit.html

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jseldess, @knz, @taroface, and @tbg)

@taroface taroface merged commit f1f9d96 into master Jun 16, 2020
@taroface taroface deleted the decommission-nodes branch June 16, 2020 18:02
taroface added a commit that referenced this pull request Jun 16, 2020
taroface added a commit that referenced this pull request Jun 16, 2020
port #7304 node decommissioning updates to 20.2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants