server: freshly created nodes appear as dead in UI; also abort during node startup causes unused+dead node IDs #70015

knz · 2021-09-10T10:59:43Z

Summary

The following two situations occur and share the same root cause:

Freshly created nodes that have not yet heartbeaten their liveness show up immediately as "dead" in the UI and other places that report liveness, before they are marked as "live" a while later.

This is UX surprise because a newly added node should either not show up yet in the UI, or show up as live. The fact it's reported as "dead" is not expected.
Additionally, under certain circumstances (details below), a freshly added node can fail to initialize, and crash, but still acquires a node ID and causes a node descriptor to exist. When this happens, during the next start it will allocate a new node ID. After that, the first node ID that had been allocated will appear to be a dead node and will need to be decommissioned manually.

This is an operational inconvenience because if there is a crash loop during initialization, it's possible for dozens of node IDs to be allocated and immediately appear as dead, and they all need to be cleaned manually afterwards.

Desired resolution

A newly created node status record should be annotated with a special status "newly created", and subsequently ignored when computing node liveness, UI node reports, etc.

The status "newly created" should then be removed (and replaced by "live") the first time the node reports livenesss successfully.
(lower priority) we should try to find a way to persist a node ID that's been allocated during node startup before the store directory has been initialized, and reuse it when starting up again after a crash.

Detail of how the situation occurs

For context, when a new node is added to a cluster, the following happens:

the new node sends a "join" RPC to another pre-existing node
the pre-existing node allocates a node ID for the new node and creates a node status record for it.
the pre-existing node sends the node ID back to the new node
the new node then finalizes its startup, then starts heartbeating its liveness to its status record.

There are two problems with this:

steps 3-4 can last for multiple seconds. During that time, the newly added node will show up as "dead" in the web UI and other places where operators can inspect liveness.

This is surprising.
additionally, if a node crashes in step 4, before it finishes initializing. This is possible e.g. when there is a clock skew: the clock skew detection kicks in when the node re-connects to the cluster after it gets its node ID, and causes a crash, and this crash occurs before the node has finished writing its initial data files in the store directory (and persist its newly allocated node ID).

Because the data directory is not ready, when the node starts again, it appears as if the node has not initialized yet, so it starts again at step 1. This results in an unused node ID which will be forever-dead.

gz#9577

Jira issue: CRDB-9900

irfansharif · 2021-09-10T12:57:52Z

In our UI we could stop consulting the node status keys and look at liveness records instead. To distinguish between nodes that were able to get their liveness records installed and nodes that were properly booted up after installing liveness records, we could look at the last heartbeat timestamp. The liveness records are created with an empty timestamp -- we rely on the joining node to heartbeat itself once it's fully loaded. #50707

knz · 2021-09-10T16:00:07Z

ok so "zero timestamp" would explain why they appear as dead: the difference between the zero timestamp and the current time is going to always be greater than the "time until store dead".

at least that checks out.

tbg · 2022-05-24T15:44:12Z

This just occurred again on a customer deployment.

@erikgrinaker points out: if we didn't show these entries as prominently there may not be an issue.

knz added this to To do in DB Server & Security via automation Sep 10, 2021

blathers-crl bot added the T-server-and-security DB Server & Security label Sep 10, 2021

knz mentioned this issue Sep 10, 2021

In a K8s deployment, scaling stateful set to 2 results in immediate "dead" status #69856

Closed

knz changed the title ~~server: abort node startup results in unused + dead node IDs~~ server: freshly created nodes appear as dead in UI; also abort during node startup causes unused+dead node IDs Sep 10, 2021

knz moved this from To do to Queued for roadmapping in DB Server & Security Sep 13, 2021

williamkulju added O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster O-23.2-scale-testing issues found during 23.2 scale testing labels Nov 10, 2023

williamkulju added the A-observability-inf label Dec 21, 2023

blathers-crl bot added the T-observability-inf label Dec 21, 2023

nkodali removed T-observability-inf A-observability-inf labels Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: freshly created nodes appear as dead in UI; also abort during node startup causes unused+dead node IDs #70015

server: freshly created nodes appear as dead in UI; also abort during node startup causes unused+dead node IDs #70015

knz commented Sep 10, 2021 •

edited by cockroach-jira-scripts

irfansharif commented Sep 10, 2021

knz commented Sep 10, 2021

tbg commented May 24, 2022

server: freshly created nodes appear as dead in UI; also abort during node startup causes unused+dead node IDs #70015

server: freshly created nodes appear as dead in UI; also abort during node startup causes unused+dead node IDs #70015

Comments

knz commented Sep 10, 2021 • edited by cockroach-jira-scripts

Summary

Desired resolution

Detail of how the situation occurs

irfansharif commented Sep 10, 2021

knz commented Sep 10, 2021

tbg commented May 24, 2022

knz commented Sep 10, 2021 •

edited by cockroach-jira-scripts