server: freshly created nodes appear as dead in UI; also abort during node startup causes unused+dead node IDs #70015
Labels
A-server-start-drain
Pertains to server startup and shutdown sequences
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
O-23.2-scale-testing
issues found during 23.2 scale testing
O-testcluster
Issues found or occurred on a test cluster, i.e. a long-running internal cluster
S-3-ux-surprise
Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.
T-server-and-security
DB Server & Security
Projects
Summary
The following two situations occur and share the same root cause:
Freshly created nodes that have not yet heartbeaten their liveness show up immediately as "dead" in the UI and other places that report liveness, before they are marked as "live" a while later.
This is UX surprise because a newly added node should either not show up yet in the UI, or show up as live. The fact it's reported as "dead" is not expected.
Additionally, under certain circumstances (details below), a freshly added node can fail to initialize, and crash, but still acquires a node ID and causes a node descriptor to exist. When this happens, during the next start it will allocate a new node ID. After that, the first node ID that had been allocated will appear to be a dead node and will need to be decommissioned manually.
This is an operational inconvenience because if there is a crash loop during initialization, it's possible for dozens of node IDs to be allocated and immediately appear as dead, and they all need to be cleaned manually afterwards.
Desired resolution
A newly created node status record should be annotated with a special status "newly created", and subsequently ignored when computing node liveness, UI node reports, etc.
The status "newly created" should then be removed (and replaced by "live") the first time the node reports livenesss successfully.
(lower priority) we should try to find a way to persist a node ID that's been allocated during node startup before the store directory has been initialized, and reuse it when starting up again after a crash.
Detail of how the situation occurs
For context, when a new node is added to a cluster, the following happens:
There are two problems with this:
steps 3-4 can last for multiple seconds. During that time, the newly added node will show up as "dead" in the web UI and other places where operators can inspect liveness.
This is surprising.
additionally, if a node crashes in step 4, before it finishes initializing. This is possible e.g. when there is a clock skew: the clock skew detection kicks in when the node re-connects to the cluster after it gets its node ID, and causes a crash, and this crash occurs before the node has finished writing its initial data files in the store directory (and persist its newly allocated node ID).
Because the data directory is not ready, when the node starts again, it appears as if the node has not initialized yet, so it starts again at step 1. This results in an unused node ID which will be forever-dead.
gz#9577
Jira issue: CRDB-9900
The text was updated successfully, but these errors were encountered: