Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: freshly created nodes appear as dead in UI; also abort during node startup causes unused+dead node IDs #70015

Open
knz opened this issue Sep 10, 2021 · 3 comments
Labels
A-server-start-drain Pertains to server startup and shutdown sequences C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-23.2-scale-testing issues found during 23.2 scale testing O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. T-server-and-security DB Server & Security

Comments

@knz
Copy link
Contributor

knz commented Sep 10, 2021

Summary

The following two situations occur and share the same root cause:

  • Freshly created nodes that have not yet heartbeaten their liveness show up immediately as "dead" in the UI and other places that report liveness, before they are marked as "live" a while later.

    This is UX surprise because a newly added node should either not show up yet in the UI, or show up as live. The fact it's reported as "dead" is not expected.

  • Additionally, under certain circumstances (details below), a freshly added node can fail to initialize, and crash, but still acquires a node ID and causes a node descriptor to exist. When this happens, during the next start it will allocate a new node ID. After that, the first node ID that had been allocated will appear to be a dead node and will need to be decommissioned manually.

    This is an operational inconvenience because if there is a crash loop during initialization, it's possible for dozens of node IDs to be allocated and immediately appear as dead, and they all need to be cleaned manually afterwards.

Desired resolution

  1. A newly created node status record should be annotated with a special status "newly created", and subsequently ignored when computing node liveness, UI node reports, etc.

    The status "newly created" should then be removed (and replaced by "live") the first time the node reports livenesss successfully.

  2. (lower priority) we should try to find a way to persist a node ID that's been allocated during node startup before the store directory has been initialized, and reuse it when starting up again after a crash.

Detail of how the situation occurs

For context, when a new node is added to a cluster, the following happens:

  1. the new node sends a "join" RPC to another pre-existing node
  2. the pre-existing node allocates a node ID for the new node and creates a node status record for it.
  3. the pre-existing node sends the node ID back to the new node
  4. the new node then finalizes its startup, then starts heartbeating its liveness to its status record.

There are two problems with this:

  • steps 3-4 can last for multiple seconds. During that time, the newly added node will show up as "dead" in the web UI and other places where operators can inspect liveness.

    This is surprising.

  • additionally, if a node crashes in step 4, before it finishes initializing. This is possible e.g. when there is a clock skew: the clock skew detection kicks in when the node re-connects to the cluster after it gets its node ID, and causes a crash, and this crash occurs before the node has finished writing its initial data files in the store directory (and persist its newly allocated node ID).

    Because the data directory is not ready, when the node starts again, it appears as if the node has not initialized yet, so it starts again at step 1. This results in an unused node ID which will be forever-dead.

gz#9577

Jira issue: CRDB-9900

@knz knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. A-server-start-drain Pertains to server startup and shutdown sequences labels Sep 10, 2021
@knz knz added this to To do in DB Server & Security via automation Sep 10, 2021
@blathers-crl blathers-crl bot added the T-server-and-security DB Server & Security label Sep 10, 2021
@knz knz changed the title server: abort node startup results in unused + dead node IDs server: freshly created nodes appear as dead in UI; also abort during node startup causes unused+dead node IDs Sep 10, 2021
@irfansharif
Copy link
Contributor

In our UI we could stop consulting the node status keys and look at liveness records instead. To distinguish between nodes that were able to get their liveness records installed and nodes that were properly booted up after installing liveness records, we could look at the last heartbeat timestamp. The liveness records are created with an empty timestamp -- we rely on the joining node to heartbeat itself once it's fully loaded. #50707

@knz
Copy link
Contributor Author

knz commented Sep 10, 2021

ok so "zero timestamp" would explain why they appear as dead: the difference between the zero timestamp and the current time is going to always be greater than the "time until store dead".

at least that checks out.

@knz knz moved this from To do to Queued for roadmapping in DB Server & Security Sep 13, 2021
@tbg
Copy link
Member

tbg commented May 24, 2022

This just occurred again on a customer deployment.

@erikgrinaker points out: if we didn't show these entries as prominently there may not be an issue.

@williamkulju williamkulju added O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster O-23.2-scale-testing issues found during 23.2 scale testing labels Nov 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-server-start-drain Pertains to server startup and shutdown sequences C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-23.2-scale-testing issues found during 23.2 scale testing O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. T-server-and-security DB Server & Security
Projects
DB Server & Security
  
Queued for roadmapping
Development

No branches or pull requests

5 participants