server: create a mechanism to gracefully and quickly shut down an entire cluster #58417

knz · 2021-01-04T15:15:28Z

tldr: we want a way to reliably shut down an entire cluster.

Today during multi-user demos and tests, users run afoul of production-level rules when trying to shut down an entire cluster: the shutdown process is designed+optimized to ensure that the cluster remains available when one node is shut down.

This means, in particular, that a node does not let itself shut down if it is unable to find a replacement live node to transfer range leases to. This logic is needed to preserve cluster availability through rolling restarts and other production operations on live clusters.

However this logic is also incompatible with interactive use, when a human user following a tutorial attempts to stop an entire cluster at the end of a tutorial/guide/lesson. Their experience is that "nodes refuse to shut down" and they have to resort to ungraceful shutdowns.

We generally would prefer to not encourage (nor teach) ungraceful shutdowns, as they are more likely to be detrimental to cluster health, and thus certainly should not be used for routine operations.

So we really want a tool / method / operation to "shut down an entire cluster gracefully", separate from the incremental node shutdown/restart which preserves cluster health.

For several reasons (not detailed here), it is unreasonable to expect that the same mechanism can be used for both availability-preserving node shutdowns/restarts, and availability-destroying whole-cluster shutdowns.

Additionally, there is at least one reason to desire different mechanisms: a whole-cluster shutdown should preserve the location of range leases, so that the patterns of data locality and traffic does not change significantly if/when the cluster is restarted. Graceful individual node shutdowns, by definition, are designed to change this traffic by redirecting it to other live nodes.

Jira issue: CRDB-3395

blathers-crl · 2021-01-04T15:15:29Z

Hi @knz, please add a C-ategory label to your issue. Check out the label system docs.

While you're here, please consider adding an A- label to help keep our repository tidy.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

knz · 2021-01-25T18:10:02Z

Pushing this to v21.2.

@piyush-singh can you keep this on the radar for v21.2 roadmapping? mIght necessitate a jira ticket. Thanks.

joshimhoff · 2021-10-25T14:03:50Z

I don't understand the rationale for this. I don't really see how it will benefit cloud. Can someone explain?

Between graceful & ungraceful shutdown, I feel we have the tools that are needed.

Is this just about tutorials? If yes, then is that really a priority? If not, what other use cases exist?

knz · 2021-10-27T16:53:36Z

I don't really see how it will benefit cloud. Can someone explain?

It greatly simplifies the task of cleaning up at the end of automated tests.

joshimhoff · 2021-10-27T17:02:29Z

Why can't you use ungraceful shutdown for automated tests? I am surely just not understanding the tests you have in mind :)

knz · 2021-10-27T17:06:09Z

Why can't you use ungraceful shutdown for automated tests? I

We can, but it is not desirable. We have a lot of concerns today about the process of graceful shutdown, not completing on time. The reason why these concerns exist is because nearly all our tests use ungraceful shutdown, and therefore nearly never exercise the graceful shutdown.

We should have tests exercise graceful shutdown by default so that defects there are surfaced more quickly as test failures.

However to achieve this we need to ensure there's a way to shut down entire clusters via graceful shutdown, which is not possible yet.

joshimhoff · 2021-10-27T17:09:45Z

Interesting! I think I understand now. This is about making CRDB better by increasing test coverage. It's not really about usage in production.

knz · 2021-10-27T17:13:51Z

correct

jseldess · 2021-10-27T18:03:45Z

I think the only other angle is for education:

...when a human user following a tutorial attempts to stop an entire cluster at the end of a tutorial/guide/lesson. Their experience is that "nodes refuse to shut down" and they have to resort to ungraceful shutdowns.

But it's not high priority.

knz added A-kv-server Relating to the KV-level RPC server C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) labels Jan 4, 2021

knz added this to To do in DB Server & Security via automation Jan 4, 2021

knz moved this from To do to Hot and loose in DB Server & Security Jan 18, 2021

jlinder added the T-server-and-security DB Server & Security label Jun 16, 2021

knz moved this from Hot and loose to To do in DB Server & Security Jun 22, 2021

knz added A-server-start-drain Pertains to server startup and shutdown sequences A-server-architecture Relates to the internal APIs and src org for server code A-cc-enablement Pertains to current CC production issues or short-term projects labels Jul 29, 2021

knz moved this from To do to Linked issues (from the roadmap columns on the right) in DB Server & Security Jul 29, 2021

knz mentioned this issue Oct 13, 2022

cli: last node fails to drain #89872

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: create a mechanism to gracefully and quickly shut down an entire cluster #58417

server: create a mechanism to gracefully and quickly shut down an entire cluster #58417

knz commented Jan 4, 2021 •

edited by cockroach-jira-scripts

blathers-crl bot commented Jan 4, 2021

knz commented Jan 25, 2021

joshimhoff commented Oct 25, 2021

knz commented Oct 27, 2021

joshimhoff commented Oct 27, 2021

knz commented Oct 27, 2021

joshimhoff commented Oct 27, 2021

knz commented Oct 27, 2021

jseldess commented Oct 27, 2021 •

edited

server: create a mechanism to gracefully and quickly shut down an entire cluster #58417

server: create a mechanism to gracefully and quickly shut down an entire cluster #58417

Comments

knz commented Jan 4, 2021 • edited by cockroach-jira-scripts

blathers-crl bot commented Jan 4, 2021

knz commented Jan 25, 2021

joshimhoff commented Oct 25, 2021

knz commented Oct 27, 2021

joshimhoff commented Oct 27, 2021

knz commented Oct 27, 2021

joshimhoff commented Oct 27, 2021

knz commented Oct 27, 2021

jseldess commented Oct 27, 2021 • edited

knz commented Jan 4, 2021 •

edited by cockroach-jira-scripts

jseldess commented Oct 27, 2021 •

edited