Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: create a mechanism to gracefully and quickly shut down an entire cluster #58417

Open
knz opened this issue Jan 4, 2021 · 9 comments
Open
Labels
A-cc-enablement Pertains to current CC production issues or short-term projects A-kv-server Relating to the KV-level RPC server A-server-architecture Relates to the internal APIs and src org for server code A-server-start-drain Pertains to server startup and shutdown sequences C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-server-and-security DB Server & Security

Comments

@knz
Copy link
Contributor

knz commented Jan 4, 2021

Requested by @awoods187 and @jseldess

tldr: we want a way to reliably shut down an entire cluster.

Today during multi-user demos and tests, users run afoul of production-level rules when trying to shut down an entire cluster: the shutdown process is designed+optimized to ensure that the cluster remains available when one node is shut down.

This means, in particular, that a node does not let itself shut down if it is unable to find a replacement live node to transfer range leases to. This logic is needed to preserve cluster availability through rolling restarts and other production operations on live clusters.

However this logic is also incompatible with interactive use, when a human user following a tutorial attempts to stop an entire cluster at the end of a tutorial/guide/lesson. Their experience is that "nodes refuse to shut down" and they have to resort to ungraceful shutdowns.

We generally would prefer to not encourage (nor teach) ungraceful shutdowns, as they are more likely to be detrimental to cluster health, and thus certainly should not be used for routine operations.

So we really want a tool / method / operation to "shut down an entire cluster gracefully", separate from the incremental node shutdown/restart which preserves cluster health.

For several reasons (not detailed here), it is unreasonable to expect that the same mechanism can be used for both availability-preserving node shutdowns/restarts, and availability-destroying whole-cluster shutdowns.

Additionally, there is at least one reason to desire different mechanisms: a whole-cluster shutdown should preserve the location of range leases, so that the patterns of data locality and traffic does not change significantly if/when the cluster is restarted. Graceful individual node shutdowns, by definition, are designed to change this traffic by redirecting it to other live nodes.

Jira issue: CRDB-3395

@blathers-crl
Copy link

blathers-crl bot commented Jan 4, 2021

Hi @knz, please add a C-ategory label to your issue. Check out the label system docs.

While you're here, please consider adding an A- label to help keep our repository tidy.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@knz knz added A-kv-server Relating to the KV-level RPC server C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) labels Jan 4, 2021
@knz knz added this to To do in DB Server & Security via automation Jan 4, 2021
@knz knz moved this from To do to Hot and loose in DB Server & Security Jan 18, 2021
@knz
Copy link
Contributor Author

knz commented Jan 25, 2021

Pushing this to v21.2.

@piyush-singh can you keep this on the radar for v21.2 roadmapping? mIght necessitate a jira ticket. Thanks.

@jlinder jlinder added the T-server-and-security DB Server & Security label Jun 16, 2021
@knz knz moved this from Hot and loose to To do in DB Server & Security Jun 22, 2021
@knz knz added A-server-start-drain Pertains to server startup and shutdown sequences A-server-architecture Relates to the internal APIs and src org for server code A-cc-enablement Pertains to current CC production issues or short-term projects labels Jul 29, 2021
@knz knz moved this from To do to Linked issues (from the roadmap columns on the right) in DB Server & Security Jul 29, 2021
@joshimhoff
Copy link
Collaborator

I don't understand the rationale for this. I don't really see how it will benefit cloud. Can someone explain?

Between graceful & ungraceful shutdown, I feel we have the tools that are needed.

Is this just about tutorials? If yes, then is that really a priority? If not, what other use cases exist?

@knz
Copy link
Contributor Author

knz commented Oct 27, 2021

I don't really see how it will benefit cloud. Can someone explain?

It greatly simplifies the task of cleaning up at the end of automated tests.

@joshimhoff
Copy link
Collaborator

Why can't you use ungraceful shutdown for automated tests? I am surely just not understanding the tests you have in mind :)

@knz
Copy link
Contributor Author

knz commented Oct 27, 2021

Why can't you use ungraceful shutdown for automated tests? I

We can, but it is not desirable. We have a lot of concerns today about the process of graceful shutdown, not completing on time. The reason why these concerns exist is because nearly all our tests use ungraceful shutdown, and therefore nearly never exercise the graceful shutdown.

We should have tests exercise graceful shutdown by default so that defects there are surfaced more quickly as test failures.

However to achieve this we need to ensure there's a way to shut down entire clusters via graceful shutdown, which is not possible yet.

@joshimhoff
Copy link
Collaborator

Interesting! I think I understand now. This is about making CRDB better by increasing test coverage. It's not really about usage in production.

@knz
Copy link
Contributor Author

knz commented Oct 27, 2021

correct

@jseldess
Copy link
Contributor

jseldess commented Oct 27, 2021

I think the only other angle is for education:

...when a human user following a tutorial attempts to stop an entire cluster at the end of a tutorial/guide/lesson. Their experience is that "nodes refuse to shut down" and they have to resort to ungraceful shutdowns.

But it's not high priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cc-enablement Pertains to current CC production issues or short-term projects A-kv-server Relating to the KV-level RPC server A-server-architecture Relates to the internal APIs and src org for server code A-server-start-drain Pertains to server startup and shutdown sequences C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-server-and-security DB Server & Security
Projects
DB Server & Security
  
Linked issues (from the roadmap colum...
Development

No branches or pull requests

4 participants