New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: create a mechanism to gracefully and quickly shut down an entire cluster #58417
Comments
Hi @knz, please add a C-ategory label to your issue. Check out the label system docs. While you're here, please consider adding an A- label to help keep our repository tidy. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Pushing this to v21.2. @piyush-singh can you keep this on the radar for v21.2 roadmapping? mIght necessitate a jira ticket. Thanks. |
I don't understand the rationale for this. I don't really see how it will benefit cloud. Can someone explain? Between graceful & ungraceful shutdown, I feel we have the tools that are needed. Is this just about tutorials? If yes, then is that really a priority? If not, what other use cases exist? |
It greatly simplifies the task of cleaning up at the end of automated tests. |
Why can't you use ungraceful shutdown for automated tests? I am surely just not understanding the tests you have in mind :) |
We can, but it is not desirable. We have a lot of concerns today about the process of graceful shutdown, not completing on time. The reason why these concerns exist is because nearly all our tests use ungraceful shutdown, and therefore nearly never exercise the graceful shutdown. We should have tests exercise graceful shutdown by default so that defects there are surfaced more quickly as test failures. However to achieve this we need to ensure there's a way to shut down entire clusters via graceful shutdown, which is not possible yet. |
Interesting! I think I understand now. This is about making CRDB better by increasing test coverage. It's not really about usage in production. |
correct |
I think the only other angle is for education:
But it's not high priority. |
Requested by @awoods187 and @jseldess
tldr: we want a way to reliably shut down an entire cluster.
Today during multi-user demos and tests, users run afoul of production-level rules when trying to shut down an entire cluster: the shutdown process is designed+optimized to ensure that the cluster remains available when one node is shut down.
This means, in particular, that a node does not let itself shut down if it is unable to find a replacement live node to transfer range leases to. This logic is needed to preserve cluster availability through rolling restarts and other production operations on live clusters.
However this logic is also incompatible with interactive use, when a human user following a tutorial attempts to stop an entire cluster at the end of a tutorial/guide/lesson. Their experience is that "nodes refuse to shut down" and they have to resort to ungraceful shutdowns.
We generally would prefer to not encourage (nor teach) ungraceful shutdowns, as they are more likely to be detrimental to cluster health, and thus certainly should not be used for routine operations.
So we really want a tool / method / operation to "shut down an entire cluster gracefully", separate from the incremental node shutdown/restart which preserves cluster health.
For several reasons (not detailed here), it is unreasonable to expect that the same mechanism can be used for both availability-preserving node shutdowns/restarts, and availability-destroying whole-cluster shutdowns.
Additionally, there is at least one reason to desire different mechanisms: a whole-cluster shutdown should preserve the location of range leases, so that the patterns of data locality and traffic does not change significantly if/when the cluster is restarted. Graceful individual node shutdowns, by definition, are designed to change this traffic by redirecting it to other live nodes.
Jira issue: CRDB-3395
The text was updated successfully, but these errors were encountered: