New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cli,server: distinguish graceful vs non-graceful shutdown requests #108612
Conversation
TLDR: this patch fixes a bug whereby the top-level shutdown logic in `cli/start.go` mistakenly attempted a graceful drain when that, in fact, was impossible. This would show up as a storm of errors in logs and, sometimes, block server shutdown until a non-graceful shutdown was initiated by other means. A more detailed explanation follows. *For context*, we have a notification mechanism (called "stop trigger") through which internal components in a server can request a shutdown of the overall server. As of this writing, this mechanism is used for example: - when an error is encountered very early during server startup. - at the tail end of the `Drain` RPC, when the `Shutdown` request flag is true, to indicate the process can now exit. - when a SQL liveness record is found to have been deleted or expired without heartbeat. The way this works is that the internal component, given a reference to `*server.stopTrigger`, when it needs a server shutdown calls `signalStop()` and passes a "shutdown request" (`serverctl.MakeShutdownRequest`). At the very outer orchestration layer, a loop monitor a channel from the `stopTrigger` and processes shutdown requests. As of this writing, we have at least the following orchestration points: 1. the top-level shutdown logic in `cli/start.go`. 2. the top-level `testServer` shutdown monitor task (initiated from `(*server.testServer).Start()`) 3. the tenant server orchestration logic in `(*server.channelOrchestrator) startControlledServer()`. Prior to this patch, a bug existed in `cli/start.go` as follows: - the shutdown logic considered that any "shutdown reason" other than `ShutdownReasonServerStartupError` could be processed by performing a graceful drain on the server. - meanwhile, in fact *none* of the current shutdown reasons can be correctly processed via a graceful drain (they all signal conditions that make the current server unhealthy and unable to shut down gracefully). - as a result, if a shutdown request other than "server startup error" was received in `cli/start.go`, it would attempt a graceful drain and the graceful drain process would generally fail with all kinds of error messages in logs. It would also sometime hang, blocking the overall shutdown. This patch fixes this by avoiding the graceful drain unless the shutdown request is for a graceful drain. Through this, care is taken to ensure that if a hard shutdown is triggered after a graceful shutdown, the hard shutdown is also "delivered" to the orchestration. That allows hard shutdown to "catch up" with and shortcut graceful drains initiated earlier. Release note (bug fix): A bug was fixed whereby `cockroach start` would sometimes incorrectly hang upon shutting down a server after encountering an internal error. This bug had been introduced some time in v22.x.
033703f
to
29668e1
Compare
TFYR! bors r=stevendanna |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 29668e1 to blathers/backport-release-23.1-108612: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/server/serverctl/shutdown.go
line 25 at r1 (raw file):
} // Graceful determines whether the shutdown should be effected via a
nit: s/Graceful/TerminateUsingGracefulDrain/
.
108613: server,cli/start: gracefully drain tenant servers upon service disable r=yuzefovich,stevendanna a=knz First commit from #108612. Thanks to `@stevendanna` for brainstorming solutions. TLDR: when the service mode of a secondary tenant is changed to NONE, but the data state is still READY, the tenant servers will now shut down gracefully. This applies both to tenant servers run as separate processes (via `cockroach mt start-sql`) and those started via the internal server controller. More details follow. For context, the reader is invited to first refer to the commit immediately prior. Prior to this patch, all internal shutdown requests were considered to be non-graceful. This included the request that was triggered as a result of a change in the service mode. This caused two separate bugs: - it created a race condition between the server controller (for shared-process servers) and the service mode checker task inside the SQL server (`startCheckService`). When the service mode was flipped to 'NONE', the server controller initiated a graceful drain but then that was caught up by the service mode checker and converted to a hard shutdown. This particular bug was a regression caused by #96144 and had been identified through flakes in `TestServiceShutdownUsesGracefulDrain`. - it was generally causing shutdowns upon changes in service mode to always be run as hard shutdowns (both for external-process and shared-process ervers). This is generally undesirable when the data state is still READY, as it prevents cleaning up the SQL liveness record and can cause delays during subsequent server startups. This patch fixes it by using a graceful shutdown request whenever the service mode changes and the data state is still READY. Since the tenant server now handles graceful drains, the patch also removes this responsibility from the server controller. Release note: None Fixes #107856 Epic: CRDB-26691 108681: sql/resolver: rename AvoidLeased to AssertNotLeased r=fqazi a=fqazi Previously, we had a flag to avoid leased descriptors, which did not do anything within look-up contexts. This was because this flag was never really used since skipping leased descriptors happens at a higher level with flags on the resolver. To address this, this the patch will make this flag more useful by making it an assertion that no leased descriptor is being used in these contexts. Release note: None 108723: serverctl: fix a docstring r=yuzefovich a=knz Requested by `@yuzefovich` in #108612 (review) Release note: None Epic: CRDB-26691 Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
Needed to fix #107856.
TLDR: this patch fixes a bug whereby the top-level shutdown logic in
cli/start.go
mistakenly attempted a graceful drain when that, in fact, was impossible. This would show up as a storm of errors in logs and, sometimes, block server shutdown until a non-graceful shutdown was initiated by other means.A more detailed explanation follows.
For context, we have a notification mechanism (called "stop trigger") through which internal components in a server can request a shutdown of the overall server. As of this writing, this mechanism is used for example:
Drain
RPC, when theShutdown
request flag is true, to indicate the process can now exit.The way this works is that the internal component, given a reference to
*server.stopTrigger
, when it needs a server shutdown callssignalStop()
and passes a "shutdownrequest" (
serverctl.MakeShutdownRequest
).At the very outer orchestration layer, a loop monitor a channel from the
stopTrigger
and processes shutdown requests. As of this writing, we have at least the following orchestration points:cli/start.go
.testServer
shutdown monitor task (initiated from(*server.testServer).Start()
)(*server.channelOrchestrator) startControlledServer()
.Prior to this patch, a bug existed in
cli/start.go
as follows:ShutdownReasonServerStartupError
could be processed by performing a graceful drain on the server.cli/start.go
, it would attempt a graceful drain and the graceful drain process would generally fail with all kinds of error messages in logs. It would also sometime hang, blocking the overall shutdown.This patch fixes this by avoiding the graceful drain unless the shutdown request is for a graceful drain.
Through this, care is taken to ensure that if a hard shutdown is triggered after a graceful shutdown, the hard shutdown is also "delivered" to the orchestration. That allows hard shutdown to "catch up" with and shortcut graceful drains initiated earlier.
Release note (bug fix): A bug was fixed whereby
cockroach start
would sometimes incorrectly hang upon shutting down a server after encountering an internal error. This bug had been introduced some time in v22.x.Fixes #108611.
Epic: CRDB-28893