cli,server: distinguish graceful vs non-graceful shutdown requests #108612

TLDR: this patch fixes a bug whereby the top-level shutdown logic in `cli/start.go` mistakenly attempted a graceful drain when that, in fact, was impossible. This would show up as a storm of errors in logs and, sometimes, block server shutdown until a non-graceful shutdown was initiated by other means. A more detailed explanation follows. *For context*, we have a notification mechanism (called "stop trigger") through which internal components in a server can request a shutdown of the overall server. As of this writing, this mechanism is used for example: - when an error is encountered very early during server startup. - at the tail end of the `Drain` RPC, when the `Shutdown` request flag is true, to indicate the process can now exit. - when a SQL liveness record is found to have been deleted or expired without heartbeat. The way this works is that the internal component, given a reference to `*server.stopTrigger`, when it needs a server shutdown calls `signalStop()` and passes a "shutdown request" (`serverctl.MakeShutdownRequest`). At the very outer orchestration layer, a loop monitor a channel from the `stopTrigger` and processes shutdown requests. As of this writing, we have at least the following orchestration points: 1. the top-level shutdown logic in `cli/start.go`. 2. the top-level `testServer` shutdown monitor task (initiated from `(*server.testServer).Start()`) 3. the tenant server orchestration logic in `(*server.channelOrchestrator) startControlledServer()`. Prior to this patch, a bug existed in `cli/start.go` as follows: - the shutdown logic considered that any "shutdown reason" other than `ShutdownReasonServerStartupError` could be processed by performing a graceful drain on the server. - meanwhile, in fact *none* of the current shutdown reasons can be correctly processed via a graceful drain (they all signal conditions that make the current server unhealthy and unable to shut down gracefully). - as a result, if a shutdown request other than "server startup error" was received in `cli/start.go`, it would attempt a graceful drain and the graceful drain process would generally fail with all kinds of error messages in logs. It would also sometime hang, blocking the overall shutdown. This patch fixes this by avoiding the graceful drain unless the shutdown request is for a graceful drain. Through this, care is taken to ensure that if a hard shutdown is triggered after a graceful shutdown, the hard shutdown is also "delivered" to the orchestration. That allows hard shutdown to "catch up" with and shortcut graceful drains initiated earlier. Release note (bug fix): A bug was fixed whereby `cockroach start` would sometimes incorrectly hang upon shutting down a server after encountering an internal error. This bug had been introduced some time in v22.x.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cli,server: distinguish graceful vs non-graceful shutdown requests #108612

cli,server: distinguish graceful vs non-graceful shutdown requests #108612

Commits on Aug 11, 2023