TLDR: this patch fixes a bug whereby the top-level shutdown logic in
`cli/start.go` mistakenly attempted a graceful drain when that, in
fact, was impossible. This would show up as a storm of errors in logs
and, sometimes, block server shutdown until a non-graceful shutdown was
initiated by other means.
A more detailed explanation follows.
*For context*, we have a notification mechanism (called "stop
trigger") through which internal components in a server can request a
shutdown of the overall server. As of this writing, this mechanism is
used for example:
- when an error is encountered very early during server startup.
- at the tail end of the `Drain` RPC, when the `Shutdown` request
flag is true, to indicate the process can now exit.
- when a SQL liveness record is found to have been deleted or
expired without heartbeat.
The way this works is that the internal component, given a reference
to `*server.stopTrigger`, when it needs a server shutdown calls
`signalStop()` and passes a "shutdown
request" (`serverctl.MakeShutdownRequest`).
At the very outer orchestration layer, a loop monitor a channel from
the `stopTrigger` and processes shutdown requests. As of this writing,
we have at least the following orchestration points:
1. the top-level shutdown logic in `cli/start.go`.
2. the top-level `testServer` shutdown monitor task
(initiated from `(*server.testServer).Start()`)
3. the tenant server orchestration logic in
`(*server.channelOrchestrator) startControlledServer()`.
Prior to this patch, a bug existed in `cli/start.go` as follows:
- the shutdown logic considered that any "shutdown reason" other than
`ShutdownReasonServerStartupError` could be processed by performing a
graceful drain on the server.
- meanwhile, in fact *none* of the current shutdown reasons can be
correctly processed via a graceful drain (they all signal conditions
that make the current server unhealthy and unable to shut down
gracefully).
- as a result, if a shutdown request other than "server startup error"
was received in `cli/start.go`, it would attempt a graceful
drain and the graceful drain process would generally fail
with all kinds of error messages in logs. It would also
sometime hang, blocking the overall shutdown.
This patch fixes this by avoiding the graceful drain unless the
shutdown request is for a graceful drain.
Through this, care is taken to ensure that if a hard shutdown is
triggered after a graceful shutdown, the hard shutdown is also
"delivered" to the orchestration. That allows hard shutdown to "catch
up" with and shortcut graceful drains initiated earlier.
Release note (bug fix): A bug was fixed whereby `cockroach start`
would sometimes incorrectly hang upon shutting down a server after
encountering an internal error. This bug had been introduced some time
in v22.x.