Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli,server: distinguish graceful vs non-graceful shutdown requests #108612

Merged
merged 1 commit into from Aug 11, 2023

Commits on Aug 11, 2023

  1. cli,server: distinguish graceful vs non-graceful shutdown requests

    TLDR: this patch fixes a bug whereby the top-level shutdown logic in
    `cli/start.go` mistakenly attempted a graceful drain when that, in
    fact, was impossible. This would show up as a storm of errors in logs
    and, sometimes, block server shutdown until a non-graceful shutdown was
    initiated by other means.
    
    A more detailed explanation follows.
    
    *For context*, we have a notification mechanism (called "stop
    trigger") through which internal components in a server can request a
    shutdown of the overall server. As of this writing, this mechanism is
    used for example:
    
    - when an error is encountered very early during server startup.
    - at the tail end of the `Drain` RPC, when the `Shutdown` request
      flag is true, to indicate the process can now exit.
    - when a SQL liveness record is found to have been deleted or
      expired without heartbeat.
    
    The way this works is that the internal component, given a reference
    to `*server.stopTrigger`, when it needs a server shutdown calls
    `signalStop()` and passes a "shutdown
    request" (`serverctl.MakeShutdownRequest`).
    
    At the very outer orchestration layer, a loop monitor a channel from
    the `stopTrigger` and processes shutdown requests. As of this writing,
    we have at least the following orchestration points:
    
    1. the top-level shutdown logic in `cli/start.go`.
    2. the top-level `testServer` shutdown monitor task
       (initiated from `(*server.testServer).Start()`)
    3. the tenant server orchestration logic in
       `(*server.channelOrchestrator) startControlledServer()`.
    
    Prior to this patch, a bug existed in `cli/start.go` as follows:
    
    - the shutdown logic considered that any "shutdown reason" other than
      `ShutdownReasonServerStartupError` could be processed by performing a
      graceful drain on the server.
    - meanwhile, in fact *none* of the current shutdown reasons can be
      correctly processed via a graceful drain (they all signal conditions
      that make the current server unhealthy and unable to shut down
      gracefully).
    - as a result, if a shutdown request other than "server startup error"
      was received in `cli/start.go`, it would attempt a graceful
      drain and the graceful drain process would generally fail
      with all kinds of error messages in logs. It would also
      sometime hang, blocking the overall shutdown.
    
    This patch fixes this by avoiding the graceful drain unless the
    shutdown request is for a graceful drain.
    
    Through this, care is taken to ensure that if a hard shutdown is
    triggered after a graceful shutdown, the hard shutdown is also
    "delivered" to the orchestration. That allows hard shutdown to "catch
    up" with and shortcut graceful drains initiated earlier.
    
    Release note (bug fix): A bug was fixed whereby `cockroach start`
    would sometimes incorrectly hang upon shutting down a server after
    encountering an internal error. This bug had been introduced some time
    in v22.x.
    knz committed Aug 11, 2023
    Configuration menu
    Copy the full SHA
    29668e1 View commit details
    Browse the repository at this point in the history