cli,server: distinguish graceful vs non-graceful shutdown requests #108612

knz · 2023-08-11T16:14:13Z

Needed to fix #107856.

TLDR: this patch fixes a bug whereby the top-level shutdown logic in cli/start.go mistakenly attempted a graceful drain when that, in fact, was impossible. This would show up as a storm of errors in logs and, sometimes, block server shutdown until a non-graceful shutdown was initiated by other means.

A more detailed explanation follows.

For context, we have a notification mechanism (called "stop trigger") through which internal components in a server can request a shutdown of the overall server. As of this writing, this mechanism is used for example:

when an error is encountered very early during server startup.
at the tail end of the Drain RPC, when the Shutdown request flag is true, to indicate the process can now exit.
when a SQL liveness record is found to have been deleted or expired without heartbeat.

The way this works is that the internal component, given a reference to *server.stopTrigger, when it needs a server shutdown calls signalStop() and passes a "shutdown
request" (serverctl.MakeShutdownRequest).

At the very outer orchestration layer, a loop monitor a channel from the stopTrigger and processes shutdown requests. As of this writing, we have at least the following orchestration points:

the top-level shutdown logic in cli/start.go.
the top-level testServer shutdown monitor task (initiated from (*server.testServer).Start())
the tenant server orchestration logic in (*server.channelOrchestrator) startControlledServer().

Prior to this patch, a bug existed in cli/start.go as follows:

the shutdown logic considered that any "shutdown reason" other than ShutdownReasonServerStartupError could be processed by performing a graceful drain on the server.
meanwhile, in fact none of the current shutdown reasons can be correctly processed via a graceful drain (they all signal conditions that make the current server unhealthy and unable to shut down gracefully).
as a result, if a shutdown request other than "server startup error" was received in cli/start.go, it would attempt a graceful drain and the graceful drain process would generally fail with all kinds of error messages in logs. It would also sometime hang, blocking the overall shutdown.

This patch fixes this by avoiding the graceful drain unless the shutdown request is for a graceful drain.

Through this, care is taken to ensure that if a hard shutdown is triggered after a graceful shutdown, the hard shutdown is also "delivered" to the orchestration. That allows hard shutdown to "catch up" with and shortcut graceful drains initiated earlier.

Release note (bug fix): A bug was fixed whereby cockroach start would sometimes incorrectly hang upon shutting down a server after encountering an internal error. This bug had been introduced some time in v22.x.

Fixes #108611.
Epic: CRDB-28893

cockroach-teamcity · 2023-08-11T16:14:23Z

This change is

TLDR: this patch fixes a bug whereby the top-level shutdown logic in `cli/start.go` mistakenly attempted a graceful drain when that, in fact, was impossible. This would show up as a storm of errors in logs and, sometimes, block server shutdown until a non-graceful shutdown was initiated by other means. A more detailed explanation follows. *For context*, we have a notification mechanism (called "stop trigger") through which internal components in a server can request a shutdown of the overall server. As of this writing, this mechanism is used for example: - when an error is encountered very early during server startup. - at the tail end of the `Drain` RPC, when the `Shutdown` request flag is true, to indicate the process can now exit. - when a SQL liveness record is found to have been deleted or expired without heartbeat. The way this works is that the internal component, given a reference to `*server.stopTrigger`, when it needs a server shutdown calls `signalStop()` and passes a "shutdown request" (`serverctl.MakeShutdownRequest`). At the very outer orchestration layer, a loop monitor a channel from the `stopTrigger` and processes shutdown requests. As of this writing, we have at least the following orchestration points: 1. the top-level shutdown logic in `cli/start.go`. 2. the top-level `testServer` shutdown monitor task (initiated from `(*server.testServer).Start()`) 3. the tenant server orchestration logic in `(*server.channelOrchestrator) startControlledServer()`. Prior to this patch, a bug existed in `cli/start.go` as follows: - the shutdown logic considered that any "shutdown reason" other than `ShutdownReasonServerStartupError` could be processed by performing a graceful drain on the server. - meanwhile, in fact *none* of the current shutdown reasons can be correctly processed via a graceful drain (they all signal conditions that make the current server unhealthy and unable to shut down gracefully). - as a result, if a shutdown request other than "server startup error" was received in `cli/start.go`, it would attempt a graceful drain and the graceful drain process would generally fail with all kinds of error messages in logs. It would also sometime hang, blocking the overall shutdown. This patch fixes this by avoiding the graceful drain unless the shutdown request is for a graceful drain. Through this, care is taken to ensure that if a hard shutdown is triggered after a graceful shutdown, the hard shutdown is also "delivered" to the orchestration. That allows hard shutdown to "catch up" with and shortcut graceful drains initiated earlier. Release note (bug fix): A bug was fixed whereby `cockroach start` would sometimes incorrectly hang upon shutting down a server after encountering an internal error. This bug had been introduced some time in v22.x.

knz · 2023-08-11T20:08:19Z

TFYR!

bors r=stevendanna

craig · 2023-08-11T20:48:15Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2023-08-11T20:48:23Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 29668e1 to blathers/backport-release-23.1-108612: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

yuzefovich

Reviewed 6 of 6 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/server/serverctl/shutdown.go line 25 at r1 (raw file):

}

// Graceful determines whether the shutdown should be effected via a

nit: s/Graceful/TerminateUsingGracefulDrain/.

108613: server,cli/start: gracefully drain tenant servers upon service disable r=yuzefovich,stevendanna a=knz First commit from #108612. Thanks to `@stevendanna` for brainstorming solutions. TLDR: when the service mode of a secondary tenant is changed to NONE, but the data state is still READY, the tenant servers will now shut down gracefully. This applies both to tenant servers run as separate processes (via `cockroach mt start-sql`) and those started via the internal server controller. More details follow. For context, the reader is invited to first refer to the commit immediately prior. Prior to this patch, all internal shutdown requests were considered to be non-graceful. This included the request that was triggered as a result of a change in the service mode. This caused two separate bugs: - it created a race condition between the server controller (for shared-process servers) and the service mode checker task inside the SQL server (`startCheckService`). When the service mode was flipped to 'NONE', the server controller initiated a graceful drain but then that was caught up by the service mode checker and converted to a hard shutdown. This particular bug was a regression caused by #96144 and had been identified through flakes in `TestServiceShutdownUsesGracefulDrain`. - it was generally causing shutdowns upon changes in service mode to always be run as hard shutdowns (both for external-process and shared-process ervers). This is generally undesirable when the data state is still READY, as it prevents cleaning up the SQL liveness record and can cause delays during subsequent server startups. This patch fixes it by using a graceful shutdown request whenever the service mode changes and the data state is still READY. Since the tenant server now handles graceful drains, the patch also removes this responsibility from the server controller. Release note: None Fixes #107856 Epic: CRDB-26691 108681: sql/resolver: rename AvoidLeased to AssertNotLeased r=fqazi a=fqazi Previously, we had a flag to avoid leased descriptors, which did not do anything within look-up contexts. This was because this flag was never really used since skipping leased descriptors happens at a higher level with flags on the resolver. To address this, this the patch will make this flag more useful by making it an assertion that no leased descriptor is being used in these contexts. Release note: None 108723: serverctl: fix a docstring r=yuzefovich a=knz Requested by `@yuzefovich` in #108612 (review) Release note: None Epic: CRDB-26691 Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>

knz requested review from stevendanna and yuzefovich August 11, 2023 16:14

knz requested review from a team as code owners August 11, 2023 16:14

knz requested review from rachitgsrivastava and DarrylWong and removed request for a team August 11, 2023 16:14

knz added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Aug 11, 2023

knz mentioned this pull request Aug 11, 2023

server,cli/start: gracefully drain tenant servers upon service disable #108613

Merged

stevendanna approved these changes Aug 11, 2023

View reviewed changes

knz force-pushed the 20230811-graceful-drain branch from 033703f to 29668e1 Compare August 11, 2023 20:07

knz added the db-cy-23 label Aug 11, 2023

craig bot merged commit f32fc17 into cockroachdb:master Aug 11, 2023
6 of 7 checks passed

knz mentioned this pull request Aug 11, 2023

release-23.1: cli,server: distinguish graceful vs non-graceful shutdown requests #108640

Closed

knz deleted the 20230811-graceful-drain branch August 11, 2023 20:59

yuzefovich reviewed Aug 12, 2023

View reviewed changes

knz mentioned this pull request Aug 14, 2023

serverctl: fix a docstring #108723

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cli,server: distinguish graceful vs non-graceful shutdown requests #108612

cli,server: distinguish graceful vs non-graceful shutdown requests #108612

knz commented Aug 11, 2023 •

edited

cockroach-teamcity commented Aug 11, 2023

knz commented Aug 11, 2023

craig bot commented Aug 11, 2023

blathers-crl bot commented Aug 11, 2023

yuzefovich left a comment

cli,server: distinguish graceful vs non-graceful shutdown requests #108612

cli,server: distinguish graceful vs non-graceful shutdown requests #108612

Conversation

knz commented Aug 11, 2023 • edited

cockroach-teamcity commented Aug 11, 2023

knz commented Aug 11, 2023

craig bot commented Aug 11, 2023

blathers-crl bot commented Aug 11, 2023

yuzefovich left a comment

Choose a reason for hiding this comment

knz commented Aug 11, 2023 •

edited