Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli,server: distinguish graceful vs non-graceful shutdown requests #108612

Merged
merged 1 commit into from Aug 11, 2023

Conversation

knz
Copy link
Contributor

@knz knz commented Aug 11, 2023

Needed to fix #107856.

TLDR: this patch fixes a bug whereby the top-level shutdown logic in cli/start.go mistakenly attempted a graceful drain when that, in fact, was impossible. This would show up as a storm of errors in logs and, sometimes, block server shutdown until a non-graceful shutdown was initiated by other means.

A more detailed explanation follows.

For context, we have a notification mechanism (called "stop trigger") through which internal components in a server can request a shutdown of the overall server. As of this writing, this mechanism is used for example:

  • when an error is encountered very early during server startup.
  • at the tail end of the Drain RPC, when the Shutdown request flag is true, to indicate the process can now exit.
  • when a SQL liveness record is found to have been deleted or expired without heartbeat.

The way this works is that the internal component, given a reference to *server.stopTrigger, when it needs a server shutdown calls signalStop() and passes a "shutdown
request" (serverctl.MakeShutdownRequest).

At the very outer orchestration layer, a loop monitor a channel from the stopTrigger and processes shutdown requests. As of this writing, we have at least the following orchestration points:

  1. the top-level shutdown logic in cli/start.go.
  2. the top-level testServer shutdown monitor task (initiated from (*server.testServer).Start())
  3. the tenant server orchestration logic in (*server.channelOrchestrator) startControlledServer().

Prior to this patch, a bug existed in cli/start.go as follows:

  • the shutdown logic considered that any "shutdown reason" other than ShutdownReasonServerStartupError could be processed by performing a graceful drain on the server.
  • meanwhile, in fact none of the current shutdown reasons can be correctly processed via a graceful drain (they all signal conditions that make the current server unhealthy and unable to shut down gracefully).
  • as a result, if a shutdown request other than "server startup error" was received in cli/start.go, it would attempt a graceful drain and the graceful drain process would generally fail with all kinds of error messages in logs. It would also sometime hang, blocking the overall shutdown.

This patch fixes this by avoiding the graceful drain unless the shutdown request is for a graceful drain.

Through this, care is taken to ensure that if a hard shutdown is triggered after a graceful shutdown, the hard shutdown is also "delivered" to the orchestration. That allows hard shutdown to "catch up" with and shortcut graceful drains initiated earlier.

Release note (bug fix): A bug was fixed whereby cockroach start would sometimes incorrectly hang upon shutting down a server after encountering an internal error. This bug had been introduced some time in v22.x.

Fixes #108611.
Epic: CRDB-28893

@knz knz requested review from a team as code owners August 11, 2023 16:14
@knz knz requested review from rachitgsrivastava and DarrylWong and removed request for a team August 11, 2023 16:14
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz knz added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Aug 11, 2023
TLDR: this patch fixes a bug whereby the top-level shutdown logic in
`cli/start.go` mistakenly attempted a graceful drain when that, in
fact, was impossible. This would show up as a storm of errors in logs
and, sometimes, block server shutdown until a non-graceful shutdown was
initiated by other means.

A more detailed explanation follows.

*For context*, we have a notification mechanism (called "stop
trigger") through which internal components in a server can request a
shutdown of the overall server. As of this writing, this mechanism is
used for example:

- when an error is encountered very early during server startup.
- at the tail end of the `Drain` RPC, when the `Shutdown` request
  flag is true, to indicate the process can now exit.
- when a SQL liveness record is found to have been deleted or
  expired without heartbeat.

The way this works is that the internal component, given a reference
to `*server.stopTrigger`, when it needs a server shutdown calls
`signalStop()` and passes a "shutdown
request" (`serverctl.MakeShutdownRequest`).

At the very outer orchestration layer, a loop monitor a channel from
the `stopTrigger` and processes shutdown requests. As of this writing,
we have at least the following orchestration points:

1. the top-level shutdown logic in `cli/start.go`.
2. the top-level `testServer` shutdown monitor task
   (initiated from `(*server.testServer).Start()`)
3. the tenant server orchestration logic in
   `(*server.channelOrchestrator) startControlledServer()`.

Prior to this patch, a bug existed in `cli/start.go` as follows:

- the shutdown logic considered that any "shutdown reason" other than
  `ShutdownReasonServerStartupError` could be processed by performing a
  graceful drain on the server.
- meanwhile, in fact *none* of the current shutdown reasons can be
  correctly processed via a graceful drain (they all signal conditions
  that make the current server unhealthy and unable to shut down
  gracefully).
- as a result, if a shutdown request other than "server startup error"
  was received in `cli/start.go`, it would attempt a graceful
  drain and the graceful drain process would generally fail
  with all kinds of error messages in logs. It would also
  sometime hang, blocking the overall shutdown.

This patch fixes this by avoiding the graceful drain unless the
shutdown request is for a graceful drain.

Through this, care is taken to ensure that if a hard shutdown is
triggered after a graceful shutdown, the hard shutdown is also
"delivered" to the orchestration. That allows hard shutdown to "catch
up" with and shortcut graceful drains initiated earlier.

Release note (bug fix): A bug was fixed whereby `cockroach start`
would sometimes incorrectly hang upon shutting down a server after
encountering an internal error. This bug had been introduced some time
in v22.x.
@knz
Copy link
Contributor Author

knz commented Aug 11, 2023

TFYR!

bors r=stevendanna

@knz knz added the db-cy-23 label Aug 11, 2023
@craig
Copy link
Contributor

craig bot commented Aug 11, 2023

Build succeeded:

@craig craig bot merged commit f32fc17 into cockroachdb:master Aug 11, 2023
6 of 7 checks passed
@blathers-crl
Copy link

blathers-crl bot commented Aug 11, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 29668e1 to blathers/backport-release-23.1-108612: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/server/serverctl/shutdown.go line 25 at r1 (raw file):

}

// Graceful determines whether the shutdown should be effected via a

nit: s/Graceful/TerminateUsingGracefulDrain/.

craig bot pushed a commit that referenced this pull request Aug 14, 2023
108613: server,cli/start: gracefully drain tenant servers upon service disable r=yuzefovich,stevendanna a=knz

First commit from #108612.
Thanks to `@stevendanna` for brainstorming solutions.

TLDR: when the service mode of a secondary tenant is changed to NONE,
but the data state is still READY, the tenant servers will now shut
down gracefully. This applies both to tenant servers run as separate
processes (via `cockroach mt start-sql`) and those started via the
internal server controller.

More details follow.

For context, the reader is invited to first refer to the commit
immediately prior.

Prior to this patch, all internal shutdown requests were considered to
be non-graceful. This included the request that was triggered as a
result of a change in the service mode.

This caused two separate bugs:

- it created a race condition between the server controller (for
  shared-process servers) and the service mode checker task inside the
  SQL server (`startCheckService`). When the service mode was flipped to
  'NONE', the server controller initiated a graceful drain but then that
  was caught up by the service mode checker and converted to a
  hard shutdown. This particular bug was a regression caused by #96144
  and had been identified through flakes in `TestServiceShutdownUsesGracefulDrain`.

- it was generally causing shutdowns upon changes in service mode to
  always be run as hard shutdowns (both for external-process and shared-process
  ervers). This is generally undesirable when the data state is still
  READY, as it prevents cleaning up the SQL liveness record and can
  cause delays during subsequent server startups.

This patch fixes it by using a graceful shutdown request whenever
the service mode changes and the data state is still READY.
Since the tenant server now handles graceful drains, the patch
also removes this responsibility from the server controller.

Release note: None
Fixes #107856
Epic: CRDB-26691

108681: sql/resolver: rename AvoidLeased to AssertNotLeased r=fqazi a=fqazi

Previously, we had a flag to avoid leased descriptors, which did not do anything within look-up contexts. This was because this flag was never really used since skipping leased descriptors happens at a higher level with flags on the resolver. To address this, this
the patch will make this flag more useful by making it an assertion that no leased descriptor is being used in these contexts.

Release note: None

108723: serverctl: fix a docstring r=yuzefovich a=knz

Requested by `@yuzefovich` in #108612 (review)

Release note: None
Epic: CRDB-26691

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1 db-cy-23
Projects
None yet
4 participants