cli/start: remove the 1-minute hard shutdown timeout #44074

knz · 2020-01-16T16:01:14Z

Prior to this patch, after CockroachDB receives an instruction to
gracefully shut down (signal, Drain request etc), the code for
cockroach start would start a 1-minute countdown. If the graceful
shutdown did not complete within that time, a hard shutdown was
triggered instead.

This behavior was neither necessary nor desirable.

It is not necessary because process managers already have "process
shutdown timeout" logic to force-shutdown a process that does not
terminate in a timely manner. It is not the db's responsibility to do
the service manager's job (in fact, the redundancy in behavior can be
confusing to troubleshoot).

It is not desirable either because in large clusters, a graceful
shutdown may truly last longer than a minute. Graceful shutdowns are
also rather important to ensure a smooth transition during e.g. a
rolling upgrade, as they guarantee a transition without latency
blips. Even though this cockroach start timeout is not the
only such timeout through the code, it is one obstacle to painless
graceful shutdowns and thus ought to be removed.

This patch achieves just that.

Release note (cli change): The CockroachDB node
command (start/start-single-node) does not any more initiate a
1-minute hard shutdown countdown after a request to gracefully
terminates. This means that graceful shutdowns are now free to take
longer than one minute. It also means that deployments where a
maximum shutdown time must be enforced must now use a service manager
that is suitably configured to do so.

cockroach-teamcity · 2020-01-16T16:01:24Z

This change is

Prior to this patch, after CockroachDB receives an instruction to gracefully shut down (signal, `Drain` request etc), the code for `cockroach start` would start a 1-minute countdown. If the graceful shutdown did not complete within that time, a hard shutdown was triggered instead. This behavior was neither necessary nor desirable. It is not necessary because process managers already have "process shutdown timeout" logic to force-shutdown a process that does not terminate in a timely manner. It is not the db's responsibility to do the service manager's job (in fact, the redundancy in behavior can be confusing to troubleshoot). It is not desirable either because in large clusters, a graceful shutdown may truly last longer than a minute. Graceful shutdowns are also rather important to ensure a smooth transition during e.g. a rolling upgrade, as they guarantee a transition without latency blips. Even though this `cockroach start` timeout is not the only such timeout through the code, it is one obstacle to painless graceful shutdowns and thus ought to be removed. This patch achieves just that. Release note (cli change): The CockroachDB node command (`start`/`start-single-node`) does not any more initiate a 1-minute hard shutdown countdown after a request to gracefully terminates. This means that graceful shutdowns are now free to take longer than one minute. It also means that deployments where a maximum shutdown time must be enforced must now use a service manager that is suitably configured to do so.

tbg

Reviewed 1 of 1 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained

knz · 2020-01-17T10:46:53Z

tfyr

bors r=tbg

44074: cli/start: remove the 1-minute hard shutdown timeout r=tbg a=knz Fixes #43902. Prior to this patch, after CockroachDB receives an instruction to gracefully shut down (signal, `Drain` request etc), the code for `cockroach start` would start a 1-minute countdown. If the graceful shutdown did not complete within that time, a hard shutdown was triggered instead. This behavior was neither necessary nor desirable. It is not necessary because process managers already have "process shutdown timeout" logic to force-shutdown a process that does not terminate in a timely manner. It is not the db's responsibility to do the service manager's job (in fact, the redundancy in behavior can be confusing to troubleshoot). It is not desirable either because in large clusters, a graceful shutdown may truly last longer than a minute. Graceful shutdowns are also rather important to ensure a smooth transition during e.g. a rolling upgrade, as they guarantee a transition without latency blips. Even though this `cockroach start` timeout is not the only such timeout through the code, it is one obstacle to painless graceful shutdowns and thus ought to be removed. This patch achieves just that. Release note (cli change): The CockroachDB node command (`start`/`start-single-node`) does not any more initiate a 1-minute hard shutdown countdown after a request to gracefully terminates. This means that graceful shutdowns are now free to take longer than one minute. It also means that deployments where a maximum shutdown time must be enforced must now use a service manager that is suitably configured to do so. Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>

craig · 2020-01-17T11:14:51Z

Build succeeded

GitHub CI (Cockroach)

knz requested a review from tbg January 16, 2020 16:01

knz requested a review from a team as a code owner January 16, 2020 16:01

knz added this to To do in DB Server & Security via automation Jan 16, 2020

knz moved this from To do to In progress in DB Server & Security Jan 16, 2020

knz force-pushed the 20200116-cli-start-timeout branch from d1a0cbc to d697c92 Compare January 16, 2020 18:11

tbg approved these changes Jan 17, 2020

View reviewed changes

craig bot merged commit d697c92 into cockroachdb:master Jan 17, 2020

DB Server & Security automation moved this from In progress to Done 20.1 Jan 17, 2020

jseldess mentioned this pull request Feb 19, 2020

cli/start: remove the 1-minute hard shutdown timeout cockroachdb/docs#6630

Closed

knz mentioned this pull request Mar 24, 2020

release-19.2: cli/start: remove the 1-minute hard shutdown timeout #46483

Merged

knz deleted the 20200116-cli-start-timeout branch March 24, 2020 14:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cli/start: remove the 1-minute hard shutdown timeout #44074

cli/start: remove the 1-minute hard shutdown timeout #44074

knz commented Jan 16, 2020

cockroach-teamcity commented Jan 16, 2020

tbg left a comment

knz commented Jan 17, 2020

craig bot commented Jan 17, 2020

cli/start: remove the 1-minute hard shutdown timeout #44074

cli/start: remove the 1-minute hard shutdown timeout #44074

Conversation

knz commented Jan 16, 2020

cockroach-teamcity commented Jan 16, 2020

tbg left a comment

Choose a reason for hiding this comment

knz commented Jan 17, 2020

craig bot commented Jan 17, 2020

Build succeeded