Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli/start: remove the 1-minute hard shutdown timeout #44074

Merged
merged 1 commit into from
Jan 17, 2020

Conversation

knz
Copy link
Contributor

@knz knz commented Jan 16, 2020

Fixes #43902.

Prior to this patch, after CockroachDB receives an instruction to
gracefully shut down (signal, Drain request etc), the code for
cockroach start would start a 1-minute countdown. If the graceful
shutdown did not complete within that time, a hard shutdown was
triggered instead.

This behavior was neither necessary nor desirable.

It is not necessary because process managers already have "process
shutdown timeout" logic to force-shutdown a process that does not
terminate in a timely manner. It is not the db's responsibility to do
the service manager's job (in fact, the redundancy in behavior can be
confusing to troubleshoot).

It is not desirable either because in large clusters, a graceful
shutdown may truly last longer than a minute. Graceful shutdowns are
also rather important to ensure a smooth transition during e.g. a
rolling upgrade, as they guarantee a transition without latency
blips. Even though this cockroach start timeout is not the
only such timeout through the code, it is one obstacle to painless
graceful shutdowns and thus ought to be removed.

This patch achieves just that.

Release note (cli change): The CockroachDB node
command (start/start-single-node) does not any more initiate a
1-minute hard shutdown countdown after a request to gracefully
terminates. This means that graceful shutdowns are now free to take
longer than one minute. It also means that deployments where a
maximum shutdown time must be enforced must now use a service manager
that is suitably configured to do so.

@knz knz requested a review from tbg January 16, 2020 16:01
@knz knz requested a review from a team as a code owner January 16, 2020 16:01
@knz knz added this to To do in DB Server & Security via automation Jan 16, 2020
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz knz moved this from To do to In progress in DB Server & Security Jan 16, 2020
Prior to this patch, after CockroachDB receives an instruction to
gracefully shut down (signal, `Drain` request etc), the code for
`cockroach start` would start a 1-minute countdown. If the graceful
shutdown did not complete within that time, a hard shutdown was
triggered instead.

This behavior was neither necessary nor desirable.

It is not necessary because process managers already have "process
shutdown timeout" logic to force-shutdown a process that does not
terminate in a timely manner. It is not the db's responsibility to do
the service manager's job (in fact, the redundancy in behavior can be
confusing to troubleshoot).

It is not desirable either because in large clusters, a graceful
shutdown may truly last longer than a minute. Graceful shutdowns are
also rather important to ensure a smooth transition during e.g. a
rolling upgrade, as they guarantee a transition without latency
blips. Even though this `cockroach start` timeout is not the
only such timeout through the code, it is one obstacle to painless
graceful shutdowns and thus ought to be removed.

This patch achieves just that.

Release note (cli change): The CockroachDB node
command (`start`/`start-single-node`) does not any more initiate a
1-minute hard shutdown countdown after a request to gracefully
terminates. This means that graceful shutdowns are now free to take
longer than one minute. It also means that deployments where a
maximum shutdown time must be enforced must now use a service manager
that is suitably configured to do so.
@knz knz force-pushed the 20200116-cli-start-timeout branch from d1a0cbc to d697c92 Compare January 16, 2020 18:11
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@knz
Copy link
Contributor Author

knz commented Jan 17, 2020

tfyr

bors r=tbg

craig bot pushed a commit that referenced this pull request Jan 17, 2020
44074: cli/start: remove the 1-minute hard shutdown timeout r=tbg a=knz

Fixes #43902.

Prior to this patch, after CockroachDB receives an instruction to
gracefully shut down (signal, `Drain` request etc), the code for
`cockroach start` would start a 1-minute countdown. If the graceful
shutdown did not complete within that time, a hard shutdown was
triggered instead.

This behavior was neither necessary nor desirable.

It is not necessary because process managers already have "process
shutdown timeout" logic to force-shutdown a process that does not
terminate in a timely manner. It is not the db's responsibility to do
the service manager's job (in fact, the redundancy in behavior can be
confusing to troubleshoot).

It is not desirable either because in large clusters, a graceful
shutdown may truly last longer than a minute. Graceful shutdowns are
also rather important to ensure a smooth transition during e.g. a
rolling upgrade, as they guarantee a transition without latency
blips. Even though this `cockroach start` timeout is not the
only such timeout through the code, it is one obstacle to painless
graceful shutdowns and thus ought to be removed.

This patch achieves just that.

Release note (cli change): The CockroachDB node
command (`start`/`start-single-node`) does not any more initiate a
1-minute hard shutdown countdown after a request to gracefully
terminates. This means that graceful shutdowns are now free to take
longer than one minute. It also means that deployments where a
maximum shutdown time must be enforced must now use a service manager
that is suitably configured to do so.

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
@craig
Copy link
Contributor

craig bot commented Jan 17, 2020

Build succeeded

@craig craig bot merged commit d697c92 into cockroachdb:master Jan 17, 2020
DB Server & Security automation moved this from In progress to Done 20.1 Jan 17, 2020
@knz knz deleted the 20200116-cli-start-timeout branch March 24, 2020 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cli,server: remove the shutdown timeout upon the first SIGTERM
3 participants