Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: separate signals for passive, active, and forced shutdown #12358

Merged
merged 7 commits into from
Mar 15, 2024

Conversation

kylecarbs
Copy link
Member

@kylecarbs kylecarbs commented Feb 28, 2024

This is from a customer request to gracefully drain instances instead of forcefully killing the jobs.

SIGTERM: Passive shutdown stopping provisioner daemons from accepting new
jobs but waiting for existing jobs to be successfully completed.

SIGINT (old existing behavior): Notify provisioner daemons to cancel in-flight jobs, wait for 5s for jobs to be exited, then force quit.

SIGKILL: Untouched from before, will force-quit.

@kylecarbs kylecarbs self-assigned this Feb 28, 2024
@kylecarbs kylecarbs force-pushed the gracefulshutdown branch 2 times, most recently from 5715f44 to e823bae Compare February 28, 2024 22:01
`SIGTERM`: Passive shutdown stopping provisioner daemons from accepting new
jobs but waiting for existing jobs to successfully complete.

`SIGINT` (old existing behavior): Notify provisioner daemons to cancel in-flight jobs, wait 5s for jobs to be exited, then force quit.

`SIGKILL`: Untouched from before, will force-quit.
Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like there being an option to let in-flight jobs complete, but I think it would be worthwhile to consolidate around one shutdown method. (I'm fine with wait for jobs being the default.)

Also, I think we should merge back "interrupt" and "stop" signal slices so that we don't break other commands.

If we still want interrupt in coderd to be "faster" (only useful for development, really?), we could use two additional slices.

var StopSignals = []os.Signal{ // new default (prev. interruptsignals renamed)
	os.Interrupt,
	syscall.SIGTERM,
	syscall.SIGHUP,
}

var StopSignalsNoInterrupt = []os.Signal{
	syscall.SIGTERM,
	syscall.SIGHUP,
}
var InterruptSignals = []os.Signal{
	os.Interrupt,
}

That's my proposal, also moved HUP since there's no inherent reason for HUP to use the fast-sequence either.

(We don't even need to take HUP as a stop signal, but it can be useful for development. If we don't use it for stop we should probably ignore it though to protect users from abrupt shutdown.)

cli/server.go Outdated
exitErr = notifyCtx.Err()
case <-stopCtx.Done():
exitErr = stopCtx.Err()
graceful = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel both methods are graceful, so perhaps a rename here is appropriate?

Suggested change
graceful = true
waitForProvisionerJobs = true

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's reasonable.

cli/signal_unix.go Outdated Show resolved Hide resolved
@@ -225,7 +225,7 @@ func (r *RootCmd) provisionerDaemonStart() *clibase.Cmd {
cliui.Errorf(inv.Stderr, "Unexpected error, shutting down server: %s\n", exitErr)
}

err = srv.Shutdown(ctx)
err = srv.Shutdown(ctx, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we're changing the default behavior for interrupt, and since we split interrupt and stop, we're also not handling SIGTERM here anymore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

os.Interrupt,
syscall.SIGTERM,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change in all CLI commands that handle signals. They used to handle TERM but no longer do.

@aaronlehmann
Copy link
Contributor

It would be really fantastic to get this finished and included in a release! That would make deployments much safer since they wouldn't risk interrupting the provisioner.

@github-actions github-actions bot added the stale This issue is like stale bread. label Mar 14, 2024
Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🎉

os.Interrupt,
}

var StopSignalsNoInterrupt = []os.Signal{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love how Windows leaves you SOL. 🥲

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No doubt!

@kylecarbs kylecarbs removed the stale This issue is like stale bread. label Mar 15, 2024
@kylecarbs kylecarbs enabled auto-merge (squash) March 15, 2024 12:55
@kylecarbs kylecarbs disabled auto-merge March 15, 2024 12:55
@kylecarbs kylecarbs enabled auto-merge (squash) March 15, 2024 12:56
@kylecarbs kylecarbs merged commit 895df54 into main Mar 15, 2024
21 checks passed
@kylecarbs kylecarbs deleted the gracefulshutdown branch March 15, 2024 13:16
@github-actions github-actions bot locked and limited conversation to collaborators Mar 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants