Change the signal handler to ensure the agent quits after the grace period#3200
Change the signal handler to ensure the agent quits after the grace period#3200
Conversation
…eriod This uses two new calls to which are only triggered after the second quit attempt to ensure the agent exits: 1. After half the cancel grace period (defaults to at least 10s) the agent will cancel the context. 2. At the end of the cancel grace period, if the agent is stuck it will call os.Exit(1).
1cb3aa3 to
44074a8
Compare
DrJosh9000
left a comment
There was a problem hiding this comment.
Mostly LGTM, just some small things.
There was a problem hiding this comment.
💭 This little helper be moved to a different package, given it will be no longer job-specific.
There was a problem hiding this comment.
I am open to suggestions here, agree it should move out, just not sure were best to put it.
There was a problem hiding this comment.
Not urgent, we can come back to it later.
clicommand/agent_start.go
Outdated
|
|
||
| func cancelGracePeriodDuration(cancelGracePeriod int) time.Duration { | ||
| if cancelGracePeriod < 10 { | ||
| return 10 // minimum 10 seconds for forceful shutdown |
There was a problem hiding this comment.
The function's return type is time.Duration, but time.Duration(10) means 10 nanoseconds. If we keep this function it ought to return a time.Duration that represents the intended duration (as in, multiply by time.Second within).
clicommand/agent_start.go
Outdated
| l.Info("Forced agent(s) to stop") | ||
| pool.Stop(false) // one last chance to stop | ||
|
|
||
| time.Sleep(cancelGracePeriodDuration(cancelGracePeriod/2) * time.Second) |
There was a problem hiding this comment.
cancelGracePeriodDuration could be eliminated with the new max builtin:
| time.Sleep(cancelGracePeriodDuration(cancelGracePeriod/2) * time.Second) | |
| time.Sleep(max(cancelGracePeriod/2, 10) * time.Second) |
|
@DrJosh9000 I added your suggestions and tweaked it to remove the helper function, see what you think. |
Description
This uses two new calls to which are only triggered after the second quit attempt to ensure the agent exits:
The pool stop with graceful false should, if at all possible, trigger disconnect calls for all the agents.
Note
This code logic is only invoked after a second exit signal.
Context
If the agent is unable to commicate with the agent API it can take almost an hour to stop.
Changes
Adds more logic to the signal handler to ensure the agent exits.
Testing
go test ./...). Buildkite employees may check this if the pipeline has run automatically.go fmt ./...)To test this logic I used https://github.com/smarty/cproxy as a HTTP proxy with the agent configured to use it, then once a job was started and the process was running I killed the proxy, interupted the agent twice and waited for the service to exit.