Skip to content

Change the signal handler to ensure the agent quits after the grace period#3200

Merged
wolfeidau merged 2 commits intomainfrom
feat_force_shutdown
Feb 24, 2025
Merged

Change the signal handler to ensure the agent quits after the grace period#3200
wolfeidau merged 2 commits intomainfrom
feat_force_shutdown

Conversation

@wolfeidau
Copy link
Member

@wolfeidau wolfeidau commented Feb 24, 2025

Description

This uses two new calls to which are only triggered after the second quit attempt to ensure the agent exits:

  1. After half the cancel grace period (defaults to at least 10s) the agent will cancel the context.
  2. At the end of the cancel grace period, if the agent is stuck it will call os.Exit(1).

The pool stop with graceful false should, if at all possible, trigger disconnect calls for all the agents.

Note

This code logic is only invoked after a second exit signal.

Context

If the agent is unable to commicate with the agent API it can take almost an hour to stop.

Changes

Adds more logic to the signal handler to ensure the agent exits.

Testing

  • Tests have run locally (with go test ./...). Buildkite employees may check this if the pipeline has run automatically.
  • Code is formatted (with go fmt ./...)

To test this logic I used https://github.com/smarty/cproxy as a HTTP proxy with the agent configured to use it, then once a job was started and the process was running I killed the proxy, interupted the agent twice and waited for the service to exit.

@wolfeidau wolfeidau requested review from a team and DrJosh9000 February 24, 2025 02:34
…eriod

This uses two new calls to which are only triggered after the second
quit attempt to ensure the agent exits:

1. After half the cancel grace period (defaults to at least 10s) the
   agent will cancel the context.
2. At the end of the cancel grace period, if the agent is stuck it will
   call os.Exit(1).
@wolfeidau wolfeidau marked this pull request as ready for review February 24, 2025 02:37
Copy link
Contributor

@DrJosh9000 DrJosh9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, just some small things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 This little helper be moved to a different package, given it will be no longer job-specific.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am open to suggestions here, agree it should move out, just not sure were best to put it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not urgent, we can come back to it later.


func cancelGracePeriodDuration(cancelGracePeriod int) time.Duration {
if cancelGracePeriod < 10 {
return 10 // minimum 10 seconds for forceful shutdown
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function's return type is time.Duration, but time.Duration(10) means 10 nanoseconds. If we keep this function it ought to return a time.Duration that represents the intended duration (as in, multiply by time.Second within).

l.Info("Forced agent(s) to stop")
pool.Stop(false) // one last chance to stop

time.Sleep(cancelGracePeriodDuration(cancelGracePeriod/2) * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cancelGracePeriodDuration could be eliminated with the new max builtin:

Suggested change
time.Sleep(cancelGracePeriodDuration(cancelGracePeriod/2) * time.Second)
time.Sleep(max(cancelGracePeriod/2, 10) * time.Second)

@wolfeidau
Copy link
Member Author

@DrJosh9000 I added your suggestions and tweaked it to remove the helper function, see what you think.

@wolfeidau wolfeidau requested a review from DrJosh9000 February 24, 2025 05:31
Copy link
Contributor

@DrJosh9000 DrJosh9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still LGTM 👍

@wolfeidau wolfeidau merged commit f751db7 into main Feb 24, 2025
1 check passed
@wolfeidau wolfeidau deleted the feat_force_shutdown branch February 24, 2025 05:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants