Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continue heartbeat after stop signal #991

Closed
wants to merge 1 commit into from

Conversation

stevenmatthewt
Copy link

This addresses a regression that was added in #971.

Heartbeats were no longer sent after a SIGTERM or SIGINT was received, which interferes with the graceful shutdown. While the buildkite-agent shuts down gracefully, buildkite.com will display "Exited with status -1 (agent lost)" after ~5 minutes. This limits the graceful shutdown feature to a duration of only 5 minutes.

I understand the previous change addressed a race condition, so please double check that I haven't accidentally reintroduced that issue.

@lox
Copy link
Contributor

lox commented Apr 19, 2019

Argh, sorry about this regression! We need to figure out a way to test some of these edge cases.

Looking at the code, it seems like there might be a race condition on checking a.running. Whilst this is preferable to the bug I introduced, I wonder how we can fix it.

The intention of my change in #971 was to use a channel closure to signal the heartbeater and the pinger, perhaps we just need to move the closure in to a defer so it doesn't run until the agent actually terminates?

@@ -149,10 +154,6 @@ func (a *AgentWorker) Start() error {
a.logger.Error("Failed to heartbeat %s. Will try again in %s. (Last successful was %v ago)",
err, heartbeatInterval, time.Now().Sub(lastHeartbeat))
}

case <-a.stop:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other way to do this would be with a context with cancel:

ctx, cancel := context.WithCancel(context.Background()
defer cancel()
...
case <-ctx.Done():

@lox
Copy link
Contributor

lox commented Apr 19, 2019

I think the introduction of a.stop was confusing, I think what we wanted was to signal that the agent had entered stopping state, not stopped state, which should end the ping timer, but not the heartbeat timer, which should only stop when the agent reaches a stopped state.

@lox
Copy link
Contributor

lox commented Apr 19, 2019

I wonder if a state machine would make this clearer 🤔

@lox
Copy link
Contributor

lox commented Apr 19, 2019

I made this change with a context.Context in #992.

@stevenmatthewt
Copy link
Author

Changes in #992 look great. Closing this in favor of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants