Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let the agent serve a status page via HTTP. #1066

Draft
wants to merge 6 commits into
base: master
from

Conversation

@philwo
Copy link
Contributor

commented Aug 6, 2019

This feature can optionally be enabled via the --health-check-addr addr:port flag or the equivalent BUILDKITE_HEALTH_CHECK_ADDR environment variable.

When enabled, the agent starts an HTTP server on the given addr:port that can be used for pull-based health checks, like Google Cloud's instance group autohealing. This allows the instance group to automatically detect when a VM isn't running a healthy Buildkite agent and recreate it.

It exports two endpoints:

  • /: This is an overall status page for the agent itself.
  • /agent/<spawn_id>: This is the status page of the agent worker with the numeric index spawn_id, starting from 1. If you use --spawn=N, there will be one status page for each running agent. This one actually takes the last heartbeat into account and if the heartbeat fails, this will return a failed HTTP response, signaling that the agent isn't healthy.

I'll still do some polish and testing of this, but am sending it already for review to get your feedback on this idea. :)

cli.StringFlag{
Name: "health-check-addr",
Usage: "Start an HTTP server on this addr:port that returns whether the agent is healthy, disabled by default",
EnvVar: "BUILDKITE_HEALTH_CHECK_ADDR",

This comment has been minimized.

Copy link
@lox

lox Aug 7, 2019

Contributor

For modern params we use BUILDKITE_AGENT_ as a prefix, it's long, but let's go with BUILDKITE_AGENT_HEALTH_CHECK_ADDR

@lox

This comment has been minimized.

Copy link
Contributor

commented Aug 7, 2019

This looks awesome! Very keen on the idea.

My biggest hesitation is the use of the default http server. It took me a while to click that a handler was being added in the agent package. I know this is how some of the standard library stuff works, but it's always seemed like a bad use of global scope to me.

What are your thoughts on that @philwo? I'd be happy to accept this implementation and we can refactor the next time we need to touch it, but figured it was worth discussing.

@lox

This comment has been minimized.

Copy link
Contributor

commented Aug 7, 2019

Oh, my only other very minor observation was there might be a small window of time where the / route replies with OK before the agent worker has actually spun up and registered.

@philwo

This comment has been minimized.

Copy link
Contributor Author

commented Aug 13, 2019

@lox Thanks for the review and good feedback! I'm on vacation this week, will follow up next week :)

@philwo philwo force-pushed the philwo:health-check branch from 954efa0 to fb5fed6 Aug 22, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
2 participants
You can’t perform that action at this time.