New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support Docker HEALTHCHECK #2087

Closed
bradrydzewski opened this Issue Jun 23, 2017 · 16 comments

Comments

Projects
8 participants
@bradrydzewski
Copy link
Member

bradrydzewski commented Jun 23, 2017

support a healthcheck for the server and agent

HEALTHCHECK CMD curl --fail http://localhost:3000/ || exit 1
@gtaylor

This comment has been minimized.

Copy link
Member

gtaylor commented Jun 23, 2017

This will be great for Kubernetes livelieness/readiness checks.

@dellintosh

This comment has been minimized.

Copy link

dellintosh commented Jun 23, 2017

Also Mesos... ;) 👍

@tboerger

This comment has been minimized.

Copy link
Member

tboerger commented Jun 29, 2017

And even plain docker-compose ;)

@mrueg

This comment has been minimized.

Copy link
Contributor

mrueg commented Jul 12, 2017

Agents could probably use https://github.com/grpc/grpc/blob/master/doc/health-checking.md and report it via a local debug socket.

@omerxx

This comment has been minimized.

Copy link

omerxx commented Aug 29, 2017

Hey everyone, any chance this gets priority anytime soon?
I'm trying to deploy 0.8 on AWS ECS.

@vanadium23

This comment has been minimized.

Copy link
Contributor

vanadium23 commented Sep 6, 2017

@bradrydzewski @gtaylor it's just adding this line to Dockerfile? Can I take it?
Also may be we need separate endpoint for healtcheck as described in Mozilla Dockerflow.

@bradrydzewski

This comment has been minimized.

Copy link
Member

bradrydzewski commented Sep 6, 2017

The first step is to propose a design. The server image does not have curl or any utilities available to ping the user interface (by design), which means a command needs to be added to the drone-server binary. For example drone-server --ping

The agent image does not have any ports that we can ping for healthcheck data. So the agent needs to be modified to add some sort of hooks for healthchecks. Similar to the server, we then need a command to ping the agent. For example drone-agent --ping

So the first step is to create a detailed design proposal. Once the design is approved we can begin coding, and yes, @vanadium23 would love to have your help coding this up. Thanks!

@tboerger

This comment has been minimized.

Copy link
Member

tboerger commented Sep 6, 2017

I'm also really interested in that as drone is one of only a few services for me without a healthcheck. What should the agent --ping or --healthcheck flag check for the agent? If a process on the system (container) exists since we are not opening any port?

@tboerger

This comment has been minimized.

Copy link
Member

tboerger commented Sep 6, 2017

Or should we just check if the agent is able to connect to the server?

@bradrydzewski

This comment has been minimized.

Copy link
Member

bradrydzewski commented Sep 6, 2017

Or should we just check if the agent is able to connect to the server?

I propose the initial implementation of the agent healthcheck should just return a 200 response if the agent is running.

Future versions of the agent should absolutely perform internal checks and return 500 if there are issues pinging the server or blocked builds, but I would like to keep that out of scope for the initial implementation since the agent grpc code is new and still evolving.

@gtaylor

This comment has been minimized.

Copy link
Member

gtaylor commented Sep 6, 2017

👍 A naive 200 would at least be be immediately helpful for detecting complete deadlock or other weird states.

@metalmatze

This comment has been minimized.

Copy link
Member

metalmatze commented Sep 11, 2017

That naive health check can also happen on some normal http page. ☺️

@bradrydzewski

This comment has been minimized.

Copy link
Member

bradrydzewski commented Sep 12, 2017

I started working on an initial healthcheck for the agent, however, I am not sure that a simple endpoint that returns a 200 will be enough here. Even if the agent is deadlocked or no longer processing builds, the http endpoint will likely still serve requests. If the agent dies, the docker container will exit with a non-zero exit code, which I presume kubenerets will detect, and then restart the image (based on whatever policy you have configured).

So I think to make the healthcheck meaningful we need to come up with a list of failure points and how we can detect these failure points. This information can then be used with the healthcheck endpoint.

Here are some things that can go wrong, off the top of my head:

  1. agent cannot connect to server, will perform infinite backoff to reconnect
  2. agent cannot make grpc call (usually due to server connection), will perform infinite retries
  3. docker daemon locks
  4. some internal lock / race / bug that we don't know about yet

I am not sure 1 and 2 are an error, since the agent will continue to retry. If the grpc connection suddenly starts failing it is likely a server connectivity issue that would be resolved by restoring server connectivity, so I'm not sure we want these errors to signify an unhealthy agent.

Number 3 and 4 could in theory be detected by keeping track of how long the build has been running. If the runtime has exceeded the timeout + buffer and the agent is still processing the build, we know there is probably an issue, and the healthcheck can return a 500

@gtaylor

This comment has been minimized.

Copy link
Member

gtaylor commented Sep 12, 2017

I like the conditions you've outlined. As an side, having a stupid 200 healthcheck would still be useful for Readiness probes, since we could at least see that the process is in good enough shape to return the 200s (rather than deadlocked entirely or in some other weird state).

@bradrydzewski

This comment has been minimized.

Copy link
Member

bradrydzewski commented Sep 12, 2017

sure thing. can I get a quick review #2209 ?

@bradrydzewski

This comment has been minimized.

Copy link
Member

bradrydzewski commented Sep 12, 2017

ok, merged #2209 with the more advanced logic mentioned above. This will be included in 0.8.0-rc.5 and should therefore get bundled with 0.8.0-final pending we don't discover any last minute issues.

note that this change also gives you the option to expose port 3000 for the agent to access the following endpoint:

  • /varz returns some basic agent statistics, including currently running builds
  • /healthz returns 200 if agent is healthy
  • /version returns returns the version information
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment