Add support for user-defined healthchecks #22719

talex5 · 2016-05-13T12:16:03Z

This PR adds support for user-defined health-check probes for Docker containers. It adds a HEALTHCHECK instruction to the Dockerfile syntax plus some corresponding "docker run" options. It can be used with a restart policy to automatically restart a container if the check fails.

The HEALTHCHECK instruction has two forms (more may be added later, e.g. HTTP):

HEALTHCHECK [OPTIONS] CMD command (check container health by running a command inside the container)
HEALTHCHECK NONE (disable any healthcheck inherited from the base image)

The HEALTHCHECK instruction tells Docker how to test a container to check that it is still working. This can detect cases such as a web server that is stuck in an infinite loop and unable to handle new connections, even though the server process is still running.

When a container has a healthcheck specified, it has a health status in addition to its normal status. This status is initially starting. When a health check passes, it becomes healthy. After a certain number of failures, it becomes unhealthy.

The options that can appear before CMD are:

--interval=DURATION (default: 30s)
--timeout=DURATION (default: 30s)
--grace=DURATION (default: 30s)
--retries=N (default: 1)
--exit-on-unhealthy=X (default: true)

The health check will first run interval seconds after the container is started, and then again interval seconds after each previous check completes.

If a single run of the check takes longer than timeout seconds then the check is considered to have failed.

If the health state is starting and a check started within grace seconds of the container's start time fails, the failure is ignored and the health state remains starting.

It takes retries consecutive failures of the health check for the container to be considered unhealthy.

If --exit-on-unhealthy is true then the container will exit as soon as it becomes unhealthy. The container may then be automatically restarted, depending on its restart policy.

For example, to check every five minutes or so that a web-server is able to serve the site's main page within three seconds:

HEALTHCHECK --interval=5m --grace=20s --timeout=3s --exit-on-unhealthy \
  CMD curl -f http://localhost/

(from https://github.com/talex5/docker/blob/healthcheck/docs/reference/builder.md#healthcheck)

The changes to "docker run" are described here:

https://github.com/talex5/docker/blob/healthcheck/docs/reference/run.md#healthcheck

Example

$ docker run --name=test -d \
    --health-cmd='stat /etc/passwd' \
    --health-interval=2s \
    --exit-on-unhealthy=false \
    busybox sleep 1d
$ sleep 2; docker inspect --format='{{.State.Health.Status}}' test
healthy
$ docker exec test rm /etc/passwd
$ sleep 2; docker inspect --format='{{json .State.Health}}' test
{
  "Status":"unhealthy",
  "FailingStreak":1,
  "LastCheckStart":"2016-05-09T11:09:09.673709108Z",
  "LastCheckEnd":"2016-05-09T11:09:09.786146142Z",
  "LastExitCode":1,
  "LastOutput":"stat: can't stat '/etc/passwd': No such file or directory\n"
}

The health status is also displayed in the docker ps output.

Description for the changelog: Add support for user-defined healthchecks

Closes #21142 and #21143.

thaJeztah · 2016-05-13T12:41:39Z

thanks @talex5!

ping @icecrime 😄

cpuguy83 · 2016-05-13T13:05:46Z

docs/reference/builder.md

+
+The options that can appear before `CMD` are:
+
+* `--interval=DURATION` (default: `30s`)


These all seem to be runtime configurations that ought not be in the Dockerfile.

It does seem a bit problematic. I might test my container on a fast unloaded machine with SSD and then you run it on a slow loaded machine with a hard drive and the startup time is much longer so it fails the health check (eg mysql seems to take a long time to start serving I noticed recently). It does seem odd to have values that are related to runtime performance in the dockerfile.

I think I'm really a hard -1 on these flags in the Dockerfile.

cpuguy83 · 2016-05-13T13:23:45Z

There doesn't seem to be anything documented on the protocol between docker and the probe.
I imagine this is just exit statuses, but would be good to document.

Do we want this to be a simple 0 == OK, not zero == bad or more robust ala nagios-style checks (ok, warning, critical, and status messages for reporting to the user)?

cpuguy83 · 2016-05-13T13:25:35Z

Images also often define the ports they listening on (EXPOSE), would we want to provide a flag that runs some default check on these ports (like open a TCP conn)? This would be similar to how -P automatically creates port forwards for all EXPOSE'd ports.

crosbymichael · 2016-05-13T18:12:24Z

daemon/health.go

+
+// exec the healthcheck command in the container.
+// Returns the exit code and error message (if any)
+func (*cmdProbe) run(ctx context.Context, d *Daemon, container *container.Container) (int, string) {


In go, your second return value should be error here and not return strings

The error here is normally the error string returned by the probe rather than a Go error, although there are a few places where I handle the (unlikely) case of a Go error starting the probe by pretending that the error came from the probe itself. If we return these errors as a third return value, we'll just turn it into a string at the next opportunity anyway.

@talex5 Errors in Go are just values. You can add the error content to the error that makes sense for the application. I would expect these to build on exec.ExitError or define a special error type to encapsulate the error. Use a stringly-typed approach is just calling for bugs.

Ya, this return value should be an error 100%. Even if it is just a string you can use errors.New("my string") or fmt.Errorf("my %s", "string") but you should not return string especially for how you are using it.

aluzzardi · 2016-05-31T19:11:42Z

You can have a []byte but it gets base64 encoded by the marshaler.

That's true in Go but it's very language specific.

What I meant by "you can't have a []byte there" is, at the end, that field is going to be a string no matter what. Whether we base64 encode or not is another question.

How do we currently encode output for other commands (e.g. logs)?

stevvooe · 2016-05-31T19:22:06Z

What if we did Raw with []byte and Output with string?

thaJeztah · 2016-05-31T19:56:56Z

What if we did Raw with []byte and Output with string?

SGTM, would give both options; would Raw be limited in size?

thaJeztah · 2016-06-01T00:08:08Z

ping @aluzzardi @crosbymichael WDYT on @stevvooe's proposal?

Use 4096 instead. Signed-off-by: Thomas Leonard <thomas.leonard@docker.com>

talex5 · 2016-06-01T19:59:45Z

I'm not sure what you mean about having both. Would we include the same data twice in each message, or somehow know where to send which data?

Another possibility: if a probe wants to generate binary data then it base64 encodes it itself. That way, probes that only want text don't need to do anything and the messages are human-readable. Probes that return binary data would need something at the client end to interpret it anyway, so decoding it wouldn't be much extra work. The data on the wire would be the same as if we'd used []byte in that case.

stevvooe · 2016-06-01T20:44:14Z

@talex5 Yes, have both. One that is "human-readable", and hopefully properly sanitized, and one that is the raw, unprocessed output.

This will meet both the use cases of sending binary data and debugging problems with health check output during development. It is unreasonable to make users go through so much work just to get unprocessed command output.

Let's make sure this feature isn't limited to our imaginations but, instead, enables others'.

cpuguy83 · 2016-06-02T20:18:26Z

I do not see the purpose of supporting more than just simple output from a healthcheck, and certainly not storing both binary and human-readable....

I think we should support only something very simple for a human to use as debug info, or potentially remove support for capturing output from the healthcheck from this PR and implement separately.

cpuguy83 · 2016-06-02T20:47:57Z

docs/reference/builder.md

+
+- 0: success - the container is healthy and ready for use
+- 1: unhealthy - the container is not working correctly
+- 2: starting - the container is not ready for use yet, but is working correctly


Isn't this state really just any failure before the first success?
I would recommend removing it.

Without this, whatever is monitoring the status will need some timeout to decide that the container has failed to start. @justincormack (I think) pointed out that a simple timeout is not enough to cover e.g. a database that is busy performing recovery actions. So the idea of this state is to allow the container to tell the monitor that it needs more time, and avoid the monitor having to guess what the grace period should be.

A nit about the labels/description attached to the return codes. Some containers are "run once and die" in that they produce some artifact and exit. Take for example a container that generates a TLS certificate and writes it to a volume, which other containers subsequently mount in order to use the certificate.

I would propose something like:
0: succeeded - the container is healthy or has exited successfully
1: failed - the container is unhealthy or has exited unsuccessfully
2: starting (in-progress?) - the container is not ready for use yet or has not completed its task

crosbymichael · 2016-06-02T21:32:42Z

For the output, in this PR we should keep it as is. The size has already been increased to 4096 which is much better than what it was before. Most commands that do health checks should be encouraged to write a simple reason why they are reporting as they did. "I returned unhealthy BECAUSE i could not connect to the database." We don't want to encourage checks to do a yolo operation and puke up some stacktrace to decipher.

println("could not connect to database");
return 1;

In the future after this has some user and they are wanting some type of more structured response than a limited string it would be effortless to add an unbounded []byte array to the output for this. It is something that is safe to defer until we gain feedback.

For the three states, simple health checks don't have to use the starting state if they don't want to. To the normal developer they can just return 1 or 0. If the app has a long startup or some type of complexity on boot then it can take advantage of this state and we don't have to infer any type of guesses that we would otherwise be doing. Lets to write software that guesses things, let the things that know the state tell us. You are not forced to use this state if you don't want to.

After looking at the code again, I think all these things are already implemented.

LGTM

thaJeztah · 2016-06-02T21:35:20Z

ok, enough bike-shedding. Let's go for it LGTM

dongluochen · 2016-06-02T22:21:56Z

builder/dockerfile/parser/testfiles/health/Dockerfile

+ADD check.sh main.sh /app/
+CMD /app/main.sh
+HEALTHCHECK
+HEALTHCHECK --interval=5s --timeout=3s --retries=1 \


nit: I think a good practice is multiple failures like 3 to tolerate random failures. Giving an example of --retries=1 might not be appropriate.

I suggest we set default retries to 3.

dongluochen · 2016-06-02T22:26:04Z

PR looks good to me.

thaJeztah · 2016-06-02T22:43:04Z

@dongluochen I carried the PR, rebased/squashed and with vendor update in #23218. Let me know if you need those addressed in the PR, otherwise I'll do a follow-up

dongluochen · 2016-06-02T23:06:18Z

Thanks @thaJeztah. Follow-up is good.

Carry of #22719

mageddo · 2017-07-26T13:57:49Z

Guys what about the restart policy? I think that is very important, the reason was explained here. So will this feature be implemented?

stevvooe · 2017-07-26T20:57:53Z

@mageddo This is a long closed pull request, carried elsewhere. If you have questions or a feature request, please file a new issue.

thaJeztah · 2017-07-26T21:48:32Z

Locking the conversation in this issue for the reason @stevvoe mentioned above; comments on closed issues and PRs easily go unnoticed - I'm locking the conversation to prevent that from happening

GordonTheTurtle added area/distribution status/0-triage labels May 13, 2016

thaJeztah added this to the 1.12.0 milestone May 13, 2016

thaJeztah added status/1-design-review and removed status/0-triage labels May 13, 2016

This was referenced May 13, 2016

Proposal - Application-defined "alive probe" #21142

Closed

Proposal - Image defined probe #21143

Closed

cpuguy83 reviewed May 13, 2016
View reviewed changes

crosbymichael reviewed May 13, 2016
View reviewed changes

icecrime added status/2-code-review and removed area/distribution status/1-design-review labels May 13, 2016

Remove the arbitrary output limitation of 255 bytes

33ffb3a

Use 4096 instead. Signed-off-by: Thomas Leonard <thomas.leonard@docker.com>

cpuguy83 reviewed Jun 2, 2016
View reviewed changes

dongluochen reviewed Jun 2, 2016
View reviewed changes

thaJeztah mentioned this pull request Jun 2, 2016

[Carry 22719] healthcheck feature #23218

Merged

crosbymichael closed this in #23218 Jun 2, 2016

crosbymichael added a commit that referenced this pull request Jun 2, 2016

Add User defined Healthchecks for Containers

ce255f7

Carry of #22719

thaJeztah mentioned this pull request Jun 3, 2016

Healthcheck: set default retries to 3 #23232

Merged

nishanttotla mentioned this pull request Jun 10, 2016

Healthcheck Support moby/swarmkit#641

Closed

thaJeztah mentioned this pull request Jun 10, 2016

remove unused defaultExitOnUnhealthy constant #23442

Merged

augi mentioned this pull request Jul 8, 2016

Check health state of containers Closes #31 avast/gradle-docker-compose-plugin#33

Merged

thekid mentioned this pull request Oct 23, 2016

More booted checks tueftler/boot#2

Open

talex5 mentioned this pull request Nov 15, 2016

trigger restart from unhealthy status #28400

Open

willfarrell mentioned this pull request Feb 13, 2017

Add support for unhealthy status mcasimir/docker-autoheal#3

Open

moby locked and limited conversation to collaborators Jul 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for user-defined healthchecks #22719

Add support for user-defined healthchecks #22719

talex5 commented May 13, 2016

thaJeztah commented May 13, 2016

cpuguy83 May 13, 2016

justincormack May 13, 2016

stevvooe May 13, 2016

cpuguy83 May 16, 2016

cpuguy83 commented May 13, 2016

cpuguy83 commented May 13, 2016

crosbymichael May 13, 2016

talex5 May 16, 2016

stevvooe May 16, 2016

crosbymichael May 16, 2016

aluzzardi commented May 31, 2016

stevvooe commented May 31, 2016

thaJeztah commented May 31, 2016

thaJeztah commented Jun 1, 2016

talex5 commented Jun 1, 2016 •

edited

Loading

stevvooe commented Jun 1, 2016

cpuguy83 commented Jun 2, 2016

cpuguy83 Jun 2, 2016

talex5 Jun 2, 2016

dannc Jun 2, 2016 •

edited

Loading

crosbymichael commented Jun 2, 2016

thaJeztah commented Jun 2, 2016

dongluochen Jun 2, 2016

dongluochen Jun 2, 2016

dongluochen commented Jun 2, 2016

thaJeztah commented Jun 2, 2016

dongluochen commented Jun 2, 2016

mageddo commented Jul 26, 2017

stevvooe commented Jul 26, 2017

thaJeztah commented Jul 26, 2017


		The options that can appear before `CMD` are:

		* `--interval=DURATION` (default: `30s`)

Add support for user-defined healthchecks #22719

Add support for user-defined healthchecks #22719

Conversation

talex5 commented May 13, 2016

thaJeztah commented May 13, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpuguy83 commented May 13, 2016

cpuguy83 commented May 13, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aluzzardi commented May 31, 2016

stevvooe commented May 31, 2016

thaJeztah commented May 31, 2016

thaJeztah commented Jun 1, 2016

talex5 commented Jun 1, 2016 • edited Loading

stevvooe commented Jun 1, 2016

cpuguy83 commented Jun 2, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dannc Jun 2, 2016 • edited Loading

Choose a reason for hiding this comment

crosbymichael commented Jun 2, 2016

thaJeztah commented Jun 2, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongluochen commented Jun 2, 2016

thaJeztah commented Jun 2, 2016

dongluochen commented Jun 2, 2016

mageddo commented Jul 26, 2017

stevvooe commented Jul 26, 2017

thaJeztah commented Jul 26, 2017

talex5 commented Jun 1, 2016 •

edited

Loading

dannc Jun 2, 2016 •

edited

Loading