-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for user-defined healthchecks #22719
Conversation
|
||
The options that can appear before `CMD` are: | ||
|
||
* `--interval=DURATION` (default: `30s`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These all seem to be runtime configurations that ought not be in the Dockerfile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does seem a bit problematic. I might test my container on a fast unloaded machine with SSD and then you run it on a slow loaded machine with a hard drive and the startup time is much longer so it fails the health check (eg mysql seems to take a long time to start serving I noticed recently). It does seem odd to have values that are related to runtime performance in the dockerfile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delay
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm really a hard -1 on these flags in the Dockerfile.
There doesn't seem to be anything documented on the protocol between docker and the probe. Do we want this to be a simple 0 == OK, not zero == bad or more robust ala nagios-style checks (ok, warning, critical, and status messages for reporting to the user)? |
Images also often define the ports they listening on ( |
|
||
// exec the healthcheck command in the container. | ||
// Returns the exit code and error message (if any) | ||
func (*cmdProbe) run(ctx context.Context, d *Daemon, container *container.Container) (int, string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In go, your second return value should be error
here and not return strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error here is normally the error string returned by the probe rather than a Go error, although there are a few places where I handle the (unlikely) case of a Go error starting the probe by pretending that the error came from the probe itself. If we return these errors as a third return value, we'll just turn it into a string at the next opportunity anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@talex5 Errors in Go are just values. You can add the error content to the error that makes sense for the application. I would expect these to build on exec.ExitError
or define a special error type to encapsulate the error. Use a stringly-typed approach is just calling for bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, this return value should be an error
100%. Even if it is just a string you can use errors.New("my string")
or fmt.Errorf("my %s", "string")
but you should not return string
especially for how you are using it.
That's true in Go but it's very language specific. What I meant by "you can't have a []byte there" is, at the end, that field is going to be a string no matter what. Whether we base64 encode or not is another question. How do we currently encode output for other commands (e.g. logs)? |
What if we did |
SGTM, would give both options; would |
ping @aluzzardi @crosbymichael WDYT on @stevvooe's proposal? |
Use 4096 instead. Signed-off-by: Thomas Leonard <thomas.leonard@docker.com>
I'm not sure what you mean about having both. Would we include the same data twice in each message, or somehow know where to send which data? Another possibility: if a probe wants to generate binary data then it base64 encodes it itself. That way, probes that only want text don't need to do anything and the messages are human-readable. Probes that return binary data would need something at the client end to interpret it anyway, so decoding it wouldn't be much extra work. The data on the wire would be the same as if we'd used |
@talex5 Yes, have both. One that is "human-readable", and hopefully properly sanitized, and one that is the raw, unprocessed output. This will meet both the use cases of sending binary data and debugging problems with health check output during development. It is unreasonable to make users go through so much work just to get unprocessed command output. Let's make sure this feature isn't limited to our imaginations but, instead, enables others'. |
I do not see the purpose of supporting more than just simple output from a healthcheck, and certainly not storing both binary and human-readable.... I think we should support only something very simple for a human to use as debug info, or potentially remove support for capturing output from the healthcheck from this PR and implement separately. |
|
||
- 0: success - the container is healthy and ready for use | ||
- 1: unhealthy - the container is not working correctly | ||
- 2: starting - the container is not ready for use yet, but is working correctly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this state really just any failure before the first success
?
I would recommend removing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this, whatever is monitoring the status will need some timeout to decide that the container has failed to start. @justincormack (I think) pointed out that a simple timeout is not enough to cover e.g. a database that is busy performing recovery actions. So the idea of this state is to allow the container to tell the monitor that it needs more time, and avoid the monitor having to guess what the grace period should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nit about the labels/description attached to the return codes. Some containers are "run once and die" in that they produce some artifact and exit. Take for example a container that generates a TLS certificate and writes it to a volume, which other containers subsequently mount in order to use the certificate.
I would propose something like:
0: succeeded - the container is healthy or has exited successfully
1: failed - the container is unhealthy or has exited unsuccessfully
2: starting (in-progress?) - the container is not ready for use yet or has not completed its task
For the output, in this PR we should keep it as is. The size has already been increased to 4096 which is much better than what it was before. Most commands that do health checks should be encouraged to write a simple reason why they are reporting as they did. "I returned unhealthy BECAUSE i could not connect to the database." We don't want to encourage checks to do a yolo operation and puke up some stacktrace to decipher. println("could not connect to database");
return 1; In the future after this has some user and they are wanting some type of more structured response than a limited string it would be effortless to add an unbounded []byte array to the output for this. It is something that is safe to defer until we gain feedback. For the three states, simple health checks don't have to use the starting state if they don't want to. To the normal developer they can just return 1 or 0. If the app has a long startup or some type of complexity on boot then it can take advantage of this state and we don't have to infer any type of guesses that we would otherwise be doing. Lets to write software that guesses things, let the things that know the state tell us. You are not forced to use this state if you don't want to. After looking at the code again, I think all these things are already implemented. LGTM |
ok, enough bike-shedding. Let's go for it LGTM |
ADD check.sh main.sh /app/ | ||
CMD /app/main.sh | ||
HEALTHCHECK | ||
HEALTHCHECK --interval=5s --timeout=3s --retries=1 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think a good practice is multiple failures like 3 to tolerate random failures. Giving an example of --retries=1
might not be appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we set default retries to 3.
PR looks good to me. |
@dongluochen I carried the PR, rebased/squashed and with vendor update in #23218. Let me know if you need those addressed in the PR, otherwise I'll do a follow-up |
Thanks @thaJeztah. Follow-up is good. |
Guys what about the restart policy? I think that is very important, the reason was explained here. So will this feature be implemented? |
@mageddo This is a long closed pull request, carried elsewhere. If you have questions or a feature request, please file a new issue. |
Locking the conversation in this issue for the reason @stevvoe mentioned above; comments on closed issues and PRs easily go unnoticed - I'm locking the conversation to prevent that from happening |
This PR adds support for user-defined health-check probes for Docker containers. It adds a
HEALTHCHECK
instruction to the Dockerfile syntax plus some corresponding "docker run" options. It can be used with a restart policy to automatically restart a container if the check fails.The
HEALTHCHECK
instruction has two forms (more may be added later, e.g.HTTP
):HEALTHCHECK [OPTIONS] CMD command
(check container health by running a command inside the container)HEALTHCHECK NONE
(disable any healthcheck inherited from the base image)The
HEALTHCHECK
instruction tells Docker how to test a container to check that it is still working. This can detect cases such as a web server that is stuck in an infinite loop and unable to handle new connections, even though the server process is still running.When a container has a healthcheck specified, it has a health status in addition to its normal status. This status is initially
starting
. When a health check passes, it becomeshealthy
. After a certain number of failures, it becomesunhealthy
.The options that can appear before
CMD
are:--interval=DURATION
(default:30s
)--timeout=DURATION
(default:30s
)--grace=DURATION
(default:30s
)--retries=N
(default:1
)--exit-on-unhealthy=X
(default:true
)The health check will first run interval seconds after the container is started, and then again interval seconds after each previous check completes.
If a single run of the check takes longer than timeout seconds then the check is considered to have failed.
If the health state is
starting
and a check started within grace seconds of the container's start time fails, the failure is ignored and the health state remainsstarting
.It takes retries consecutive failures of the health check for the container to be considered
unhealthy
.If
--exit-on-unhealthy
istrue
then the container will exit as soon as it becomes unhealthy. The container may then be automatically restarted, depending on its restart policy.For example, to check every five minutes or so that a web-server is able to serve the site's main page within three seconds:
(from https://github.com/talex5/docker/blob/healthcheck/docs/reference/builder.md#healthcheck)
The changes to "docker run" are described here:
https://github.com/talex5/docker/blob/healthcheck/docs/reference/run.md#healthcheck
Example
The health status is also displayed in the
docker ps
output.Description for the changelog: Add support for user-defined healthchecks
Closes #21142 and #21143.