-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add health check in docker build-in swarm mode #24139
Conversation
|
||
// run wait and checkHealth | ||
go runner(r.adapter.wait, waitErrCh) | ||
go runner(r.checkHealth, healthErrCh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tonistiigi, 2 go routines are used, instead of one as you suggested. Because using one go routine I got some tricky synchronization problems.
I tested health check on 2 restart conditions, any
and on_failure
. It works correctly and exited unhealthy container gets exit code 137.
Is it ok to keep 2 go routine here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what problems appeared. I think if our only action for unhealthy
event is to shutdown container we should wait here until the shutdown has actually happened and return an error that is a combination of the wait error together with "unhealthy" cause.
If you check exit code just on container then this is not handled here. We always set the exit code but in this case, we don't seem to be setting it as a task failure reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tonistiigi updated. Now unhealthy container is waited to fully shutdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@runshenzhu how about something like:
// Wait on the container to exit.
func (r *controller) Wait(pctx context.Context) error {
if err := r.checkClosed(); err != nil {
return err
}
ctx, cancel := context.WithCancel(pctx)
defer cancel()
healthErr := make(chan error, 1)
go func() {
ectx, cancel := context.WithCancel(ctx) // cancel event context on first event
defer cancel()
if err := r.checkHealth(ectx); err == ErrContainerUnhealthy {
healthErr <- ErrContainerUnhealthy
if err := r.Shutdown(ctx); err != nil {
log.G(ctx).WithError(err).Debug("shutdown failed on unhealthy")
}
}
}()
err := r.adapter.wait(ctx)
if ctx.Err() != nil {
return ctx.Err()
}
if err != nil {
ee := &exitError{}
if ec, ok := err.(exec.ExitCoder); ok {
ee.code = ec.ExitCode()
}
select {
case e := <-healthErr:
ee.cause = e
default:
if err.Error() != "" {
ee.cause = err
}
}
return ee
}
return nil
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update it. The previous concern is how to sync with shutdown failed on unhealthy
error correctly.
Now it's much easier since we just log shutdown failed on unhealthy
error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I looked that in controller.Remove()
we also only log on same case so it should be fine. I'm not even sure it is possible for the Shutdown
to return error.
This can be in follow-up PR but would be great to get an integration test for this as well. We can use the same template that regular healthcheck test uses, just through a service. |
@tonistiigi Test is added. PTAL |
The task should not move out of starting until the container is healthy. This means that |
I was thinking of an integration test but doesn't matter for this PR.
I agree but we should implement that in swarmkit first. LGTM ping @aaronlehmann |
@tonistiigi CI is broken, but I have no ideas of how to fix it. :( |
@runshenzhu I restarted it |
@@ -0,0 +1,102 @@ | |||
// +build !windows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tonistiigi maybe I should add this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did this test fail before in windows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I tested in Linux, It passed. After adding this test, there was one time that all tests are passed, except on windows.
Also I notice that both docker_cli_swarm_test
and docker_cli_service_update_test
are built !windows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I notice that both docker_cli_swarm_test and docker_cli_service_update_test are built !windows
This is because they use second daemon binary. There maybe something here as well that doesn't work on windows and what I'm missing.
} | ||
} | ||
}() | ||
|
||
err := r.adapter.wait(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does adapter.wait
exit when the health check is failing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will exit. Unhealthy container will be shutdown, which causes adapter.wait
to exit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am having issues with an image using health checks, it uses curl localhost:port and from my logs at the swarm service inspect command the error I got is from the swarm node hosting the container being checked.
I supposed more like a docker exec way inside the container to be checked.
Thanks for any feedback.
Signed-off-by: runshenzhu <runshen.zhu@gmail.com>
ping @LK4D4 |
@tonistiigi once swarmkit/#1122 get merged, should I port it to docker engine? |
@runshenzhu Yes, but that can be in a separate pr. |
@tonistiigi @stevvooe Are you good with this one. |
I LGTM'ed |
LGTM. |
The approach is to listen on the event queue to monitor container's status. When unhealthy container is detected, it will be shutdown. And corresponding Error is returned to upper layer.
SubscribeToEvents
is introduced to backend to implementadapter.events
@dongluochen @tonistiigi @stevvooe
Signed-off-by: runshenzhu runshen.zhu@gmail.com