add health check in docker build-in swarm mode #24139

runshenzhu · 2016-06-29T19:59:08Z

The approach is to listen on the event queue to monitor container's status. When unhealthy container is detected, it will be shutdown. And corresponding Error is returned to upper layer.

SubscribeToEvents is introduced to backend to implement adapter.events

@dongluochen @tonistiigi @stevvooe
Signed-off-by: runshenzhu runshen.zhu@gmail.com

runshenzhu · 2016-06-29T20:03:41Z

daemon/cluster/executor/container/controller.go

+
+	// run wait and checkHealth
+	go runner(r.adapter.wait, waitErrCh)
+	go runner(r.checkHealth, healthErrCh)


@tonistiigi, 2 go routines are used, instead of one as you suggested. Because using one go routine I got some tricky synchronization problems.

I tested health check on 2 restart conditions, any and on_failure. It works correctly and exited unhealthy container gets exit code 137.

Is it ok to keep 2 go routine here?

I'm not sure what problems appeared. I think if our only action for unhealthy event is to shutdown container we should wait here until the shutdown has actually happened and return an error that is a combination of the wait error together with "unhealthy" cause.

If you check exit code just on container then this is not handled here. We always set the exit code but in this case, we don't seem to be setting it as a task failure reason.

@tonistiigi updated. Now unhealthy container is waited to fully shutdown.

@runshenzhu how about something like:

// Wait on the container to exit. func (r *controller) Wait(pctx context.Context) error { if err := r.checkClosed(); err != nil { return err } ctx, cancel := context.WithCancel(pctx) defer cancel() healthErr := make(chan error, 1) go func() { ectx, cancel := context.WithCancel(ctx) // cancel event context on first event defer cancel() if err := r.checkHealth(ectx); err == ErrContainerUnhealthy { healthErr <- ErrContainerUnhealthy if err := r.Shutdown(ctx); err != nil { log.G(ctx).WithError(err).Debug("shutdown failed on unhealthy") } } }() err := r.adapter.wait(ctx) if ctx.Err() != nil { return ctx.Err() } if err != nil { ee := &exitError{} if ec, ok := err.(exec.ExitCoder); ok { ee.code = ec.ExitCode() } select { case e := <-healthErr: ee.cause = e default: if err.Error() != "" { ee.cause = err } } return ee } return nil }

I will update it. The previous concern is how to sync with shutdown failed on unhealthy error correctly.
Now it's much easier since we just log shutdown failed on unhealthy error.

Yes, I looked that in controller.Remove() we also only log on same case so it should be fine. I'm not even sure it is possible for the Shutdown to return error.

tonistiigi · 2016-06-30T07:20:08Z

This can be in follow-up PR but would be great to get an integration test for this as well. We can use the same template that regular healthcheck test uses, just through a service.

runshenzhu · 2016-06-30T21:08:45Z

@tonistiigi Test is added. PTAL

stevvooe · 2016-06-30T23:06:14Z

The task should not move out of starting until the container is healthy. This means that Start should block until the container is healthy. Wait should then exit when the container goes to unhealthy.

tonistiigi · 2016-06-30T23:12:43Z

Test is added. PTAL

I was thinking of an integration test but doesn't matter for this PR.

The task should not move out of starting until the container is healthy. This means that Start should block until the container is healthy. Wait should then exit when the container goes to unhealthy.

I agree but we should implement that in swarmkit first.

LGTM

ping @aaronlehmann

runshenzhu · 2016-07-01T02:00:23Z

@tonistiigi CI is broken, but I have no ideas of how to fix it. :(
Any suggestions?

tonistiigi · 2016-07-01T02:02:53Z

@runshenzhu I restarted it

runshenzhu · 2016-07-01T02:06:37Z

daemon/cluster/executor/container/health_test.go

@@ -0,0 +1,102 @@
+// +build !windows


@tonistiigi maybe I should add this.

did this test fail before in windows?

No. I tested in Linux, It passed. After adding this test, there was one time that all tests are passed, except on windows.

Also I notice that both docker_cli_swarm_test and docker_cli_service_update_test are built !windows

Also I notice that both docker_cli_swarm_test and docker_cli_service_update_test are built !windows

This is because they use second daemon binary. There maybe something here as well that doesn't work on windows and what I'm missing.

stevvooe · 2016-07-01T03:25:26Z

daemon/cluster/executor/container/controller.go

+			}
+		}
+	}()
+
 	err := r.adapter.wait(ctx)


Does adapter.wait exit when the health check is failing?

Yes, it will exit. Unhealthy container will be shutdown, which causes adapter.wait to exit.

I am having issues with an image using health checks, it uses curl localhost:port and from my logs at the swarm service inspect command the error I got is from the swarm node hosting the container being checked.
I supposed more like a docker exec way inside the container to be checked.
Thanks for any feedback.

Signed-off-by: runshenzhu <runshen.zhu@gmail.com>

tonistiigi · 2016-07-07T00:55:07Z

ping @LK4D4

runshenzhu · 2016-07-07T01:04:27Z

@tonistiigi once swarmkit/#1122 get merged, should I port it to docker engine?

tonistiigi · 2016-07-07T01:08:02Z

@runshenzhu Yes, but that can be in a separate pr.

cpuguy83 · 2016-07-08T19:24:09Z

@tonistiigi @stevvooe Are you good with this one.
Looks ok to me, but you are more familiar with daemon/cluster

tonistiigi · 2016-07-08T21:57:43Z

Are you good with this one.

I LGTM'ed

cpuguy83 · 2016-07-11T13:43:57Z

LGTM.
If there are some issues we can fix separately.
Thanks @runshenzhu

GordonTheTurtle added the status/0-triage label Jun 29, 2016

runshenzhu reviewed Jun 29, 2016
View reviewed changes

runshenzhu force-pushed the health-check branch from a571114 to 5f7fc38 Compare June 29, 2016 21:04

tonistiigi added status/2-code-review and removed status/0-triage labels Jun 30, 2016

runshenzhu force-pushed the health-check branch from 5f7fc38 to 069518c Compare June 30, 2016 21:03

GordonTheTurtle added the dco/no Automatically set by a bot when one of the commits lacks proper signature label Jun 30, 2016

runshenzhu force-pushed the health-check branch from a7c4d5a to 1f8dec7 Compare June 30, 2016 22:50

GordonTheTurtle removed the dco/no Automatically set by a bot when one of the commits lacks proper signature label Jun 30, 2016

tonistiigi added the process/cherry-pick label Jul 1, 2016

thaJeztah added this to the 1.12.0 milestone Jul 1, 2016

runshenzhu force-pushed the health-check branch from 1f8dec7 to f928991 Compare July 1, 2016 02:03

runshenzhu reviewed Jul 1, 2016
View reviewed changes

stevvooe mentioned this pull request Jul 1, 2016

Controller.Start should block until container is healthy moby/swarmkit#1113

Closed

stevvooe reviewed Jul 1, 2016
View reviewed changes

add health check in docker cluster

1ded1f2

Signed-off-by: runshenzhu <runshen.zhu@gmail.com>

runshenzhu force-pushed the health-check branch from f928991 to 1ded1f2 Compare July 6, 2016 20:44

runshenzhu mentioned this pull request Jul 8, 2016

Swarm does not consider health checks while updating #23962

Closed

cpuguy83 mentioned this pull request Jul 11, 2016

[Feature Request] Task level health check should supported #24500

Open

cpuguy83 merged commit e2fc143 into moby:master Jul 11, 2016

runshenzhu deleted the health-check branch July 12, 2016 00:45

tiborvass mentioned this pull request Jul 12, 2016

[WIP] Cherry picks 1.12.0 rc4 #24467

Merged

tiborvass added process/cherry-picked and removed process/cherry-pick labels Jul 12, 2016

runshenzhu mentioned this pull request Jul 25, 2016

swarm: block controller.Start until container is healthy #24545

Merged

thaJeztah mentioned this pull request Jul 3, 2023

fix some minor linting issues #45877

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add health check in docker build-in swarm mode #24139

add health check in docker build-in swarm mode #24139

runshenzhu commented Jun 29, 2016

runshenzhu Jun 29, 2016

tonistiigi Jun 29, 2016

runshenzhu Jun 29, 2016

tonistiigi Jun 30, 2016

runshenzhu Jun 30, 2016

tonistiigi Jun 30, 2016

tonistiigi commented Jun 30, 2016

runshenzhu commented Jun 30, 2016 •

edited

stevvooe commented Jun 30, 2016

tonistiigi commented Jun 30, 2016

runshenzhu commented Jul 1, 2016

tonistiigi commented Jul 1, 2016

runshenzhu Jul 1, 2016

tonistiigi Jul 1, 2016

runshenzhu Jul 1, 2016

tonistiigi Jul 1, 2016

stevvooe Jul 1, 2016

runshenzhu Jul 1, 2016

Wolfium Jan 26, 2017

tonistiigi commented Jul 7, 2016

runshenzhu commented Jul 7, 2016

tonistiigi commented Jul 7, 2016

cpuguy83 commented Jul 8, 2016

tonistiigi commented Jul 8, 2016

cpuguy83 commented Jul 11, 2016

add health check in docker build-in swarm mode #24139

add health check in docker build-in swarm mode #24139

Conversation

runshenzhu commented Jun 29, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tonistiigi commented Jun 30, 2016

runshenzhu commented Jun 30, 2016 • edited

stevvooe commented Jun 30, 2016

tonistiigi commented Jun 30, 2016

runshenzhu commented Jul 1, 2016

tonistiigi commented Jul 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tonistiigi commented Jul 7, 2016

runshenzhu commented Jul 7, 2016

tonistiigi commented Jul 7, 2016

cpuguy83 commented Jul 8, 2016

tonistiigi commented Jul 8, 2016

cpuguy83 commented Jul 11, 2016

runshenzhu commented Jun 30, 2016 •

edited