Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add health check in docker build-in swarm mode #24139

Merged
merged 1 commit into from
Jul 11, 2016

Conversation

runshenzhu
Copy link
Contributor

The approach is to listen on the event queue to monitor container's status. When unhealthy container is detected, it will be shutdown. And corresponding Error is returned to upper layer.

SubscribeToEvents is introduced to backend to implement adapter.events

@dongluochen @tonistiigi @stevvooe
Signed-off-by: runshenzhu runshen.zhu@gmail.com


// run wait and checkHealth
go runner(r.adapter.wait, waitErrCh)
go runner(r.checkHealth, healthErrCh)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tonistiigi, 2 go routines are used, instead of one as you suggested. Because using one go routine I got some tricky synchronization problems.

I tested health check on 2 restart conditions, any and on_failure. It works correctly and exited unhealthy container gets exit code 137.

Is it ok to keep 2 go routine here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what problems appeared. I think if our only action for unhealthy event is to shutdown container we should wait here until the shutdown has actually happened and return an error that is a combination of the wait error together with "unhealthy" cause.

If you check exit code just on container then this is not handled here. We always set the exit code but in this case, we don't seem to be setting it as a task failure reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tonistiigi updated. Now unhealthy container is waited to fully shutdown.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@runshenzhu how about something like:

// Wait on the container to exit.
func (r *controller) Wait(pctx context.Context) error {
    if err := r.checkClosed(); err != nil {
        return err
    }

    ctx, cancel := context.WithCancel(pctx)
    defer cancel()

    healthErr := make(chan error, 1)
    go func() {
        ectx, cancel := context.WithCancel(ctx) // cancel event context on first event
        defer cancel()
        if err := r.checkHealth(ectx); err == ErrContainerUnhealthy {
            healthErr <- ErrContainerUnhealthy
            if err := r.Shutdown(ctx); err != nil {
                log.G(ctx).WithError(err).Debug("shutdown failed on unhealthy")
            }
        }
    }()

    err := r.adapter.wait(ctx)
    if ctx.Err() != nil {
        return ctx.Err()
    }

    if err != nil {
        ee := &exitError{}
        if ec, ok := err.(exec.ExitCoder); ok {
            ee.code = ec.ExitCode()
        }
        select {
        case e := <-healthErr:
            ee.cause = e
        default:
            if err.Error() != "" {
                ee.cause = err
            }
        }
        return ee
    }

    return nil
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update it. The previous concern is how to sync with shutdown failed on unhealthy error correctly.
Now it's much easier since we just log shutdown failed on unhealthy error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I looked that in controller.Remove() we also only log on same case so it should be fine. I'm not even sure it is possible for the Shutdown to return error.

@tonistiigi
Copy link
Member

This can be in follow-up PR but would be great to get an integration test for this as well. We can use the same template that regular healthcheck test uses, just through a service.

@runshenzhu
Copy link
Contributor Author

runshenzhu commented Jun 30, 2016

@tonistiigi Test is added. PTAL

@GordonTheTurtle GordonTheTurtle added the dco/no Automatically set by a bot when one of the commits lacks proper signature label Jun 30, 2016
@GordonTheTurtle GordonTheTurtle removed the dco/no Automatically set by a bot when one of the commits lacks proper signature label Jun 30, 2016
@stevvooe
Copy link
Contributor

The task should not move out of starting until the container is healthy. This means that Start should block until the container is healthy. Wait should then exit when the container goes to unhealthy.

@tonistiigi
Copy link
Member

Test is added. PTAL

I was thinking of an integration test but doesn't matter for this PR.

The task should not move out of starting until the container is healthy. This means that Start should block until the container is healthy. Wait should then exit when the container goes to unhealthy.

I agree but we should implement that in swarmkit first.

LGTM

ping @aaronlehmann

@thaJeztah thaJeztah added this to the 1.12.0 milestone Jul 1, 2016
@runshenzhu
Copy link
Contributor Author

@tonistiigi CI is broken, but I have no ideas of how to fix it. :(
Any suggestions?

@tonistiigi
Copy link
Member

@runshenzhu I restarted it

@@ -0,0 +1,102 @@
// +build !windows
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tonistiigi maybe I should add this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did this test fail before in windows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I tested in Linux, It passed. After adding this test, there was one time that all tests are passed, except on windows.

Also I notice that both docker_cli_swarm_test and docker_cli_service_update_test are built !windows

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I notice that both docker_cli_swarm_test and docker_cli_service_update_test are built !windows

This is because they use second daemon binary. There maybe something here as well that doesn't work on windows and what I'm missing.

}
}
}()

err := r.adapter.wait(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does adapter.wait exit when the health check is failing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will exit. Unhealthy container will be shutdown, which causes adapter.wait to exit.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having issues with an image using health checks, it uses curl localhost:port and from my logs at the swarm service inspect command the error I got is from the swarm node hosting the container being checked.
I supposed more like a docker exec way inside the container to be checked.
Thanks for any feedback.

Signed-off-by: runshenzhu <runshen.zhu@gmail.com>
@tonistiigi
Copy link
Member

ping @LK4D4

@runshenzhu
Copy link
Contributor Author

@tonistiigi once swarmkit/#1122 get merged, should I port it to docker engine?

@tonistiigi
Copy link
Member

@runshenzhu Yes, but that can be in a separate pr.

@cpuguy83
Copy link
Member

cpuguy83 commented Jul 8, 2016

@tonistiigi @stevvooe Are you good with this one.
Looks ok to me, but you are more familiar with daemon/cluster

@tonistiigi
Copy link
Member

Are you good with this one.

I LGTM'ed

@cpuguy83
Copy link
Member

LGTM.
If there are some issues we can fix separately.
Thanks @runshenzhu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants