Swarm does not consider health checks while updating #23962

otbe · 2016-06-26T08:08:13Z

Hi,

Im playing around with docker 1.12 and the HEALTHCHECK directive in dockerfiles.
What I do is:

docker swarm init

Dockerfile

FROM nginx:latest

RUN touch /marker

ADD ./check_running.sh /check_running.sh
RUN chmod +x /check_running.sh

HEALTHCHECK --interval=5s --timeout=3s CMD ./check_running.sh

check_running.sh

#!/usr/bin/env bash

if [[ -e /marker ]]; then
    rm /marker
    exit 2
else
    exit 0
fi

Build some images from this Dockerfile:

$ docker build -t mynginx:1 .
$ docker build -t mynginx:2 .

Start a service with 5 replicas

$ docker service create --name web --replicas=5 -p 8080:80 mynginx:1

Wait some time (5s) until all replicas are healthy. Now I want to update the image to v2. The update procedure should be rolling with a parallelism of 1. So I run:

$ docker service update --image mynginx:2 --update-parallelism 1 web

From my understanding swarm should execute the update based on this algorithm:

swarm stops one container
swarm starts a container with mynginx:2 and waits as long as it becomes healthy again (not starting)
go to 1)

But what I get is:

$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED                  STATUS                                     PORTS               NAMES
eee2b1b184ac        mynginx:2           "nginx -g 'daemon off"   Less than a second ago   Up Less than a second (health: starting)   80/tcp, 443/tcp     web.3.6kn82dul8gnyvzgz6cje2byfj
b3797cdb5e05        mynginx:2           "nginx -g 'daemon off"   3 seconds ago            Up 2 seconds (health: starting)            80/tcp, 443/tcp     web.5.46ll6b2lhrhr98utowuifccei
0c85128b031e        mynginx:2           "nginx -g 'daemon off"   6 seconds ago            Up 5 seconds (health: starting)            80/tcp, 443/tcp     web.1.apenirwkjp96xiyhhspx134wx
9dcec24ae32a        mynginx:2           "nginx -g 'daemon off"   9 seconds ago            Up 8 seconds (health: starting)            80/tcp, 443/tcp     web.2.5oi6fk1c6u25s3rpqbal39i3b
f37332162d40        mynginx:2           "nginx -g 'daemon off"   12 seconds ago           Up 11 seconds (healthy)                    80/tcp, 443/tcp     web.4.7mgs62e6s78iijcqwv4cose11

Sometimes all replicas are in starting state which results in a downtime of this service. My expectation was that I can obtain "native" zero downtime deployment with HEALTHCHECK and rolling updates on my swarm cluster :)

Whats wrong with my attempt?

Thanks!

Update: Sorry for closing/opening this issue. My mobile is somehow broken today :/

The text was updated successfully, but these errors were encountered:

cpuguy83 · 2016-06-27T14:19:02Z

What happens if you exit 1 instead of exit 2

otbe · 2016-06-27T14:58:30Z

@cpuguy83
Replicas will not run at all, because swarm scheduler will reschedule the containers because they are unhealthy. The result is a loop: Accepted -> Running -> Unhealthy -> Accepted.
Therefore I can't update my service properly :/
Do I miss something?

cpuguy83 · 2016-06-27T14:59:48Z

@otbe So it sounds like swarm is not taking into account the "starting" state specifically.

otbe · 2016-06-27T15:05:40Z

@cpuguy83 yes I think so. Thats what I observe while using HEALTHCHECKs and swarm.

cpuguy83 · 2016-06-27T16:41:12Z

Testing this myself, I'm not sure that swarm is taking into account healthchecks at all atm.

otbe · 2016-06-27T16:47:04Z

It is not mentioned explicitly in the docs, but it was my expectation. :)

When exiting with 1 it works for me. Swarm will reschedule unhealthy containers.

This is my most wanted feature because it allows native zero downtime deployments for the first time.

jhorwit2 · 2016-06-27T16:48:38Z

It's also important for update to take into account health check because if the image code has a bug you now just released an unhealthy image to your whole service which means downtime. It would be nice if update would quit if a container gets an unhealthy status response to avoid this scenario.

runshenzhu · 2016-06-27T17:27:44Z

@otbe maybe you could try --update-delay to avoid the downtime.

As far as I know, in stead of checking health_status of the container (e.g. starting, healthy, and unhealth), SwarmKit uses container's status (running, exited) as the indicator to start next container.

otbe · 2016-06-27T17:33:07Z

@runshenzhu
Yeah this is my non native workaround. But it is very hard to predict bootstrap time in some circumstances, so real checking would be much better.

tiborvass · 2016-06-27T23:31:36Z

Ping @tonistiigi @stevvooe @aluzzardi

ChristianKniep · 2016-07-07T12:26:40Z

If @runshenzhu is right and the 'move-forward' indicator is running and not healthy (in case a health check is present) I would realy love to see this change...
It seems there is even a PR open for SwarmKit: moby/swarmkit#1122

ChristianKniep · 2016-07-08T11:19:03Z

I built docker from master, which should include the PR (moby/swarmkit#1122) mentioned, but it still takes down the containers, no matter the health-check status.... :(

$ docker version
Client:
 Version:      1.12.0-dev
 API version:  1.25
 Go version:   go1.6.2
 Git commit:   63186c0
 Built:        Thu Jul  7 01:55:16 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.0-dev
 API version:  1.25
 Go version:   go1.6.2
 Git commit:   63186c0
 Built:        Fri Jul  8 08:45:34 2016
 OS/Arch:      linux/amd64

I created a stack like this:

$ docker service create --name httpcheck -p 8080:80 --replicas=3 qnib/httpcheck:docker-1.12

if I then update the service:

$ docker service update --image httpcheck:docker-1.12-1 --update-parallelism 1 httpcheck

It does not wait for the health-check, but iterates through all tasks while the one at hand is still in starting state.

runshenzhu · 2016-07-08T16:58:28Z

@ChristianKniep health check in docker build-in swarm mode is different from that in swarmkit. It hasn't enabled in docker build-in swarm mode.

The first step is to implement health check in container's running state, as described in #24139
once #24139 is merged, I will port docker/swarmkit#1122 to docker engine as a follow up.

ChristianKniep · 2016-07-09T08:53:32Z

@runshenzhu Ahh, OK... I thought that this change will directly effect the eninge - that it is a dependency of the engine, but understood. Thanks for the clarification.
Can't wait for the PR to the engine to get this included.

aaronlehmann · 2016-07-12T22:23:47Z

@runshenzhu: Will this be fixed by #24545?

runshenzhu · 2016-07-12T22:28:19Z

@aaronlehmann Yes. Once #24545 gets merged, swarm service updating will take health check into account.

aaronlehmann · 2016-07-12T22:32:04Z

@runshenzhu: Great! I went ahead and edited #24545 to include "Fixes: #23962" so that this issue will be closed automatically when that PR is merged.

runshenzhu · 2016-07-12T22:34:24Z

@aaronlehmann oh, thanks! I didn't know this great feature of github.

vdemeester added this to the 1.12.0 milestone Jun 26, 2016

vdemeester added version/1.12 area/swarm labels Jun 26, 2016

otbe closed this as completed Jun 26, 2016

otbe reopened this Jun 26, 2016

tiborvass added the priority/P3 Best effort: those are nice to have / minor issues. label Jun 27, 2016

cpuguy83 mentioned this issue Jun 27, 2016

service update should respect health checks #23994

Closed

tiborvass assigned aluzzardi Jun 27, 2016

tiborvass added priority/P2 Normal priority: default priority applied. and removed priority/P1 Important: P1 issues are a top priority and a must-have for the next release. labels Jun 27, 2016

thaJeztah mentioned this issue Jul 11, 2016

[Feature Request] Task level health check should supported #24500

Open

runshenzhu mentioned this issue Jul 12, 2016

swarm: block controller.Start until container is healthy #24545

Merged

cpuguy83 closed this as completed in #24545 Jul 26, 2016

veerendra2 mentioned this issue Aug 9, 2016

Perform Health Check externally? #25525

Closed

darkrasid mentioned this issue Oct 31, 2016

Swarm mode doesn't consider health:start status while updating from docker engine 1.12.3 version. #27891

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swarm does not consider health checks while updating #23962

Swarm does not consider health checks while updating #23962

otbe commented Jun 26, 2016 •

edited

cpuguy83 commented Jun 27, 2016

otbe commented Jun 27, 2016

cpuguy83 commented Jun 27, 2016

otbe commented Jun 27, 2016

cpuguy83 commented Jun 27, 2016

otbe commented Jun 27, 2016

jhorwit2 commented Jun 27, 2016

runshenzhu commented Jun 27, 2016

otbe commented Jun 27, 2016

tiborvass commented Jun 27, 2016

ChristianKniep commented Jul 7, 2016

ChristianKniep commented Jul 8, 2016

runshenzhu commented Jul 8, 2016

ChristianKniep commented Jul 9, 2016

aaronlehmann commented Jul 12, 2016

runshenzhu commented Jul 12, 2016

aaronlehmann commented Jul 12, 2016

runshenzhu commented Jul 12, 2016

Swarm does not consider health checks while updating #23962

Swarm does not consider health checks while updating #23962

Comments

otbe commented Jun 26, 2016 • edited

cpuguy83 commented Jun 27, 2016

otbe commented Jun 27, 2016

cpuguy83 commented Jun 27, 2016

otbe commented Jun 27, 2016

cpuguy83 commented Jun 27, 2016

otbe commented Jun 27, 2016

jhorwit2 commented Jun 27, 2016

runshenzhu commented Jun 27, 2016

otbe commented Jun 27, 2016

tiborvass commented Jun 27, 2016

ChristianKniep commented Jul 7, 2016

ChristianKniep commented Jul 8, 2016

runshenzhu commented Jul 8, 2016

ChristianKniep commented Jul 9, 2016

aaronlehmann commented Jul 12, 2016

runshenzhu commented Jul 12, 2016

aaronlehmann commented Jul 12, 2016

runshenzhu commented Jul 12, 2016

otbe commented Jun 26, 2016 •

edited