Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swarm does not consider health checks while updating #23962

Closed
otbe opened this issue Jun 26, 2016 · 18 comments
Closed

Swarm does not consider health checks while updating #23962

otbe opened this issue Jun 26, 2016 · 18 comments
Assignees
Labels
area/swarm priority/P2 Normal priority: default priority applied. version/1.12
Milestone

Comments

@otbe
Copy link

otbe commented Jun 26, 2016

Hi,

Im playing around with docker 1.12 and the HEALTHCHECK directive in dockerfiles.
What I do is:

docker swarm init

Dockerfile

FROM nginx:latest

RUN touch /marker

ADD ./check_running.sh /check_running.sh
RUN chmod +x /check_running.sh

HEALTHCHECK --interval=5s --timeout=3s CMD ./check_running.sh

check_running.sh

#!/usr/bin/env bash

if [[ -e /marker ]]; then
    rm /marker
    exit 2
else
    exit 0
fi

Build some images from this Dockerfile:

$ docker build -t mynginx:1 .
$ docker build -t mynginx:2 .

Start a service with 5 replicas

$ docker service create --name web --replicas=5 -p 8080:80 mynginx:1

Wait some time (5s) until all replicas are healthy. Now I want to update the image to v2. The update procedure should be rolling with a parallelism of 1. So I run:

$ docker service update --image mynginx:2 --update-parallelism 1 web

From my understanding swarm should execute the update based on this algorithm:

  1. swarm stops one container
  2. swarm starts a container with mynginx:2 and waits as long as it becomes healthy again (not starting)
  3. go to 1)

But what I get is:

$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED                  STATUS                                     PORTS               NAMES
eee2b1b184ac        mynginx:2           "nginx -g 'daemon off"   Less than a second ago   Up Less than a second (health: starting)   80/tcp, 443/tcp     web.3.6kn82dul8gnyvzgz6cje2byfj
b3797cdb5e05        mynginx:2           "nginx -g 'daemon off"   3 seconds ago            Up 2 seconds (health: starting)            80/tcp, 443/tcp     web.5.46ll6b2lhrhr98utowuifccei
0c85128b031e        mynginx:2           "nginx -g 'daemon off"   6 seconds ago            Up 5 seconds (health: starting)            80/tcp, 443/tcp     web.1.apenirwkjp96xiyhhspx134wx
9dcec24ae32a        mynginx:2           "nginx -g 'daemon off"   9 seconds ago            Up 8 seconds (health: starting)            80/tcp, 443/tcp     web.2.5oi6fk1c6u25s3rpqbal39i3b
f37332162d40        mynginx:2           "nginx -g 'daemon off"   12 seconds ago           Up 11 seconds (healthy)                    80/tcp, 443/tcp     web.4.7mgs62e6s78iijcqwv4cose11

Sometimes all replicas are in starting state which results in a downtime of this service. My expectation was that I can obtain "native" zero downtime deployment with HEALTHCHECK and rolling updates on my swarm cluster :)

Whats wrong with my attempt?

Thanks!

Update: Sorry for closing/opening this issue. My mobile is somehow broken today :/

@vdemeester vdemeester added this to the 1.12.0 milestone Jun 26, 2016
@otbe otbe closed this as completed Jun 26, 2016
@otbe otbe reopened this Jun 26, 2016
@cpuguy83
Copy link
Member

What happens if you exit 1 instead of exit 2

@otbe
Copy link
Author

otbe commented Jun 27, 2016

@cpuguy83
Replicas will not run at all, because swarm scheduler will reschedule the containers because they are unhealthy. The result is a loop: Accepted -> Running -> Unhealthy -> Accepted.
Therefore I can't update my service properly :/
Do I miss something?

@cpuguy83
Copy link
Member

@otbe So it sounds like swarm is not taking into account the "starting" state specifically.

@otbe
Copy link
Author

otbe commented Jun 27, 2016

@cpuguy83 yes I think so. Thats what I observe while using HEALTHCHECKs and swarm.

@tiborvass tiborvass added the priority/P3 Best effort: those are nice to have / minor issues. label Jun 27, 2016
@cpuguy83
Copy link
Member

Testing this myself, I'm not sure that swarm is taking into account healthchecks at all atm.

@otbe
Copy link
Author

otbe commented Jun 27, 2016

It is not mentioned explicitly in the docs, but it was my expectation. :)

When exiting with 1 it works for me. Swarm will reschedule unhealthy containers.

This is my most wanted feature because it allows native zero downtime deployments for the first time.

@jhorwit2
Copy link
Contributor

It's also important for update to take into account health check because if the image code has a bug you now just released an unhealthy image to your whole service which means downtime. It would be nice if update would quit if a container gets an unhealthy status response to avoid this scenario.

@cpuguy83 cpuguy83 added priority/P2 Normal priority: default priority applied. priority/P1 Important: P1 issues are a top priority and a must-have for the next release. and removed priority/P3 Best effort: those are nice to have / minor issues. priority/P2 Normal priority: default priority applied. labels Jun 27, 2016
@runshenzhu
Copy link
Contributor

@otbe maybe you could try --update-delay to avoid the downtime.

As far as I know, in stead of checking health_status of the container (e.g. starting, healthy, and unhealth), SwarmKit uses container's status (running, exited) as the indicator to start next container.

@otbe
Copy link
Author

otbe commented Jun 27, 2016

@runshenzhu
Yeah this is my non native workaround. But it is very hard to predict bootstrap time in some circumstances, so real checking would be much better.

@tiborvass
Copy link
Contributor

Ping @tonistiigi @stevvooe @aluzzardi

@tiborvass tiborvass added priority/P2 Normal priority: default priority applied. and removed priority/P1 Important: P1 issues are a top priority and a must-have for the next release. labels Jun 27, 2016
@ChristianKniep
Copy link

If @runshenzhu is right and the 'move-forward' indicator is running and not healthy (in case a health check is present) I would realy love to see this change...
It seems there is even a PR open for SwarmKit: moby/swarmkit#1122

@ChristianKniep
Copy link

I built docker from master, which should include the PR (moby/swarmkit#1122) mentioned, but it still takes down the containers, no matter the health-check status.... :(

$ docker version
Client:
 Version:      1.12.0-dev
 API version:  1.25
 Go version:   go1.6.2
 Git commit:   63186c0
 Built:        Thu Jul  7 01:55:16 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.0-dev
 API version:  1.25
 Go version:   go1.6.2
 Git commit:   63186c0
 Built:        Fri Jul  8 08:45:34 2016
 OS/Arch:      linux/amd64

I created a stack like this:

$ docker service create --name httpcheck -p 8080:80 --replicas=3 qnib/httpcheck:docker-1.12

if I then update the service:

$ docker service update --image httpcheck:docker-1.12-1 --update-parallelism 1 httpcheck

It does not wait for the health-check, but iterates through all tasks while the one at hand is still in starting state.

@runshenzhu
Copy link
Contributor

@ChristianKniep health check in docker build-in swarm mode is different from that in swarmkit. It hasn't enabled in docker build-in swarm mode.

The first step is to implement health check in container's running state, as described in #24139
once #24139 is merged, I will port docker/swarmkit#1122 to docker engine as a follow up.

@ChristianKniep
Copy link

@runshenzhu Ahh, OK... I thought that this change will directly effect the eninge - that it is a dependency of the engine, but understood. Thanks for the clarification.
Can't wait for the PR to the engine to get this included.

@aaronlehmann
Copy link
Contributor

@runshenzhu: Will this be fixed by #24545?

@runshenzhu
Copy link
Contributor

@aaronlehmann Yes. Once #24545 gets merged, swarm service updating will take health check into account.

@aaronlehmann
Copy link
Contributor

@runshenzhu: Great! I went ahead and edited #24545 to include "Fixes: #23962" so that this issue will be closed automatically when that PR is merged.

@runshenzhu
Copy link
Contributor

@aaronlehmann oh, thanks! I didn't know this great feature of github.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/swarm priority/P2 Normal priority: default priority applied. version/1.12
Projects
None yet
Development

No branches or pull requests

9 participants