Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

latest 'dind' tag (19.03) gives error on Gitlab CI "failed to dial gRPC: cannot connect to the Docker daemon. Is 'docker daemon' running on this host?" #170

Closed
daenny opened this issue Jul 23, 2019 · 34 comments

Comments

@daenny
Copy link

commented Jul 23, 2019

We are running a gitlab server and several gitlab-ci-runners. Today we woke up to several failed builds.
We did several tests and we found out that the most likely culprits are the newest tags of the docker:dind and docker:git images.
We tested with docker:18-dind and docker:18-git and the errors does not occur anymore.

The error is given below:

time="2019-07-23T06:52:31Z" level=error msg="failed to dial gRPC: cannot connect to the Docker daemon. Is 'docker daemon' running on this host?: dial tcp 172.17.0.3:2375: connect: connection refused"

The gitlab-runners are running in privileged mode.

EDIT: This is not a bug or unresolved issue:
see: #170 (comment), https://about.gitlab.com/2019/07/31/docker-in-docker-with-docker-19-dot-03/

@frbl

This comment has been minimized.

Copy link

commented Jul 23, 2019

confirming, the workaround suggested by jubel-han fixes it (tested on Gitlab CI)

@peter-c-larsson

This comment has been minimized.

Copy link

commented Jul 23, 2019

Since jubel-hans comments are no longer here and I will repost the part that did the trick for me.

Adding the following variable:

  DOCKER_TLS_CERTDIR: ''

Also changed the docker image tag from stable to stable-dind and not sure if it was needed. edit: and after further testing it was not needed

@joch

This comment has been minimized.

Copy link

commented Jul 23, 2019

Same thing here. We reverted to the 18-dind tag in Gitab in the meantime.

@mymusise

This comment has been minimized.

Copy link

commented Jul 23, 2019

We met this error with using the stable tags of stable and stable-dind images. But 18-dind work.

@jelgersma

This comment has been minimized.

Copy link

commented Jul 23, 2019

Specifying 18-dind as tag fixed it for us for now :)

@daenny

This comment has been minimized.

Copy link
Author

commented Jul 23, 2019

@JanMikes

This comment has been minimized.

Copy link

commented Jul 23, 2019

This will literally break half of the Gitlab CI builds that are building Docker images in it and does not have specified version but using tag docker:dind 😄

@janw

This comment has been minimized.

Copy link

commented Jul 23, 2019

@JanMikes as the person responsible for the CI runners at our company: it’s already in full effect. People are assuming the runners are broken. 😣

Best example of why going with “blanket tags” like latest is a no-no.

@REBELinBLUE

This comment has been minimized.

Copy link

commented Jul 23, 2019

@janw Yep, exactly the same here, everyone was all over me this morning. My fault, shouldn't have set the Jenkins slaves to use stable-dind, setting to 18-dind as suggested has fixed the issue. 🤦‍♂️

@jozuenoon

This comment has been minimized.

Copy link

commented Jul 23, 2019

This affected also builds on Jenkins! Reverted to 18-dind.

@buianhthang

This comment has been minimized.

Copy link

commented Jul 23, 2019

it's a few hours since we found this problem, why the stable-dind version still 19.xx. are we able to revert to 18.xx?

@DavidBadura

This comment has been minimized.

Copy link

commented Jul 23, 2019

maintainer also have to sleep sometime and are not available around the clock ;-)

repomaa pushed a commit to repomaa/slide_server that referenced this issue Jul 23, 2019

Joakim Repomaa
@bachka

This comment has been minimized.

Copy link

commented Jul 23, 2019

same thing here. all pipelines with dind failed ...

@Karreg

This comment has been minimized.

Copy link

commented Jul 23, 2019

Confirmed on my side too. Targetting 18-dind is a workaround until the fix, so maintainers don't have to stay awake for the next 24 hours ;)

@tianon

This comment has been minimized.

Copy link
Member

commented Jul 23, 2019

There will not be an update in this repository to "fix" this as 19.03.0 is now released and GA and the TLS behavioral change was intentional (and applied to 19.03+ only by default to give folks two separate escape hatches to opt out -- environment variable or downgrade).

See https://gitlab.com/gitlab-org/gitlab-runner/issues/4501#note_194648542 for a comment from a GitLab team member that sums up my thoughts even better than I could.

@Mike-Dunton

This comment has been minimized.

Copy link

commented Jul 23, 2019

We were able to work around this for now. We sync the docker containers to an internal registry, and overrode the docker:dind tag with docker:18-dind and turned the sync off. This way we did not have to update all of our configs.

@kinghuang

This comment has been minimized.

Copy link

commented Jul 23, 2019

@tianon While I appreciate that using the stable or latest tags on the docker image runs the risk of breaking changes, Docker 19.03 has been in beta and RC for over 4 months and this change to the image was made just 6 days ago. I've been testing the docker:19.03.0-rc* images in my GitLab CI pipelines for months in preparation for the release, and didn't run into this breaking change because it wasn't in any of the RCs.

I think it's very poor form to introduce such a breaking change in the last few days of a major release without any notifications.

@llech

This comment has been minimized.

Copy link

commented Jul 23, 2019

@kinghuang IMHO it's always a poor practice to introduce breaking change where it could be easily avoided.
In that case we have a new feature that breaks old functionality if a variable is set to true. The problem is, that default value is true. I don't quite understand what people doing such things have in mind. Unfortunately, it's not the first time I see something like that in stable and broadly used open source project.

@AkihiroSuda

This comment has been minimized.

Copy link

commented Jul 23, 2019

TCP connection without tlsverify has been unrecommended for years.

@JordanP

This comment has been minimized.

Copy link

commented Jul 23, 2019

TCP connection without tlsverify has been unrecommended for years.

If you are running everything locally, it's okay though.

Me and my company got caught by this issue, that's fine but that commit seems a bit rushed, like a big breaking change just before the release... (plz don't revert though)

@tianon

This comment has been minimized.

Copy link
Member

commented Jul 23, 2019

For some values of "okay" -- any container you run (either inside or outside the DinD instance) can guess/scan for the IP, connect, and do whatever it wants to your host machine as root. 😅

@REBELinBLUE

This comment has been minimized.

Copy link

commented Jul 23, 2019

I thought I'd try setting my jenkins slaves to work correctly, using this manifest generated by the K8S plugin https://gist.github.com/REBELinBLUE/97a5c13c2589bb1f3df5a5b330718eb0

But it doesn't seem to generate all the certificates before the job starts, I added ls /certs/** to the start of the job and I end up with

/certs/ca:
cert.pem
cert.srl
key.pem

/certs/client:
key.pem

/certs/server:
ca.pem
cert.pem
csr.pem
key.pem
openssl.cnf

if I add the liveness probe it seems to generate the certificates before it fully starts but then when I try to run docker commands I end up with Error response from daemon: Client sent an HTTP request to an HTTPS server. (yes I set the ports to 2376)

In the end I have given up and just set DOCKER_TLS_CERTDIR to an empty value and set the ports back to 2375 but I'd like to get it working properly

@tianon

This comment has been minimized.

Copy link
Member

commented Jul 23, 2019

Besides setting DOCKER_HOST to use port 2376, you need to set DOCKER_TLS_VERIFY=1 and DOCKER_CERT_PATH=/certs/client to tell Docker to use TLS (and where to get certificates to handshake with).

Also, you should only share /certs/client with your client containers.

See also:

# if DOCKER_HOST isn't set and we don't have the default unix socket, let's set DOCKER_HOST to a sane remote value
if [ -z "${DOCKER_HOST:-}" ] && [ ! -S /var/run/docker.sock ]; then
if _should_tls || [ -n "${DOCKER_TLS_VERIFY:-}" ]; then
export DOCKER_HOST='tcp://docker:2376'
else
export DOCKER_HOST='tcp://docker:2375'
fi
fi
if [ -n "${DOCKER_HOST:-}" ] && _should_tls; then
export DOCKER_TLS_VERIFY=1
export DOCKER_CERT_PATH="$DOCKER_TLS_CERTDIR/client"
fi

@REBELinBLUE

This comment has been minimized.

Copy link

commented Jul 23, 2019

Thanks! That works.

Still need to figure out how to only mount the client certificates on the client, annoyingly the jenkins K8S plugin doesn't appear to allow you to configure different mounts per container.

Although I need to add sleep to the beginning of my jobs, need to figure out how to have the slave not be ready until docker has generated the certificates and it is running

tmwack added a commit to tmwack/gitlab-continuous-integration that referenced this issue Jul 24, 2019

bugfix: docker:stable-dind is broken, locking to 18-dind instead.
Updated all references to docker:stable-dind because the latest
image pushed to stable-dind updates to Docker 19.03. Unlike 18.x,
19.03 enables TLS by default -- which, for some reason, does not
agree with GitLab runners.

Docker Issue: docker-library/docker#170
GitLab Runner Issue: https://gitlab.com/gitlab-org/gitlab-runner/issues/4501

We can revert this, or upgrade to 19.x, once the issue is resolved.
For now, we should stick to 18.x.

briangweber added a commit to Cimpress-MCP/gitlab-continuous-integration that referenced this issue Jul 24, 2019

bugfix: docker:stable-dind is broken, locking to 18-dind instead. (#30)
Updated all references to docker:stable-dind because the latest
image pushed to stable-dind updates to Docker 19.03. Unlike 18.x,
19.03 enables TLS by default -- which, for some reason, does not
agree with GitLab runners.

Docker Issue: docker-library/docker#170
GitLab Runner Issue: https://gitlab.com/gitlab-org/gitlab-runner/issues/4501

We can revert this, or upgrade to 19.x, once the issue is resolved.
For now, we should stick to 18.x.
@tarampampam

This comment has been minimized.

Copy link

commented Jul 25, 2019

I fix my self-hosted runners (debian, runners installed using apt-get):

$ nano /etc/gitlab-runner/config.toml
[[runners]]
-  environment = ["DOCKER_DRIVER=overlay2"]
+  environment = ["DOCKER_DRIVER=overlay2","DOCKER_TLS_VERIFY=1","DOCKER_CERT_PATH=/certs/client"]
  [runners.docker]
-    tls_verify = false
    image = "docker:dind"
    privileged = true
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
-    volumes = ["/cache"]
+    volumes = ["/cache","/certs"]

And then:

$ service gitlab-runner restart
@mqu

This comment has been minimized.

Copy link

commented Jul 29, 2019

thanks tarampampam ; this fixed definitely this issue.

@wileyj

This comment has been minimized.

Copy link

commented Jul 30, 2019

using docker:dind:

INFO[2019-07-30T01:16:04.239203878Z] Starting up                                  
WARN[2019-07-30T01:16:04.241829113Z] could not change group /var/run/docker.sock to docker: group docker not found 

reverting to 18-dind resolved this.
I plan on creating a new image based on official to manually add the user/group and see if that makes a difference.

@AkihiroSuda

This comment has been minimized.

Copy link

commented Jul 30, 2019

@wileyj The warning you are seeing is unrelated

@tianon

This comment has been minimized.

Copy link
Member

commented Jul 30, 2019

Even docker:dind doesn't have ppc64le support, and hasn't since at least 6001c15.

(Docker stopped publishing releases for it quite a while ago; see https://download.docker.com/linux/static/stable/ppc64le/)

@wileyj

This comment has been minimized.

Copy link

commented Jul 30, 2019

@wileyj The warning you are seeing is unrelated

Indeed, you are correct.
The following seemed to resolve the issue for me for now based on this, but I'll have to find a better way to enable the TLS in the future.

the key was this env var in that link: DOCKER_TLS_CERTDIR=""

11:38:59 ~$ sudo docker run -d \
>   --rm \
>   --privileged  \
>   -p 12375:2375  \
>   -p 12376:2376 \
>   -e DOCKER_TLS_CERTDIR="" \
>   --name dind \
> docker:dind
35e89597e5fde593449f6f01027c5ae240388d0c2a030bc4bd13a42f5e1b2e2d
11:39:04 ~$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS                 PORTS                                                      NAMES
35e89597e5fd        docker:dind         "dockerd-entrypoint.…"   3 seconds ago       Up 2 seconds           0.0.0.0:12375->2375/tcp, 0.0.0.0:12376->2376/tcp           dind
11:39:06 ~$ docker exec -it dind sh
/ # docker run -d alpine tail -f /dev/null
Unable to find image 'alpine:latest' locally
latest: Pulling from library/alpine
050382585609: Pull complete 
Digest: sha256:6a92cd1fcdc8d8cdec60f33dda4db2cb1fcdcacf3410a8e05b3741f44a9b5998
Status: Downloaded newer image for alpine:latest
82ee20601c943827168c62edb01d9e571749293435d2594059dd1710188ff428
/ # docker ps
CONTAINER ID        IMAGE               COMMAND               CREATED             STATUS              PORTS               NAMES
82ee20601c94        alpine              "tail -f /dev/null"   3 seconds ago       Up 1 second                             sad_boyd
/ # 11:39:25docker -H tcp://:12375 ps
CONTAINER ID        IMAGE               COMMAND               CREATED             STATUS              PORTS               NAMES
82ee20601c94        alpine              "tail -f /dev/null"   17 seconds ago      Up 15 seconds                           sad_boyd
@AkihiroSuda

This comment has been minimized.

Copy link

commented Jul 30, 2019

Any reason to keep this issue still open?

@tianon

This comment has been minimized.

Copy link
Member

commented Jul 31, 2019

Currently just using it to catch affected users (hopefully avoiding duplicates).

@AkihiroSuda

This comment has been minimized.

Copy link

commented Jul 31, 2019

Could you edit the top comment to clarify this is not an unresolved bug?

@tianon

This comment has been minimized.

Copy link
Member

commented Jul 31, 2019

GitLab now has a really nice blog post up describing the situation and how to fix it if your environment is affected: https://about.gitlab.com/2019/07/31/docker-in-docker-with-docker-19-dot-03/ 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.