docker ps hangs #1654

Closed
crawford opened this Issue Nov 9, 2016 · 33 comments
@crawford
Member
crawford commented Nov 9, 2016

Issue Report

Bug

CoreOS Version

1185.3.0

Environment

AWS and GCE confirmed. Likely all environments.

Expected Behavior

docker ps properly returns the list of running containers.

Actual Behavior

docker ps eventually hangs (not sure which conditions cause it yet).

Reproduction Steps

In @matti's case:

$ docker ps -n 17
# works

$ docker ps -n 18
# hangs

Other Information

strace:

$ strace -s  128 docker ps -n 18
clock_gettime(CLOCK_MONOTONIC, {484793, 504192514}) = 0
futex(0xc820064108, FUTEX_WAKE, 1)      = 1
read(5, 0xc820400000, 4096)             = -1 EAGAIN (Resource temporarily unavailable)
write(5, "GET /v1.23/containers/json?limit=18 HTTP/1.1\r\nHost: \r\nUser-Agent: Docker-Client/1.11.2 (linux)\r\n\r\n", 98) = 98
futex(0xc820064108, FUTEX_WAKE, 1)      = 1
futex(0x22509c8, FUTEX_WAIT, 0, NULL
@crawford
Member
crawford commented Nov 9, 2016

@chancez pointed out that docker/docker#13885 and docker/libnetwork#1507 might be related.

@matti
matti commented Nov 10, 2016

We run 5 machines and get this hang ~weekly (for the last 3 months). Is there anything we could run on the machines when the hang happens again?

@matti
matti commented Nov 10, 2016

machines are 8CPU/30gb.

During the last hang we were able to start new containers while docker ps was hanging.

@Jwpe
Jwpe commented Nov 10, 2016 edited

I am also experiencing this issue with the following setup:

CoreOS Version: 1185.3.0
Docker Server Version: 1.11.2

Not only does docker ps hang, but docker-compose commands fail with the following error message:

Nov 10 10:59:20 <aws_dns>.eu-west-1.compute.internal docker-compose[5872]: An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
Nov 10 10:59:20 <aws_dns>.eu-west-1.compute.internal docker-compose[5872]: If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).

In the Docker service logs these are the only errors I can see:

Nov 10 09:38:36 <aws_dns>.eu-west-1.compute.internal dockerd[972]: time="2016-11-10T09:38:36.701843769Z" level=error msg="attach: stdout: write unix /var/run/docker.sock->@: write: broken pipe"
Nov 10 09:38:36 <aws_dns>.eu-west-1.compute.internal dockerd[972]: time="2016-11-10T09:38:36.704819504Z" level=error msg="attach: stderr: write unix /var/run/docker.sock->@: write: broken pipe"

I am able to temporarily resolve the issue by manually restarting docker.service.

@matti
matti commented Nov 10, 2016 edited

@Jwpe when this happens again, try docker ps -n 1, docker ps -n 2 ... until it hangs just to verify that the symptoms are the same.

docker-compose uses the same HTTP api so it will hang as well. So just to make clear the title for this bug is somewhat misleading -- docker ps is just good command to detect this bug because it iterates through all the containers in the API.

@Jwpe
Jwpe commented Nov 10, 2016

@matti thanks for the tip, will do!

@Jwpe
Jwpe commented Nov 10, 2016

@matti for me docker ps -n 3 works correctly, but docker ps -n 4 hangs...

@matti
matti commented Nov 10, 2016

@Jwpe (or anyone else reading this) -- please send SIGUSR1 to the docker daemon process and paste in the logs.

@Jwpe
Jwpe commented Nov 10, 2016

@matti I get a huge goroutine stack dump on sending that signal:

ps -C docker
  PID TTY          TIME CMD
20791 ?        00:00:01 docker
sudo kill -s USR1 20791
journalctl -u docker -n 100
Nov 10 12:32:08 <aws_dns>.eu-west-1.compute.internal dockerd[20791]: time="2016-11-10T12:32:08.436895136Z" level=info msg="=== BEGIN goroutine stack dump ===\ngoroutine 20 [running]:\n
...
@matti
matti commented Nov 10, 2016

@jwpe yes... please paste the rest of it...

@juhazi over here formatted our stacktrace here: https://gist.github.com/juhazi/fbf22602561b719e9480f8be4f8a4740

@matti
matti commented Nov 10, 2016

and for the record: this time all daemon API's were jammed:

getpeername(5, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, [23]) = 0
futex(0xc820024d08, FUTEX_WAKE, 1)      = 1
read(5, 0xc8203d6000, 4096)             = -1 EAGAIN (Resource temporarily unavailable)
write(5, "GET /v1.23/containers/json HTTP/1.1\r\nHost: \r\nUser-Agent: Docker-Client/1.11.2 (linux)\r\n\r\n", 89) = 89
futex(0xc820024d08, FUTEX_WAKE, 1)      = 1
futex(0x22509c8, FUTEX_WAIT, 0, NULL
@basvdlei
basvdlei commented Nov 10, 2016 edited

We are also seeing this behavior on some 1185.3.0 nodes (on VMWare ESXi) where we are using systemd units with the docker kill, docker rm, docker pull ExecStartPre's as described here.

Here are some goroutine stack dump's of 3 different machines that had the issue that docker ps didn't respond unless given a lower -n value.

In our case the units also created the following error messages:

Handler for POST /v1.23/containers/create returned error: Conflict. The name \"/xxxxxxxxxxx\" is already in use by container 0bf826dafe929711c98d65fa812ed75c4086dc3075e1c148fd1ebfd5c28b0544. You have to remove (or rename) that container to be able to reuse that name."
Handler for POST /v1.23/containers/xxxxxxxxxxx/stop returned error: No such container: xxxxxxxxxxx"
Handler for POST /v1.23/containers/xxxxxxxxxxx/kill returned error: Cannot kill container xxxxxxxxxxx: No such container: xxxxxxxxxxx"
Handler for DELETE /v1.23/containers/xxxxxxxxxxx returned error: No such container: xxxxxxxxxxx"

Even though the stop/kill/rm said the container was not there. Which looks a lot like docker/docker#21198 which might be related

Just to be complete, we are also running cadvisor and a custom container that calls list containers about every 5min.

@matti
matti commented Nov 10, 2016

We are also running cadvisor (as the cluster runs http://www.kontena.io)

@Tomdarkness
Tomdarkness commented Nov 11, 2016 edited

Having same issue since upgrading to 1185.3.0, here are the logs: https://gist.github.com/anonymous/0ad15cacc028c38ff7759abba7ace198

@bkleef
bkleef commented Nov 14, 2016 edited

Same here at Digital Ocean. Let me know if you need more logs, happy to help.

$ docker version
Client:
 Version:      1.11.2
 API version:  1.23
 Go version:   go1.6.3
 Git commit:   bac3bae
 Built:
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.2
 API version:  1.23
 Go version:   go1.6.3
 Git commit:   bac3bae
 Built:
 OS/Arch:      linux/amd64
$ uname -a
Linux 4.7.3-coreos-r2 #1 SMP Tue Nov 1 01:38:43 UTC 2016 x86_64 Intel(R) Xeon(R) CPU E5-2630L 0 @ 2.00GHz GenuineIntel GNU/Linux
$ cat /etc/lsb-release
DISTRIB_ID=CoreOS
DISTRIB_RELEASE=1185.3.0
DISTRIB_CODENAME="MoreOS"
DISTRIB_DESCRIPTION="CoreOS 1185.3.0 (MoreOS)"
@bkleef
bkleef commented Nov 14, 2016

I've updated CoreOS 1185.3.0 to 1192.2.0 (Docker 1.11.2 to 1.12.1). Will keep you guys posted!

@Jwpe
Jwpe commented Nov 16, 2016

Has anyone had any joy here? I'm seeing this every two or three restarts of my containers, and it's currently limiting my ability to repeatably deploy.

@victorgp

We had to roll back to 1122.3.0 and it is working

Version 1185.3.0 it is not usable, every ~24hours the docker daemon gets unresponsive.

@matti
matti commented Nov 21, 2016

Daemon jammed again, this time not even docker ps -n 1 worked.

@Jwpe
Jwpe commented Nov 23, 2016

@bkleef are you seeing this issue still on 1192.2.0? Thanks!

@bkleef
bkleef commented Nov 23, 2016 edited

@Jwpe yes it is fixed at 1192.2.0!

@Raffo
Raffo commented Nov 29, 2016

@Jwpe It's not fixed for us in 1192.2.0.

@wallneradam

It is related to this Docker issue: docker/docker#13885

I found a workaround until it is fixed: you can use ctr from containerd instead of docker exec or docker run. I wrote it in more details in this comment: docker/docker#13885 (comment)

@victorgp

I don't think that bug is related, it was opened in Jun 2015 and this started to happen recently with the latests CoreOS versions. If that's the bug, then it should've happened in older CoreOS versions

@wallneradam
wallneradam commented Nov 30, 2016 edited

@victorgp It is almost sure that it is related if you read all the comments or at least the latest ones. Just it is more often under Docker v1.11.0 and up. Which is only in the latest CoreOS (stable) not the previous one.
Not only docker ps hangs, but you cannot start or stop containers as well.

Anyway my suggestion is working for me for about a week. And it was not too hard to rewrite my services, which are needed to run processes or new containers, to ctr, which is integrated into CoreOS...
(The ctr utility is the CLI interface for containerd, which Docker uses as well behind the scene from v1.11.0).

@Jwpe
Jwpe commented Nov 30, 2016

@Raffo I haven't encountered the issue since upgrading to 1192.2.0, but I might just not have recreated the scenario where it occurs yet.

@philk
philk commented Dec 1, 2016 edited

We're also seeing this on 1185.3.0. At least on the one system it's currently happening on -n9 works but -n10 fails. The ctr command can communicate with things but this failing docker daemon breaks the kubelet making this a completely failed node. Specifically pods get stuck in Pending/ContainerCreating/Terminating.

@crawford crawford added this to the CoreOS Alpha 1263.0.0 milestone Dec 2, 2016
@hjacobs
hjacobs commented Dec 5, 2016

We had a hanging docker ps as well (docker/docker#28889), we could reproduce it by heavily using docker exec. We fixed the issue for us by manually upgrading to Docker 1.13 (zalando-incubator/kubernetes-on-aws#164).

We are now waiting for Docker 1.13 release and inclusion in CoreOS: zalando-incubator/kubernetes-on-aws#167

@seanknox
seanknox commented Dec 6, 2016

I was also able to reproduce this issue using https://github.com/crosbymichael/docker-stress on a Kubernetes worker node running CoreOS Stable 1185.3.0.

Running docker_stress_linux_amd64 -k 3s -c 5 --containers 1000: 5 concurrent workers creating/deleting containers, max lifetime of containers = 3s, create up to 1000 containers on an m4.large instance on AWS would leave the Docker daemon unresponsive after about three minutes.

Upgraded to CoreOS Beta 1235.1.0 and I haven't been able to reproduce. Whereas running 5 concurrent docker_stress workers would kill CoreOS Stable after a few minutes, I was able to run with 10 and 15 concurrent workers until test completion using CoreOS Beta.

CoreOS Stable 1185.3.0

kernel: 4.7.3

docker: 1.11.2

CoreOS Beta 1235.1.0

kernel: 4.8.6

docker: 1.12.3

@mward29
mward29 commented Dec 7, 2016

Just as an FYI, I'm not entirely sure this is CoreOS specific. We are running into the same issues on CentOS. Its feels more relates to Docker atm but I've been unable to get a dump out of Docker on failure to validate.

@hjacobs
hjacobs commented Dec 7, 2016

@mward29 yes, it's a Docker bug and fixed in 1.13 (and will be backported to 1.12 AFAIK, see docker/docker#28889) --- still it should be fixed in CoreOS (by upgrading to fixed Docker as soon as it's released).

@dm0- dm0- self-assigned this Dec 12, 2016
@dm0- dm0- referenced this issue in coreos/coreos-overlay Dec 14, 2016
Merged

app-emulation/docker: bump to v1.12.4 #2317

@dm0-
Member
dm0- commented Dec 14, 2016

We are updating to Docker 1.12.4, which contains the upstream fixes for the docker daemon deadlocks (docker/docker#29095, docker/docker#29141). It will be available in the alpha later this week. You can reopen this if problems persist.

@dm0- dm0- closed this Dec 14, 2016
@Bekt
Bekt commented Jan 10, 2017 edited

This kept happening to me at about 10% of the time. For non-alpha releases, I solved this by adding a unit that constantly checks docker version and restarts if it hangs.

- name: ping-docker.service
  command: start
  content: |
    [Unit]
    Description=Make sure docker is always running.
    [Service]
    ExecStart=/usr/bin/bash -c "(timeout 2s docker version > /dev/null) || (systemctl restart docker)"
    Restart=always
    RestartSec=10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment