docker ps hangs #965

Closed
rwky opened this Issue Nov 4, 2015 · 39 comments

Projects

None yet
@rwky
rwky commented Nov 4, 2015

After a while of running various containers for no apparent reason docker ps hangs. All other docker commands work

Docker info:

Server:
Version: 1.8.3
API version: 1.20
Go version: go1.5.1
Git commit: cedd534-dirty
Built: Fri Oct 16 04:20:25 UTC 2015
OS/Arch: linux/amd64

Containers: 23
Images: 520
Storage Driver: overlay
Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.2.2-coreos
Operating System: CoreOS 835.1.0
CPUs: 8
Total Memory: 31.46 GiB
ID: KOW4:OCYI:PTET:G2F7:BPHH:Y5GU:XMXT:U2D3:5SKA:MO63:SE2Y:I42C

I ran strace docker ps via toolbox to see if there's a clue of what's going on and it produced

socket(PF_LOCAL, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
setsockopt(3, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
connect(3, {sa_family=AF_LOCAL, sun_path="/run/systemd/journal/socket"}, 30) = -1 ENOENT (No such file or directory)
close(3)                                = 0
ioctl(0, TCGETS, {B38400 opost isig icanon echo ...}) = 0
ioctl(1, TCGETS, {B38400 opost isig icanon echo ...}) = 0
stat("/root/.docker/config.json", 0xc82008a240) = -1 ENOENT (No such file or directory)
stat("/root/.dockercfg", 0xc82008a2d0)  = -1 ENOENT (No such file or directory)
futex(0xc8201a2110, FUTEX_WAKE, 1)      = 1
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
setsockopt(3, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
connect(3, {sa_family=AF_LOCAL, sun_path="/media/root/var/run/docker.sock"}, 34) = 0
epoll_create1(EPOLL_CLOEXEC)            = 4
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=172969648, u64=140205085380272}}) = 0
getsockname(3, {sa_family=AF_LOCAL, NULL}, [2]) = 0
getpeername(3, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, [23]) = 0
futex(0xc8201a2110, FUTEX_WAKE, 1)      = 1
read(3, 0xc8200ab000, 4096)             = -1 EAGAIN (Resource temporarily unavailable)
write(3, "GET /v1.20/containers/json HTTP/"..., 93) = 93
epoll_wait(4, {{EPOLLOUT, {u32=172969648, u64=140205085380272}}}, 128, 0) = 1
futex(0x181f250, FUTEX_WAIT, 0, NULL

So it looks like it's writing the get command but the server isn't responding.

@crawford
Member
crawford commented Nov 5, 2015

@yifan-gu you were showing me this behavior the other day, but I forget what was causing it. Can you refresh my memory?

@yifan-gu
Member
yifan-gu commented Nov 5, 2015

@crawford @rwky
I reported in this docker issue.
docker/docker#13885 (comment)

@spiffxp
spiffxp commented Nov 23, 2015

Running into this as well, see kubernetes/kubernetes#17525

@mischief
Member

@rwky have you experienced this after your initial report?

@rwky
rwky commented Nov 23, 2015

@mischief yep at least once a week.

@mischief
Member

@rwky any tips on reproducing it? is it intermittent, or are you doing something to work around it?

@rwky
rwky commented Nov 24, 2015

I don't know how to predictably replicate it, however the server in question spawns and destroys (using the --rm flag) several docker containers every minute. I suspect that just running a CoreOS server that does that for a few days will trigger the hang.

The only way to resolve is to reboot the server (you could probably kill and restart docker but rebooting does the job).

@pierreozoux

I'm seeing the same problem now (I'm on IRC, and will post the content of IRC here), I'll have to reboot the server in 2 hours, if in the mean time, you want me to run commands, let me know!
(I run CoreOS stable)

Ok, I'm rebooting now, here is the IRC discussion: https://framabin.org/?3b8e73977c597b06#7wIlBFnWRaGO0+o34QkfauXwhm2kSvbk+NInpxe6kes=

@leonty
leonty commented Feb 19, 2016

Same for me on docker 1.10.1 after intensive using of dockerui web app (https://hub.docker.com/r/crosbymichael/dockerui/).

@vpal
vpal commented Feb 25, 2016

I experienced similar hang with CoreOS stable (835.12.0).
I could not kill -9 the docker daemon or any of its processes.
Had to reboot the machine.

Okay, I have no idea if this is the same issue, but it is reproducible with this command:
/usr/bin/docker run --pid=host --privileged -v /etc/kubernetes/manifests:/etc/kubernetes/manifests -v /var/run/docker.sock:/var/run/docker.sock --rm --name kubelet gcr.io/google_containers/hyperkube:v1.1.8 /hyperkube kubelet --v=2 --config=/etc/kubernetes/manifests

@gopinatht

+1 for a rapid resolution of this issue on CoreOS 970.1.0 with Docker 1.10.1 with kubernetes 1.1.7. Is there an active investigation of the problem? There was a suggestion that it might be a kernel issue (docker/docker#13885 (comment)) but I am not sure how it is addressed or should be addressed in CoreOS.

@crawford
Member
crawford commented Mar 2, 2016

@gopinatht can you give the Beta channel a shot? That has the upstream fix for device mapper.

@gopinatht

@crawford Thanks for the info. I will give it a try.

Just to confirm, the alpha channel does not have the fix (I am using the absolute latest version of Alpha)? Also, docker info output says the storage driver is overlay:
Containers: 26 Running: 10 Paused: 0 Stopped: 16 Images: 12 Server Version: 1.10.1 Storage Driver: overlay Backing Filesystem: extfs Execution Driver: native-0.2 Logging Driver: json-file Plugins: Volume: local Network: host bridge null Kernel Version: 4.4.1-coreos Operating System: CoreOS 970.1.0 (Coeur Rouge) OSType: linux Architecture: x86_64 CPUs: 2 Total Memory: 3.862 GiB Name: ip-1x-xx-xx-xx.ec2.internal ID: UVON:7XXW:5AZD:QJTG:2DAW:6Y52:CKN3:EVZZ:WESO:COSQ:4UGC:VKNO

@crawford
Member
crawford commented Mar 2, 2016

Sorry, I missed the CoreOS version before. That kernel also has the patch. I doubt it will behave any differently.

@gopinatht

@crawford No worries. Any other insights on where I can investigate? I am kind of stuck on what I should do next. This issue has become a blocker for my team using a CoreOS/kubernetes/docker solution.

@gopinatht

Guys, I logged an issue with Docker team and they are indicating that the docker build on CoreOS is different enough (different Go version and custom build on CoreOS) (docker/docker#20871 (comment)) for this to be more of a CoreOS issue.

Help or pointers from CoreOS team would be appreciated.

@crawford
Member
crawford commented Mar 3, 2016

I really don't know where to start on this one. We are going to bump to Docker 1.10.2 in the next Alpha and then bump to Go 1.5.3 in the following one. So, by this time next week, we will have a mostly vanilla Docker shipping in CoreOS. Given how long this bug has been open, I don't expect the version bump to change anything, but at least we'll be able to debug with the Docker folks a bit easier. Let's reevaluate the situation next week.

@vpal
vpal commented Mar 3, 2016

To help debugging. If I add the -v /sys:/sys:ro option to the command I mentioned above the kubelet container and docker daemon does not hang.
So the full command that does not seem to hang:
/usr/bin/docker run --pid=host --privileged -v /sys:/sys:ro -v /etc/kubernetes/manifests:/etc/kubernetes/manifests -v /var/run/docker.sock:/var/run/docker.sock --rm --name kubelet gcr.igoogle_containers/hyperkube:v1.1.8 /hyperkube kubelet --v=2 --config=/etc/kubernetes/manifests

@gopinatht

@vpal Thanks for this info. My system is a little different in that I am not using hyperkube on my AWS cluster. But I wonder why mounting /sys as read only mount should solve the docker hang issue. I am not able to make the connection.

@gopinatht

@crawford I investigated the issue further and posted the results of my investigation there: docker/docker#20871 (comment)

FYI.

@gopinatht

I asked this question on the docker bug page but thought it would be relevant here as well: Could this issue be related in any way to the hairpin NAT setting? Is is possible there is some sort of thread deadlocks on a device? When I ran systemctl on my CoreOS distribution, it did not show any device with name vethbfe8f8f. Could that be the source of the issue?

@alexkappa alexkappa referenced this issue in coreos/coreos-kubernetes Mar 12, 2016
Closed

Kubernetes workers fail after a few hours #306

@crawford
Member

@gopinatht as of 983.0.0, we are shipping Docker built with Go 1.5.3, so that should eliminate a few variables.

@gopinatht

@crawford Thanks for the heads-up. I will check it out.

@abhisheyke

I came across same issue with docker 1.8.3. I noticed docker built with Go Version 1.4.2 .
is it a known bug in docker 1.8.3 ?
@gopinatht any luck using docker built with higher go version? Thanks

[root@abhi ]# docker version
Client:
Version: 1.8.3
API version: 1.20
Go version: go1.4.2
Git commit: f4bf5c7
Built: Mon Oct 12 06:06:01 UTC 2015
OS/Arch: linux/amd64

Server:
Version: 1.8.3
API version: 1.20
Go version: go1.4.2
Git commit: f4bf5c7
Built: Mon Oct 12 06:06:01 UTC 2015
OS/Arch: linux/amd64
[root@localhost abhi]# docker info
Containers: 20
Images: 324
Storage Driver: devicemapper
Pool Name: docker-8:7-268591717-pool
Pool Blocksize: 65.54 kB
Backing Filesystem: xfs
Data file: /dev/vg-abhi/data
Metadata file: /dev/vg-abhi/metadata
Data Space Used: 16.12 GB
Data Space Total: 107.4 GB
Data Space Available: 91.25 GB
Metadata Space Used: 26.04 MB
Metadata Space Total: 5.369 GB
Metadata Space Available: 5.343 GB
Udev Sync Supported: true
Deferred Removal Enabled: false
Library Version: 1.02.93-RHEL7 (2015-01-28)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.10.0-229.el7.x86_64
Operating System: CentOS Linux 7 (Core)
CPUs: 48
Total Memory: 125.6 GiB
Name: localhost.localdomain
ID: K76J:2RP6:HKQX:NMJS:AA5D:EXY2:VGGD:5YB4:KRFI:3HW6:WWG4:6QDY
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled

@crawford
Member

Is this still an issue with 1.10.3? I haven't seen this one in a while (in fact, I forgot about this bug).

@rwky
rwky commented Jun 13, 2016 edited

@crawford I've not seen this on 1.10.3 so far as I can remember, the server I normally see it on runs the beta channel so has been on 1.10.3 for a while.

Just checked my logs, not seen it since May 3rd.

@crawford
Member

@rwky Thank you for checking. Let's go ahead and close this one. We can reopen it if it turns out to still be a problem.

@crawford crawford closed this Jun 13, 2016
@feelobot

seeing this still:

core@ip-172-20-13-165 ~ $ docker info
Containers: 50
 Running: 42
 Paused: 0
 Stopped: 8
Images: 241
Server Version: 1.10.3
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
 Volume: local
 Network: bridge null host
Kernel Version: 4.5.7-coreos
Operating System: CoreOS 1010.6.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 29.44 GiB
Name: ip-172-20-13-165.ec2.internal
ID: SS63:BOHR:326B:NZUQ:BOMB:43U2:XOLL:UCT7:BDOS:LBD3:DQJK:O52K
@zihaoyu
zihaoyu commented Jul 30, 2016 edited

We see this quite often on 1068.6.0 and Kubernetes 1.3.3. Please consider reopening this issue.

core@ip-10-72-16-158 ~ $ docker info
Containers: 135
 Running: 77
 Paused: 0
 Stopped: 58
Images: 59
Server Version: 1.10.3
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
 Volume: local
 Network: bridge null host
Kernel Version: 4.6.3-coreos
Operating System: CoreOS 1068.6.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 40
Total Memory: 157.4 GiB
Name: ip-10-72-16-158.us-west-2.compute.internal
ID: F6BA:V7OB:IYIQ:QGDZ:GQCP:2ZUW:A7S7:GE73:AG6Z:SLND:CFKU:A7R3

Also seeing these a lot in journal -u docker.service:

Jul 30 00:40:29 ip-10-72-16-158.us-west-2.compute.internal dockerd[43231]: time="2016-07-30T00:40:29.198751686Z" level=info msg="Container bb8b577470e1ec6be491414d7d39df767a113e4fb6efbc3178c7b73ea14ff15e failed to exit within 30 seconds of SIGTERM - using the force"
Jul 30 00:41:09 ip-10-72-16-158.us-west-2.compute.internal dockerd[43231]: time="2016-07-30T00:41:09.470564310Z" level=info msg="Container d5677abe86893968f12f1d3db15ef0097dbd218d59c52e4c06d809d8c0205664 failed to exit within 30 seconds of SIGTERM - using the force"
Jul 30 00:41:09 ip-10-72-16-158.us-west-2.compute.internal dockerd[43231]: time="2016-07-30T00:41:09.470599085Z" level=info msg="Container 58b3ac90b2fef33b7cb76731c2df1d56d5459be0e7f0312169e6d47b599f6d1d failed to exit within 30 seconds of SIGTERM - using the force"
Jul 30 00:41:09 ip-10-72-16-158.us-west-2.compute.internal dockerd[43231]: time="2016-07-30T00:41:09.470564635Z" level=info msg="Container 85d9338bddc0553f079651c737fa3c165df53ffe35a852e579a11eb1732316c0 failed to exit within 30 seconds of SIGTERM - using the force"
Jul 30 00:41:10 ip-10-72-16-158.us-west-2.compute.internal dockerd[43231]: time="2016-07-30T00:41:10.627024947Z" level=error msg="Handler for POST /containers/36b70ef81bd78b765f58245a49883b4e11b3ab291ce4e67535c4e9314921af10/stop returned error: Container already stopped"
@rwky
rwky commented Aug 4, 2016

This just happened to me again, however at the same time I'm experiencing docker/docker#5618 which I suspect is the cause @feelobot @zihaoyu perhaps you are experiencing the same thing?

@zihaoyu
zihaoyu commented Aug 4, 2016

@rwky Not sure if it is related, but I did see

Aug 04 11:10:00 ip-10-72-16-158.us-west-2.compute.internal kernel: unregister_netdevice: waiting for veth9c879be to become free. Usage count = 1```
@rwky
rwky commented Aug 4, 2016

Yep that looks like it's the problem.

@nizar-m
nizar-m commented Aug 9, 2016

This is happening for me too.

Docker version 1.10.3, build 3cd164c

CoreOS stable (1068.8.0)

Kernel Version: 4.6.3-coreos
OS Image: CoreOS 1068.8.0 (MoreOS)
Container Runtime Version: docker://1.10.3
Kubelet Version: v1.2.4.7+cb08580860a02a-dirty
Kube-Proxy Version: v1.2.4.7+cb08580860a02a-dirty

docker info

Containers: 131
Running: 90
Paused: 0
Stopped: 41
Images: 17
Server Version: 1.10.3
Storage Driver: overlay
Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
Volume: local
Network: null host bridge
Kernel Version: 4.6.3-coreos
Operating System: CoreOS 1068.8.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 13.7 GiB

dmesg shows the folowing

[163194.652062] unregister_netdevice: waiting for vethdd22a32 to become free. Usage count = 1

We are using Flannal driver

@alph486
alph486 commented Aug 31, 2016 edited

Echo @rwky: reported details here: docker/docker#5618 (comment) . I'm probably going to try @sercand 's suggestion (docker/docker#5618 (comment))

It should probably also be mentioned that this of course affects, by default, clusters build by the kube-aws project.

@mischief
Member

sounds like a dup of #254

@liquid-sky

@mischief, doesn't look like a dup, because we run container the same way as @rwky (every minute with --rm) and I can see that docker hangs exactly on the container, long before unregistered lo message happens. On some machines there's no unregistered lo message at all, but docker ps is hanging.

docker -D info
Containers: 6
 Running: 2
 Paused: 0
 Stopped: 4
Images: 6
Server Version: 1.11.2
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: host bridge null
Kernel Version: 4.7.3-coreos-r2
Operating System: CoreOS 1185.3.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.8 GiB
Name: <hostname>
ID: GFHQ:Y6IV:XCQI:Y2NA:PCQN:DEII:OSPZ:LENL:OCFU:6CBI:MDFV:EMD7
Docker Root Dir: /var/lib/docker
Debug mode (client): true
Debug mode (server): false
Username: <username>
Registry: https://index.docker.io/v1/
@matti
matti commented Nov 9, 2016

I don't have unregister_netdevice in dmesg, but docker ps hangs like this:

$ docker ps -n 17
# works

$ docker ps -n 18
# hangs

strace:

$ strace -s  128 docker ps -n 18
clock_gettime(CLOCK_MONOTONIC, {484793, 504192514}) = 0
futex(0xc820064108, FUTEX_WAKE, 1)      = 1
read(5, 0xc820400000, 4096)             = -1 EAGAIN (Resource temporarily unavailable)
write(5, "GET /v1.23/containers/json?limit=18 HTTP/1.1\r\nHost: \r\nUser-Agent: Docker-Client/1.11.2 (linux)\r\n\r\n", 98) = 98
futex(0xc820064108, FUTEX_WAKE, 1)      = 1
futex(0x22509c8, FUTEX_WAIT, 0, NULL

so $ curl --unix-socket /var/run/docker.sock http://localhost/v1.23/containers/json?limit=17 works, but curl --unix-socket /var/run/docker.sock http://dolantrump/v1.23/containers/json?limit=18 hangs.

Coreos: VERSION=1185.3.0 BUILD_ID=2016-11-01-0605

$ docker info
Containers: 52
 Running: 28
 Paused: 0
 Stopped: 24
Images: 110
Server Version: 1.11.2
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge null host
Kernel Version: 4.7.3-coreos-r2
Operating System: CoreOS 1185.3.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 29.46 GiB
Name: gcp-1-5.c.kontena-1162.internal
ID: CTTM:L5IQ:K24M:GNOZ:JOMO:CZEG:Z756:EI3M:ZGPS:HYPO:SXVO:GIO5
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/

running in google cloud platform.

@crawford
Member
crawford commented Nov 9, 2016

Since this is a new bug, let's move this conversation over to #1654.

@klawed
klawed commented Jan 6, 2017

systemctl restart docker
totally worked for me (which is definitely nicer than a reboot).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment