Docker daemon hangs #1681

Open
pjknkda opened this Issue Nov 29, 2016 · 5 comments
@pjknkda
pjknkda commented Nov 29, 2016 edited

Issue Report

Bug

CoreOS Version

$ cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=1235.0.0
VERSION_ID=1235.0.0
BUILD_ID=2016-11-17-0416
PRETTY_NAME="CoreOS 1235.0.0 (MoreOS)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

Environment

KVM on CentOS 7 (3.10.0-327.36.3.el7.x86_64)

Expected Behavior

Docker daemon should be responsible for commands such as ps, run, stop, etc..

Actual Behavior

Docker daemon hangs on commands such as ps, run, stop, etc.

Reproduction Steps

I cannot find exact reproduction step yet (I'm making simplified version of my real service, but yet success to reproduce the issue). However, it occurs almost every time when I repeat creating and removing containers a lot (more than about 1,000 containers, but sometimes with fewer containers. Each container lives 1s ~ 120s).

Other Information

Stacktrace of dockerd : sigusr1.txt

Journalctl log,

Nov 28 07:25:54 localhost docker[3564]: [Worker 0] Accpet request (Id: 2076410)
Nov 28 07:25:54 localhost docker[3564]: Get (Id: 2076410)
Nov 28 07:25:55 localhost docker[3564]: New container: aad1a33df311 (Elapsed: 46ms)
Nov 28 07:25:55 localhost systemd-udevd[14298]: Could not generate persistent MAC address for veth208af43: No such file or directory
Nov 28 07:25:55 localhost systemd-udevd[14299]: Could not generate persistent MAC address for veth34418f0: No such file or directory
Nov 28 07:25:55 localhost systemd-networkd[3188]: veth208af43: IPv6 enabled for interface: Success
Nov 28 07:25:55 localhost kernel: IPv6: ADDRCONF(NETDEV_UP): veth208af43: link is not ready
ov 28 07:25:55 localhost systemd-networkd[3188]: veth34418f0: IPv6 enabled for interface: Success
Nov 28 07:25:55 localhost kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth208af43: link becomes ready
Nov 28 07:25:55 localhost systemd-timesyncd[3123]: Network configuration changed, trying to establish connection.
Nov 28 07:25:55 localhost systemd-networkd[3188]: veth34418f0: Gained carrier
Nov 28 07:25:55 localhost systemd-networkd[3188]: veth208af43: Gained carrier
Nov 28 07:25:55 localhost kernel: docker0: port 3(veth34418f0) entered blocking state
Nov 28 07:25:55 localhost kernel: docker0: port 3(veth34418f0) entered disabled state
Nov 28 07:25:55 localhost kernel: device veth34418f0 entered promiscuous mode
Nov 28 07:25:55 localhost kernel: docker0: port 3(veth34418f0) entered blocking state
Nov 28 07:25:55 localhost kernel: docker0: port 3(veth34418f0) entered forwarding state
Nov 28 07:25:57 localhost systemd-networkd[3188]: veth208af43: Gained IPv6LL
Nov 28 07:25:57 localhost systemd-networkd[3188]: veth34418f0: Gained IPv6LL

Nov 28 07:26:06 localhost kernel: INFO: task bash:14181 blocked for more than 120 seconds.
Nov 28 07:26:06 localhost kernel:       Not tainted 4.8.6-coreos #1
Nov 28 07:26:06 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 28 07:26:06 localhost kernel: bash            D ffff8b963e5a7c48     0 14181  14166 0x00000182
Nov 28 07:26:06 localhost kernel:  ffff8b963e5a7c48 0000000000000000 ffffffff96a0d540 ffff8b960ded1d40
Nov 28 07:26:06 localhost kernel:  ffffffff9607bb30 ffff8b963e5a8000 ffff8b960bd33420 0000000000000001
Nov 28 07:26:06 localhost kernel:  ffff8b96091f65a8 0000000000000008 ffff8b963e5a7c60 ffffffff965b9bf5
Nov 28 07:26:06 localhost kernel: Call Trace:
Nov 28 07:26:06 localhost kernel:  [<ffffffff9607bb30>] ? task_stopped_code+0x50/0x50
Nov 28 07:26:06 localhost kernel:  [<ffffffff965b9bf5>] schedule+0x35/0x80
Nov 28 07:26:06 localhost kernel:  [<ffffffff9611318f>] zap_pid_ns_processes+0x13f/0x1a0
Nov 28 07:26:06 localhost kernel:  [<ffffffff9607dcd6>] do_exit+0xa96/0xb10
Nov 28 07:26:06 localhost kernel:  [<ffffffff9607ddd5>] do_group_exit+0x45/0xb0
Nov 28 07:26:06 localhost kernel:  [<ffffffff96088ff2>] get_signal+0x282/0x610
Nov 28 07:26:06 localhost kernel:  [<ffffffff9602e197>] do_signal+0x37/0x6c0
Nov 28 07:26:06 localhost kernel:  [<ffffffff961ff9bf>] ? __vfs_read+0xdf/0x130
Nov 28 07:26:06 localhost kernel:  [<ffffffff9611bcf1>] ? __audit_syscall_entry+0xb1/0x100
Nov 28 07:26:06 localhost kernel:  [<ffffffff9600357c>] exit_to_usermode_loop+0x8c/0xd0
Nov 28 07:26:06 localhost kernel:  [<ffffffff96003bc9>] do_syscall_64+0xa9/0x160

I first tried with coreos stable(1185.3.0) and suffered from another issue, and moved to coreos alpha (1235.0.0) and then this issue is occurred. Here is the error log what I got in stable channel.

Nov 28 05:26:32 localhost dockerd[1748]: 2016/11/28 05:26:32 http: panic serving @: runtime error: index out of range
Nov 28 05:26:32 localhost dockerd[1748]: goroutine 55942 [running]:
Nov 28 05:26:32 localhost dockerd[1748]: net/http.(*conn).serve.func1(0xc822768700)
Nov 28 05:26:32 localhost dockerd[1748]:         /usr/lib/go1.6/src/net/http/server.go:1389 +0xc1
Nov 28 05:26:32 localhost dockerd[1748]: panic(0x1532c80, 0xc82000e040)
Nov 28 05:26:32 localhost dockerd[1748]:         /usr/lib/go1.6/src/runtime/panic.go:443 +0x4e9
Nov 28 05:26:32 localhost dockerd[1748]: github.com/docker/docker/docker/hack.(*MalformedHostHeaderOverrideConn).Read(0xc823434300, 0xc826507000, 0x1000, 0x1000, 0x4, 0x0, 0x0)
Nov 28 05:26:32 localhost dockerd[1748]:         /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/.gopath/src/github.com/docker/docker/docker/hack/malformed_host_override.go:58 +0x7b5
Nov 28 05:26:32 localhost dockerd[1748]: net/http.(*connReader).Read(0xc823434400, 0xc826507000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
Nov 28 05:26:32 localhost dockerd[1748]:         /usr/lib/go1.6/src/net/http/server.go:526 +0x196
Nov 28 05:26:32 localhost dockerd[1748]: bufio.(*Reader).fill(0xc822f8daa0)
Nov 28 05:26:32 localhost dockerd[1748]:         /usr/lib/go1.6/src/bufio/bufio.go:97 +0x1e9
Nov 28 05:26:32 localhost dockerd[1748]: bufio.(*Reader).ReadSlice(0xc822f8daa0, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0)
Nov 28 05:26:32 localhost dockerd[1748]:         /usr/lib/go1.6/src/bufio/bufio.go:328 +0x21a
Nov 28 05:26:32 localhost dockerd[1748]: bufio.(*Reader).ReadLine(0xc822f8daa0, 0x0, 0x0, 0x0, 0x1709400, 0x0, 0x0)
Nov 28 05:26:32 localhost dockerd[1748]:         /usr/lib/go1.6/src/bufio/bufio.go:357 +0x53
Nov 28 05:26:32 localhost dockerd[1748]: net/textproto.(*Reader).readLineSlice(0xc822f75200, 0x0, 0x0, 0x0, 0x0, 0x0)
Nov 28 05:26:32 localhost dockerd[1748]:         /usr/lib/go1.6/src/net/textproto/reader.go:55 +0x81
Nov 28 05:26:32 localhost dockerd[1748]: net/textproto.(*Reader).ReadLine(0xc822f75200, 0x0, 0x0, 0x0, 0x0)
Nov 28 05:26:32 localhost dockerd[1748]:         /usr/lib/go1.6/src/net/textproto/reader.go:36 +0x40
Nov 28 05:26:32 localhost dockerd[1748]: net/http.readRequest(0xc822f8daa0, 0xc81fe0c500, 0xc82585fb20, 0x0, 0x0)
Nov 28 05:26:32 localhost dockerd[1748]:         /usr/lib/go1.6/src/net/http/request.go:721 +0xb6
Nov 28 05:26:32 localhost dockerd[1748]: net/http.(*conn).readRequest(0xc822768700, 0x0, 0x0, 0x0)
Nov 28 05:26:32 localhost dockerd[1748]:         /usr/lib/go1.6/src/net/http/server.go:705 +0x359
Nov 28 05:26:32 localhost dockerd[1748]: net/http.(*conn).serve(0xc822768700)
Nov 28 05:26:32 localhost dockerd[1748]:         /usr/lib/go1.6/src/net/http/server.go:1425 +0x947
Nov 28 05:26:32 localhost dockerd[1748]: created by net/http.(*Server).Serve
Nov 28 05:26:32 localhost dockerd[1748]:         /usr/lib/go1.6/src/net/http/server.go:2137 +0x44e
@pjknkda
pjknkda commented Nov 29, 2016

I had tested the same service with CentOS (3.10.0-327.36.3.el7.x86_64) + Docker 1.12.3, which is the same version of current CoreOS alpha channel, with zfs storage driver (because overlay driver simply does not work) and found that the issue is not reproduced (~7 days with no hang). Therefore, I posted the issue here rather than github.com/docker/docker.

@crawford crawford added this to the CoreOS Alpha 1263.0.0 milestone Dec 2, 2016
@wkruse
wkruse commented Dec 5, 2016

Seems like a duplicate of #1654.

@dm0- dm0- self-assigned this Dec 12, 2016
@dm0- dm0- referenced this issue in coreos/coreos-overlay Dec 14, 2016
Merged

app-emulation/docker: bump to v1.12.4 #2317

@dm0-
Member
dm0- commented Dec 14, 2016

We are updating to Docker 1.12.4, which contains the upstream fixes for the docker daemon deadlocks (docker/docker#29095, docker/docker#29141). It will be available in the alpha later this week. You can reopen this if problems persist.

@dm0- dm0- closed this Dec 14, 2016
@pjknkda
pjknkda commented Dec 16, 2016 edited

The issue still exists after updating coreOS to 1262.0.0 (alpha).

localhost elice # cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=1262.0.0
VERSION_ID=1262.0.0
BUILD_ID=2016-12-14-2334
PRETTY_NAME="CoreOS 1262.0.0 (Ladybug)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"
localhost elice # docker info
Containers: 91
 Running: 0
 Paused: 0
 Stopped: 91
Images: 1256
Server Version: 1.12.4
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null host bridge overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp selinux
Kernel Version: 4.9.0-coreos
Operating System: CoreOS 1262.0.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 55.05 GiB
Name: localhost
ID: 4NGY:OJIT:ZJGD:FWKT:URNJ:MN3C:4COS:HNOU:SNM7:SQVG:7YQM:TUPS
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
 127.0.0.0/8

hang.sigusr1.decoded.txt

@crawford crawford reopened this Dec 18, 2016
@pjknkda
pjknkda commented Dec 21, 2016 edited

I've tested with CoreOS 1262.0 and Docker v1.13.0-rc4 for 3 days and it seems that the issue does not occur in this configuration.

(update 1/3/2017) No hangs for 15 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment