New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker daemon hangs #1681

Closed
pjknkda opened this Issue Nov 29, 2016 · 8 comments

Comments

Projects
None yet
7 participants
@pjknkda

pjknkda commented Nov 29, 2016

Issue Report

Bug

CoreOS Version

$ cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=1235.0.0
VERSION_ID=1235.0.0
BUILD_ID=2016-11-17-0416
PRETTY_NAME="CoreOS 1235.0.0 (MoreOS)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

Environment

KVM on CentOS 7 (3.10.0-327.36.3.el7.x86_64)

Expected Behavior

Docker daemon should be responsible for commands such as ps, run, stop, etc..

Actual Behavior

Docker daemon hangs on commands such as ps, run, stop, etc.

Reproduction Steps

I cannot find exact reproduction step yet (I'm making simplified version of my real service, but yet success to reproduce the issue). However, it occurs almost every time when I repeat creating and removing containers a lot (more than about 1,000 containers, but sometimes with fewer containers. Each container lives 1s ~ 120s).

Other Information

Stacktrace of dockerd : sigusr1.txt

Journalctl log,

Nov 28 07:25:54 localhost docker[3564]: [Worker 0] Accpet request (Id: 2076410)
Nov 28 07:25:54 localhost docker[3564]: Get (Id: 2076410)
Nov 28 07:25:55 localhost docker[3564]: New container: aad1a33df311 (Elapsed: 46ms)
Nov 28 07:25:55 localhost systemd-udevd[14298]: Could not generate persistent MAC address for veth208af43: No such file or directory
Nov 28 07:25:55 localhost systemd-udevd[14299]: Could not generate persistent MAC address for veth34418f0: No such file or directory
Nov 28 07:25:55 localhost systemd-networkd[3188]: veth208af43: IPv6 enabled for interface: Success
Nov 28 07:25:55 localhost kernel: IPv6: ADDRCONF(NETDEV_UP): veth208af43: link is not ready
ov 28 07:25:55 localhost systemd-networkd[3188]: veth34418f0: IPv6 enabled for interface: Success
Nov 28 07:25:55 localhost kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth208af43: link becomes ready
Nov 28 07:25:55 localhost systemd-timesyncd[3123]: Network configuration changed, trying to establish connection.
Nov 28 07:25:55 localhost systemd-networkd[3188]: veth34418f0: Gained carrier
Nov 28 07:25:55 localhost systemd-networkd[3188]: veth208af43: Gained carrier
Nov 28 07:25:55 localhost kernel: docker0: port 3(veth34418f0) entered blocking state
Nov 28 07:25:55 localhost kernel: docker0: port 3(veth34418f0) entered disabled state
Nov 28 07:25:55 localhost kernel: device veth34418f0 entered promiscuous mode
Nov 28 07:25:55 localhost kernel: docker0: port 3(veth34418f0) entered blocking state
Nov 28 07:25:55 localhost kernel: docker0: port 3(veth34418f0) entered forwarding state
Nov 28 07:25:57 localhost systemd-networkd[3188]: veth208af43: Gained IPv6LL
Nov 28 07:25:57 localhost systemd-networkd[3188]: veth34418f0: Gained IPv6LL

Nov 28 07:26:06 localhost kernel: INFO: task bash:14181 blocked for more than 120 seconds.
Nov 28 07:26:06 localhost kernel:       Not tainted 4.8.6-coreos #1
Nov 28 07:26:06 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 28 07:26:06 localhost kernel: bash            D ffff8b963e5a7c48     0 14181  14166 0x00000182
Nov 28 07:26:06 localhost kernel:  ffff8b963e5a7c48 0000000000000000 ffffffff96a0d540 ffff8b960ded1d40
Nov 28 07:26:06 localhost kernel:  ffffffff9607bb30 ffff8b963e5a8000 ffff8b960bd33420 0000000000000001
Nov 28 07:26:06 localhost kernel:  ffff8b96091f65a8 0000000000000008 ffff8b963e5a7c60 ffffffff965b9bf5
Nov 28 07:26:06 localhost kernel: Call Trace:
Nov 28 07:26:06 localhost kernel:  [<ffffffff9607bb30>] ? task_stopped_code+0x50/0x50
Nov 28 07:26:06 localhost kernel:  [<ffffffff965b9bf5>] schedule+0x35/0x80
Nov 28 07:26:06 localhost kernel:  [<ffffffff9611318f>] zap_pid_ns_processes+0x13f/0x1a0
Nov 28 07:26:06 localhost kernel:  [<ffffffff9607dcd6>] do_exit+0xa96/0xb10
Nov 28 07:26:06 localhost kernel:  [<ffffffff9607ddd5>] do_group_exit+0x45/0xb0
Nov 28 07:26:06 localhost kernel:  [<ffffffff96088ff2>] get_signal+0x282/0x610
Nov 28 07:26:06 localhost kernel:  [<ffffffff9602e197>] do_signal+0x37/0x6c0
Nov 28 07:26:06 localhost kernel:  [<ffffffff961ff9bf>] ? __vfs_read+0xdf/0x130
Nov 28 07:26:06 localhost kernel:  [<ffffffff9611bcf1>] ? __audit_syscall_entry+0xb1/0x100
Nov 28 07:26:06 localhost kernel:  [<ffffffff9600357c>] exit_to_usermode_loop+0x8c/0xd0
Nov 28 07:26:06 localhost kernel:  [<ffffffff96003bc9>] do_syscall_64+0xa9/0x160

I first tried with coreos stable(1185.3.0) and suffered from another issue, and moved to coreos alpha (1235.0.0) and then this issue is occurred. Here is the error log what I got in stable channel.

Nov 28 05:26:32 localhost dockerd[1748]: 2016/11/28 05:26:32 http: panic serving @: runtime error: index out of range
Nov 28 05:26:32 localhost dockerd[1748]: goroutine 55942 [running]:
Nov 28 05:26:32 localhost dockerd[1748]: net/http.(*conn).serve.func1(0xc822768700)
Nov 28 05:26:32 localhost dockerd[1748]:         /usr/lib/go1.6/src/net/http/server.go:1389 +0xc1
Nov 28 05:26:32 localhost dockerd[1748]: panic(0x1532c80, 0xc82000e040)
Nov 28 05:26:32 localhost dockerd[1748]:         /usr/lib/go1.6/src/runtime/panic.go:443 +0x4e9
Nov 28 05:26:32 localhost dockerd[1748]: github.com/docker/docker/docker/hack.(*MalformedHostHeaderOverrideConn).Read(0xc823434300, 0xc826507000, 0x1000, 0x1000, 0x4, 0x0, 0x0)
Nov 28 05:26:32 localhost dockerd[1748]:         /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/.gopath/src/github.com/docker/docker/docker/hack/malformed_host_override.go:58 +0x7b5
Nov 28 05:26:32 localhost dockerd[1748]: net/http.(*connReader).Read(0xc823434400, 0xc826507000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
Nov 28 05:26:32 localhost dockerd[1748]:         /usr/lib/go1.6/src/net/http/server.go:526 +0x196
Nov 28 05:26:32 localhost dockerd[1748]: bufio.(*Reader).fill(0xc822f8daa0)
Nov 28 05:26:32 localhost dockerd[1748]:         /usr/lib/go1.6/src/bufio/bufio.go:97 +0x1e9
Nov 28 05:26:32 localhost dockerd[1748]: bufio.(*Reader).ReadSlice(0xc822f8daa0, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0)
Nov 28 05:26:32 localhost dockerd[1748]:         /usr/lib/go1.6/src/bufio/bufio.go:328 +0x21a
Nov 28 05:26:32 localhost dockerd[1748]: bufio.(*Reader).ReadLine(0xc822f8daa0, 0x0, 0x0, 0x0, 0x1709400, 0x0, 0x0)
Nov 28 05:26:32 localhost dockerd[1748]:         /usr/lib/go1.6/src/bufio/bufio.go:357 +0x53
Nov 28 05:26:32 localhost dockerd[1748]: net/textproto.(*Reader).readLineSlice(0xc822f75200, 0x0, 0x0, 0x0, 0x0, 0x0)
Nov 28 05:26:32 localhost dockerd[1748]:         /usr/lib/go1.6/src/net/textproto/reader.go:55 +0x81
Nov 28 05:26:32 localhost dockerd[1748]: net/textproto.(*Reader).ReadLine(0xc822f75200, 0x0, 0x0, 0x0, 0x0)
Nov 28 05:26:32 localhost dockerd[1748]:         /usr/lib/go1.6/src/net/textproto/reader.go:36 +0x40
Nov 28 05:26:32 localhost dockerd[1748]: net/http.readRequest(0xc822f8daa0, 0xc81fe0c500, 0xc82585fb20, 0x0, 0x0)
Nov 28 05:26:32 localhost dockerd[1748]:         /usr/lib/go1.6/src/net/http/request.go:721 +0xb6
Nov 28 05:26:32 localhost dockerd[1748]: net/http.(*conn).readRequest(0xc822768700, 0x0, 0x0, 0x0)
Nov 28 05:26:32 localhost dockerd[1748]:         /usr/lib/go1.6/src/net/http/server.go:705 +0x359
Nov 28 05:26:32 localhost dockerd[1748]: net/http.(*conn).serve(0xc822768700)
Nov 28 05:26:32 localhost dockerd[1748]:         /usr/lib/go1.6/src/net/http/server.go:1425 +0x947
Nov 28 05:26:32 localhost dockerd[1748]: created by net/http.(*Server).Serve
Nov 28 05:26:32 localhost dockerd[1748]:         /usr/lib/go1.6/src/net/http/server.go:2137 +0x44e
@pjknkda

This comment has been minimized.

Show comment
Hide comment
@pjknkda

pjknkda Nov 29, 2016

I had tested the same service with CentOS (3.10.0-327.36.3.el7.x86_64) + Docker 1.12.3, which is the same version of current CoreOS alpha channel, with zfs storage driver (because overlay driver simply does not work) and found that the issue is not reproduced (~7 days with no hang). Therefore, I posted the issue here rather than github.com/docker/docker.

pjknkda commented Nov 29, 2016

I had tested the same service with CentOS (3.10.0-327.36.3.el7.x86_64) + Docker 1.12.3, which is the same version of current CoreOS alpha channel, with zfs storage driver (because overlay driver simply does not work) and found that the issue is not reproduced (~7 days with no hang). Therefore, I posted the issue here rather than github.com/docker/docker.

@wkruse

This comment has been minimized.

Show comment
Hide comment
@wkruse

wkruse Dec 5, 2016

Seems like a duplicate of #1654.

wkruse commented Dec 5, 2016

Seems like a duplicate of #1654.

@dm0-

This comment has been minimized.

Show comment
Hide comment
@dm0-

dm0- Dec 14, 2016

Member

We are updating to Docker 1.12.4, which contains the upstream fixes for the docker daemon deadlocks (moby/moby#29095, moby/moby#29141). It will be available in the alpha later this week. You can reopen this if problems persist.

Member

dm0- commented Dec 14, 2016

We are updating to Docker 1.12.4, which contains the upstream fixes for the docker daemon deadlocks (moby/moby#29095, moby/moby#29141). It will be available in the alpha later this week. You can reopen this if problems persist.

@dm0- dm0- closed this Dec 14, 2016

@pjknkda

This comment has been minimized.

Show comment
Hide comment
@pjknkda

pjknkda Dec 16, 2016

The issue still exists after updating coreOS to 1262.0.0 (alpha).

localhost elice # cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=1262.0.0
VERSION_ID=1262.0.0
BUILD_ID=2016-12-14-2334
PRETTY_NAME="CoreOS 1262.0.0 (Ladybug)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"
localhost elice # docker info
Containers: 91
 Running: 0
 Paused: 0
 Stopped: 91
Images: 1256
Server Version: 1.12.4
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null host bridge overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp selinux
Kernel Version: 4.9.0-coreos
Operating System: CoreOS 1262.0.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 55.05 GiB
Name: localhost
ID: 4NGY:OJIT:ZJGD:FWKT:URNJ:MN3C:4COS:HNOU:SNM7:SQVG:7YQM:TUPS
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
 127.0.0.0/8

hang.sigusr1.decoded.txt

pjknkda commented Dec 16, 2016

The issue still exists after updating coreOS to 1262.0.0 (alpha).

localhost elice # cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=1262.0.0
VERSION_ID=1262.0.0
BUILD_ID=2016-12-14-2334
PRETTY_NAME="CoreOS 1262.0.0 (Ladybug)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"
localhost elice # docker info
Containers: 91
 Running: 0
 Paused: 0
 Stopped: 91
Images: 1256
Server Version: 1.12.4
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null host bridge overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp selinux
Kernel Version: 4.9.0-coreos
Operating System: CoreOS 1262.0.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 55.05 GiB
Name: localhost
ID: 4NGY:OJIT:ZJGD:FWKT:URNJ:MN3C:4COS:HNOU:SNM7:SQVG:7YQM:TUPS
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
 127.0.0.0/8

hang.sigusr1.decoded.txt

@pjknkda

This comment has been minimized.

Show comment
Hide comment
@pjknkda

pjknkda Dec 21, 2016

I've tested with CoreOS 1262.0 and Docker v1.13.0-rc4 for 3 days and it seems that the issue does not occur in this configuration.

(update 1/3/2017) No hangs for 15 days.

pjknkda commented Dec 21, 2016

I've tested with CoreOS 1262.0 and Docker v1.13.0-rc4 for 3 days and it seems that the issue does not occur in this configuration.

(update 1/3/2017) No hangs for 15 days.

@crawford crawford modified the milestones: Container Linux Stable 1235.7.0, CoreOS Alpha 1300.0.0 Jan 24, 2017

@crawford crawford removed this from the Container Linux Stable 1235.8.0 milestone Jan 31, 2017

@mikebryant

This comment has been minimized.

Show comment
Hide comment
@mikebryant

mikebryant Feb 20, 2017

We're getting hangs, on CoreOS 1235.9.0, docker version 1.12.6

We're getting hangs, on CoreOS 1235.9.0, docker version 1.12.6

@someword

This comment has been minimized.

Show comment
Hide comment
@someword

someword Feb 25, 2017

We are getting hans on 1235.0.0 - 1235.12.0 which encompasses coreos docker versions 1.12.3 - 1.12.6

here is stack trace from 1235.12.0 https://gist.github.com/someword/6ca8abc4782dcc6c94c8803cab32a93a

I had to pull the stack trace from journalctl. Not sure if the formatting is in the desired format.
I used the following to extract the stack trace from journalctl

journalctl -u docker --no-pager --since="17:53:48" --until="17:54:00" -o cat >/var/tmp/docker-usr1.txt

Os and docker details below

core@ip-10-20-2-103 ~ $ cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1235.12.0
VERSION_ID=1235.12.0
BUILD_ID=2017-02-23-0222
PRETTY_NAME="Container Linux by CoreOS 1235.12.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

docker info
core@ip-10-20-2-103 ~ $ docker info
Containers: 98
Running: 78
Paused: 0
Stopped: 20
Images: 45
Server Version: 1.12.6
Storage Driver: overlay
Backing Filesystem: extfs
Logging Driver: journald
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge null host overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp selinux
Kernel Version: 4.7.3-coreos-r3
Operating System: Container Linux by CoreOS 1235.12.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 36
Total Memory: 58.97 GiB
Name: ip-10-20-2-103.ec2.internal
ID: K5X2:AFJQ:IMWN:HC6V:OJ72:DWUZ:LJWI:5PAV:RITI:GK6M:QWOC:DENL
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
127.0.0.0/8

We are getting hans on 1235.0.0 - 1235.12.0 which encompasses coreos docker versions 1.12.3 - 1.12.6

here is stack trace from 1235.12.0 https://gist.github.com/someword/6ca8abc4782dcc6c94c8803cab32a93a

I had to pull the stack trace from journalctl. Not sure if the formatting is in the desired format.
I used the following to extract the stack trace from journalctl

journalctl -u docker --no-pager --since="17:53:48" --until="17:54:00" -o cat >/var/tmp/docker-usr1.txt

Os and docker details below

core@ip-10-20-2-103 ~ $ cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1235.12.0
VERSION_ID=1235.12.0
BUILD_ID=2017-02-23-0222
PRETTY_NAME="Container Linux by CoreOS 1235.12.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

docker info
core@ip-10-20-2-103 ~ $ docker info
Containers: 98
Running: 78
Paused: 0
Stopped: 20
Images: 45
Server Version: 1.12.6
Storage Driver: overlay
Backing Filesystem: extfs
Logging Driver: journald
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge null host overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp selinux
Kernel Version: 4.7.3-coreos-r3
Operating System: Container Linux by CoreOS 1235.12.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 36
Total Memory: 58.97 GiB
Name: ip-10-20-2-103.ec2.internal
ID: K5X2:AFJQ:IMWN:HC6V:OJ72:DWUZ:LJWI:5PAV:RITI:GK6M:QWOC:DENL
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
127.0.0.0/8

@bgilbert

This comment has been minimized.

Show comment
Hide comment
@bgilbert

bgilbert May 16, 2018

Member

Closing due to inactivity. Current versions of Container Linux include a much newer version of Docker. If you're still seeing this problem with current Container Linux, please open a new bug.

Member

bgilbert commented May 16, 2018

Closing due to inactivity. Current versions of Container Linux include a much newer version of Docker. If you're still seeing this problem with current Container Linux, please open a new bug.

@bgilbert bgilbert closed this May 16, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment