Random containerd crashes #1909

Open
Eshakk opened this Issue Apr 11, 2017 · 28 comments

Comments

Projects
None yet
10 participants
@Eshakk

Eshakk commented Apr 11, 2017

Issue Report

Bug

We are observing containerd crashes across multiple nodes randomly. Each node have around 8 to 10 containers. containerd is spiking to ~100% CPU usage and after sometime it is crashing by killing all container-shim processes. Our node is able to work as expected when containerd is down. But as a result, docker is becoming unresponsive after couple of days.

We have no idea about how to reproduce this since this is happening on some random nodes.
All the nodes in our network have same configuration.

Container Linux Version

$ cat /etc/os-release 
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1298.6.0
VERSION_ID=1298.6.0
BUILD_ID=2017-03-14-2119
PRETTY_NAME="Container Linux by CoreOS 1298.6.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

docker version:

Client:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   d5236f0
 Built:        Tue Mar 14 21:10:49 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   d5236f0
 Built:        Tue Mar 14 21:10:49 2017
 OS/Arch:      linux/amd64

docker info:

Containers: 9
 Running: 9
 Paused: 0
 Stopped: 0
Images: 14
Server Version: 1.12.6
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null host bridge overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp selinux
Kernel Version: 4.9.9-coreos-r1
Operating System: Container Linux by CoreOS 1298.6.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 70
Total Memory: 58.98 GiB
Name: Server01
ID: PYQ3:JC2K:TNDY:QT6J:5RCU:6V64:TRPC:E44I:5WUB:B3RW:V3VF:GLGI
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
 127.0.0.0/8

Docker and containerd journalctl logs:
sudo journalctl --no-pager -b -u docker -u containerd
journal_b_docker.txt

Docker daemon journalctl logs using sigusr1 signal at the time of containerd crash:
sudo kill -s SIGUSR1
journal_dockerd_sigusr1.txt

If you need any specific info, I'll be happy to provide.

@euank

This comment has been minimized.

Show comment
Hide comment
@euank

euank Apr 11, 2017

Member

This looks like a go runtime bug (fatal error: free list corrupted).

I'm updating containerd to be built with go1.7 which should hopefully fix this.

Member

euank commented Apr 11, 2017

This looks like a go runtime bug (fatal error: free list corrupted).

I'm updating containerd to be built with go1.7 which should hopefully fix this.

@Eshakk

This comment has been minimized.

Show comment
Hide comment
@Eshakk

Eshakk Apr 13, 2017

Thank you very much for the response @euank .
We are frequently hitting this scenario. Would you please let us know approximately when can we expect this in stable channel ?

Eshakk commented Apr 13, 2017

Thank you very much for the response @euank .
We are frequently hitting this scenario. Would you please let us know approximately when can we expect this in stable channel ?

@Eshakk

This comment has been minimized.

Show comment
Hide comment

Eshakk commented Apr 14, 2017

@euank Another instance of this issue:
containerd_crash.txt

dockerd_with_sigusr1.txt

@euank

This comment has been minimized.

Show comment
Hide comment
@euank

euank Apr 14, 2017

Member

fatal error: bad g->status in ready, which is probably fixed by golang/go@b902a63

That fix is included in 1.8.x, and was also backported to 1.7.5 golang/go#18708

We should probably backport a go1.7.5 update to stable and beta for our next releases of them.

Member

euank commented Apr 14, 2017

fatal error: bad g->status in ready, which is probably fixed by golang/go@b902a63

That fix is included in 1.8.x, and was also backported to 1.7.5 golang/go#18708

We should probably backport a go1.7.5 update to stable and beta for our next releases of them.

@BugRoger

This comment has been minimized.

Show comment
Hide comment
@BugRoger

BugRoger Apr 18, 2017

This is also a serious issue for us. We need to manually restart all affected containerds and clean up the resulting inconsistencies. :(

This is also a serious issue for us. We need to manually restart all affected containerds and clean up the resulting inconsistencies. :(

@Eshakk

This comment has been minimized.

Show comment
Hide comment
@Eshakk

Eshakk Apr 18, 2017

We are seeing these crashes more frequently now. Every time as a work around if we have to restart the node, we have to drop all the activities currently on them :(. We have monitoring services in place to do that, but sometimes, after containerd crash, docker runs cleanly as if nothing is changed. It takes a while for docker to become un-responsive.

@BugRoger Are you also seeing any CPU/Memory spike before the containerd crash ?

Eshakk commented Apr 18, 2017

We are seeing these crashes more frequently now. Every time as a work around if we have to restart the node, we have to drop all the activities currently on them :(. We have monitoring services in place to do that, but sometimes, after containerd crash, docker runs cleanly as if nothing is changed. It takes a while for docker to become un-responsive.

@BugRoger Are you also seeing any CPU/Memory spike before the containerd crash ?

@BugRoger

This comment has been minimized.

Show comment
Hide comment
@BugRoger

BugRoger Apr 18, 2017

We're not monitoring containerd directly. Our monitoring just picks up the resulting fallout: Kubelet unresponsive, Node gets evicted... So no, I havent seen that. If it's a bug in Golang, a memory spike is likely a coincidence or last hurrah.

Anectodically, this just resolves itself by restarting containerd. I'm wondering why the containerd.service doesn't restart on its own. I added /etc/systemd/system/containerd.service.d/10-restart.conf:

[Service]
Restart=always

Whether there's a deeper reasoning behind this omission, I don't know, but it helps with a segfaulting process until a proper fix lands in stable... :)

BugRoger commented Apr 18, 2017

We're not monitoring containerd directly. Our monitoring just picks up the resulting fallout: Kubelet unresponsive, Node gets evicted... So no, I havent seen that. If it's a bug in Golang, a memory spike is likely a coincidence or last hurrah.

Anectodically, this just resolves itself by restarting containerd. I'm wondering why the containerd.service doesn't restart on its own. I added /etc/systemd/system/containerd.service.d/10-restart.conf:

[Service]
Restart=always

Whether there's a deeper reasoning behind this omission, I don't know, but it helps with a segfaulting process until a proper fix lands in stable... :)

@dm0-

This comment has been minimized.

Show comment
Hide comment
@dm0-

dm0- Apr 25, 2017

Member

Closing, since Go 1.7.5 (with this fix) was used to build containerd in the stable release that should start rolling out tomorrow. Beta and alpha will be updated later in the week. Please reopen if the problem persists.

Member

dm0- commented Apr 25, 2017

Closing, since Go 1.7.5 (with this fix) was used to build containerd in the stable release that should start rolling out tomorrow. Beta and alpha will be updated later in the week. Please reopen if the problem persists.

@dm0- dm0- closed this Apr 25, 2017

@databus23

This comment has been minimized.

Show comment
Hide comment
@databus23

databus23 Apr 27, 2017

@dm0- Can we reopen this? I just got a another crash on 1353.6.0 which contains this above changes.

This gist contains the full beauty: https://gist.github.com/databus23/e2a4e2e5ec12cc07667d1a58c43cd0ff

It start with this:

Apr 26 18:52:16 containerd[2286]: runtime: newstack sp=0xc420087710 stack=[0xc420087000, 0xc4200877e0]
Apr 26 18:52:16 containerd[2286]:         morebuf={pc:0xc9cfd8 sp:0xc420087718 lr:0x0}
Apr 26 18:52:16 containerd[2286]:         sched={pc:0x444e93 sp:0xc420087710 lr:0x0 ctxt:0x0}
Apr 26 18:52:16 containerd[2286]: fatal error: runtime: stack split at bad time
Apr 26 18:52:16 containerd[2286]: fatal error: workbuf is not empty
Apr 26 18:52:16 containerd[2286]: fatal error: unexpected signal during runtime execution

databus23 commented Apr 27, 2017

@dm0- Can we reopen this? I just got a another crash on 1353.6.0 which contains this above changes.

This gist contains the full beauty: https://gist.github.com/databus23/e2a4e2e5ec12cc07667d1a58c43cd0ff

It start with this:

Apr 26 18:52:16 containerd[2286]: runtime: newstack sp=0xc420087710 stack=[0xc420087000, 0xc4200877e0]
Apr 26 18:52:16 containerd[2286]:         morebuf={pc:0xc9cfd8 sp:0xc420087718 lr:0x0}
Apr 26 18:52:16 containerd[2286]:         sched={pc:0x444e93 sp:0xc420087710 lr:0x0 ctxt:0x0}
Apr 26 18:52:16 containerd[2286]: fatal error: runtime: stack split at bad time
Apr 26 18:52:16 containerd[2286]: fatal error: workbuf is not empty
Apr 26 18:52:16 containerd[2286]: fatal error: unexpected signal during runtime execution

@euank euank reopened this Apr 27, 2017

@zihaoyu

This comment has been minimized.

Show comment
Hide comment
@zihaoyu

zihaoyu Apr 29, 2017

containerd just crashed on one of the minions in our production kubernetes cluster:

Apr 29 00:57:11 ip-10-72-34-115.us-west-2.compute.internal containerd[1738]: acquirep: p->m=859539461120(126) p->status=1
Apr 29 00:57:11 ip-10-72-34-115.us-west-2.compute.internal containerd[1738]: fatal error: acquirep: invalid p state
Apr 29 00:57:11 ip-10-72-34-115.us-west-2.compute.internal containerd[1738]: runtime: nonempty check fails b.log[0]= 0 b.log[1]= 0 b.log[2]= 0 b.log[3]= 0
Apr 29 00:57:11 ip-10-72-34-115.us-west-2.compute.internal containerd[1738]: fatal error: workbuf is empty

zihaoyu commented Apr 29, 2017

containerd just crashed on one of the minions in our production kubernetes cluster:

Apr 29 00:57:11 ip-10-72-34-115.us-west-2.compute.internal containerd[1738]: acquirep: p->m=859539461120(126) p->status=1
Apr 29 00:57:11 ip-10-72-34-115.us-west-2.compute.internal containerd[1738]: fatal error: acquirep: invalid p state
Apr 29 00:57:11 ip-10-72-34-115.us-west-2.compute.internal containerd[1738]: runtime: nonempty check fails b.log[0]= 0 b.log[1]= 0 b.log[2]= 0 b.log[3]= 0
Apr 29 00:57:11 ip-10-72-34-115.us-west-2.compute.internal containerd[1738]: fatal error: workbuf is empty
@BugRoger

This comment has been minimized.

Show comment
Hide comment
@BugRoger

BugRoger Apr 29, 2017

This is causing catastrophic errors for us. We need to ongoingly and manually recover node after node. Some of our applications (especially databases clients) have a hard time recovering from this problem. There must also be a contrak issue in Kubernetes related to this.

Any chance we can roll back to a stable containerd/docker until this can be resolved?

This is causing catastrophic errors for us. We need to ongoingly and manually recover node after node. Some of our applications (especially databases clients) have a hard time recovering from this problem. There must also be a contrak issue in Kubernetes related to this.

Any chance we can roll back to a stable containerd/docker until this can be resolved?

@dm0- dm0- removed their assignment May 1, 2017

@BugRoger

This comment has been minimized.

Show comment
Hide comment
@BugRoger

BugRoger May 3, 2017

Just logged another crash. Interesting enough with a new error this time:

May 03 16:11:34 containerd[19317]: fatal error: exitsyscall: syscall frame is no longer valid
May 03 16:11:34 containerd[19317]: runtime stack:
May 03 16:11:34 containerd[19317]: runtime.throw(0x8925d6, 0x2d)
May 03 16:11:34 containerd[19317]:         /usr/lib/go1.7/src/runtime/panic.go:566 +0x95
May 03 16:11:34 containerd[19317]: runtime.exitsyscall.func1()
May 03 16:11:34 containerd[19317]:         /usr/lib/go1.7/src/runtime/proc.go:2470 +0x36
May 03 16:11:34 containerd[19317]: runtime.systemstack(0xc420b1c380)
May 03 16:11:34 containerd[19317]:         /usr/lib/go1.7/src/runtime/asm_amd64.s:298 +0x79
May 03 16:11:34 containerd[19317]: runtime.mstart()
May 03 16:11:34 containerd[19317]:         /usr/lib/go1.7/src/runtime/proc.go:1079

Full trace: https://gist.github.com/BugRoger/d4735f2390274f9bbda68d1a503bb43d

BugRoger commented May 3, 2017

Just logged another crash. Interesting enough with a new error this time:

May 03 16:11:34 containerd[19317]: fatal error: exitsyscall: syscall frame is no longer valid
May 03 16:11:34 containerd[19317]: runtime stack:
May 03 16:11:34 containerd[19317]: runtime.throw(0x8925d6, 0x2d)
May 03 16:11:34 containerd[19317]:         /usr/lib/go1.7/src/runtime/panic.go:566 +0x95
May 03 16:11:34 containerd[19317]: runtime.exitsyscall.func1()
May 03 16:11:34 containerd[19317]:         /usr/lib/go1.7/src/runtime/proc.go:2470 +0x36
May 03 16:11:34 containerd[19317]: runtime.systemstack(0xc420b1c380)
May 03 16:11:34 containerd[19317]:         /usr/lib/go1.7/src/runtime/asm_amd64.s:298 +0x79
May 03 16:11:34 containerd[19317]: runtime.mstart()
May 03 16:11:34 containerd[19317]:         /usr/lib/go1.7/src/runtime/proc.go:1079

Full trace: https://gist.github.com/BugRoger/d4735f2390274f9bbda68d1a503bb43d

@ichekrygin ichekrygin referenced this issue in containerd/containerd May 3, 2017

Closed

containerd.service fails #789

@euank

This comment has been minimized.

Show comment
Hide comment
@euank

euank May 4, 2017

Member

I've been unable to reproduce this, and the stack traces don't make it obvious to me what's actually causing these crashes.

If anyone has a reproduction or more information about what sort of setup leads to this, it would be appreciated!

@BugRoger I don't think rolling back docker/containerd helps here necessarily. If these started recently, well, docker/containerd haven't changed from 1.12.x for a bit so there's not much to roll back to.

Member

euank commented May 4, 2017

I've been unable to reproduce this, and the stack traces don't make it obvious to me what's actually causing these crashes.

If anyone has a reproduction or more information about what sort of setup leads to this, it would be appreciated!

@BugRoger I don't think rolling back docker/containerd helps here necessarily. If these started recently, well, docker/containerd haven't changed from 1.12.x for a bit so there's not much to roll back to.

@Eshakk

This comment has been minimized.

Show comment
Hide comment
@Eshakk

Eshakk May 5, 2017

@euank We have only 10 containers running. Among these, there was one container logging aggressively to stdout(around 150k messages/30 seconds). We thought this might have lead to containerd crash, so we restricted the logging to a local log file and reduced number of messages logged too. This only reduced the frequency of containerd crashes. Below are most recent crashes we have seen:

containerd_stack_trace.txt

containerd_stack_trace2.txt

If you need any further details or any specific details(like debug log) just before crash please let me know. I'll do the required config.

Eshakk commented May 5, 2017

@euank We have only 10 containers running. Among these, there was one container logging aggressively to stdout(around 150k messages/30 seconds). We thought this might have lead to containerd crash, so we restricted the logging to a local log file and reduced number of messages logged too. This only reduced the frequency of containerd crashes. Below are most recent crashes we have seen:

containerd_stack_trace.txt

containerd_stack_trace2.txt

If you need any further details or any specific details(like debug log) just before crash please let me know. I'll do the required config.

@Eshakk

This comment has been minimized.

Show comment
Hide comment
@Eshakk

Eshakk May 10, 2017

@euank Another instance of containerd crash:
containerd_crash_may10.txt

Could you please let me know if you need any specific info ?

Eshakk commented May 10, 2017

@euank Another instance of containerd crash:
containerd_crash_may10.txt

Could you please let me know if you need any specific info ?

@euank

This comment has been minimized.

Show comment
Hide comment
@euank

euank May 10, 2017

Member

@Eshakk that crash references go1.6 compilation strings which means it happened on a previous Container Linux version, right? If you're seeing these across different Container Linux versions, please include the version along with the reports.

Containerd does support a --debug flag (which can be added via a drop-in for the containerd.service unit), and it's possible that will contain more helpful information
If there's anything in dmesg (e.g. oom messages, filesystem-related events, etc) that would be useful to know too!

I did try some heavily logging containers, but couldn't reproduce any crashes like these still.

Is there anything else unusual about your environment? Any other information which might help reproduce this would be welcome..

Member

euank commented May 10, 2017

@Eshakk that crash references go1.6 compilation strings which means it happened on a previous Container Linux version, right? If you're seeing these across different Container Linux versions, please include the version along with the reports.

Containerd does support a --debug flag (which can be added via a drop-in for the containerd.service unit), and it's possible that will contain more helpful information
If there's anything in dmesg (e.g. oom messages, filesystem-related events, etc) that would be useful to know too!

I did try some heavily logging containers, but couldn't reproduce any crashes like these still.

Is there anything else unusual about your environment? Any other information which might help reproduce this would be welcome..

@Eshakk

This comment has been minimized.

Show comment
Hide comment
@Eshakk

Eshakk May 16, 2017

@euank Sorry for not including the details.
I'm attaching here details of crash which happened today morning(May 16 02:44 UTC).
docker info

Containers: 10
 Running: 10
 Paused: 0
 Stopped: 0
Images: 16
Server Version: 1.12.6
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null bridge overlay host
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp selinux
Kernel Version: 4.9.24-coreos
Operating System: Container Linux by CoreOS 1353.6.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 46
Total Memory: 58.98 GiB
Name: Server01
ID: TMCK:VRAG:6YYY:EKUK:EVF6:FOC5:DFD5:2XDR:ASSW:PXRT:5YC6:6LPJ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
 127.0.0.0/8

docker version:

Client:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   d5236f0
 Built:        Tue Apr 25 02:09:35 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   d5236f0
 Built:        Tue Apr 25 02:09:35 2017
 OS/Arch:      linux/amd64

In the stack trace I'm seeing go1.7 version ??

dmesg:
dmesg.txt

journalctl:
journalctl.txt

dockerd config:
dockerd_config.txt

We do not have anything unusual in our network config. We are planning to add debug to containerd. Will post the logs with debug as soon as I get them.

Eshakk commented May 16, 2017

@euank Sorry for not including the details.
I'm attaching here details of crash which happened today morning(May 16 02:44 UTC).
docker info

Containers: 10
 Running: 10
 Paused: 0
 Stopped: 0
Images: 16
Server Version: 1.12.6
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null bridge overlay host
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp selinux
Kernel Version: 4.9.24-coreos
Operating System: Container Linux by CoreOS 1353.6.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 46
Total Memory: 58.98 GiB
Name: Server01
ID: TMCK:VRAG:6YYY:EKUK:EVF6:FOC5:DFD5:2XDR:ASSW:PXRT:5YC6:6LPJ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
 127.0.0.0/8

docker version:

Client:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   d5236f0
 Built:        Tue Apr 25 02:09:35 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   d5236f0
 Built:        Tue Apr 25 02:09:35 2017
 OS/Arch:      linux/amd64

In the stack trace I'm seeing go1.7 version ??

dmesg:
dmesg.txt

journalctl:
journalctl.txt

dockerd config:
dockerd_config.txt

We do not have anything unusual in our network config. We are planning to add debug to containerd. Will post the logs with debug as soon as I get them.

@databus23

This comment has been minimized.

Show comment
Hide comment
@databus23

databus23 May 26, 2017

From the release notes this bug is supposedly fixed in 1395.0.0.
From this issue and all referenced ones its not clear to me how this was fixed. If this was fixed why is this bug still open?
@euank Could you shine some light on this?

databus23 commented May 26, 2017

From the release notes this bug is supposedly fixed in 1395.0.0.
From this issue and all referenced ones its not clear to me how this was fixed. If this was fixed why is this bug still open?
@euank Could you shine some light on this?

@databus23

This comment has been minimized.

Show comment
Hide comment
@databus23

databus23 May 26, 2017

Ah sorry. Now I see this was mentioned as fixed in all channesl when bumping to go 1.7 which didn't fix it. Disregard my previous comment.

Ah sorry. Now I see this was mentioned as fixed in all channesl when bumping to go 1.7 which didn't fix it. Disregard my previous comment.

@Eshakk

This comment has been minimized.

Show comment
Hide comment
@Eshakk

Eshakk Jun 5, 2017

@euank We have enabled debug flag for containerd service. Below are the logs captured in the recent containerd crash.
docker info:

Containers: 11
 Running: 11
 Paused: 0
 Stopped: 0
Images: 17
Server Version: 1.12.6
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null host overlay bridge
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp selinux
Kernel Version: 4.9.24-coreos
Operating System: Container Linux by CoreOS 1353.8.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 46
Total Memory: 58.98 GiB
Name: Server01
ID: XW6K:MQM5:SLJ5:57JF:C52N:G6J4:FT3R:NGAP:7NBC:KVF3:MHYM:GEWH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
 127.0.0.0/8

docker version:

Client:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   d5236f0
 Built:        Tue May 30 23:15:08 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   d5236f0
 Built:        Tue May 30 23:15:08 2017
 OS/Arch:      linux/amd64

journalctl:
containerd_crash_jun01.txt

On another node, we have also seen a problem with containerd while rebooting cores. This might help:
docker info:

Containers: 11
 Running: 11
 Paused: 0
 Stopped: 0
Images: 17
Server Version: 1.12.6
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: host bridge overlay null
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp selinux
Kernel Version: 4.9.24-coreos
Operating System: Container Linux by CoreOS 1353.8.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 70
Total Memory: 58.98 GiB
Name: Server02
ID: PYQ3:JC2K:TNDY:QT6J:5RCU:6V64:TRPC:E44I:5WUB:B3RW:V3VF:GLGI
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
 127.0.0.0/8

docker version:

Client:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   d5236f0
 Built:        Tue May 30 23:15:08 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   d5236f0
 Built:        Tue May 30 23:15:08 2017
 OS/Arch:      linux/amd64

containerd_stop_sigterm_timedout.txt

Could you please point to documentation, if any, to understand containerd debug information ?

Please let me know if you need any further information.

Eshakk commented Jun 5, 2017

@euank We have enabled debug flag for containerd service. Below are the logs captured in the recent containerd crash.
docker info:

Containers: 11
 Running: 11
 Paused: 0
 Stopped: 0
Images: 17
Server Version: 1.12.6
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null host overlay bridge
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp selinux
Kernel Version: 4.9.24-coreos
Operating System: Container Linux by CoreOS 1353.8.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 46
Total Memory: 58.98 GiB
Name: Server01
ID: XW6K:MQM5:SLJ5:57JF:C52N:G6J4:FT3R:NGAP:7NBC:KVF3:MHYM:GEWH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
 127.0.0.0/8

docker version:

Client:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   d5236f0
 Built:        Tue May 30 23:15:08 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   d5236f0
 Built:        Tue May 30 23:15:08 2017
 OS/Arch:      linux/amd64

journalctl:
containerd_crash_jun01.txt

On another node, we have also seen a problem with containerd while rebooting cores. This might help:
docker info:

Containers: 11
 Running: 11
 Paused: 0
 Stopped: 0
Images: 17
Server Version: 1.12.6
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: host bridge overlay null
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp selinux
Kernel Version: 4.9.24-coreos
Operating System: Container Linux by CoreOS 1353.8.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 70
Total Memory: 58.98 GiB
Name: Server02
ID: PYQ3:JC2K:TNDY:QT6J:5RCU:6V64:TRPC:E44I:5WUB:B3RW:V3VF:GLGI
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
 127.0.0.0/8

docker version:

Client:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   d5236f0
 Built:        Tue May 30 23:15:08 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   d5236f0
 Built:        Tue May 30 23:15:08 2017
 OS/Arch:      linux/amd64

containerd_stop_sigterm_timedout.txt

Could you please point to documentation, if any, to understand containerd debug information ?

Please let me know if you need any further information.

@Eshakk

This comment has been minimized.

Show comment
Hide comment
@Eshakk

Eshakk Jun 8, 2017

@euank Could you please let me know any update on this ?

Eshakk commented Jun 8, 2017

@euank Could you please let me know any update on this ?

@databus23

This comment has been minimized.

Show comment
Hide comment
@databus23

databus23 Jun 28, 2017

@Eshakk As reported by @BugRoger we are also seeing this bug frequently. I happen to notice that your machines have a lot of CPUs. We are also having a lot of CPUs (80) on our barematel nodes.

The fact that this seems to be a go runtime bug and not many people seem to suffer from it maybe is only triggered by a sufficiently large GOMAXPROCS setting which defaults to the number of CPUs AFAIK.

It might be worth reducing GOMAXPROCS for the containerd process and see if that improves things.

@zihaoyu Do you also have a lot of CPUs in the nodes where you are experiencing those crashes?

databus23 commented Jun 28, 2017

@Eshakk As reported by @BugRoger we are also seeing this bug frequently. I happen to notice that your machines have a lot of CPUs. We are also having a lot of CPUs (80) on our barematel nodes.

The fact that this seems to be a go runtime bug and not many people seem to suffer from it maybe is only triggered by a sufficiently large GOMAXPROCS setting which defaults to the number of CPUs AFAIK.

It might be worth reducing GOMAXPROCS for the containerd process and see if that improves things.

@zihaoyu Do you also have a lot of CPUs in the nodes where you are experiencing those crashes?

@krmayankk

This comment has been minimized.

Show comment
Hide comment
@krmayankk

krmayankk Aug 9, 2017

i am also seeing this on 1.12.6 on Centos 7. Has this issue been resolved and is the fix available in 1.13.1 ?

i am also seeing this on 1.12.6 on Centos 7. Has this issue been resolved and is the fix available in 1.13.1 ?

@boardcode

This comment has been minimized.

Show comment
Hide comment
@boardcode

boardcode Aug 31, 2017

I am also experiencing this crash across nodes running docker version 1.12.6 on CoreOS 1353.7.0.

I am also experiencing this crash across nodes running docker version 1.12.6 on CoreOS 1353.7.0.

@Eshakk

This comment has been minimized.

Show comment
Hide comment
@Eshakk

Eshakk Aug 31, 2017

@boardcode Please check my last comment in this issue: https://github.com/moby/moby/issues/32537.
Unfortunately, this issue has not been successfully reproduced by anyone.

Eshakk commented Aug 31, 2017

@boardcode Please check my last comment in this issue: https://github.com/moby/moby/issues/32537.
Unfortunately, this issue has not been successfully reproduced by anyone.

@brenix

This comment has been minimized.

Show comment
Hide comment
@brenix

brenix Sep 30, 2017

We had this issue and used @databus23's suggestion about limiting GOMAXPROCS. After creating a systemd drop-in for containerd which sets the GOMAXPROCS environmental variable to a lower value, we have not seen this issue occur again (2w+)

brenix commented Sep 30, 2017

We had this issue and used @databus23's suggestion about limiting GOMAXPROCS. After creating a systemd drop-in for containerd which sets the GOMAXPROCS environmental variable to a lower value, we have not seen this issue occur again (2w+)

@pfhayes

This comment has been minimized.

Show comment
Hide comment
@pfhayes

pfhayes Oct 5, 2017

I also encountered this issue on a machine with 16 cores and setting GOMAXPROCS resolved the issue

pfhayes commented Oct 5, 2017

I also encountered this issue on a machine with 16 cores and setting GOMAXPROCS resolved the issue

@euank

This comment has been minimized.

Show comment
Hide comment
@euank

euank Nov 17, 2017

Member

I think all the evidence here still points to this being a go runtime bug. It's possible it's golang/go#20427, which resulted in these sorts of random crashes; there are certainly some similarities here.

Our next step should be to start building containerd with a newer Go version and see if anyone still hits this.

Member

euank commented Nov 17, 2017

I think all the evidence here still points to this being a go runtime bug. It's possible it's golang/go#20427, which resulted in these sorts of random crashes; there are certainly some similarities here.

Our next step should be to start building containerd with a newer Go version and see if anyone still hits this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment