New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker does not catch container exit #2306

Closed
Raffo opened this Issue Dec 28, 2017 · 11 comments

Comments

Projects
None yet
8 participants
@Raffo

Raffo commented Dec 28, 2017

Issue Report

Bug

Docker does not correctly catch the container exit.
The same issue is described on moby/moby#33820 . It's unclear at this stage if it is related to the docker build used in Container Linux which is the reason why I am opening this here.

Container Linux Version

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1576.4.0
VERSION_ID=1576.4.0
BUILD_ID=2017-12-06-0449
PRETTY_NAME="Container Linux by CoreOS 1576.4.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

Environment

AWS EC2 m4.large

Expected Behavior

Docker should correctly catch the container exit.

Actual Behavior

Docker does not catch the container exit: an exited container cannot be found in the process try while it is still visible via docker ps.

Reproduction Steps

Once the problem happens:

$ docker run -it ubuntu /bin/bash
root@943b8935e38e:/# exit
exit

[nothing happens]

Other Information

Docker version:

Client:
 Version:      17.09.0-ce
 API version:  1.32
 Go version:   go1.8.4
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:24:58 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.09.0-ce
 API version:  1.32 (minimum version 1.12)
 Go version:   go1.8.4
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:24:58 2017
 OS/Arch:      linux/amd64
 Experimental: false

docker info:

Containers: 33
 Running: 26
 Paused: 0
 Stopped: 7
Images: 21
Server Version: 17.09.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: v0.13.2 (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
 seccomp
  Profile: default
 selinux
Kernel Version: 4.13.16-coreos-r2
Operating System: Container Linux by CoreOS 1576.4.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.799GiB
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Live Restore Enabled: false
@Deshke

This comment has been minimized.

Show comment
Hide comment
@Deshke

Deshke Jan 11, 2018

could be the same thing here

DISTRIB_ID="Container Linux by CoreOS"
DISTRIB_RELEASE=1576.5.0
DISTRIB_CODENAME="Ladybug"
DISTRIB_DESCRIPTION="Container Linux by CoreOS 1576.5.0 (Ladybug)"

PLEG is going crazy because it can not connect to a container that does not exists anymore

Jan 11 09:48:35 ip-172-25-104-55.us-east-2.compute.internal env[1157]: time="2018-01-11T09:48:35.606000889Z" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container 8824044248a7efceb04e1744fbd2b4ea4faa7d4b8394a8b70b29ef93e8f2d59c: rpc error: code = Unknown desc = containerd: container not found"

docker info

docker info
Containers: 106
 Running: 51
 Paused: 0
 Stopped: 55
Images: 27
Server Version: 17.09.0-ce
Storage Driver: overlay
 Backing Filesystem: extfs
 Supports d_type: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: v0.13.2 (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
 seccomp
  Profile: default
 selinux
Kernel Version: 4.14.11-coreos
Operating System: Container Linux by CoreOS 1576.5.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 29.43GiB
Name: ip-172-25-104-55.us-east-2.compute.internal
ID: XQ2B:UBZW:PXBF:6ME4:IFA2:H4LQ:H7WN:UUWS:AESL:FVM5:W2V6:VNZ4
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: livevideocloud
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Deshke commented Jan 11, 2018

could be the same thing here

DISTRIB_ID="Container Linux by CoreOS"
DISTRIB_RELEASE=1576.5.0
DISTRIB_CODENAME="Ladybug"
DISTRIB_DESCRIPTION="Container Linux by CoreOS 1576.5.0 (Ladybug)"

PLEG is going crazy because it can not connect to a container that does not exists anymore

Jan 11 09:48:35 ip-172-25-104-55.us-east-2.compute.internal env[1157]: time="2018-01-11T09:48:35.606000889Z" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container 8824044248a7efceb04e1744fbd2b4ea4faa7d4b8394a8b70b29ef93e8f2d59c: rpc error: code = Unknown desc = containerd: container not found"

docker info

docker info
Containers: 106
 Running: 51
 Paused: 0
 Stopped: 55
Images: 27
Server Version: 17.09.0-ce
Storage Driver: overlay
 Backing Filesystem: extfs
 Supports d_type: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: v0.13.2 (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
 seccomp
  Profile: default
 selinux
Kernel Version: 4.14.11-coreos
Operating System: Container Linux by CoreOS 1576.5.0 (Ladybug)
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 29.43GiB
Name: ip-172-25-104-55.us-east-2.compute.internal
ID: XQ2B:UBZW:PXBF:6ME4:IFA2:H4LQ:H7WN:UUWS:AESL:FVM5:W2V6:VNZ4
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: livevideocloud
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
@Raffo

This comment has been minimized.

Show comment
Hide comment
@Raffo

Raffo Jan 11, 2018

@Deshke You can make sure that is is the same problem by running:

docker run -it ubuntu /bin/bash
root@943b8935e38e:/# exit

If your terminal just hangs, you have the same problem.

Raffo commented Jan 11, 2018

@Deshke You can make sure that is is the same problem by running:

docker run -it ubuntu /bin/bash
root@943b8935e38e:/# exit

If your terminal just hangs, you have the same problem.

@lucab

This comment has been minimized.

Show comment
Hide comment
@lucab

lucab Jan 11, 2018

Member

For reference, I tried to reproduce the ubuntu-bash-exit hang on all current versions of docker across stable (17.09.0-ce), beta (17.09.01-ce) and alpha (17.11.0-ce) without luck so far, so there may be additional environmental factors triggering this (or increasing race chances).
For those who can semi-reliably reproduce this, it may be helpful to check if this also happens on beta and alpha.

However, the original report on Debian suggests that this is a generic docker upstream issue which is better triaged on moby tracker, if you have additional details please followup at moby/moby#33820. I'm keeping this ticket open to track future resolution status in CL channels.

Member

lucab commented Jan 11, 2018

For reference, I tried to reproduce the ubuntu-bash-exit hang on all current versions of docker across stable (17.09.0-ce), beta (17.09.01-ce) and alpha (17.11.0-ce) without luck so far, so there may be additional environmental factors triggering this (or increasing race chances).
For those who can semi-reliably reproduce this, it may be helpful to check if this also happens on beta and alpha.

However, the original report on Debian suggests that this is a generic docker upstream issue which is better triaged on moby tracker, if you have additional details please followup at moby/moby#33820. I'm keeping this ticket open to track future resolution status in CL channels.

@Raffo

This comment has been minimized.

Show comment
Hide comment
@Raffo

Raffo Jan 11, 2018

Thanks for having a look @lucab . I think the thing we have in common is running lots of docker containers on the same host (and using Kubernetes). Maybe in such an environment the issue becomes more frequent? Anyway, we currently have rolled back to 1.12.06 and the issue disappeared with the same version of Container Linux.

Raffo commented Jan 11, 2018

Thanks for having a look @lucab . I think the thing we have in common is running lots of docker containers on the same host (and using Kubernetes). Maybe in such an environment the issue becomes more frequent? Anyway, we currently have rolled back to 1.12.06 and the issue disappeared with the same version of Container Linux.

@dada941

This comment has been minimized.

Show comment
Hide comment
@dada941

dada941 Jan 11, 2018

@lucab - I couldn't reproduce it either manually.
Our k8s clusters run our CICD builds, so we schedule thousands of pods a day. We start seeing the issue after a few hours on some nodes and eventually all nodes show the symptoms.

CoreOS Beta is affected too, I'm going put a few nodes on Alpha and will report back with the results.

dada941 commented Jan 11, 2018

@lucab - I couldn't reproduce it either manually.
Our k8s clusters run our CICD builds, so we schedule thousands of pods a day. We start seeing the issue after a few hours on some nodes and eventually all nodes show the symptoms.

CoreOS Beta is affected too, I'm going put a few nodes on Alpha and will report back with the results.

@Deshke

This comment has been minimized.

Show comment
Hide comment
@Deshke

Deshke Jan 12, 2018

@Raffo can reproduce on a instance that already has a zombie container running.

So far i can reproduce this with the current stable, beta and alpha image. on the alpha image it currently takes a day until docker is unresponsive

image
(instance response time = time of the PLEG health check )

Deshke commented Jan 12, 2018

@Raffo can reproduce on a instance that already has a zombie container running.

So far i can reproduce this with the current stable, beta and alpha image. on the alpha image it currently takes a day until docker is unresponsive

image
(instance response time = time of the PLEG health check )

@chrisferry

This comment has been minimized.

Show comment
Hide comment
@chrisferry

chrisferry Feb 12, 2018

We are seeing this issue as well which then causes PLEG issues and finally general k8s cluster instability.
Docker version 17.09.1-ce, build 19e2cf6
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0

Going to revert to 1.12.06, hopefully that will solve the problem for now.

chrisferry commented Feb 12, 2018

We are seeing this issue as well which then causes PLEG issues and finally general k8s cluster instability.
Docker version 17.09.1-ce, build 19e2cf6
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0

Going to revert to 1.12.06, hopefully that will solve the problem for now.

@tyranron

This comment has been minimized.

Show comment
Hide comment
@tyranron

tyranron Feb 13, 2018

Same problem here. Reverting to 1.12.06 solved the PLEG issue.

tyranron commented Feb 13, 2018

Same problem here. Reverting to 1.12.06 solved the PLEG issue.

@lucab

This comment has been minimized.

Show comment
Hide comment
@lucab

lucab Apr 11, 2018

Member

A runc race has been recently fixed via opencontainers/runc#1698, and that has been backported to docker 17.12.1 which we are currently shipping in beta and stable channels.

I suspect it may be related to this bug and thus fixing it, but I have no way to verify that. It would be good if anybody previously affected by this could check if the issue is still present with docker 17.12.1.

Member

lucab commented Apr 11, 2018

A runc race has been recently fixed via opencontainers/runc#1698, and that has been backported to docker 17.12.1 which we are currently shipping in beta and stable channels.

I suspect it may be related to this bug and thus fixing it, but I have no way to verify that. It would be good if anybody previously affected by this could check if the issue is still present with docker 17.12.1.

@euank

This comment has been minimized.

Show comment
Hide comment
@euank

euank Apr 16, 2018

Contributor

I agree with @lucab's assessment that this may be that runc issue that should be fixed in all current channels with docker-ce 17.12.1 and newer.

To confirm whether that's the bug you're encountering, once dockerd has hung, send it and containerd a SIGUSR1 to collect stack traces.

If that's the bug, dockerd's stack trace should include ones similar to those here.

@chrisferry since you saw an issue similar to this on 17.12.1, can you open a new issue (here or against the upstream docker project as appropriate) with the Container Linux version information + goroutine stacks?

Contributor

euank commented Apr 16, 2018

I agree with @lucab's assessment that this may be that runc issue that should be fixed in all current channels with docker-ce 17.12.1 and newer.

To confirm whether that's the bug you're encountering, once dockerd has hung, send it and containerd a SIGUSR1 to collect stack traces.

If that's the bug, dockerd's stack trace should include ones similar to those here.

@chrisferry since you saw an issue similar to this on 17.12.1, can you open a new issue (here or against the upstream docker project as appropriate) with the Container Linux version information + goroutine stacks?

@euank

This comment has been minimized.

Show comment
Hide comment
@euank

euank May 16, 2018

Contributor

Per my previous comment, I'm closing this with the hope it's fixed as of docker-ce 17.12.1, and thus fixed in all channels. If you still see this issue, let us know.

Contributor

euank commented May 16, 2018

Per my previous comment, I'm closing this with the hope it's fixed as of docker-ce 17.12.1, and thus fixed in all channels. If you still see this issue, let us know.

@euank euank closed this May 16, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment