Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot stop the container: stop timeout #3125

Closed
sofat1989 opened this issue Mar 25, 2019 · 7 comments
Closed

Cannot stop the container: stop timeout #3125

sofat1989 opened this issue Mar 25, 2019 · 7 comments
Labels

Comments

@sofat1989
Copy link

Containerd: v1.2.4
Error Message:

time="2019-03-25T02:07:08.210779178-07:00" level=error msg="An error occurs during waiting for container "50226b57f4b79111737460f54cc3e142631d8803efa6ecef58d808f8b56ed0a3" to be stopped" error="wait container "50226b57f4b79111737460f54cc3e142631d8803efa6ecef58d808f8b56ed0a3" stop timeout"

The container is soon OOMKilled after it is up. The command, crictl ps shows the container is running. But the container is exited
I checked the logs of containerd, here is an error message

Mar 22 01:56:43  containerd[9598]: time="2019-03-22T01:56:43.650896850-07:00" level=info msg="StartContainer for "50226b57f4b79111737460f54cc3e142631d8803efa6ecef58d808f8b56ed0a3""
Mar 22 01:56:43 containerd[9598]: time="2019-03-22T01:56:43.652175173-07:00" level=info msg="shim containerd-shim started" address="/containerd-shim/k8s.io/50226b57f4b79111737460f54cc3e142631d8803efa6ecef58d808f8b56ed0a3/shim.sock" debug=true pid=28058
Mar 22 01:56:44 containerd[9598]: time="2019-03-22T01:56:44.476147620-07:00" level=info msg="StartContainer for "50226b57f4b79111737460f54cc3e142631d8803efa6ecef58d808f8b56ed0a3" returns successfully"
Mar 22 01:58:04 containerd[9598]: time="2019-03-22T01:58:04-07:00" level=error msg="post event" error="failed to publish event" namespace=k8s.io path="/run/containerd/io.containerd.runtime.v1.linux/k8s.io/50226b57f4b79111737460f54cc3e142631d8803efa6ecef58d808f8b56ed0a3" pid=28058
Mar 22 02:01:23 containerd[9598]: time="2019-03-22T02:01:23.970668962-07:00" level=info msg="Finish piping stderr of container "50226b57f4b79111737460f54cc3e142631d8803efa6ecef58d808f8b56ed0a3""
Mar 22 02:01:23  containerd[9598]: time="2019-03-22T02:01:23.970856941-07:00" level=info msg="Finish piping stdout of container "50226b57f4b79111737460f54cc3e142631d8803efa6ecef58d808f8b56ed0a3""
Mar 22 02:01:24 containerd[9598]: time="2019-03-22T02:01:24.013183916-07:00" level=info msg="TaskOOM event &TaskOOM{ContainerID:50226b57f4b79111737460f54cc3e142631d8803efa6ecef58d808f8b56ed0a3,}"
Mar 22 02:07:16  containerd[9598]: time="2019-03-22T02:07:16.799368410-07:00" level=info msg="StopContainer for "50226b57f4b79111737460f54cc3e142631d8803efa6ecef58d808f8b56ed0a3" with timeout 30 (s)"
Mar 22 02:07:16 containerd[9598]: time="2019-03-22T02:07:16.818108390-07:00" level=info msg="Stop container "50226b57f4b79111737460f54cc3e142631d8803efa6ecef58d808f8b56ed0a3" with signal terminated"
Mar 22 02:07:46  containerd[9598]: time="2019-03-22T02:07:46.834222655-07:00" level=error msg="An error occurs during waiting for container "50226b57f4b79111737460f54cc3e142631d8803efa6ecef58d808f8b56ed0a3" to be stopped" error="wait container "50226b57f4b79111737460f54cc3e142631d8803efa6ecef58d808f8b56ed0a3" stop timeout"

Here is an error

Mar 22 01:58:04  containerd[9598]: time="2019-03-22T01:58:04-07:00" level=error msg="post event" error="failed to publish event" namespace=k8s.io path="/run/containerd/io.containerd.runtime.v1.linux/k8s.io/50226b57f4b79111737460f54cc3e142631d8803efa6ecef58d808f8b56ed0a3" pid=28058

I think here should have a TaskExit event. We cannot get the reason why the event is failed to publish. We doubt that the container cannot be stopped because containerd didn't get the TaskExit event

After the containerd is restarted, The containerd shows

Mar 25 02:07:28 containerd[25806]: time="2019-03-25T02:07:28.877990557-07:00" level=info msg="TaskExit event &TaskExit{ContainerID:50226b57f4b79111737460f54cc3e142631d8803efa6ecef58d808f8b56ed0a3,ID:50226b57f4b79111737460f54cc3e142631d8803efa6ecef58d808f8b56ed0a3,Pid:28097,ExitStatus:137,ExitedAt:2019-03-25 09:07:26.071348787 +0000 UTC,}"
Mar 25 02:07:36  containerd[25806]: time="2019-03-25T02:07:36.494666976-07:00" level=info msg="Container to stop "50226b57f4b79111737460f54cc3e142631d8803efa6ecef58d808f8b56ed0a3" must be in running or unknown state, current state "CONTAINER_EXITED""
Mar 25 02:07:36  containerd[25806]: time="2019-03-25T02:07:36.516380707-07:00" level=info msg="Container to stop "50226b57f4b79111737460f54cc3e142631d8803efa6ecef58d808f8b56ed0a3" must be in running or unknown state, current state "CONTAINER_EXITED""
Mar 25 02:07:36  containerd[25806]: time="2019-03-25T02:07:36.537179076-07:00" level=info msg="RemoveContainer for "50226b57f4b79111737460f54cc3e142631d8803efa6ecef58d808f8b56ed0a3""
Mar 25 02:07:36 containerd[25806]: time="2019-03-25T02:07:36.563730617-07:00" level=info msg="RemoveContainer for "50226b57f4b79111737460f54cc3e142631d8803efa6ecef58d808f8b56ed0a3" returns successfully"
Mar 25 02:07:36  containerd[25806]: time="2019-03-25T02:07:36.766274471-07:00" level=debug msg="removed snapshot" key="k8s.io/4711/50226b57f4b79111737460f54cc3e142631d8803efa6ecef58d808f8b56ed0a3" snapshotter=overlayfs

Q1: In which conditions, will the publishing events fail?
Q2: Why cannot the container be stopped?

@estesp
Copy link
Member

estesp commented Mar 25, 2019

I don't know about the failed event publish, but Q2 has at least one fairly easy answer; if we're talking about Linux (which I assume we are), then uninterruptible sleep (e.g. in-kernel I/O) is one reason for a container process to refuse to be killed, even after the 30s delay followed by SIGKILL. I have seen this with containerd and the use of NFS mounted filesystems in several cases--more detail in this 2015 blog.

Can you look at the PID mentioned on the host and validate what state it is in? Looks like it is pid 28058

@keyingliu
Copy link

keyingliu commented Mar 26, 2019

@estesp the container process has been killed and gone, but the containerd-shim for this container is still there, 28058 is the pid of containerd-shim. From the ctr c ls or crictl ps, the container is still in running status.
After restart containerd, the containerd-shim process disappeared.

@Random-Liu
Copy link
Member

Random-Liu commented Mar 28, 2019

Q1: In which conditions, will the publishing events fail?

We need to figure this out.

Based on the code, this is because the event publishing process exits with non-zero exit code. https://github.com/containerd/containerd/blob/master/cmd/containerd-shim/main_unix.go#L286

We should:

  1. Print error with stdout/stderr from event publishing process for debuging;
  2. Retry sending event, especially for TaskExit, which is very important in the container lifecycle handling.

Q2: Why cannot the container be stopped?

If containerd doesn't receive the TaskExit event, it won't know the container exited.

@sofat1989
Copy link
Author

@Random-Liu We reproduced this issue. The error message from stderr is

stderr: containerd: transport is closing: unavailable

stdout is empty
the exit code is 1

@Random-Liu
Copy link
Member

@sofat1989 IIRC, If containerd is down at that time, it is fine, because when containerd restarts, it will recover the correct container state by querying the latest state from shims.

However, if containerd is runing, but event publish fails, containerd will never know the container exits, in this case we need to do something.

@Random-Liu
Copy link
Member

This should be fixed by containerd/cri#1133 and containerd/cri#1136

Close for now. Feel free to reopen if you encounter this issue again after next containerd upgrade.

thaJeztah added a commit to thaJeztah/docker that referenced this issue Jun 14, 2019
From the release notes: https://github.com/containerd/containerd/releases/tag/v1.2.7

> Welcome to the v1.2.7 release of containerd!
>
> The seventh patch release for containerd 1.2 introduces OCI image
> descriptor annotation support and contains fixes for containerd shim logs,
> container stop/deletion, cri plugin and selinux.
>
> It also contains several important bug fixes for goroutine and file
> descriptor leakage in containerd and containerd shims.
>
> Notable Updates
>
> - Support annotations in the OCI image descriptor, and filtering image by annotations. containerd/containerd#3254
> - Support context timeout in ttrpc which can help avoid containerd hangs when a shim is unresponsive. containerd/ttrpc#31
> - Fix a bug that containerd shim leaks goroutine and file descriptor after containerd restarts. containerd/ttrpc#37
> - Fix a bug that a container can't be deleted if first deletion attempt is canceled or timeout. containerd/containerd#3264
> - Fix a bug that containerd leaks file descriptor when using v2 containerd shims, e.g. containerd-shim-runc-v1. containerd/containerd#3273
> - Fix a bug that a container with lingering processes can't terminate when it shares pid namespace with another container. moby#38978
> - Fix a bug that containerd can't read shim logs after restart. containerd/containerd#3282
> - Fix a bug that shim_debug option is not honored for existing containerd shims after containerd restarts. containerd/containerd#3283
> - cri: Fix a bug that a container can't be stopped when the exit event is not successfully published by the containerd shim. containerd/containerd#3125, containerd/containerd#3177
> - cri: Fix a bug that exec process is not cleaned up if grpc context is canceled or timeout. contaienrd/cri#1159
> - Fix a selinux keyring labeling issue by updating runc to v1.0.0-rc.8 and selinux library to v1.2.2. opencontainers/selinux#50
> - Update ttrpc to f82148331ad2181edea8f3f649a1f7add6c3f9c2. containerd/containerd#3316
> - Update cri to 49ca74043390bc2eeea7a45a46005fbec58a3f88. containerd/containerd#3330

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
docker-jenkins pushed a commit to docker-archive/docker-ce that referenced this issue Jun 17, 2019
From the release notes: https://github.com/containerd/containerd/releases/tag/v1.2.7

> Welcome to the v1.2.7 release of containerd!
>
> The seventh patch release for containerd 1.2 introduces OCI image
> descriptor annotation support and contains fixes for containerd shim logs,
> container stop/deletion, cri plugin and selinux.
>
> It also contains several important bug fixes for goroutine and file
> descriptor leakage in containerd and containerd shims.
>
> Notable Updates
>
> - Support annotations in the OCI image descriptor, and filtering image by annotations. containerd/containerd#3254
> - Support context timeout in ttrpc which can help avoid containerd hangs when a shim is unresponsive. containerd/ttrpc#31
> - Fix a bug that containerd shim leaks goroutine and file descriptor after containerd restarts. containerd/ttrpc#37
> - Fix a bug that a container can't be deleted if first deletion attempt is canceled or timeout. containerd/containerd#3264
> - Fix a bug that containerd leaks file descriptor when using v2 containerd shims, e.g. containerd-shim-runc-v1. containerd/containerd#3273
> - Fix a bug that a container with lingering processes can't terminate when it shares pid namespace with another container. moby/moby#38978
> - Fix a bug that containerd can't read shim logs after restart. containerd/containerd#3282
> - Fix a bug that shim_debug option is not honored for existing containerd shims after containerd restarts. containerd/containerd#3283
> - cri: Fix a bug that a container can't be stopped when the exit event is not successfully published by the containerd shim. containerd/containerd#3125, containerd/containerd#3177
> - cri: Fix a bug that exec process is not cleaned up if grpc context is canceled or timeout. contaienrd/cri#1159
> - Fix a selinux keyring labeling issue by updating runc to v1.0.0-rc.8 and selinux library to v1.2.2. opencontainers/selinux#50
> - Update ttrpc to f82148331ad2181edea8f3f649a1f7add6c3f9c2. containerd/containerd#3316
> - Update cri to 49ca74043390bc2eeea7a45a46005fbec58a3f88. containerd/containerd#3330

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Upstream-commit: d5669ec1c6eedcd5dd8b0ecd615638934561daa4
Component: engine
thaJeztah added a commit to thaJeztah/docker that referenced this issue Sep 12, 2019
From the release notes: https://github.com/containerd/containerd/releases/tag/v1.2.7

> Welcome to the v1.2.7 release of containerd!
>
> The seventh patch release for containerd 1.2 introduces OCI image
> descriptor annotation support and contains fixes for containerd shim logs,
> container stop/deletion, cri plugin and selinux.
>
> It also contains several important bug fixes for goroutine and file
> descriptor leakage in containerd and containerd shims.
>
> Notable Updates
>
> - Support annotations in the OCI image descriptor, and filtering image by annotations. containerd/containerd#3254
> - Support context timeout in ttrpc which can help avoid containerd hangs when a shim is unresponsive. containerd/ttrpc#31
> - Fix a bug that containerd shim leaks goroutine and file descriptor after containerd restarts. containerd/ttrpc#37
> - Fix a bug that a container can't be deleted if first deletion attempt is canceled or timeout. containerd/containerd#3264
> - Fix a bug that containerd leaks file descriptor when using v2 containerd shims, e.g. containerd-shim-runc-v1. containerd/containerd#3273
> - Fix a bug that a container with lingering processes can't terminate when it shares pid namespace with another container. moby#38978
> - Fix a bug that containerd can't read shim logs after restart. containerd/containerd#3282
> - Fix a bug that shim_debug option is not honored for existing containerd shims after containerd restarts. containerd/containerd#3283
> - cri: Fix a bug that a container can't be stopped when the exit event is not successfully published by the containerd shim. containerd/containerd#3125, containerd/containerd#3177
> - cri: Fix a bug that exec process is not cleaned up if grpc context is canceled or timeout. contaienrd/cri#1159
> - Fix a selinux keyring labeling issue by updating runc to v1.0.0-rc.8 and selinux library to v1.2.2. opencontainers/selinux#50
> - Update ttrpc to f82148331ad2181edea8f3f649a1f7add6c3f9c2. containerd/containerd#3316
> - Update cri to 49ca74043390bc2eeea7a45a46005fbec58a3f88. containerd/containerd#3330

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit d5669ec)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
docker-jenkins pushed a commit to docker-archive/docker-ce that referenced this issue Sep 12, 2019
From the release notes: https://github.com/containerd/containerd/releases/tag/v1.2.7

> Welcome to the v1.2.7 release of containerd!
>
> The seventh patch release for containerd 1.2 introduces OCI image
> descriptor annotation support and contains fixes for containerd shim logs,
> container stop/deletion, cri plugin and selinux.
>
> It also contains several important bug fixes for goroutine and file
> descriptor leakage in containerd and containerd shims.
>
> Notable Updates
>
> - Support annotations in the OCI image descriptor, and filtering image by annotations. containerd/containerd#3254
> - Support context timeout in ttrpc which can help avoid containerd hangs when a shim is unresponsive. containerd/ttrpc#31
> - Fix a bug that containerd shim leaks goroutine and file descriptor after containerd restarts. containerd/ttrpc#37
> - Fix a bug that a container can't be deleted if first deletion attempt is canceled or timeout. containerd/containerd#3264
> - Fix a bug that containerd leaks file descriptor when using v2 containerd shims, e.g. containerd-shim-runc-v1. containerd/containerd#3273
> - Fix a bug that a container with lingering processes can't terminate when it shares pid namespace with another container. moby/moby#38978
> - Fix a bug that containerd can't read shim logs after restart. containerd/containerd#3282
> - Fix a bug that shim_debug option is not honored for existing containerd shims after containerd restarts. containerd/containerd#3283
> - cri: Fix a bug that a container can't be stopped when the exit event is not successfully published by the containerd shim. containerd/containerd#3125, containerd/containerd#3177
> - cri: Fix a bug that exec process is not cleaned up if grpc context is canceled or timeout. contaienrd/cri#1159
> - Fix a selinux keyring labeling issue by updating runc to v1.0.0-rc.8 and selinux library to v1.2.2. opencontainers/selinux#50
> - Update ttrpc to f82148331ad2181edea8f3f649a1f7add6c3f9c2. containerd/containerd#3316
> - Update cri to 49ca74043390bc2eeea7a45a46005fbec58a3f88. containerd/containerd#3330

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit d5669ec1c6eedcd5dd8b0ecd615638934561daa4)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Upstream-commit: 768923199f89246ff51039ae030e4b492f8d4555
Component: engine
thaJeztah added a commit to thaJeztah/docker that referenced this issue Sep 27, 2019
From the release notes: https://github.com/containerd/containerd/releases/tag/v1.2.7

> Welcome to the v1.2.7 release of containerd!
>
> The seventh patch release for containerd 1.2 introduces OCI image
> descriptor annotation support and contains fixes for containerd shim logs,
> container stop/deletion, cri plugin and selinux.
>
> It also contains several important bug fixes for goroutine and file
> descriptor leakage in containerd and containerd shims.
>
> Notable Updates
>
> - Support annotations in the OCI image descriptor, and filtering image by annotations. containerd/containerd#3254
> - Support context timeout in ttrpc which can help avoid containerd hangs when a shim is unresponsive. containerd/ttrpc#31
> - Fix a bug that containerd shim leaks goroutine and file descriptor after containerd restarts. containerd/ttrpc#37
> - Fix a bug that a container can't be deleted if first deletion attempt is canceled or timeout. containerd/containerd#3264
> - Fix a bug that containerd leaks file descriptor when using v2 containerd shims, e.g. containerd-shim-runc-v1. containerd/containerd#3273
> - Fix a bug that a container with lingering processes can't terminate when it shares pid namespace with another container. moby#38978
> - Fix a bug that containerd can't read shim logs after restart. containerd/containerd#3282
> - Fix a bug that shim_debug option is not honored for existing containerd shims after containerd restarts. containerd/containerd#3283
> - cri: Fix a bug that a container can't be stopped when the exit event is not successfully published by the containerd shim. containerd/containerd#3125, containerd/containerd#3177
> - cri: Fix a bug that exec process is not cleaned up if grpc context is canceled or timeout. contaienrd/cri#1159
> - Fix a selinux keyring labeling issue by updating runc to v1.0.0-rc.8 and selinux library to v1.2.2. opencontainers/selinux#50
> - Update ttrpc to f82148331ad2181edea8f3f649a1f7add6c3f9c2. containerd/containerd#3316
> - Update cri to 49ca74043390bc2eeea7a45a46005fbec58a3f88. containerd/containerd#3330

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit d5669ec)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
docker-jenkins pushed a commit to docker-archive/docker-ce that referenced this issue Sep 27, 2019
From the release notes: https://github.com/containerd/containerd/releases/tag/v1.2.7

> Welcome to the v1.2.7 release of containerd!
>
> The seventh patch release for containerd 1.2 introduces OCI image
> descriptor annotation support and contains fixes for containerd shim logs,
> container stop/deletion, cri plugin and selinux.
>
> It also contains several important bug fixes for goroutine and file
> descriptor leakage in containerd and containerd shims.
>
> Notable Updates
>
> - Support annotations in the OCI image descriptor, and filtering image by annotations. containerd/containerd#3254
> - Support context timeout in ttrpc which can help avoid containerd hangs when a shim is unresponsive. containerd/ttrpc#31
> - Fix a bug that containerd shim leaks goroutine and file descriptor after containerd restarts. containerd/ttrpc#37
> - Fix a bug that a container can't be deleted if first deletion attempt is canceled or timeout. containerd/containerd#3264
> - Fix a bug that containerd leaks file descriptor when using v2 containerd shims, e.g. containerd-shim-runc-v1. containerd/containerd#3273
> - Fix a bug that a container with lingering processes can't terminate when it shares pid namespace with another container. moby/moby#38978
> - Fix a bug that containerd can't read shim logs after restart. containerd/containerd#3282
> - Fix a bug that shim_debug option is not honored for existing containerd shims after containerd restarts. containerd/containerd#3283
> - cri: Fix a bug that a container can't be stopped when the exit event is not successfully published by the containerd shim. containerd/containerd#3125, containerd/containerd#3177
> - cri: Fix a bug that exec process is not cleaned up if grpc context is canceled or timeout. contaienrd/cri#1159
> - Fix a selinux keyring labeling issue by updating runc to v1.0.0-rc.8 and selinux library to v1.2.2. opencontainers/selinux#50
> - Update ttrpc to f82148331ad2181edea8f3f649a1f7add6c3f9c2. containerd/containerd#3316
> - Update cri to 49ca74043390bc2eeea7a45a46005fbec58a3f88. containerd/containerd#3330

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit d5669ec1c6eedcd5dd8b0ecd615638934561daa4)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Upstream-commit: 8c7928adaa83264947b6e296eeb068b99843822e
Component: engine
burnMyDread pushed a commit to burnMyDread/moby that referenced this issue Oct 21, 2019
From the release notes: https://github.com/containerd/containerd/releases/tag/v1.2.7

> Welcome to the v1.2.7 release of containerd!
>
> The seventh patch release for containerd 1.2 introduces OCI image
> descriptor annotation support and contains fixes for containerd shim logs,
> container stop/deletion, cri plugin and selinux.
>
> It also contains several important bug fixes for goroutine and file
> descriptor leakage in containerd and containerd shims.
>
> Notable Updates
>
> - Support annotations in the OCI image descriptor, and filtering image by annotations. containerd/containerd#3254
> - Support context timeout in ttrpc which can help avoid containerd hangs when a shim is unresponsive. containerd/ttrpc#31
> - Fix a bug that containerd shim leaks goroutine and file descriptor after containerd restarts. containerd/ttrpc#37
> - Fix a bug that a container can't be deleted if first deletion attempt is canceled or timeout. containerd/containerd#3264
> - Fix a bug that containerd leaks file descriptor when using v2 containerd shims, e.g. containerd-shim-runc-v1. containerd/containerd#3273
> - Fix a bug that a container with lingering processes can't terminate when it shares pid namespace with another container. moby#38978
> - Fix a bug that containerd can't read shim logs after restart. containerd/containerd#3282
> - Fix a bug that shim_debug option is not honored for existing containerd shims after containerd restarts. containerd/containerd#3283
> - cri: Fix a bug that a container can't be stopped when the exit event is not successfully published by the containerd shim. containerd/containerd#3125, containerd/containerd#3177
> - cri: Fix a bug that exec process is not cleaned up if grpc context is canceled or timeout. contaienrd/cri#1159
> - Fix a selinux keyring labeling issue by updating runc to v1.0.0-rc.8 and selinux library to v1.2.2. opencontainers/selinux#50
> - Update ttrpc to f82148331ad2181edea8f3f649a1f7add6c3f9c2. containerd/containerd#3316
> - Update cri to 49ca74043390bc2eeea7a45a46005fbec58a3f88. containerd/containerd#3330

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: zach <Zachary.Joyner@linux.com>
@huangjiasingle
Copy link

containerd 1.5.11 has same problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants