New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Containers failing to terminate and complete in containerd 1.1.4 and 1.2 #2744
Comments
We expected it to be fixed by these lines But the problem still is occurring in 1.1.4 |
The version is containerd 1.1.4 The kube-version is: 1.11 (1.11.3) |
Port 10010 is the (old) streaming server fixed port--for the kubelet to talk to the CRI plugin for various features (exec, attach, and port-forward I think); for k8s 1.11 it doesn't have to be a fixed port anymore (see containerd/cri#858), but that is most likely unrelated to the issue. The fact that this communication broke down might be part of the issue, but I don't know enough to be sure. @Random-Liu @mikebrow |
Sorry, I'm on paternity leave. I'll take a look next week. |
I believe for the first problem you need this fix: I don't know if the 2nd problem is related to the first or not. |
The problem is fully on containerds end nothing to do with kubernetes. The issue is containerd after some time is not sending kill signals properly to the containerd-shim processes. What that results in is even when crictl stop is called: it sits in and endless loop waiting for the container to terminate.
root@kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3:
root@kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3:
Nov 13 18:26:20 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:26:20.083927741Z" level=error msg="StopContainer for "9758a90e48241" failed" error="failed to kill container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430": context canceled: unknown"
root 30841 0.0 0.1 10820 5184 ? Sl 16:56 0:00 containerd-shim -namespace k8s.io -workdir /var/data/cripersistentstorage/io.containerd.runtime.v1.linux/k8s.io/9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430 -address /run/containerd/containerd.sock -containerd-binary /usr/local/bin/containerd
|
^
^
^ Manually stop it with containerd.logs
^
^have the containerd shim process around Containerd-shim never sent the signal. |
This looks very much like #2438 and #2709. I expect them to be fixed in the coming 1.1.5 release by:
If not, let's revisit this. |
@Random-Liu is there a work around in the interim? I tried setting the KillMode= of the systemd unit file to mixed and that works (basically has systemd send the signal properly to all the containerd-shim processes and everything gets cleaned up. But it also shuts down every container as well. I am looking for some sort of process in the interim I can do to do the reaping properly overtime. |
But that sounds good about the fix I will look at the PRs... |
This is also happening in 1.2 as well (not sure if that matters or not). |
I'm getting a test build of 1.1.5 to see if the issue is resolved from @estesp . |
cc @estesp and @Random-Liu this issue is still occurring in Pod list
|
Kubelet log snippit and full logs
|
So far I have not been able to replicate in 1.1.5 |
I can't find a commit
Hope it is fixed in 1.1.5. |
@Random-Liu I created a “potential” 1.2.1 for @relyt0925 to test by:
|
There could be other post-1.2.0 GA fixes I missed, but I did try and look at the list of post-1.2.0 merged PRs and nothing else jumped out at me |
@Random-Liu wow..that was a pretty important and embarrassing miss in my work :) @relyt0925 I can generate a new build with that included so you can test again. I’ll also create a cherrypick PR for |
The new release looks good so far. I was able to get a node in a state where containers were failing to terminate:
Checked logs in the kubelet to ensure it was hitting the problem
Then went in and shut containerd down and replaced the binaries with with the new test binaries @estesp gave me. I then rebooted the node and all the terminating containers were cleaned out. Doing more testing to ensure that didn't happen just due to the reboot but it looks promising. |
Actually still seeing it in the 1.2.2 test release
|
|
That is the containerd version |
I’ll look into the log more today.
Does the pod need anything special in ibm cloud? Is it possible to make a
pod yaml we can test in OSS? It would be much easier if I can deploy the
pod and reproduce the issue. :)
…On Fri, Nov 16, 2018 at 10:33 AM relyt0925 ***@***.***> wrote:
containerd github.com/containerd/containerd v1.2.0-13-g3c81b6c7 3c81b6c72fd06b39781840b93dc25b1a43b07adc
***@***.***:/#
That is the containerd version
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2744 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFjVuwj0Jh_dqRJewZMlMkwFd6aVipaRks5uvwTngaJpZM4X49Ca>
.
|
It doesn't! The yamls are below
|
|
I mainly just scheduled a bunch of daemonsets and did deletions across all of them and then just wait to make sure they all delete |
|
test script run:
|
@ivan4th Sorry, I was using 1.1.5. I can reproduce with containerd 1.2.0-rc.0. I'll look into it. I'm not sure whether this is the same with the original issue, but this is definitely something we should figure out and fix. |
Here's stack trace (obtained via
|
I guess the problem is this one:
I think the reason is that the
but the
The reason is that since containerd 1.2, for non host pid container, we won't call When
This is pretty bad. And to fix this, we should explicitly call |
As a side note, I distilled the test case from Virtlet DaemonSet which uses {
"metadata": {
"name": "testpod",
"uid": "0e7f9b63-d7f4-4dfb-b805-665dddd3c7c8",
"namespace": "default"
},
"log_directory": "/tmp",
"linux": {
"security_context": {
"namespace_options": {
"pid": 2
}
}
}
} and it also did hang (checked with 1.2.1-rc0) -- but you may recheck just in case |
@ivan4th I see. The reason is that So you are actually running a pod with shared pid namespace. In that case, even if init process I'm going to send out a fix soon. And I think I can't produce this on my desktop because I haven't updated containerd-shim for too long time. |
Thanks folks! Will try containerd master / release-1.2 with Virtlet when the changes land. |
I confirm that Virtlet works just fine after the changes with the pod terminating correctly when Virtlet is removed from a node. Verified with 1.2.1 release. |
This is temporary change till a new docker-ce/containerd.io package versions comes out. There's critical bug in containerd 1.2.0: containerd/containerd#2744
@Random-Liu is this one resolved, now that #2860 was merged? (I see GitHub didn't auto-close the ticket for some reason) or still changes needed for the 1.1 branch? |
I'm going to close this one for now. Thanks for looking into it for us @ivan4th |
This is temporary change till a new docker-ce/containerd.io package versions comes out. There's critical bug in containerd 1.2.0: containerd/containerd#2744
I'm seeing this still for some reason. It would help if I gave a little bit of context because I think my scenario is a little unique and can cause some confusion. I'm building a linux distro designed to run Kubernetes. It uses containerd as the
The init container seems to be holding it up from moving out of this state, however, the container runs successfully:
I'm seeing the following in the kubelet logs after deleting
I have tried containerd 1.2.1 and 1.2.2. I saw the same thing when attempting to use flannel instead of calico. Both use init containers. Anything with an init container seems to hang. I'm not clear on how moving from EDIT: more kubelet logs:
|
@andrewrynhard It may not be the same issue. Containerd and containerd-shim stacktrace would be helpful here. You can get the stacktrace by sending |
@Random-Liu I will work on grabbing the stacktraces. In the mean time I am seeing this:
Notice EDIT:
Still hoping there is a way to explicitly set it. EDIT 2: |
@Random-Liu it looks like it was indeed due to the bad value |
The main issue is simply that containers will stay stuck in Terminating or in the case of
Init
containers in kubernetes complete but just stay in theInit
phase. What that causes is the pod will never transition to it's main container when that occurs (basically as if the init container ran forever). For theTerminating
problem the pod just hangs around till it is force deleted.An example of a pod that was stuck in terminating
It looks like potentially the kubelet can have problems talking to containerd that might result in this?
The other we saw was an init container run fully but stay stuck in a
Init:0/1
. I suspect containerd never fully terminated the init container even though the container exited and therefore the pod stayed in this sate.Yaml for the pod is the following
The text was updated successfully, but these errors were encountered: