Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Containers failing to terminate and complete in containerd 1.1.4 and 1.2 #2744

Closed
relyt0925 opened this issue Oct 24, 2018 · 58 comments
Closed
Labels
Milestone

Comments

@relyt0925
Copy link

The main issue is simply that containers will stay stuck in Terminating or in the case of Init containers in kubernetes complete but just stay in the Init phase. What that causes is the pod will never transition to it's main container when that occurs (basically as if the init container ran forever). For the Terminating problem the pod just hangs around till it is force deleted.

An example of a pod that was stuck in terminating

Oct 18 13:06:04 kube-dal12-cree61ddf4b2934227a8b166434d7403f8-w2 kubelet.service[1747]: I1018 13:06:04.146167    1747 kubelet.go:1853] SyncLoop (DELETE, "api"): "hubperf-0_default(e6166efd-d2b0-11e8-b9e2-261fec5456d5)"
Oct 18 13:06:04 kube-dal12-cree61ddf4b2934227a8b166434d7403f8-w2 kubelet.service[1747]: I1018 13:06:04.146415    1747 kuberuntime_container.go:553] Killing container "containerd://fb3c811af15cf95ed631baca4b07960c223994449584f75468e80bf1b4f012f4" with 30 second grace period
Oct 18 13:06:04 kube-dal12-cree61ddf4b2934227a8b166434d7403f8-w2 systemd[1]: Starting Logrotate Cronjob...
Oct 18 13:06:04 kube-dal12-cree61ddf4b2934227a8b166434d7403f8-w2 systemd[1]: Started Logrotate Cronjob.
Oct 18 13:06:04 kube-dal12-cree61ddf4b2934227a8b166434d7403f8-w2 kubelet.service[1747]: E1018 13:06:04.373608    1747 upgradeaware.go:310] Error proxying data from client to backend: readfrom tcp 10.185.18.132:31409->10.185.18.132:10010: write tcp 10.185.18.132:31409->10.185.18.132:10010: write: broken pipe
Oct 18 13:06:38 kube-dal12-cree61ddf4b2934227a8b166434d7403f8-w2 kubelet.service[1747]: I1018 13:06:38.523728    1747 kubelet.go:1853] SyncLoop (DELETE, "api"): "hubperf-0_default(e6166efd-d2b0-11e8-b9e2-261fec5456d5)"

It looks like potentially the kubelet can have problems talking to containerd that might result in this?

Oct 18 13:06:04 kube-dal12-cree61ddf4b2934227a8b166434d7403f8-w2 kubelet.service[1747]: E1018 13:06:04.373608    1747 upgradeaware.go:310] Error proxying data from client to backend: readfrom tcp 10.185.18.132:31409->10.185.18.132:10010: write tcp 10.185.18.132:31409->10.185.18.132:10010: write: broken pipe

The other we saw was an init container run fully but stay stuck in a Init:0/1. I suspect containerd never fully terminated the init container even though the container exited and therefore the pod stayed in this sate.

kube-system     ibm-master-proxy-kcq4x                                            0/1       Init:0/1            0          15h       10.116.131.16    10.116.131.16   <none>

Yaml for the pod is the following

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"apps/v1","kind":"DaemonSet","metadata":{"annotations":{},"labels":{"k8s-app":"ibm-master-proxy"},"name":"ibm-master-proxy","namespace":"kube-system"},"spec":{"selector":{"matchLabels":{"k8s-app":"ibm-master-proxy"}},"template":{"metadata":{"annotations":{"scheduler.alpha.kubernetes.io/critical-pod":"","scheduler.alpha.kubernetes.io/tolerations":"[{\"key\":\"CriticalAddonsOnly\", \"operator\":\"Exists\"}]"},"labels":{"k8s-app":"ibm-master-proxy"}},"spec":{"affinity":{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"ibm-cloud.kubernetes.io/ha-worker","operator":"DoesNotExist"}]}]}}},"containers":[{"command":["/bin/sh","-c","mkdir /cache \u0026\u0026 cp /scripts/applyinterfaces.sh /cache \u0026\u0026 chmod +x /cache/applyinterfaces.sh \u0026\u0026 /cache/applyinterfaces.sh \u0026\u0026 /sbin/syslogd -O /proc/1/fd/1  \u0026\u0026 haproxy -f /usr/local/etc/haproxy/haproxy.cfg -V -dR"],"image":"registry.ng.bluemix.net/armada-master/haproxy:1.8.12-alpine","imagePullPolicy":"IfNotPresent","livenessProbe":{"failureThreshold":6,"httpGet":{"host":"172.20.0.1","path":"/healthz","port":2040,"scheme":"HTTPS"},"initialDelaySeconds":10,"periodSeconds":60,"successThreshold":1,"timeoutSeconds":10},"name":"ibm-master-proxy","ports":[{"containerPort":2040,"hostPort":2040,"name":"apiserver","protocol":"TCP"},{"containerPort":2041,"hostPort":2041,"name":"etcd","protocol":"TCP"}],"resources":{"limits":{"cpu":"300m","memory":"512M"},"requests":{"cpu":"25m","memory":"32M"}},"securityContext":{"privileged":true},"volumeMounts":[{"mountPath":"/usr/local/etc/haproxy","name":"etc-config","readOnly":true},{"mountPath":"/scripts","name":"ibm-network-interfaces"},{"mountPath":"/host","name":"host-path"}]}],"hostNetwork":true,"priorityClassName":"system-cluster-critical","tolerations":[{"operator":"Exists"}],"volumes":[{"configMap":{"name":"ibm-master-proxy-config"},"name":"etc-config"},{"configMap":{"name":"ibm-network-interfaces"},"name":"ibm-network-interfaces"},{"hostPath":{"path":"/"},"name":"host-path"}]}}}}
  creationTimestamp: 2018-10-24T18:26:10Z
  generation: 1
  labels:
    k8s-app: ibm-master-proxy
  name: ibm-master-proxy
  namespace: kube-system
  resourceVersion: "79560"
  selfLink: /apis/extensions/v1beta1/namespaces/kube-system/daemonsets/ibm-master-proxy
  uid: 47a5e5ae-d7ba-11e8-9128-ca4f5830ba32
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: ibm-master-proxy
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
        scheduler.alpha.kubernetes.io/tolerations: '[{"key":"CriticalAddonsOnly",
          "operator":"Exists"}]'
      creationTimestamp: null
      labels:
        k8s-app: ibm-master-proxy
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: ibm-cloud.kubernetes.io/ha-worker
                operator: DoesNotExist
      containers:
      - command:
        - /bin/sh
        - -c
        - mkdir /cache && cp /scripts/applyinterfaces.sh /cache && chmod +x /cache/applyinterfaces.sh
          && /cache/applyinterfaces.sh && /sbin/syslogd -O /proc/1/fd/1  && haproxy
          -f /usr/local/etc/haproxy/haproxy.cfg -V -dR
        image: registry.ng.bluemix.net/armada-master/haproxy:1.8.12-alpine
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 6
          httpGet:
            host: 172.20.0.1
            path: /healthz
            port: 2040
            scheme: HTTPS
          initialDelaySeconds: 10
          periodSeconds: 60
          successThreshold: 1
          timeoutSeconds: 10
        name: ibm-master-proxy
        ports:
        - containerPort: 2040
          hostPort: 2040
          name: apiserver
          protocol: TCP
        - containerPort: 2041
          hostPort: 2041
          name: etcd
          protocol: TCP
        resources:
          limits:
            cpu: 300m
            memory: 512M
          requests:
            cpu: 25m
            memory: 32M
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /usr/local/etc/haproxy
          name: etc-config
          readOnly: true
        - mountPath: /scripts
          name: ibm-network-interfaces
        - mountPath: /host
          name: host-path
      dnsPolicy: ClusterFirst
      hostNetwork: true
      priorityClassName: system-cluster-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      tolerations:
      - operator: Exists
      volumes:
      - configMap:
          defaultMode: 420
          name: ibm-master-proxy-config
        name: etc-config
      - configMap:
          defaultMode: 420
          name: ibm-network-interfaces
        name: ibm-network-interfaces
      - hostPath:
          path: /
          type: ""
        name: host-path
  templateGeneration: 1
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate
@relyt0925
Copy link
Author

We expected it to be fixed by these lines
https://github.com/containerd/cri/blob/0367114b46d71343451ec5ba53f0393a33e95691/pkg/server/container_stop.go#L118-L131

But the problem still is occurring in 1.1.4

@relyt0925
Copy link
Author

relyt0925 commented Oct 24, 2018

The version is containerd 1.1.4

The kube-version is: 1.11 (1.11.3)

@estesp
Copy link
Member

estesp commented Oct 24, 2018

Port 10010 is the (old) streaming server fixed port--for the kubelet to talk to the CRI plugin for various features (exec, attach, and port-forward I think); for k8s 1.11 it doesn't have to be a fixed port anymore (see containerd/cri#858), but that is most likely unrelated to the issue.

The fact that this communication broke down might be part of the issue, but I don't know enough to be sure. @Random-Liu @mikebrow

@Random-Liu
Copy link
Member

Sorry, I'm on paternity leave. I'll take a look next week.

@mikebrow
Copy link
Member

mikebrow commented Oct 25, 2018

I believe for the first problem you need this fix:
kubernetes/kubernetes#68041

I don't know if the 2nd problem is related to the first or not.

@relyt0925 relyt0925 changed the title Containers failing to terminate and complete in containerd 1.1.4 Containers failing to terminate and complete in containerd 1.1.4 and 1.2 Nov 9, 2018
@relyt0925
Copy link
Author

The problem is fully on containerds end nothing to do with kubernetes.

The issue is containerd after some time is not sending kill signals properly to the containerd-shim processes. What that results in is even when crictl stop is called: it sits in and endless loop waiting for the container to terminate.

Nov 13 18:14:39 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm kubelet.service[10153]: E1113 18:14:39.189544   10153 pod_workers.go:186] Error syncing pod fbb6fe6e-e763-11e8-a585-d63ae9c16d17 ("hello-daemonset-nhk9x_default(fbb6fe6e-e763-11e8-a585-d63ae9c16d17)"), skipping: error killing pod: [failed to "KillContainer" for "hello-kubernetes" with KillContainerError: "rpc error: code = Unknown desc = an error occurs during waiting for container \"25feb03c275a67c5a97155259dbe06ff2cc6172eddf73919e8657f8a8fa6484a\" init process to be killed: wait container \"25feb03c275a67c5a97155259dbe06ff2cc6172eddf73919e8657f8a8fa6484a\" stop timeout"
Nov 13 18:10:09 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm kubelet.service[10153]: I1113 18:10:09.138839   10153 kubelet.go:1899] SyncLoop (DELETE, "api"): "hello-daemonset-nhk9x_default(fbb6fe6e-e763-11e8-a585-d63ae9c16d17)"
Nov 13 16:48:51 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm kubelet.service[10153]: I1113 16:48:51.033122   10153 kubelet.go:1928] SyncLoop (PLEG): "hello-daemonset-nhk9x_default(fbb6fe6e-e763-11e8-a585-d63ae9c16d17)", event: &pleg.PodLifecycleEvent{ID:"fbb6fe6e-e763-11e8-a585-d63ae9c16d17", Type:"ContainerStarted", Data:"25feb03c275a67c5a97155259dbe06ff2cc6172eddf73919e8657f8a8fa6484a"}
Nov 13 16:48:46 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm kubelet.service[10153]: I1113 16:48:46.023545   10153 kubelet.go:1928] SyncLoop (PLEG): "hello-daemonset-nhk9x_default(fbb6fe6e-e763-11e8-a585-d63ae9c16d17)", event: &pleg.PodLifecycleEvent{ID:"fbb6fe6e-e763-11e8-a585-d63ae9c16d17", Type:"ContainerStarted", Data:"9b133938b6584f83eb07e18187fd1d7b7dd0a8106e42d68ab3cfbc0c1a08b0fa"}
Nov 13 16:48:44 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm kubelet.service[10153]: I1113 16:48:44.896508   10153 kuberuntime_manager.go:396] No sandbox for pod "hello-daemonset-nhk9x_default(fbb6fe6e-e763-11e8-a585-d63ae9c16d17)" can be found. Need to start a new one
Nov 13 16:48:44 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm kubelet.service[10153]: I1113 16:48:44.875136   10153 operation_generator.go:568] MountVolume.SetUp succeeded for volume "default-token-v67cl" (UniqueName: "kubernetes.io/secret/fbb6fe6e-e763-11e8-a585-d63ae9c16d17-default-token-v67cl") pod "hello-daemonset-nhk9x" (UID: "fbb6fe6e-e763-11e8-a585-d63ae9c16d17")
Nov 13 16:48:44 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm kubelet.service[10153]: I1113 16:48:44.843124   10153 reconciler.go:252] operationExecutor.MountVolume started for volume "default-token-v67cl" (UniqueName: "kubernetes.io/secret/fbb6fe6e-e763-11e8-a585-d63ae9c16d17-default-token-v67cl") pod "hello-daemonset-nhk9x" (UID: "fbb6fe6e-e763-11e8-a585-d63ae9c16d17")
Nov 13 16:48:44 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm kubelet.service[10153]: I1113 16:48:44.742577   10153 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "default-token-v67cl" (UniqueName: "kubernetes.io/secret/fbb6fe6e-e763-11e8-a585-d63ae9c16d17-default-token-v67cl") pod "hello-daemonset-nhk9x" (UID: "fbb6fe6e-e763-11e8-a585-d63ae9c16d17")
Nov 13 16:48:44 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm kubelet.service[10153]: I1113 16:48:44.588640   10153 kubelet.go:1883] SyncLoop (ADD, "api"): "hello-daemonset-nhk9x_default(fbb6fe6e-e763-11e8-a585-d63ae9c16d17)"
root@kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3:~# ps aux | grep

^
Failed on that one. Note that the wait container fails


crictl ps

root@kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3:# crictl ps
CONTAINER ID IMAGE CREATED STATE NAME ATTEMPT POD ID
89fec9b1cad4f efdec82af25a7 About an hour ago Running runon 0 99d779b9d65b9
9758a90e48241 efdec82af25a7 About an hour ago Running runon 0 39987b86fd957
25feb03c275a6 d9b314c4a9a57 2 hours ago Running hello-kubernetes 0 9b133938b6584
979f42892c6b7 448e9ce2f225d 2 hours ago Running ssh-daemonset 0 eb80684b3d142
b15879baae7ec 655f7833fcbf1 2 hours ago Running ingress-auth 0 cb2242a22d5c7
ec12f99f4ddc3 d4bac35d555c5 2 hours ago Running nginx-ingress 0 cb2242a22d5c7
b6a0857014837 c4729023cbd19 2 hours ago Running ibm-cloud-provider-ip-169-61-27-6 0 4983f7d7df36c
9d1e8ecd36179 20282b8475a84 2 hours ago Running install-cni 0 f8ae129b3d05e
f4b5d75eb30ee 1da681c87e27c 2 hours ago Running keepalived-watcher 0 b2c356df4c317
86c52a9e831b5 986dbf23b28e3 2 hours ago Running calico-node 0 f8ae129b3d05e
4664e52037b74 9f8092b1c3428 2 hours ago Running fluentd 0 db241f0c00a50
124797d61fbde 448e9ce2f225d 2 hours ago Running ssh-daemonset 0 e2cf7038ec933
bcd09a726a6ce da86e6ba6ca19 2 hours ago Running pause 0 15114745f4671
355eedeab2771 72ae71163d38c 2 hours ago Running ibm-master-proxy-static 0 15114745f4671
root@kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3:
# crictl ps

^
critctl ps still shows the daemonset pod running



root@kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3:# crictl stop 89fec9b1cad4f
^C
root@kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3:
# journalctl -u containerd -r > containerd.logs
root@kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3:# cat containerd.logs | grep 89fe
Nov 13 18:24:26 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:24:26.039214509Z" level=error msg="StopContainer for "89fec9b1cad4f" failed" error="failed to kill container "89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f" init process: context canceled: unknown"
Nov 13 18:24:26 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:24:26.038987778Z" level=info msg="Kill container "89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f" init process"
Nov 13 18:24:26 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:24:26.038948567Z" level=error msg="An error occurs during waiting for container "89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f" to be killed" error="wait container "89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f" is cancelled"
Nov 13 18:24:20 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:24:20.874811346Z" level=info msg="Finish piping stdout of container "89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f""
Nov 13 18:24:20 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:24:20.874427280Z" level=info msg="Finish piping stderr of container "89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f""
Nov 13 18:24:20 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:24:20.850922245Z" level=info msg="Kill container "89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f""
Nov 13 18:24:20 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:24:20.850882417Z" level=error msg="An error occurs during waiting for container "89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f" to be stopped" error="wait container "89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f" stop timeout"
Nov 13 18:24:10 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:24:10.843266380Z" level=info msg="Stop container "89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f" with signal terminated"
Nov 13 18:24:10 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:24:10.832816711Z" level=info msg="StopContainer for "89fec9b1cad4f" with timeout 10 (s)"
Nov 13 17:50:34 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T17:50:34.520269732Z" level=info msg="Exec process "30d167070067589fef825ae34aac30bec48e1d938b6cfacb0d16a55186edfed8" exits with exit code 0 and error "
Nov 13 17:50:34 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T17:50:34.518523514Z" level=info msg="Finish piping "stdout" of container exec "30d167070067589fef825ae34aac30bec48e1d938b6cfacb0d16a55186edfed8""
Nov 13 17:50:34 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T17:50:34.517934314Z" level=info msg="Finish piping "stderr" of container exec "30d167070067589fef825ae34aac30bec48e1d938b6cfacb0d16a55186edfed8""
Nov 13 17:19:32 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T17:19:32.185183743Z" level=info msg="StartContainer for "89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f" returns successfully"
Nov 13 17:19:32 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T17:19:32.060435335Z" level=info msg="shim containerd-shim started" address="/containerd-shim/k8s.io/89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f/shim.sock" debug=false pid=17550
Nov 13 17:19:32 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T17:19:32.058818270Z" level=info msg="StartContainer for "89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f""
Nov 13 17:19:32 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T17:19:32.057902069Z" level=info msg="CreateContainer within sandbox "99d779b9d65b9f98e40c6f1f5d2fb7c73d0a28f4b0eae726dc105560f782157e" for &ContainerMetadata{Name:runon,Attempt:0,} returns container id "89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f""
root@kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3:
#

^ Manually stop it with containerd.logs



Nov 13 18:26:20 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:26:20.083927741Z" level=error msg="StopContainer for "9758a90e48241" failed" error="failed to kill container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430": context canceled: unknown"
Nov 13 18:26:20 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:26:20.083787400Z" level=info msg="Kill container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430""
Nov 13 18:26:20 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:26:20.083718955Z" level=error msg="An error occurs during waiting for container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" to be stopped" error="wait container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" is cancelled"
Nov 13 18:26:17 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:26:17.673077249Z" level=info msg="Stop container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" with signal terminated"
Nov 13 18:26:17 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:26:17.663389712Z" level=info msg="StopContainer for "9758a90e48241" with timeout 10 (s)"
Nov 13 18:26:10 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:26:10.406541470Z" level=info msg="Kill container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" init process"
Nov 13 18:26:10 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:26:10.406497060Z" level=error msg="An error occurs during waiting for container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" to be killed" error="wait container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" stop timeout"
Nov 13 18:25:10 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:25:10.405727840Z" level=info msg="Kill container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430""
Nov 13 18:25:10 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:25:10.394185449Z" level=error msg="StopContainer for "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" failed" error="an error occurs during waiting for container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" init process to be killed: wait container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" stop timeout"
Nov 13 18:24:10 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:24:10.392988366Z" level=info msg="Kill container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" init process"
Nov 13 18:24:10 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:24:10.392941991Z" level=error msg="An error occurs during waiting for container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" to be killed" error="wait container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" stop timeout"
Nov 13 18:23:10 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:23:10.392340651Z" level=info msg="Kill container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430""
Nov 13 18:23:10 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:23:10.392301238Z" level=error msg="An error occurs during waiting for container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" to be stopped" error="wait container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" stop timeout"
Nov 13 18:22:40 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:22:40.402229289Z" level=info msg="Finish piping stdout of container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430""
Nov 13 18:22:40 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:22:40.402178284Z" level=info msg="Finish piping stderr of container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430""
Nov 13 18:22:40 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:22:40.382836800Z" level=info msg="Stop container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" with signal terminated"
Nov 13 18:22:40 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:22:40.375070037Z" level=info msg="StopContainer for "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" with timeout 30 (s)"
Nov 13 17:40:24 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T17:40:24.861975860Z" level=error msg="ReopenContainerLog for "124797d61fbde3cf9dc74296260d9a0b9ac24ffcea0ce8be30c4696d1cce6e11" failed" error="failed to create and open log file: open /var/log/pods/2318f7fc-7563-11e8-a64d-f2d338a3ff89/ssh-daemonset/0.log: no such file or directory"

^
repeatedly fails to stop the container


root 30841 0.0 0.1 10820 5184 ? Sl 16:56 0:00 containerd-shim -namespace k8s.io -workdir /var/data/cripersistentstorage/io.containerd.runtime.v1.linux/k8s.io/9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430 -address /run/containerd/containerd.sock -containerd-binary /usr/local/bin/containerd

^have the containerd shim process around still



@relyt0925
Copy link
Author

Nov 13 18:14:39 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm kubelet.service[10153]: E1113 18:14:39.189544   10153 pod_workers.go:186] Error syncing pod fbb6fe6e-e763-11e8-a585-d63ae9c16d17 ("hello-daemonset-nhk9x_default(fbb6fe6e-e763-11e8-a585-d63ae9c16d17)"), skipping: error killing pod: [failed to "KillContainer" for "hello-kubernetes" with KillContainerError: "rpc error: code = Unknown desc = an error occurs during waiting for container \"25feb03c275a67c5a97155259dbe06ff2cc6172eddf73919e8657f8a8fa6484a\" init process to be killed: wait container \"25feb03c275a67c5a97155259dbe06ff2cc6172eddf73919e8657f8a8fa6484a\" stop timeout"
Nov 13 18:10:09 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm kubelet.service[10153]: I1113 18:10:09.138839   10153 kubelet.go:1899] SyncLoop (DELETE, "api"): "hello-daemonset-nhk9x_default(fbb6fe6e-e763-11e8-a585-d63ae9c16d17)"
Nov 13 16:48:51 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm kubelet.service[10153]: I1113 16:48:51.033122   10153 kubelet.go:1928] SyncLoop (PLEG): "hello-daemonset-nhk9x_default(fbb6fe6e-e763-11e8-a585-d63ae9c16d17)", event: &pleg.PodLifecycleEvent{ID:"fbb6fe6e-e763-11e8-a585-d63ae9c16d17", Type:"ContainerStarted", Data:"25feb03c275a67c5a97155259dbe06ff2cc6172eddf73919e8657f8a8fa6484a"}
Nov 13 16:48:46 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm kubelet.service[10153]: I1113 16:48:46.023545   10153 kubelet.go:1928] SyncLoop (PLEG): "hello-daemonset-nhk9x_default(fbb6fe6e-e763-11e8-a585-d63ae9c16d17)", event: &pleg.PodLifecycleEvent{ID:"fbb6fe6e-e763-11e8-a585-d63ae9c16d17", Type:"ContainerStarted", Data:"9b133938b6584f83eb07e18187fd1d7b7dd0a8106e42d68ab3cfbc0c1a08b0fa"}
Nov 13 16:48:44 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm kubelet.service[10153]: I1113 16:48:44.896508   10153 kuberuntime_manager.go:396] No sandbox for pod "hello-daemonset-nhk9x_default(fbb6fe6e-e763-11e8-a585-d63ae9c16d17)" can be found. Need to start a new one
Nov 13 16:48:44 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm kubelet.service[10153]: I1113 16:48:44.875136   10153 operation_generator.go:568] MountVolume.SetUp succeeded for volume "default-token-v67cl" (UniqueName: "kubernetes.io/secret/fbb6fe6e-e763-11e8-a585-d63ae9c16d17-default-token-v67cl") pod "hello-daemonset-nhk9x" (UID: "fbb6fe6e-e763-11e8-a585-d63ae9c16d17")
Nov 13 16:48:44 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm kubelet.service[10153]: I1113 16:48:44.843124   10153 reconciler.go:252] operationExecutor.MountVolume started for volume "default-token-v67cl" (UniqueName: "kubernetes.io/secret/fbb6fe6e-e763-11e8-a585-d63ae9c16d17-default-token-v67cl") pod "hello-daemonset-nhk9x" (UID: "fbb6fe6e-e763-11e8-a585-d63ae9c16d17")
Nov 13 16:48:44 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm kubelet.service[10153]: I1113 16:48:44.742577   10153 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "default-token-v67cl" (UniqueName: "kubernetes.io/secret/fbb6fe6e-e763-11e8-a585-d63ae9c16d17-default-token-v67cl") pod "hello-daemonset-nhk9x" (UID: "fbb6fe6e-e763-11e8-a585-d63ae9c16d17")
Nov 13 16:48:44 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm kubelet.service[10153]: I1113 16:48:44.588640   10153 kubelet.go:1883] SyncLoop (ADD, "api"): "hello-daemonset-nhk9x_default(fbb6fe6e-e763-11e8-a585-d63ae9c16d17)"
root@kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3:~# ps aux | grep

^
Failed on that one. Note that the wait container fails

crictl ps
root@kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3:~# crictl ps
CONTAINER ID        IMAGE               CREATED             STATE               NAME                                ATTEMPT             POD ID
89fec9b1cad4f       efdec82af25a7       About an hour ago   Running             runon                               0                   99d779b9d65b9
9758a90e48241       efdec82af25a7       About an hour ago   Running             runon                               0                   39987b86fd957
25feb03c275a6       d9b314c4a9a57       2 hours ago         Running             hello-kubernetes                    0                   9b133938b6584
979f42892c6b7       448e9ce2f225d       2 hours ago         Running             ssh-daemonset                       0                   eb80684b3d142
b15879baae7ec       655f7833fcbf1       2 hours ago         Running             ingress-auth                        0                   cb2242a22d5c7
ec12f99f4ddc3       d4bac35d555c5       2 hours ago         Running             nginx-ingress                       0                   cb2242a22d5c7
b6a0857014837       c4729023cbd19       2 hours ago         Running             ibm-cloud-provider-ip-169-61-27-6   0                   4983f7d7df36c
9d1e8ecd36179       20282b8475a84       2 hours ago         Running             install-cni                         0                   f8ae129b3d05e
f4b5d75eb30ee       1da681c87e27c       2 hours ago         Running             keepalived-watcher                  0                   b2c356df4c317
86c52a9e831b5       986dbf23b28e3       2 hours ago         Running             calico-node                         0                   f8ae129b3d05e
4664e52037b74       9f8092b1c3428       2 hours ago         Running             fluentd                             0                   db241f0c00a50
124797d61fbde       448e9ce2f225d       2 hours ago         Running             ssh-daemonset                       0                   e2cf7038ec933
bcd09a726a6ce       da86e6ba6ca19       2 hours ago         Running             pause                               0                   15114745f4671
355eedeab2771       72ae71163d38c       2 hours ago         Running             ibm-master-proxy-static             0                   15114745f4671
root@kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3:~# crictl ps

^
critctl ps still shows the daemonset pod running

root@kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3:~# crictl stop 89fec9b1cad4f
^C
root@kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3:~# journalctl -u containerd -r > containerd.logs
root@kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3:~# cat containerd.logs | grep 89fe
Nov 13 18:24:26 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:24:26.039214509Z" level=error msg="StopContainer for "89fec9b1cad4f" failed" error="failed to kill container "89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f" init process: context canceled: unknown"
Nov 13 18:24:26 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:24:26.038987778Z" level=info msg="Kill container "89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f" init process"
Nov 13 18:24:26 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:24:26.038948567Z" level=error msg="An error occurs during waiting for container "89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f" to be killed" error="wait container "89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f" is cancelled"
Nov 13 18:24:20 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:24:20.874811346Z" level=info msg="Finish piping stdout of container "89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f""
Nov 13 18:24:20 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:24:20.874427280Z" level=info msg="Finish piping stderr of container "89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f""
Nov 13 18:24:20 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:24:20.850922245Z" level=info msg="Kill container "89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f""
Nov 13 18:24:20 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:24:20.850882417Z" level=error msg="An error occurs during waiting for container "89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f" to be stopped" error="wait container "89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f" stop timeout"
Nov 13 18:24:10 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:24:10.843266380Z" level=info msg="Stop container "89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f" with signal terminated"
Nov 13 18:24:10 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:24:10.832816711Z" level=info msg="StopContainer for "89fec9b1cad4f" with timeout 10 (s)"
Nov 13 17:50:34 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T17:50:34.520269732Z" level=info msg="Exec process "30d167070067589fef825ae34aac30bec48e1d938b6cfacb0d16a55186edfed8" exits with exit code 0 and error <nil>"
Nov 13 17:50:34 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T17:50:34.518523514Z" level=info msg="Finish piping "stdout" of container exec "30d167070067589fef825ae34aac30bec48e1d938b6cfacb0d16a55186edfed8""
Nov 13 17:50:34 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T17:50:34.517934314Z" level=info msg="Finish piping "stderr" of container exec "30d167070067589fef825ae34aac30bec48e1d938b6cfacb0d16a55186edfed8""
Nov 13 17:19:32 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T17:19:32.185183743Z" level=info msg="StartContainer for "89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f" returns successfully"
Nov 13 17:19:32 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T17:19:32.060435335Z" level=info msg="shim containerd-shim started" address="/containerd-shim/k8s.io/89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f/shim.sock" debug=false pid=17550
Nov 13 17:19:32 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T17:19:32.058818270Z" level=info msg="StartContainer for "89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f""
Nov 13 17:19:32 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T17:19:32.057902069Z" level=info msg="CreateContainer within sandbox "99d779b9d65b9f98e40c6f1f5d2fb7c73d0a28f4b0eae726dc105560f782157e" for &ContainerMetadata{Name:runon,Attempt:0,} returns container id "89fec9b1cad4f27d229c25ec649fde5003ef7b2e30f4da2351b3c675acfc5b4f""
root@kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3:~#

^ Manually stop it with containerd.logs

Nov 13 18:26:20 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:26:20.083927741Z" level=error msg="StopContainer for "9758a90e48241" failed" error="failed to kill container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430": context canceled: unknown"
Nov 13 18:26:20 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:26:20.083787400Z" level=info msg="Kill container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430""
Nov 13 18:26:20 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:26:20.083718955Z" level=error msg="An error occurs during waiting for container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" to be stopped" error="wait container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" is cancelled"
Nov 13 18:26:17 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:26:17.673077249Z" level=info msg="Stop container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" with signal terminated"
Nov 13 18:26:17 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:26:17.663389712Z" level=info msg="StopContainer for "9758a90e48241" with timeout 10 (s)"
Nov 13 18:26:10 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:26:10.406541470Z" level=info msg="Kill container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" init process"
Nov 13 18:26:10 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:26:10.406497060Z" level=error msg="An error occurs during waiting for container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" to be killed" error="wait container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" stop timeout"
Nov 13 18:25:10 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:25:10.405727840Z" level=info msg="Kill container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430""
Nov 13 18:25:10 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:25:10.394185449Z" level=error msg="StopContainer for "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" failed" error="an error occurs during waiting for container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" init process to be killed: wait container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" stop timeout"
Nov 13 18:24:10 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:24:10.392988366Z" level=info msg="Kill container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" init process"
Nov 13 18:24:10 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:24:10.392941991Z" level=error msg="An error occurs during waiting for container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" to be killed" error="wait container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" stop timeout"
Nov 13 18:23:10 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:23:10.392340651Z" level=info msg="Kill container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430""
Nov 13 18:23:10 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:23:10.392301238Z" level=error msg="An error occurs during waiting for container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" to be stopped" error="wait container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" stop timeout"
Nov 13 18:22:40 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:22:40.402229289Z" level=info msg="Finish piping stdout of container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430""
Nov 13 18:22:40 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:22:40.402178284Z" level=info msg="Finish piping stderr of container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430""
Nov 13 18:22:40 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:22:40.382836800Z" level=info msg="Stop container "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" with signal terminated"
Nov 13 18:22:40 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T18:22:40.375070037Z" level=info msg="StopContainer for "9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430" with timeout 30 (s)"
Nov 13 17:40:24 kube-dal13-cr7e2fe902cba8449bbdb4eae11738aafb-w3.cloud.ibm containerd[8214]: time="2018-11-13T17:40:24.861975860Z" level=error msg="ReopenContainerLog for "124797d61fbde3cf9dc74296260d9a0b9ac24ffcea0ce8be30c4696d1cce6e11" failed" error="failed to create and open log file: open /var/log/pods/2318f7fc-7563-11e8-a64d-f2d338a3ff89/ssh-daemonset/0.log: no such file or directory"

^
repeatedly fails to stop the container

root     30841  0.0  0.1  10820  5184 ?        Sl   16:56   0:00 containerd-shim -namespace k8s.io -workdir /var/data/cripersistentstorage/io.containerd.runtime.v1.linux/k8s.io/9758a90e4824184fc010e6234a9de2f0789c969e7187fbc77687f1d73280b430 -address /run/containerd/containerd.sock -containerd-binary /usr/local/bin/containerd

^have the containerd shim process around

Containerd-shim never sent the signal.

@Random-Liu
Copy link
Member

Random-Liu commented Nov 13, 2018

@relyt0925
Copy link
Author

@Random-Liu is there a work around in the interim?

I tried setting the KillMode= of the systemd unit file to mixed and that works (basically has systemd send the signal properly to all the containerd-shim processes and everything gets cleaned up. But it also shuts down every container as well.

I am looking for some sort of process in the interim I can do to do the reaping properly overtime.

@relyt0925
Copy link
Author

But that sounds good about the fix I will look at the PRs...

@relyt0925
Copy link
Author

This is also happening in 1.2 as well (not sure if that matters or not).

@relyt0925
Copy link
Author

I'm getting a test build of 1.1.5 to see if the issue is resolved from @estesp .

@relyt0925
Copy link
Author

cc @estesp and @Random-Liu this issue is still occurring in
containerd://1.2.0-6-g07b5d4dd

Pod list

10.177.180.202   Ready     <none>    1h        v1.12.2+IKS   169.48.174.113   Ubuntu 16.04.5 LTS   4.4.0-139-generic   containerd://1.2.0-6-g07b5d4dd
root@stage-dal10-cr516deb11a40c492084ff7f8350084063-w1:~# kubectl get pods -o yaml
apiVersion: v1
items:
- apiVersion: v1
  kind: Pod
  metadata:
    annotations:
      kubernetes.io/psp: ibm-privileged-psp
    creationTimestamp: 2018-11-15T22:42:40Z
    deletionGracePeriodSeconds: 30
    deletionTimestamp: 2018-11-15T22:52:46Z
    generateName: containerd-installer-ds-
    labels:
      controller-revision-hash: 7b864fdbd9
      name: containerd-installer-ds
      pod-template-generation: "1"
    name: containerd-installer-ds-s2mn9
    namespace: default
    ownerReferences:
    - apiVersion: apps/v1
      blockOwnerDeletion: true
      controller: true
      kind: DaemonSet
      name: containerd-installer-ds
      uid: a4012532-e926-11e8-83d8-727f1bfab525
    resourceVersion: "13206"
    selfLink: /api/v1/namespaces/default/pods/containerd-installer-ds-s2mn9
    uid: c23d57c0-e927-11e8-83d8-727f1bfab525
  spec:
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchFields:
            - key: metadata.name
              operator: In
              values:
              - 10.177.180.202
    containers:
    - image: registry.ng.bluemix.net/armada-master/pause:3.1
      imagePullPolicy: IfNotPresent
      name: pause
      resources: {}
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: default-token-48lwt
        readOnly: true
    dnsPolicy: ClusterFirst
    hostPID: true
    imagePullSecrets:
    - name: bluemix-default-secret
    - name: bluemix-default-secret-regional
    - name: bluemix-default-secret-international
    initContainers:
    - command:
      - /bin/bash
      - -c
      - mkdir /cache && cp /scripts/installer.sh /cache && chmod +x /cache/installer.sh
        && /bin/bash /cache/installer.sh
      image: relyt0925/containerd115
      imagePullPolicy: Always
      name: runon
      resources: {}
      securityContext:
        privileged: true
        procMount: Default
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /scripts
        name: script-config
      - mountPath: /host
        name: host-volume
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: default-token-48lwt
        readOnly: true
    nodeName: 10.177.180.202
    priority: 0
    restartPolicy: Always
    schedulerName: default-scheduler
    securityContext: {}
    serviceAccount: default
    serviceAccountName: default
    terminationGracePeriodSeconds: 30
    tolerations:
    - operator: Exists
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/disk-pressure
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/memory-pressure
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/unschedulable
      operator: Exists
    volumes:
    - hostPath:
        path: /
        type: ""
      name: host-volume
    - configMap:
        defaultMode: 420
        name: containerd-installer-cm
      name: script-config
    - name: default-token-48lwt
      secret:
        defaultMode: 420
        secretName: default-token-48lwt
  status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: 2018-11-15T22:42:42Z
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: 2018-11-15T22:42:43Z
      status: "True"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: 2018-11-15T22:42:43Z
      status: "True"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: 2018-11-15T22:42:40Z
      status: "True"
      type: PodScheduled
    containerStatuses:
    - containerID: containerd://6f2a23749d9746b8c8c50271a6ea5621157c5e384787758ef7afe92a3dcdf63f
      image: k8s.gcr.io/pause:3.1
      imageID: k8s.gcr.io/pause@sha256:f78411e19d84a252e53bff71a4407a5686c46983a2c2eeed83929b888179acea
      lastState: {}
      name: pause
      ready: true
      restartCount: 0
      state:
        running:
          startedAt: 2018-11-15T22:42:42Z
    hostIP: 10.177.180.202
    initContainerStatuses:
    - containerID: containerd://3010933de4ca7766509c41cd304c0d78e49fa58d76e98abd6f08955ef71e94f8
      image: docker.io/relyt0925/containerd115:latest
      imageID: docker.io/relyt0925/containerd115@sha256:2e00108fa21dc0842aa8228c082025215e3a6f37bd0ea1d5c7487dd71c91aa5b
      lastState: {}
      name: runon
      ready: true
      restartCount: 0
      state:
        terminated:
          containerID: containerd://3010933de4ca7766509c41cd304c0d78e49fa58d76e98abd6f08955ef71e94f8
          exitCode: 0
          finishedAt: 2018-11-15T22:42:42Z
          reason: Completed
          startedAt: 2018-11-15T22:42:42Z
    phase: Running
    podIP: 172.30.254.244
    qosClass: BestEffort
    startTime: 2018-11-15T22:42:40Z
- apiVersion: v1
  kind: Pod
  metadata:
    annotations:
      kubernetes.io/psp: ibm-privileged-psp
    creationTimestamp: 2018-11-15T22:51:02Z
    deletionGracePeriodSeconds: 30
    deletionTimestamp: 2018-11-15T22:51:57Z
    generateName: ssh-daemonset-
    labels:
      controller-revision-hash: 8c588549b
      name: ssh-daemonset
      pod-template-generation: "1"
    name: ssh-daemonset-fl94q
    namespace: default
    ownerReferences:
    - apiVersion: apps/v1
      blockOwnerDeletion: true
      controller: true
      kind: DaemonSet
      name: ssh-daemonset
      uid: 057f0276-e927-11e8-9de7-16dd5373ae32
    resourceVersion: "13136"
    selfLink: /api/v1/namespaces/default/pods/ssh-daemonset-fl94q
    uid: ed16204b-e928-11e8-83d8-727f1bfab525
  spec:
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchFields:
            - key: metadata.name
              operator: In
              values:
              - 10.177.180.202
    containers:
    - image: kitch/sshdaemonset
      imagePullPolicy: Always
      name: ssh-daemonset
      resources: {}
      securityContext:
        privileged: true
        procMount: Default
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /host
        name: host-volume
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: default-token-48lwt
        readOnly: true
    dnsPolicy: ClusterFirst
    hostIPC: true
    hostNetwork: true
    hostPID: true
    imagePullSecrets:
    - name: bluemix-default-secret
    - name: bluemix-default-secret-regional
    - name: bluemix-default-secret-international
    nodeName: 10.177.180.202
    priority: 0
    restartPolicy: Always
    schedulerName: default-scheduler
    securityContext: {}
    serviceAccount: default
    serviceAccountName: default
    terminationGracePeriodSeconds: 30
    tolerations:
    - operator: Exists
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/disk-pressure
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/memory-pressure
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/unschedulable
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/network-unavailable
      operator: Exists
    volumes:
    - hostPath:
        path: /
        type: ""
      name: host-volume
    - name: default-token-48lwt
      secret:
        defaultMode: 420
        secretName: default-token-48lwt
  status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: 2018-11-15T22:51:02Z
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: 2018-11-15T22:51:03Z
      status: "True"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: 2018-11-15T22:51:03Z
      status: "True"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: 2018-11-15T22:51:02Z
      status: "True"
      type: PodScheduled
    containerStatuses:
    - containerID: containerd://50002c571090f3b737ef203ef98282fde331cec7c51bdedc959db483b7acb6f2
      image: docker.io/kitch/sshdaemonset:latest
      imageID: docker.io/kitch/sshdaemonset@sha256:8dc9a25b9843309b51c798343574840f4c9a5253cc09c64e335cdbc009a0e6b1
      lastState: {}
      name: ssh-daemonset
      ready: true
      restartCount: 0
      state:
        running:
          startedAt: 2018-11-15T22:51:03Z
    hostIP: 10.177.180.202
    phase: Running
    podIP: 10.177.180.202
    qosClass: BestEffort
    startTime: 2018-11-15T22:51:02Z
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""
root@stage-dal10-cr516deb11a40c492084ff7f8350084063-w1:~# 

@relyt0925
Copy link
Author

root@stage-dal10-cr516deb11a40c492084ff7f8350084063-w1:~# kubectl get pods
NAME                            READY     STATUS        RESTARTS   AGE
containerd-installer-ds-s2mn9   1/1       Terminating   0          15m
ssh-daemonset-fl94q             1/1       Terminating   0          7m
root@stage-dal10-cr516deb11a40c492084ff7f8350084063-w1:~# 

Kubelet log snippit and full logs

root@stage-dal10-cr516deb11a40c492084ff7f8350084063-w1:~# cat kubelet | grep ssh-daemonset-fl94q
Nov 15 22:55:57 stage-dal10-cr516deb11a40c492084ff7f8350084063-w1.cloud.ibm kubelet.service[1860]: E1115 22:55:57.411896    1860 pod_workers.go:186] Error syncing pod ed16204b-e928-11e8-83d8-727f1bfab525 ("ssh-daemonset-fl94q_default(ed16204b-e928-11e8-83d8-727f1bfab525)"), skipping: error killing pod: [failed to "KillContainer" for "ssh-daemonset" with KillContainerError: "rpc error: code = Unknown desc = an error occurs during waiting for container \"50002c571090f3b737ef203ef98282fde331cec7c51bdedc959db483b7acb6f2\" init process to be killed: wait container \"50002c571090f3b737ef203ef98282fde331cec7c51bdedc959db483b7acb6f2\" stop timeout"
Nov 15 22:51:27 stage-dal10-cr516deb11a40c492084ff7f8350084063-w1.cloud.ibm kubelet.service[1860]: I1115 22:51:27.369413    1860 kubelet.go:1899] SyncLoop (DELETE, "api"): "ssh-daemonset-fl94q_default(ed16204b-e928-11e8-83d8-727f1bfab525)"
Nov 15 22:51:03 stage-dal10-cr516deb11a40c492084ff7f8350084063-w1.cloud.ibm kubelet.service[1860]: I1115 22:51:03.421191    1860 kubelet.go:1928] SyncLoop (PLEG): "ssh-daemonset-fl94q_default(ed16204b-e928-11e8-83d8-727f1bfab525)", event: &pleg.PodLifecycleEvent{ID:"ed16204b-e928-11e8-83d8-727f1bfab525", Type:"ContainerStarted", Data:"4428fe9c76e226957d524b51707a00605548d49ec7d49b746de7b3ce38583411"}
Nov 15 22:51:03 stage-dal10-cr516deb11a40c492084ff7f8350084063-w1.cloud.ibm kubelet.service[1860]: I1115 22:51:03.421123    1860 kubelet.go:1928] SyncLoop (PLEG): "ssh-daemonset-fl94q_default(ed16204b-e928-11e8-83d8-727f1bfab525)", event: &pleg.PodLifecycleEvent{ID:"ed16204b-e928-11e8-83d8-727f1bfab525", Type:"ContainerStarted", Data:"50002c571090f3b737ef203ef98282fde331cec7c51bdedc959db483b7acb6f2"}
Nov 15 22:51:02 stage-dal10-cr516deb11a40c492084ff7f8350084063-w1.cloud.ibm kubelet.service[1860]: I1115 22:51:02.388015    1860 kuberuntime_manager.go:396] No sandbox for pod "ssh-daemonset-fl94q_default(ed16204b-e928-11e8-83d8-727f1bfab525)" can be found. Need to start a new one
Nov 15 22:51:02 stage-dal10-cr516deb11a40c492084ff7f8350084063-w1.cloud.ibm kubelet.service[1860]: I1115 22:51:02.369029    1860 operation_generator.go:568] MountVolume.SetUp succeeded for volume "default-token-48lwt" (UniqueName: "kubernetes.io/secret/ed16204b-e928-11e8-83d8-727f1bfab525-default-token-48lwt") pod "ssh-daemonset-fl94q" (UID: "ed16204b-e928-11e8-83d8-727f1bfab525")
Nov 15 22:51:02 stage-dal10-cr516deb11a40c492084ff7f8350084063-w1.cloud.ibm kubelet.service[1860]: I1115 22:51:02.339526    1860 operation_generator.go:568] MountVolume.SetUp succeeded for volume "host-volume" (UniqueName: "kubernetes.io/host-path/ed16204b-e928-11e8-83d8-727f1bfab525-host-volume") pod "ssh-daemonset-fl94q" (UID: "ed16204b-e928-11e8-83d8-727f1bfab525")
Nov 15 22:51:02 stage-dal10-cr516deb11a40c492084ff7f8350084063-w1.cloud.ibm kubelet.service[1860]: I1115 22:51:02.339449    1860 reconciler.go:252] operationExecutor.MountVolume started for volume "host-volume" (UniqueName: "kubernetes.io/host-path/ed16204b-e928-11e8-83d8-727f1bfab525-host-volume") pod "ssh-daemonset-fl94q" (UID: "ed16204b-e928-11e8-83d8-727f1bfab525")
Nov 15 22:51:02 stage-dal10-cr516deb11a40c492084ff7f8350084063-w1.cloud.ibm kubelet.service[1860]: I1115 22:51:02.339367    1860 reconciler.go:252] operationExecutor.MountVolume started for volume "default-token-48lwt" (UniqueName: "kubernetes.io/secret/ed16204b-e928-11e8-83d8-727f1bfab525-default-token-48lwt") pod "ssh-daemonset-fl94q" (UID: "ed16204b-e928-11e8-83d8-727f1bfab525")
Nov 15 22:51:02 stage-dal10-cr516deb11a40c492084ff7f8350084063-w1.cloud.ibm kubelet.service[1860]: I1115 22:51:02.238752    1860 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "host-volume" (UniqueName: "kubernetes.io/host-path/ed16204b-e928-11e8-83d8-727f1bfab525-host-volume") pod "ssh-daemonset-fl94q" (UID: "ed16204b-e928-11e8-83d8-727f1bfab525")
Nov 15 22:51:02 stage-dal10-cr516deb11a40c492084ff7f8350084063-w1.cloud.ibm kubelet.service[1860]: I1115 22:51:02.238660    1860 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "default-token-48lwt" (UniqueName: "kubernetes.io/secret/ed16204b-e928-11e8-83d8-727f1bfab525-default-token-48lwt") pod "ssh-daemonset-fl94q" (UID: "ed16204b-e928-11e8-83d8-727f1bfab525")
Nov 15 22:51:02 stage-dal10-cr516deb11a40c492084ff7f8350084063-w1.cloud.ibm kubelet.service[1860]: I1115 22:51:02.079383    1860 kubelet.go:1883] SyncLoop (ADD, "api"): "ssh-daemonset-fl94q_default(ed16204b-e928-11e8-83d8-727f1bfab525)"
root@stage-dal10-cr516deb11a40c492084ff7f8350084063-w1:~# containerd -v
containerd github.com/containerd/containerd v1.2.0-6-g07b5d4dd 07b5d4ddb32202f6f5d0488313d563e68c23360b
root@stage-dal10-cr516deb11a40c492084ff7f8350084063-w1:~# 

@relyt0925
Copy link
Author

KubeletLogs.txt

@relyt0925
Copy link
Author

So far I have not been able to replicate in 1.1.5

@Random-Liu
Copy link
Member

containerd github.com/containerd/containerd v1.2.0-6-g07b5d4dd 07b5d4ddb32202f6f5d0488313d563e68c23360b

I can't find a commit 07b5d4ddb32202f6f5d0488313d563e68c23360b in upstream. And please note that event containerd 1.2.0 doesn't have all required fixes, we need to cherrypick those fixes, and cut a containerd 1.2.1 release.

So far I have not been able to replicate in 1.1.5

Hope it is fixed in 1.1.5.

@estesp
Copy link
Member

estesp commented Nov 16, 2018

@Random-Liu I created a “potential” 1.2.1 for @relyt0925 to test by:

@estesp
Copy link
Member

estesp commented Nov 16, 2018

There could be other post-1.2.0 GA fixes I missed, but I did try and look at the list of post-1.2.0 merged PRs and nothing else jumped out at me

@Random-Liu
Copy link
Member

@estesp Have you cherry-picked #2743? That is one of the most important fixes.

@estesp
Copy link
Member

estesp commented Nov 16, 2018

@Random-Liu wow..that was a pretty important and embarrassing miss in my work :) @relyt0925 I can generate a new build with that included so you can test again.

I’ll also create a cherrypick PR for release/1.2 branch

@relyt0925
Copy link
Author

The new release looks good so far. I was able to get a node in a state where containers were failing to terminate:

root@stage-dal10-cre695783dbebb402c87a69de606d92841-w1:~# kubectl get pods
NAME                            READY     STATUS        RESTARTS   AGE
containerd-installer-ds-42q9m   1/1       Running       0          5m
hello-daemonset-h5nk6           1/1       Running       1          18m
nsenter-ds-rkrwn                1/1       Running       1          13m
nsenter-nohostip-fzlsd          0/1       Terminating   1          27m
ssh-daemonset-6dj5l             0/1       Terminating   1          27m
ssh-daemonsett-7vd2g            1/1       Running       1          7m
root@stage-dal10-cre695783dbebb402c87a69de606d92841-w1:~# kubectl get pods

Checked logs in the kubelet to ensure it was hitting the problem
journalctl -u kubelet -r

ring waiting for container "335060a7c6a280d21ec714537bb6c3e3fdc95d7c4714d500356f7830688929c5" init process to be killed: wait container "335060a7c6a280d21ec714537bb6c3e3fd
"335060a7c6a280d21ec714537bb6c3e3fdc95d7c4714d500356f7830688929c5" init process to be killed: wait container "335060a7c6a280d21ec714537bb6c3e3fdc95d7c4714d500356f783068892
pods/dc84ff19-e9aa-11e8-bad4-92321bac2769/ansible-playbook-runner with error exit status 1
open container log failed: rpc error: code = Unknown desc = failed to create and open log file: open /var/log/pods/dc84ff19-e9aa-11e8-bad4-92321bac2769/ansible-playbook-ru
 /var/log/pods/dc84ff19-e9aa-11e8-bad4-92321bac2769/ansible-playbook-runner/1.log: no such file or directory
pods/dc84ff19-e9aa-11e8-bad4-92321bac2769/ansible-playbook-runner with error exit status 1
ring waiting for container "c780d2a3363be3ade7f0caebe69e9c564520fe132628ea23968e46d6365d929a" init process to be killed: wait container "c780d2a3363be3ade7f0caebe69e9c5645
"c780d2a3363be3ade7f0caebe69e9c564520fe132628ea23968e46d6365d929a" init process to be killed: wait container "c780d2a3363be3ade7f0caebe69e9c564520fe132628ea23968e46d6365d9
ring waiting for container "4fe08c24f6886ec6bd76929a9795ac75e011ee46339891c76787a99e44d389d2" init process to be killed: wait container "4fe08c24f6886ec6bd76929a9795ac75e0
"4fe08c24f6886ec6bd76929a9795ac75e011ee46339891c76787a99e44d389d2" init process to be killed: wait container "4fe08c24f6886ec6bd76929a9795ac75e011ee46339891c76787a99e44d38
pods/dc84ff19-e9aa-11e8-bad4-92321bac2769/ansible-playbook-runner with error exit status 1
open container log failed: rpc error: code = Unknown desc = failed to create and open log file: open /var/log/pods/dc84ff19-e9aa-11e8-bad4-92321bac2769/ansible-playbook-ru

Then went in and shut containerd down and replaced the binaries with with the new test binaries @estesp gave me. I then rebooted the node and all the terminating containers were cleaned out. Doing more testing to ensure that didn't happen just due to the reboot but it looks promising.

@relyt0925
Copy link
Author

Actually still seeing it in the 1.2.2 test release

(slmonitor) Tylers-MBP-2:containerdmanualinstall tlisowski3$ kubectl get pods
NAME                            READY     STATUS        RESTARTS   AGE
containerd-installer-ds-2r2ws   0/1       Init:0/1      0          17m
hello-daemonset-thktp           1/1       Running       0          15m
nsenter-ds-pb98z                1/1       Running       0          15m
nsenter-nohostip-lcv64          1/1       Terminating   0          21m
ssh-daemonset-m65qh             1/1       Terminating   0          21m
ssh-daemonsett-7vd2g            1/1       Terminating   1          29m
(slmonitor) Tylers-MBP-2:containerdmanualinstall tlisowski3$ 

112_containerd_logs.txt
112_release_kubelet_logs.txt

@relyt0925
Copy link
Author

  Type     Reason         Age                From                    Message
  ----     ------         ----               ----                    -------
  Normal   Scheduled      22m                default-scheduler       Successfully assigned default/ssh-daemonset-m65qh to 10.93.252.184
  Normal   Pulling        22m                kubelet, 10.93.252.184  pulling image "kitch/sshdaemonset"
  Normal   Pulled         22m                kubelet, 10.93.252.184  Successfully pulled image "kitch/sshdaemonset"
  Normal   Created        22m                kubelet, 10.93.252.184  Created container
  Normal   Started        22m                kubelet, 10.93.252.184  Started container
  Normal   Killing        2m (x4 over 16m)   kubelet, 10.93.252.184  Killing container with id containerd://ssh-daemonset:Need to kill Pod
  Warning  FailedKillPod  37s (x4 over 14m)  kubelet, 10.93.252.184  error killing pod: [failed to "KillContainer" for "ssh-daemonset" with KillContainerError: "rpc error: code = Unknown desc = an error occurs during waiting for container \"e2a2250575a73db4d6689b53d18cc53d326bbecff03586b0e2307a6ce3350fa4\" init process to be killed: wait container \"e2a2250575a73db4d6689b53d18cc53d326bbecff03586b0e2307a6ce3350fa4\" stop timeout"
, failed to "KillPodSandbox" for "cc0e4a9b-e9c0-11e8-bad4-92321bac2769" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to stop container \"e2a2250575a73db4d6689b53d18cc53d326bbecff03586b0e2307a6ce3350fa4\": an error occurs during waiting for container \"e2a2250575a73db4d6689b53d18cc53d326bbecff03586b0e2307a6ce3350fa4\" init process to be killed: wait container \"e2a2250575a73db4d6689b53d18cc53d326bbecff03586b0e2307a6ce3350fa4\" stop timeout"

@relyt0925
Copy link
Author

containerd github.com/containerd/containerd v1.2.0-13-g3c81b6c7 3c81b6c72fd06b39781840b93dc25b1a43b07adc
root@nsenter-ds-pb98z:/# 

That is the containerd version

@Random-Liu
Copy link
Member

Random-Liu commented Nov 16, 2018 via email

@relyt0925
Copy link
Author

It doesn't! The yamls are below

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  labels:
    name: nsenter-ds
  name: nsenter-ds
spec:
  template:
    metadata:
      labels:
        name: nsenter-ds
    spec:
      tolerations:
      - operator: "Exists"
      hostPID: true
      containers:
      - name: runon
        image: "debian:jessie"
        command: ['/bin/bash', '-c', 'sleep 100000']
        securityContext:
          privileged: true
        volumeMounts:
        - mountPath: /host
          name: host-volume
      volumes:
      - name: host-volume
        hostPath:
          # directory location on host
          path: /
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  labels:
    name: nsenter-ds-nohostip
  name: nsenter-nohostip
spec:
  template:
    metadata:
      labels:
        name: nsenter-ds-nohostip
    spec:
      tolerations:
      - operator: "Exists"
      containers:
      - name: runon
        image: "debian:jessie"
        command: ['/bin/bash', '-c', 'sleep 100000']
        securityContext:
          privileged: true
        volumeMounts:
        - mountPath: /host
          name: host-volume
      volumes:
      - name: host-volume
        hostPath:
          # directory location on host
          path: /

@relyt0925
Copy link
Author

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  labels:
    name: mtr-daemonset
  name: mtr-daemonset
spec:
  template:
    metadata:
      labels:
        name: mtr-daemonset
    spec:
      containers:
      - image: relyt0925/network-debugger
        name: network-debugger
        command: ["/bin/sh"]
        args: ["-c", "mtr -rwc 200 -i 0.5 -rw iam.ng.bluemix.net; sleep 7200;"]

@relyt0925
Copy link
Author

I mainly just scheduled a bunch of daemonsets and did deletions across all of them and then just wait to make sure they all delete

@ivan4th
Copy link

ivan4th commented Dec 3, 2018

@Random-Liu

INFO[2018-12-03T21:09:13.471760732Z] StopContainer for "99a97d5c709457dfd4b281869e7c40b15dcc6d86617628210b90c5054c9b6c33" with timeout 10 (s)
INFO[2018-12-03T21:09:13.480194210Z] Stop container "99a97d5c709457dfd4b281869e7c40b15dcc6d86617628210b90c5054c9b6c33" with signal terminated
INFO[2018-12-03T21:09:13.541933086Z] TaskExit event &TaskExit{ContainerID:99a97d5c709457dfd4b281869e7c40b15dcc6d86617628210b90c5054c9b6c33,ID:99a97d5c709457dfd4b281869e7c40b15dcc6d86617628210b90c5054c9b6c33,Pid:3602,ExitStatus:143,ExitedAt:2018-12-03 21:09:13.487617708 +0000 UTC,}
ERRO[2018-12-03T21:09:23.488485051Z] An error occurs during waiting for container "99a97d5c709457dfd4b281869e7c40b15dcc6d86617628210b90c5054c9b6c33" to be stopped  error="wait container \"99a97d5c709457dfd4b281869e7c40b15dcc6d86617628210b90c5054c9b6c33\" stop timeout"
INFO[2018-12-03T21:09:23.488890401Z] Kill container "99a97d5c709457dfd4b281869e7c40b15dcc6d86617628210b90c5054c9b6c33"
ERRO[2018-12-03T21:10:23.490699109Z] An error occurs during waiting for container "99a97d5c709457dfd4b281869e7c40b15dcc6d86617628210b90c5054c9b6c33" to be killed  error="wait container \"99a97d5c709457dfd4b281869e7c40b15dcc6d86617628210b90c5054c9b6c33\" stop timeout"
INFO[2018-12-03T21:10:23.491367704Z] Kill container "99a97d5c709457dfd4b281869e7c40b15dcc6d86617628210b90c5054c9b6c33" init process
ERRO[2018-12-03T21:11:23.492601128Z] StopContainer for "99a97d5c709457dfd4b281869e7c40b15dcc6d86617628210b90c5054c9b6c33" failed  error="an error occurs during waiting for container \"99a97d5c709457dfd4b281869e7c40b15dcc6d86617628210b90c5054c9b6c33\" init process to be killed: wait container \"99a97d5c709457dfd4b281869e7c40b15dcc6d86617628210b90c5054c9b6c33\" stop timeout"
^CINFO[2018-12-03T21:13:54.014785381Z] Stop CRI service
root@ubuntu-vm:~# containerd --version
containerd github.com/containerd/containerd v1.2.0-112-g2175efcf 2175efcf99f548bb662d32a3b58cdaeab1375eac```

@ivan4th
Copy link

ivan4th commented Dec 3, 2018

test script run:

root@ubuntu-vm:~# ./test-containerd.sh
Image is up to date for sha256:a51debf7e1eb2018400cef1e4b01f2e9f591f6c739de7b5d6c142f954f3715a7
99a97d5c709457dfd4b281869e7c40b15dcc6d86617628210b90c5054c9b6c33
pod id fac245b86a5e9683ba89a5910b0cc8595f15c738802615f17950ca0207300806 container 99a97d5c709457dfd4b281869e7c40b15dcc6d86617628210b90c5054c9b6c33
crictl stop will now hang. Kill 'sleep Infinity' to continue
FATA[0130] Stopping the container "99a97d5c709457dfd4b281869e7c40b15dcc6d86617628210b90c5054c9b6c33" failed: rpc error: code = Unknown desc = an error occurs during waiting for container "99a97d5c709457dfd4b281869e7c40b15dcc6d86617628210b90c5054c9b6c33" init process to be killed: wait container "99a97d5c709457dfd4b281869e7c40b15dcc6d86617628210b90c5054c9b6c33" stop timeout
root@ubuntu-vm:~# ps aux|grep sleep
root      3619  0.0  0.0   4376   648 ?        S    21:09   0:00 sleep 600

@Random-Liu
Copy link
Member

Random-Liu commented Dec 3, 2018

@ivan4th Sorry, I was using 1.1.5. I can reproduce with containerd 1.2.0-rc.0.

I'll look into it. I'm not sure whether this is the same with the original issue, but this is definitely something we should figure out and fix.

@ivan4th
Copy link

ivan4th commented Dec 3, 2018

Here's stack trace (obtained via kill -USR1) from a hung containerd (master -- 2175efc ), if this helps:

=== BEGIN goroutine stack dump ===
goroutine 21 [running]:
github.com/containerd/containerd/cmd/containerd/command.dumpStacks()
	/go/src/github.com/containerd/containerd/cmd/containerd/command/main.go:272 +0x8a
github.com/containerd/containerd/cmd/containerd/command.handleSignals.func1(0xc0004f71a0, 0xc0004f7140, 0x558bdc0a9660, 0xc00003e0a8, 0xc0000887e0)
	/go/src/github.com/containerd/containerd/cmd/containerd/command/main_unix.go:51 +0x25a
created by github.com/containerd/containerd/cmd/containerd/command.handleSignals
	/go/src/github.com/containerd/containerd/cmd/containerd/command/main_unix.go:41 +0x8b

goroutine 1 [chan receive]:
github.com/containerd/containerd/cmd/containerd/command.App.func1(0xc0004fa6e0, 0xc0004fa6e0, 0xc000535d17)
	/go/src/github.com/containerd/containerd/cmd/containerd/command/main.go:199 +0x927
github.com/containerd/containerd/vendor/github.com/urfave/cli.HandleAction(0x558bdbe83f00, 0x558bdc082ad0, 0xc0004fa6e0, 0xc0004f70e0, 0x0)
	/go/src/github.com/containerd/containerd/vendor/github.com/urfave/cli/app.go:502 +0xca
github.com/containerd/containerd/vendor/github.com/urfave/cli.(*App).Run(0xc000352540, 0xc00003a1b0, 0x1, 0x1, 0x0, 0x0)
	/go/src/github.com/containerd/containerd/vendor/github.com/urfave/cli/app.go:268 +0x5b9
main.main()
	github.com/containerd/containerd/cmd/containerd/main.go:33 +0x51

goroutine 5 [syscall]:
os/signal.signal_recv(0x558bdc098fe0)
	/usr/local/go/src/runtime/sigqueue.go:139 +0x9e
os/signal.loop()
	/usr/local/go/src/os/signal/signal_unix.go:23 +0x24
created by os/signal.init.0
	/usr/local/go/src/os/signal/signal_unix.go:29 +0x43

goroutine 6 [chan receive]:
github.com/containerd/containerd/vendor/github.com/golang/glog.(*loggingT).flushDaemon(0x558bdcb2af00)
	/go/src/github.com/containerd/containerd/vendor/github.com/golang/glog/glog.go:879 +0x8d
created by github.com/containerd/containerd/vendor/github.com/golang/glog.init.0
	/go/src/github.com/containerd/containerd/vendor/github.com/golang/glog/glog.go:410 +0x205

goroutine 38 [IO wait]:
internal/poll.runtime_pollWait(0x7f24c3265f00, 0x72, 0xc00055fbb0)
	/usr/local/go/src/runtime/netpoll.go:173 +0x68
internal/poll.(*pollDesc).wait(0xc00034a418, 0x72, 0xffffffffffffff00, 0x558bdc091b20, 0x558bdcae3e58)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9c
internal/poll.(*pollDesc).waitRead(0xc00034a418, 0xc0002fe000, 0x1000, 0x1000)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3f
internal/poll.(*FD).Read(0xc00034a400, 0xc0002fe000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:169 +0x17b
net.(*netFD).Read(0xc00034a400, 0xc0002fe000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/fd_unix.go:202 +0x51
net.(*conn).Read(0xc00000e190, 0xc0002fe000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:177 +0x6a
bufio.(*Reader).Read(0xc0004f7c80, 0xc0004ff720, 0xa, 0xa, 0x0, 0xc00005fdf0, 0x558bdab24f9b)
	/usr/local/go/src/bufio/bufio.go:216 +0x231
io.ReadAtLeast(0x558bdc08e7a0, 0xc0004f7c80, 0xc0004ff720, 0xa, 0xa, 0xa, 0x558bdab736d0, 0xc000000300, 0x4)
	/usr/local/go/src/io/io.go:310 +0x8a
io.ReadFull(0x558bdc08e7a0, 0xc0004f7c80, 0xc0004ff720, 0xa, 0xa, 0x558bdab484f1, 0x558bdc0858c8, 0x558bdb9824f8)
	/usr/local/go/src/io/io.go:329 +0x5a
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.readMessageHeader(0xc0004ff720, 0xa, 0xa, 0x558bdc08e7a0, 0xc0004f7c80, 0x558bdab1e058, 0x558bdab73020, 0x8, 0xc00005fe50)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/channel.go:54 +0x62
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*channel).recv(0xc0004ff700, 0x558bdc0a9660, 0xc00003e0c0, 0xc00005ff40, 0x1, 0x558bdaeb5450, 0x558bdb88fc01, 0xc00005ff68, 0x558bdaeb5450, 0xc000088120)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/channel.go:102 +0x6f
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Client).run.func1(0xc000089d40, 0xc0004f7ce0, 0xc000539620, 0xc000089ce0)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:164 +0x8d
created by github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Client).run
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:159 +0x13f

goroutine 23 [select]:
github.com/containerd/containerd/vendor/github.com/docker/go-events.(*Broadcaster).run(0xc000508af0)
	/go/src/github.com/containerd/containerd/vendor/github.com/docker/go-events/broadcast.go:117 +0x19d
created by github.com/containerd/containerd/vendor/github.com/docker/go-events.NewBroadcaster
	/go/src/github.com/containerd/containerd/vendor/github.com/docker/go-events/broadcast.go:39 +0x1b0

goroutine 25 [select]:
github.com/containerd/containerd/gc/scheduler.(*gcScheduler).run(0xc0004f7a40, 0x558bdc0a9660, 0xc00003e0a8)
	/go/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:268 +0x1c2
created by github.com/containerd/containerd/gc/scheduler.init.0.func1
	/go/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:132 +0x488

goroutine 26 [select]:
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Client).run(0xc0004f7ce0)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:191 +0x2b7
created by github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.NewClient
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:62 +0x12e

goroutine 40 [select]:
github.com/containerd/containerd/events/exchange.(*Exchange).Subscribe.func3(0xc00053cab0, 0xc000209780, 0xc000088240, 0x558bdc0a9620, 0xc0002ef040, 0xc00007e180)
	/go/src/github.com/containerd/containerd/events/exchange/exchange.go:171 +0x130
created by github.com/containerd/containerd/events/exchange.(*Exchange).Subscribe
	/go/src/github.com/containerd/containerd/events/exchange/exchange.go:165 +0x25e

goroutine 29 [IO wait]:
internal/poll.runtime_pollWait(0x7f24c3265e30, 0x72, 0xc00005b520)
	/usr/local/go/src/runtime/netpoll.go:173 +0x68
internal/poll.(*pollDesc).wait(0xc0004f7e78, 0x72, 0xffffffffffffff01, 0x558bdc091b20, 0x558bdcae3e58)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9c
internal/poll.(*pollDesc).waitRead(0xc0004f7e78, 0xc000560001, 0x8000, 0x8000)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3f
internal/poll.(*FD).Read(0xc0004f7e60, 0xc000560000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:169 +0x17b
os.(*File).read(0xc00000e1a8, 0xc000560000, 0x8000, 0x8000, 0x7f24c32c1001, 0x0, 0xc00005b628)
	/usr/local/go/src/os/file_unix.go:249 +0x50
os.(*File).Read(0xc00000e1a8, 0xc000560000, 0x8000, 0x8000, 0xc00016bc00, 0x558bdbd82503, 0x7f24c32c1000)
	/usr/local/go/src/os/file.go:108 +0x6b
github.com/containerd/containerd/vendor/github.com/containerd/fifo.(*fifo).Read(0xc0004f7da0, 0xc000560000, 0x8000, 0x8000, 0x8000, 0x8000, 0x0)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/fifo/fifo.go:154 +0x1f0
io.copyBuffer(0x558bdc0905a0, 0xc00000e020, 0x7f24c3273de0, 0xc0004f7da0, 0xc000560000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
	/usr/local/go/src/io/io.go:402 +0x127
io.Copy(0x558bdc0905a0, 0xc00000e020, 0x7f24c3273de0, 0xc0004f7da0, 0x0, 0x0, 0x0)
	/usr/local/go/src/io/io.go:364 +0x5c
created by github.com/containerd/containerd/runtime/v1/linux.(*Runtime).loadTasks
	/go/src/github.com/containerd/containerd/runtime/v1/linux/runtime.go:357 +0x100f

goroutine 39 [sync.Cond.Wait]:
sync.runtime_notifyListWait(0xc0005561d0, 0xc000000004)
	/usr/local/go/src/runtime/sema.go:510 +0xef
sync.(*Cond).Wait(0xc0005561c0)
	/usr/local/go/src/sync/cond.go:56 +0x94
github.com/containerd/containerd/vendor/github.com/docker/go-events.(*Queue).next(0xc00053ca50, 0x0, 0x0)
	/go/src/github.com/containerd/containerd/vendor/github.com/docker/go-events/queue.go:103 +0x82
github.com/containerd/containerd/vendor/github.com/docker/go-events.(*Queue).run(0xc00053ca50)
	/go/src/github.com/containerd/containerd/vendor/github.com/docker/go-events/queue.go:68 +0x60
created by github.com/containerd/containerd/vendor/github.com/docker/go-events.NewQueue
	/go/src/github.com/containerd/containerd/vendor/github.com/docker/go-events/queue.go:29 +0x106

goroutine 32 [IO wait]:
internal/poll.runtime_pollWait(0x7f24c3265d60, 0x72, 0xc000392520)
	/usr/local/go/src/runtime/netpoll.go:173 +0x68
internal/poll.(*pollDesc).wait(0xc0004f7f98, 0x72, 0xffffffffffffff01, 0x558bdc091b20, 0x558bdcae3e58)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9c
internal/poll.(*pollDesc).waitRead(0xc0004f7f98, 0xc000568001, 0x8000, 0x8000)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3f
internal/poll.(*FD).Read(0xc0004f7f80, 0xc000568000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:169 +0x17b
os.(*File).read(0xc00000e1b8, 0xc000568000, 0x8000, 0x8000, 0x7f24c32c1001, 0x0, 0xc000392628)
	/usr/local/go/src/os/file_unix.go:249 +0x50
os.(*File).Read(0xc00000e1b8, 0xc000568000, 0x8000, 0x8000, 0xc00016bc00, 0x558bdbef2b80, 0x7f24c32c1000)
	/usr/local/go/src/os/file.go:108 +0x6b
github.com/containerd/containerd/vendor/github.com/containerd/fifo.(*fifo).Read(0xc0004f7f20, 0xc000568000, 0x8000, 0x8000, 0x8000, 0x8000, 0x0)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/fifo/fifo.go:154 +0x1f0
io.copyBuffer(0x558bdc0905a0, 0xc00000e028, 0x7f24c3273de0, 0xc0004f7f20, 0xc000568000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
	/usr/local/go/src/io/io.go:402 +0x127
io.Copy(0x558bdc0905a0, 0xc00000e028, 0x7f24c3273de0, 0xc0004f7f20, 0x0, 0x0, 0x0)
	/usr/local/go/src/io/io.go:364 +0x5c
created by github.com/containerd/containerd/runtime/v1/linux.(*Runtime).loadTasks
	/go/src/github.com/containerd/containerd/runtime/v1/linux/runtime.go:368 +0x1471

goroutine 33 [syscall]:
syscall.Syscall6(0xe8, 0xa, 0xc00006a9b8, 0x80, 0xffffffffffffffff, 0x0, 0x0, 0x0, 0x0, 0x0)
	/usr/local/go/src/syscall/asm_linux_amd64.s:44 +0x5
github.com/containerd/containerd/vendor/golang.org/x/sys/unix.EpollWait(0xa, 0xc00006a9b8, 0x80, 0x80, 0xffffffffffffffff, 0x0, 0x0, 0x0)
	/go/src/github.com/containerd/containerd/vendor/golang.org/x/sys/unix/zsyscall_linux_amd64.go:1520 +0x74
github.com/containerd/containerd/metrics/cgroups.(*oomCollector).start(0xc000165aa0)
	/go/src/github.com/containerd/containerd/metrics/cgroups/oom.go:114 +0x7d
created by github.com/containerd/containerd/metrics/cgroups.newOOMCollector
	/go/src/github.com/containerd/containerd/metrics/cgroups/oom.go:50 +0xed

goroutine 35 [sleep]:
time.Sleep(0x2540be400)
	/usr/local/go/src/runtime/time.go:105 +0x155
github.com/containerd/containerd/runtime/restart/monitor.(*monitor).run(0xc00000e210, 0x2540be400)
	/go/src/github.com/containerd/containerd/runtime/restart/monitor/monitor.go:155 +0x4e
created by github.com/containerd/containerd/runtime/restart/monitor.init.0.func1
	/go/src/github.com/containerd/containerd/runtime/restart/monitor/monitor.go:85 +0x12b

goroutine 36 [select]:
github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server.(*criService).Run(0xc0003fc6c0, 0x0, 0x0)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/service.go:214 +0x402
github.com/containerd/containerd/vendor/github.com/containerd/cri.initCRIService.func1(0x558bdc0bb0c0, 0xc0003fc6c0, 0x558bdc0a9660, 0xc00003e0a8)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/cri/cri.go:103 +0x3a
created by github.com/containerd/containerd/vendor/github.com/containerd/cri.initCRIService
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/cri/cri.go:102 +0x63d

goroutine 37 [IO wait]:
internal/poll.runtime_pollWait(0x7f24c3265c90, 0x72, 0x0)
	/usr/local/go/src/runtime/netpoll.go:173 +0x68
internal/poll.(*pollDesc).wait(0xc00034b798, 0x72, 0xc0002ee400, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9c
internal/poll.(*pollDesc).waitRead(0xc00034b798, 0xffffffffffffff00, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3f
internal/poll.(*FD).Accept(0xc00034b780, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:384 +0x1a2
net.(*netFD).accept(0xc00034b780, 0xc00082a798, 0x0, 0x0)
	/usr/local/go/src/net/fd_unix.go:238 +0x44
net.(*UnixListener).accept(0xc000112ae0, 0xc00006ddc8, 0xc00006ddd0, 0x18)
	/usr/local/go/src/net/unixsock_posix.go:162 +0x34
net.(*UnixListener).Accept(0xc000112ae0, 0x558bdc0848f8, 0xc000352fc0, 0x558bdc0b1da0, 0xc00082a798)
	/usr/local/go/src/net/unixsock.go:257 +0x49
github.com/containerd/containerd/vendor/google.golang.org/grpc.(*Server).Serve(0xc000352fc0, 0x558bdc0a80a0, 0xc000112ae0, 0x0, 0x0)
	/go/src/github.com/containerd/containerd/vendor/google.golang.org/grpc/server.go:544 +0x212
github.com/containerd/containerd/services/server.(*Server).ServeGRPC(0xc00041ecf0, 0x558bdc0a80a0, 0xc000112ae0, 0x18, 0xc000394738)
	/go/src/github.com/containerd/containerd/services/server/server.go:172 +0x6b
github.com/containerd/containerd/services/server.(*Server).ServeGRPC-fm(0x558bdc0a80a0, 0xc000112ae0, 0xc000112ae0, 0x0)
	/go/src/github.com/containerd/containerd/cmd/containerd/command/main.go:196 +0x40
github.com/containerd/containerd/cmd/containerd/command.serve.func1(0x558bdc0a80a0, 0xc000112ae0, 0xc000225310, 0x558bdc0a9660, 0xc00003e0a8, 0xc0000412e0, 0x1f)
	/go/src/github.com/containerd/containerd/cmd/containerd/command/main.go:210 +0x77
created by github.com/containerd/containerd/cmd/containerd/command.serve
	/go/src/github.com/containerd/containerd/cmd/containerd/command/main.go:208 +0x1be

goroutine 41 [semacquire]:
sync.runtime_Semacquire(0xc00079fc38)
	/usr/local/go/src/runtime/sema.go:56 +0x3b
sync.(*WaitGroup).Wait(0xc00079fc30)
	/usr/local/go/src/sync/waitgroup.go:130 +0x66
github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/io.(*wgCloser).Wait(0xc000825280)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/io/helpers.go:63 +0x31
github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/io.(*ContainerIO).Wait(0xc000825140)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/io/container_io.go:224 +0x31
github.com/containerd/containerd.(*task).Delete(0xc00094bf80, 0x558bdc0a96a0, 0xc00091f7a0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/containerd/containerd/task.go:291 +0x386
github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server.handleContainerExit(0x558bdc0a96a0, 0xc00091f7a0, 0xc00095ee40, 0xc0008d8e80, 0x40, 0xc0008d8f00, 0x3a, 0xc000534ac0, 0x40, 0xc000988000, ...)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/events.go:282 +0x308
github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server.(*eventMonitor).handleEvent(0xc0002ef0c0, 0x558bdbfd3de0, 0xc00095ee40, 0x0, 0x0)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/events.go:206 +0x66d
github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server.(*eventMonitor).start.func1(0xc000088d80, 0xc0002ef0c0, 0xc000322f60)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/events.go:154 +0x865
created by github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server.(*eventMonitor).start
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/events.go:138 +0x9c

goroutine 42 [chan receive]:
github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server.(*snapshotsSyncer).start.func1(0xc0000bda40, 0xc00050dba0)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/snapshots.go:65 +0x69
created by github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server.(*snapshotsSyncer).start
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/snapshots.go:57 +0x5d

goroutine 43 [IO wait]:
internal/poll.runtime_pollWait(0x7f24c3265bc0, 0x72, 0x0)
	/usr/local/go/src/runtime/netpoll.go:173 +0x68
internal/poll.(*pollDesc).wait(0xc00034b398, 0x72, 0xc000557900, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9c
internal/poll.(*pollDesc).waitRead(0xc00034b398, 0xffffffffffffff00, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3f
internal/poll.(*FD).Accept(0xc00034b380, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:384 +0x1a2
net.(*netFD).accept(0xc00034b380, 0x7f24c32c1000, 0x0, 0xc00055de00)
	/usr/local/go/src/net/fd_unix.go:238 +0x44
net.(*TCPListener).accept(0xc00000e778, 0xc00055de08, 0x558bdab2583a, 0x30)
	/usr/local/go/src/net/tcpsock_posix.go:139 +0x30
net.(*TCPListener).Accept(0xc00000e778, 0x558bdbfce200, 0xc0004f9b30, 0x558bdbecad60, 0x558bdcacc4a0)
	/usr/local/go/src/net/tcpsock.go:260 +0x49
net/http.(*Server).Serve(0xc0002b31e0, 0x558bdc0a8060, 0xc00000e778, 0x0, 0x0)
	/usr/local/go/src/net/http/server.go:2826 +0x231
github.com/containerd/containerd/vendor/k8s.io/kubernetes/pkg/kubelet/server/streaming.(*server).Start(0xc000106750, 0x558bdc085701, 0xc000088de0, 0x0)
	/go/src/github.com/containerd/containerd/vendor/k8s.io/kubernetes/pkg/kubelet/server/streaming/server.go:247 +0x149
github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server.(*criService).Run.func1(0xc000088de0, 0xc0003fc6c0)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/service.go:203 +0x6f
created by github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server.(*criService).Run
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/service.go:201 +0x332

goroutine 104 [syscall]:
syscall.Syscall6(0xf7, 0x1, 0x108e, 0xc000394dc8, 0x1000004, 0x0, 0x0, 0x558bdab4f691, 0xc000179e00, 0x200000003)
	/usr/local/go/src/syscall/asm_linux_amd64.s:44 +0x5
os.(*Process).blockUntilWaitable(0xc000800840, 0xc000338f00, 0x4, 0xc000394eb0)
	/usr/local/go/src/os/wait_waitid.go:31 +0x9a
os.(*Process).wait(0xc000800840, 0x0, 0xc0007a0dd0, 0xc000394fb0)
	/usr/local/go/src/os/exec_unix.go:22 +0x3b
os.(*Process).Wait(0xc000800840, 0xc000089138, 0x558bdab71870, 0xc0000890e0)
	/usr/local/go/src/os/exec.go:125 +0x2d
os/exec.(*Cmd).Wait(0xc0008866e0, 0x554854610, 0x558bdab4eb8b)
	/usr/local/go/src/os/exec/exec.go:465 +0x5d
github.com/containerd/containerd/runtime/v1/shim/client.WithStart.func1.2(0xc0008866e0, 0xc0006783c0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/containerd/containerd/runtime/v1/shim/client/client.go:96 +0x2d
created by github.com/containerd/containerd/runtime/v1/shim/client.WithStart.func1
	/go/src/github.com/containerd/containerd/runtime/v1/shim/client/client.go:95 +0x40b

goroutine 133 [IO wait]:
internal/poll.runtime_pollWait(0x7f24c3265af0, 0x72, 0xc0009d2b38)
	/usr/local/go/src/runtime/netpoll.go:173 +0x68
internal/poll.(*pollDesc).wait(0xc000920498, 0x72, 0xffffffffffffff00, 0x558bdc091b20, 0x558bdcae3e58)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9c
internal/poll.(*pollDesc).waitRead(0xc000920498, 0xc0009e2000, 0x8000, 0x8000)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3f
internal/poll.(*FD).Read(0xc000920480, 0xc0009e2000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:169 +0x17b
net.(*netFD).Read(0xc000920480, 0xc0009e2000, 0x8000, 0x8000, 0x11, 0x0, 0x0)
	/usr/local/go/src/net/fd_unix.go:202 +0x51
net.(*conn).Read(0xc00082a768, 0xc0009e2000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:177 +0x6a
bufio.(*Reader).Read(0xc00091f200, 0xc00099cff8, 0x9, 0x9, 0x20, 0x0, 0x0)
	/usr/local/go/src/bufio/bufio.go:216 +0x231
io.ReadAtLeast(0x558bdc08e7a0, 0xc00091f200, 0xc00099cff8, 0x9, 0x9, 0x9, 0xc01150ba79, 0x1150ba79dab1a820, 0x5c05ae1e)
	/usr/local/go/src/io/io.go:310 +0x8a
io.ReadFull(0x558bdc08e7a0, 0xc00091f200, 0xc00099cff8, 0x9, 0x9, 0xbef989679150ba79, 0x56436259a, 0x558bdcb2ab80)
	/usr/local/go/src/io/io.go:329 +0x5a
github.com/containerd/containerd/vendor/golang.org/x/net/http2.readFrameHeader(0xc00099cff8, 0x9, 0x9, 0x558bdc08e7a0, 0xc00091f200, 0x0, 0x558b00000000, 0xc0005a6108, 0xc0009d2e90)
	/go/src/github.com/containerd/containerd/vendor/golang.org/x/net/http2/frame.go:237 +0x7d
github.com/containerd/containerd/vendor/golang.org/x/net/http2.(*Framer).ReadFrame(0xc00099cfc0, 0xc0006dd7a0, 0xc0006dd7a0, 0x0, 0x0)
	/go/src/github.com/containerd/containerd/vendor/golang.org/x/net/http2/frame.go:492 +0xa5
github.com/containerd/containerd/vendor/google.golang.org/grpc/transport.(*http2Server).HandleStreams(0xc0009a6800, 0xc00094ad20, 0x558bdc084950)
	/go/src/github.com/containerd/containerd/vendor/google.golang.org/grpc/transport/http2_server.go:426 +0x7e
github.com/containerd/containerd/vendor/google.golang.org/grpc.(*Server).serveStreams(0xc000352fc0, 0x558bdc0b3a80, 0xc0009a6800)
	/go/src/github.com/containerd/containerd/vendor/google.golang.org/grpc/server.go:676 +0xdf
github.com/containerd/containerd/vendor/google.golang.org/grpc.(*Server).handleRawConn.func2()
	/go/src/github.com/containerd/containerd/vendor/google.golang.org/grpc/server.go:631 +0x3e
github.com/containerd/containerd/vendor/google.golang.org/grpc.(*Server).handleRawConn.func3(0xc0006d7900, 0xc000352fc0, 0x7f24c3279bb8, 0xc0009a6800)
	/go/src/github.com/containerd/containerd/vendor/google.golang.org/grpc/server.go:639 +0x29
created by github.com/containerd/containerd/vendor/google.golang.org/grpc.(*Server).handleRawConn
	/go/src/github.com/containerd/containerd/vendor/google.golang.org/grpc/server.go:638 +0x5fa

goroutine 132 [select]:
github.com/containerd/containerd/vendor/google.golang.org/grpc/transport.(*http2Server).keepalive(0xc0009a6800)
	/go/src/github.com/containerd/containerd/vendor/google.golang.org/grpc/transport/http2_server.go:867 +0x1ec
created by github.com/containerd/containerd/vendor/google.golang.org/grpc/transport.newHTTP2Server
	/go/src/github.com/containerd/containerd/vendor/google.golang.org/grpc/transport/http2_server.go:280 +0xd62

goroutine 134 [select]:
github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server.(*criService).waitContainerStop(0xc0003fc6c0, 0x558bdc0a96e0, 0xc00094b0b0, 0xc0008d8e80, 0x40, 0xc0008d8f00, 0x3a, 0xc000534ac0, 0x40, 0xc000988000, ...)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/container_stop.go:144 +0x13c
github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server.(*criService).stopContainer(0xc0003fc6c0, 0x558bdc0a96e0, 0xc00094b0b0, 0xc0008d8e80, 0x40, 0xc0008d8f00, 0x3a, 0xc000534ac0, 0x40, 0xc000988000, ...)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/container_stop.go:101 +0xb26
github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server.(*criService).StopContainer(0xc0003fc6c0, 0x558bdc0a96e0, 0xc00094b0b0, 0xc0006d7aa0, 0x2, 0x558bdc0ae6e0, 0xc0002c4b00)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/container_stop.go:49 +0x298
github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server.(*instrumentedService).StopContainer(0xc00000e368, 0x558bdc0a96e0, 0xc00094b020, 0xc0006d7aa0, 0x0, 0x0, 0x0)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/instrumented_service.go:217 +0x1ba
github.com/containerd/containerd/vendor/k8s.io/kubernetes/pkg/kubelet/apis/cri/runtime/v1alpha2._RuntimeService_StopContainer_Handler.func1(0x558bdc0a96e0, 0xc00094b020, 0x558bdbfe45c0, 0xc0006d7aa0, 0xc0005a65a0, 0x558bdbee8e20, 0x558bdcb49290, 0xc00006caa0)
	/go/src/github.com/containerd/containerd/vendor/k8s.io/kubernetes/pkg/kubelet/apis/cri/runtime/v1alpha2/api.pb.go:4298 +0x8b
github.com/containerd/containerd/vendor/github.com/grpc-ecosystem/go-grpc-prometheus.UnaryServerInterceptor(0x558bdc0a96e0, 0xc00094b020, 0x558bdbfe45c0, 0xc0006d7aa0, 0xc0006d7ac0, 0xc0006d7ae0, 0x558bdbee8e20, 0x558bdcb49290, 0x558bdc041800, 0xc00010c5a0)
	/go/src/github.com/containerd/containerd/vendor/github.com/grpc-ecosystem/go-grpc-prometheus/server.go:29 +0xa6
github.com/containerd/containerd/vendor/k8s.io/kubernetes/pkg/kubelet/apis/cri/runtime/v1alpha2._RuntimeService_StopContainer_Handler(0x558bdc057aa0, 0xc00000e368, 0x558bdc0a96e0, 0xc00094b020, 0xc000503420, 0x558bdc084378, 0x0, 0x0, 0x72, 0x558bdab9acb2)
	/go/src/github.com/containerd/containerd/vendor/k8s.io/kubernetes/pkg/kubelet/apis/cri/runtime/v1alpha2/api.pb.go:4300 +0x15a
github.com/containerd/containerd/vendor/google.golang.org/grpc.(*Server).processUnaryRPC(0xc000352fc0, 0x558bdc0b3a80, 0xc0009a6800, 0xc00010c5a0, 0xc000112900, 0x558bdcadc480, 0x0, 0x0, 0x0)
	/go/src/github.com/containerd/containerd/vendor/google.golang.org/grpc/server.go:1011 +0x4cf
github.com/containerd/containerd/vendor/google.golang.org/grpc.(*Server).handleStream(0xc000352fc0, 0x558bdc0b3a80, 0xc0009a6800, 0xc00010c5a0, 0x0)
	/go/src/github.com/containerd/containerd/vendor/google.golang.org/grpc/server.go:1249 +0x130a
github.com/containerd/containerd/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc0007c83c0, 0xc000352fc0, 0x558bdc0b3a80, 0xc0009a6800, 0xc00010c5a0)
	/go/src/github.com/containerd/containerd/vendor/google.golang.org/grpc/server.go:680 +0xa1
created by github.com/containerd/containerd/vendor/google.golang.org/grpc.(*Server).serveStreams.func1
	/go/src/github.com/containerd/containerd/vendor/google.golang.org/grpc/server.go:678 +0xa3

goroutine 105 [select]:
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Client).run(0xc0007e97a0)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:191 +0x2b7
created by github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.NewClient
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:62 +0x12e

goroutine 56 [IO wait]:
internal/poll.runtime_pollWait(0x7f24c3265950, 0x72, 0xc0009d5858)
	/usr/local/go/src/runtime/netpoll.go:173 +0x68
internal/poll.(*pollDesc).wait(0xc0005fc598, 0x72, 0xffffffffffffff00, 0x558bdc091b20, 0x558bdcae3e58)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9c
internal/poll.(*pollDesc).waitRead(0xc0005fc598, 0xc0000fec00, 0x400, 0x400)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3f
internal/poll.(*FD).Read(0xc0005fc580, 0xc0000fec00, 0x400, 0x400, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:169 +0x17b
net.(*netFD).Read(0xc0005fc580, 0xc0000fec00, 0x400, 0x400, 0x558bdab21d5d, 0xc000030000, 0x558bdbf03880)
	/usr/local/go/src/net/fd_unix.go:202 +0x51
net.(*conn).Read(0xc00000e890, 0xc0000fec00, 0x400, 0x400, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:177 +0x6a
crypto/tls.(*block).readFromUntil(0xc0005bb200, 0x558bdc090220, 0xc00000e890, 0x5, 0xc00000e890, 0x558bdab659dc)
	/usr/local/go/src/crypto/tls/conn.go:492 +0x8b
crypto/tls.(*Conn).readRecord(0xc00057ca80, 0x558bdc085a17, 0xc00057cba0, 0x0)
	/usr/local/go/src/crypto/tls/conn.go:593 +0xdf
crypto/tls.(*Conn).Read(0xc00057ca80, 0xc0005bd000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/crypto/tls/conn.go:1145 +0xf3
net/http.(*persistConn).Read(0xc0004add40, 0xc0005bd000, 0x1000, 0x1000, 0x558bdcadaf80, 0xc0004add40, 0x0)
	/usr/local/go/src/net/http/transport.go:1497 +0x77
bufio.(*Reader).fill(0xc000ccfb60)
	/usr/local/go/src/bufio/bufio.go:100 +0x108
bufio.(*Reader).Peek(0xc000ccfb60, 0x1, 0x2, 0x0, 0x0, 0xc000089a40, 0x0)
	/usr/local/go/src/bufio/bufio.go:132 +0x41
net/http.(*persistConn).readLoop(0xc0004add40)
	/usr/local/go/src/net/http/transport.go:1645 +0x1a4
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1338 +0x943

goroutine 57 [select]:
net/http.(*persistConn).writeLoop(0xc0004add40)
	/usr/local/go/src/net/http/transport.go:1885 +0x115
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1339 +0x968

goroutine 64 [IO wait]:
internal/poll.runtime_pollWait(0x7f24c3265880, 0x72, 0xc000558858)
	/usr/local/go/src/runtime/netpoll.go:173 +0x68
internal/poll.(*pollDesc).wait(0xc0005fca98, 0x72, 0xffffffffffffff00, 0x558bdc091b20, 0x558bdcae3e58)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9c
internal/poll.(*pollDesc).waitRead(0xc0005fca98, 0xc00076a000, 0x2000, 0x2000)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3f
internal/poll.(*FD).Read(0xc0005fca80, 0xc00076a000, 0x2000, 0x2000, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:169 +0x17b
net.(*netFD).Read(0xc0005fca80, 0xc00076a000, 0x2000, 0x2000, 0x558bdab21d5d, 0xc000030000, 0x558bdbf03880)
	/usr/local/go/src/net/fd_unix.go:202 +0x51
net.(*conn).Read(0xc00000e200, 0xc00076a000, 0x2000, 0x2000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:177 +0x6a
crypto/tls.(*block).readFromUntil(0xc00053d170, 0x558bdc090220, 0xc00000e200, 0x5, 0xc00000e200, 0xc000060700)
	/usr/local/go/src/crypto/tls/conn.go:492 +0x8b
crypto/tls.(*Conn).readRecord(0xc000482000, 0x558bdc085a17, 0xc000482120, 0x558bdab736d0)
	/usr/local/go/src/crypto/tls/conn.go:593 +0xdf
crypto/tls.(*Conn).Read(0xc000482000, 0xc000766000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/crypto/tls/conn.go:1145 +0xf3
net/http.(*persistConn).Read(0xc00040ed80, 0xc000766000, 0x1000, 0x1000, 0xc000558c70, 0x558bdab1ddb5, 0xc000089860)
	/usr/local/go/src/net/http/transport.go:1497 +0x77
bufio.(*Reader).fill(0xc00007ecc0)
	/usr/local/go/src/bufio/bufio.go:100 +0x108
bufio.(*Reader).Peek(0xc00007ecc0, 0x1, 0x0, 0x0, 0x1, 0xc000089740, 0x0)
	/usr/local/go/src/bufio/bufio.go:132 +0x41
net/http.(*persistConn).readLoop(0xc00040ed80)
	/usr/local/go/src/net/http/transport.go:1645 +0x1a4
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1338 +0x943

goroutine 65 [select]:
net/http.(*persistConn).writeLoop(0xc00040ed80)
	/usr/local/go/src/net/http/transport.go:1885 +0x115
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1339 +0x968

goroutine 131 [select]:
github.com/containerd/containerd/vendor/google.golang.org/grpc/transport.(*controlBuffer).get(0xc00095e600, 0x1, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/containerd/containerd/vendor/google.golang.org/grpc/transport/controlbuf.go:289 +0x105
github.com/containerd/containerd/vendor/google.golang.org/grpc/transport.(*loopyWriter).run(0xc00091f260)
	/go/src/github.com/containerd/containerd/vendor/google.golang.org/grpc/transport/controlbuf.go:374 +0x1c2
github.com/containerd/containerd/vendor/google.golang.org/grpc/transport.newHTTP2Server.func2(0xc0009a6800)
	/go/src/github.com/containerd/containerd/vendor/google.golang.org/grpc/transport/http2_server.go:276 +0xc5
created by github.com/containerd/containerd/vendor/google.golang.org/grpc/transport.newHTTP2Server
	/go/src/github.com/containerd/containerd/vendor/google.golang.org/grpc/transport/http2_server.go:273 +0xd3d

goroutine 106 [IO wait]:
internal/poll.runtime_pollWait(0x7f24c3265a20, 0x72, 0xc000070bb0)
	/usr/local/go/src/runtime/netpoll.go:173 +0x68
internal/poll.(*pollDesc).wait(0xc0005d2998, 0x72, 0xffffffffffffff00, 0x558bdc091b20, 0x558bdcae3e58)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9c
internal/poll.(*pollDesc).waitRead(0xc0005d2998, 0xc0008b3000, 0x1000, 0x1000)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3f
internal/poll.(*FD).Read(0xc0005d2980, 0xc0008b3000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:169 +0x17b
net.(*netFD).Read(0xc0005d2980, 0xc0008b3000, 0x1000, 0x1000, 0xdf, 0x0, 0x0)
	/usr/local/go/src/net/fd_unix.go:202 +0x51
net.(*conn).Read(0xc00000ff88, 0xc0008b3000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:177 +0x6a
bufio.(*Reader).Read(0xc0007e9740, 0xc0005e97a0, 0xa, 0xa, 0x3, 0xc000070f58, 0xc000588900)
	/usr/local/go/src/bufio/bufio.go:216 +0x231
io.ReadAtLeast(0x558bdc08e7a0, 0xc0007e9740, 0xc0005e97a0, 0xa, 0xa, 0xa, 0x0, 0x0, 0xc000070e20)
	/usr/local/go/src/io/io.go:310 +0x8a
io.ReadFull(0x558bdc08e7a0, 0xc0007e9740, 0xc0005e97a0, 0xa, 0xa, 0x40, 0x0, 0xc000195c80)
	/usr/local/go/src/io/io.go:329 +0x5a
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.readMessageHeader(0xc0005e97a0, 0xa, 0xa, 0x558bdc08e7a0, 0xc0007e9740, 0xc000070f60, 0x2, 0x2, 0xc000070f44)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/channel.go:54 +0x62
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*channel).recv(0xc0005e9780, 0x558bdc0a9660, 0xc00003e0c0, 0x0, 0xc000390000, 0xc0001207e0, 0xd5, 0xd5, 0x0, 0x0)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/channel.go:102 +0x6f
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Client).run.func1(0xc000195ce0, 0xc0007e97a0, 0xc000770ba0, 0xc000195c80)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:164 +0x8d
created by github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Client).run
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:159 +0x13f

goroutine 125 [IO wait]:
internal/poll.runtime_pollWait(0x7f24c3265610, 0x72, 0xc0009d3cd0)
	/usr/local/go/src/runtime/netpoll.go:173 +0x68
internal/poll.(*pollDesc).wait(0xc00091e678, 0x72, 0xffffffffffffff01, 0x558bdc091b20, 0x558bdcae3e58)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9c
internal/poll.(*pollDesc).waitRead(0xc00091e678, 0xc000924001, 0x8000, 0x8000)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3f
internal/poll.(*FD).Read(0xc00091e660, 0xc000924000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:169 +0x17b
os.(*File).read(0xc00082a660, 0xc000924000, 0x8000, 0x8000, 0x7f24c32c1000, 0x85, 0xc00005add8)
	/usr/local/go/src/os/file_unix.go:249 +0x50
os.(*File).Read(0xc00082a660, 0xc000924000, 0x8000, 0x8000, 0xc00057c000, 0x558bdbd82503, 0x7f24c32c1000)
	/usr/local/go/src/os/file.go:108 +0x6b
github.com/containerd/containerd/vendor/github.com/containerd/fifo.(*fifo).Read(0xc00083c780, 0xc000924000, 0x8000, 0x8000, 0x8000, 0x8000, 0x558bdab1e2a3)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/fifo/fifo.go:159 +0x17d
io.copyBuffer(0x558bdc08ef20, 0xc0007ce760, 0x7f24c3273de0, 0xc00083c780, 0xc000924000, 0x8000, 0x8000, 0x558bdbf1ac00, 0x558bdab1dd00, 0x7f24c3273de0)
	/usr/local/go/src/io/io.go:402 +0x127
io.Copy(0x558bdc08ef20, 0xc0007ce760, 0x7f24c3273de0, 0xc00083c780, 0xc00083c780, 0xc00005af88, 0xbef98967895f56f2)
	/usr/local/go/src/io/io.go:364 +0x5c
github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/io.(*ContainerIO).Pipe.func1(0xc000825140, 0xc00079fc30)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/io/container_io.go:110 +0x85
created by github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/io.(*ContainerIO).Pipe
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/io/container_io.go:109 +0x73

goroutine 126 [IO wait]:
internal/poll.runtime_pollWait(0x7f24c32657b0, 0x72, 0xc0009d4cd0)
	/usr/local/go/src/runtime/netpoll.go:173 +0x68
internal/poll.(*pollDesc).wait(0xc00091e6d8, 0x72, 0xffffffffffffff01, 0x558bdc091b20, 0x558bdcae3e58)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9c
internal/poll.(*pollDesc).waitRead(0xc00091e6d8, 0xc00092c001, 0x8000, 0x8000)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3f
internal/poll.(*FD).Read(0xc00091e6c0, 0xc00092c000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:169 +0x17b
os.(*File).read(0xc00082a668, 0xc00092c000, 0x8000, 0x8000, 0x7f24c32c1000, 0xc00016b285, 0xc00005a5d8)
	/usr/local/go/src/os/file_unix.go:249 +0x50
os.(*File).Read(0xc00082a668, 0xc00092c000, 0x8000, 0x8000, 0xc00057c000, 0x558bdbef2b80, 0x7f24c32c1000)
	/usr/local/go/src/os/file.go:108 +0x6b
github.com/containerd/containerd/vendor/github.com/containerd/fifo.(*fifo).Read(0xc00083c840, 0xc00092c000, 0x8000, 0x8000, 0x8000, 0x8000, 0x558bdaf9aca5)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/fifo/fifo.go:159 +0x17d
io.copyBuffer(0x558bdc08ef20, 0xc0007ce780, 0x7f24c3273de0, 0xc00083c840, 0xc00092c000, 0x8000, 0x8000, 0x558bdbf1ac00, 0xc0009cac00, 0x7f24c3273de0)
	/usr/local/go/src/io/io.go:402 +0x127
io.Copy(0x558bdc08ef20, 0xc0007ce780, 0x7f24c3273de0, 0xc00083c840, 0xc00083c840, 0xc000352fc0, 0x558bdc0b3a80)
	/usr/local/go/src/io/io.go:364 +0x5c
github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/io.(*ContainerIO).Pipe.func2(0xc000825140, 0xc00079fc30)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/io/container_io.go:122 +0x85
created by github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/io.(*ContainerIO).Pipe
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/io/container_io.go:121 +0xce

goroutine 127 [syscall]:
syscall.Syscall6(0xf7, 0x1, 0x10c8, 0xc000059dc8, 0x1000004, 0x0, 0x0, 0x558bdcb2ab80, 0x5f5e100, 0xbef98961d09d18c9)
	/usr/local/go/src/syscall/asm_linux_amd64.s:44 +0x5
os.(*Process).blockUntilWaitable(0xc000801620, 0xc000059ea8, 0x558bdab1e547, 0x558bdab4861e)
	/usr/local/go/src/os/wait_waitid.go:31 +0x9a
os.(*Process).wait(0xc000801620, 0xc0004f7620, 0xc0004f69c0, 0xc000059fb0)
	/usr/local/go/src/os/exec_unix.go:22 +0x3b
os.(*Process).Wait(0xc000801620, 0xc0004f7560, 0x0, 0x0)
	/usr/local/go/src/os/exec.go:125 +0x2d
os/exec.(*Cmd).Wait(0xc000887b80, 0x0, 0x0)
	/usr/local/go/src/os/exec/exec.go:465 +0x5d
github.com/containerd/containerd/runtime/v1/shim/client.WithStart.func1.2(0xc000887b80, 0xc0006799a0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/containerd/containerd/runtime/v1/shim/client/client.go:96 +0x2d
created by github.com/containerd/containerd/runtime/v1/shim/client.WithStart.func1
	/go/src/github.com/containerd/containerd/runtime/v1/shim/client/client.go:95 +0x40b

goroutine 128 [select]:
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Client).run(0xc00091e120)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:191 +0x2b7
created by github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.NewClient
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:62 +0x12e

goroutine 129 [IO wait]:
internal/poll.runtime_pollWait(0x7f24c32656e0, 0x72, 0xc000939bb0)
	/usr/local/go/src/runtime/netpoll.go:173 +0x68
internal/poll.(*pollDesc).wait(0xc0005d3f98, 0x72, 0xffffffffffffff00, 0x558bdc091b20, 0x558bdcae3e58)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9c
internal/poll.(*pollDesc).waitRead(0xc0005d3f98, 0xc000922000, 0x1000, 0x1000)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3f
internal/poll.(*FD).Read(0xc0005d3f80, 0xc000922000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:169 +0x17b
net.(*netFD).Read(0xc0005d3f80, 0xc000922000, 0x1000, 0x1000, 0x27d, 0x0, 0x0)
	/usr/local/go/src/net/fd_unix.go:202 +0x51
net.(*conn).Read(0xc00082a648, 0xc000922000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:177 +0x6a
bufio.(*Reader).Read(0xc00091e0c0, 0xc0009cd560, 0xa, 0xa, 0x3, 0xc000939f58, 0xc0009ac600)
	/usr/local/go/src/bufio/bufio.go:216 +0x231
io.ReadAtLeast(0x558bdc08e7a0, 0xc00091e0c0, 0xc0009cd560, 0xa, 0xa, 0xa, 0x1, 0x0, 0xc000939e20)
	/usr/local/go/src/io/io.go:310 +0x8a
io.ReadFull(0x558bdc08e7a0, 0xc00091e0c0, 0xc0009cd560, 0xa, 0xa, 0x40, 0x0, 0xc000839c80)
	/usr/local/go/src/io/io.go:329 +0x5a
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.readMessageHeader(0xc0009cd560, 0xa, 0xa, 0x558bdc08e7a0, 0xc00091e0c0, 0xc000939f60, 0x2, 0x2, 0xc000939f44)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/channel.go:54 +0x62
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*channel).recv(0xc0009cd540, 0x558bdc0a9660, 0xc00003e0c0, 0x0, 0x0, 0xc000683400, 0x273, 0x273, 0x0, 0x0)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/channel.go:102 +0x6f
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Client).run.func1(0xc000839ce0, 0xc00091e120, 0xc000780da0, 0xc000839c80)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:164 +0x8d
created by github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Client).run
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:159 +0x13f

=== END goroutine stack dump ===

@Random-Liu
Copy link
Member

Random-Liu commented Dec 3, 2018

I guess the problem is this one:

goroutine 41 [semacquire]:
sync.runtime_Semacquire(0xc00079fc38)
	/usr/local/go/src/runtime/sema.go:56 +0x3b
sync.(*WaitGroup).Wait(0xc00079fc30)
	/usr/local/go/src/sync/waitgroup.go:130 +0x66
github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/io.(*wgCloser).Wait(0xc000825280)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/io/helpers.go:63 +0x31
github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/io.(*ContainerIO).Wait(0xc000825140)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/io/container_io.go:224 +0x31
github.com/containerd/containerd.(*task).Delete(0xc00094bf80, 0x558bdc0a96a0, 0xc00091f7a0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/containerd/containerd/task.go:291 +0x386
github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server.handleContainerExit(0x558bdc0a96a0, 0xc00091f7a0, 0xc00095ee40, 0xc0008d8e80, 0x40, 0xc0008d8f00, 0x3a, 0xc000534ac0, 0x40, 0xc000988000, ...)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/events.go:282 +0x308
github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server.(*eventMonitor).handleEvent(0xc0002ef0c0, 0x558bdbfd3de0, 0xc00095ee40, 0x0, 0x0)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/events.go:206 +0x66d
github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server.(*eventMonitor).start.func1(0xc000088d80, 0xc0002ef0c0, 0xc000322f60)
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/events.go:154 +0x865
created by github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server.(*eventMonitor).start
	/go/src/github.com/containerd/containerd/vendor/github.com/containerd/cri/pkg/server/events.go:138 +0x9c

task.Delete is blocked on io wait https://github.com/containerd/containerd/blob/master/task.go#L291.

I think the reason is that the sh process is killed:

INFO[2018-12-03T22:25:33.219549980Z] Stop container "9d8424cb113f85c21d204ed6ab75a2f94f928e042ba52649e0ca34f61eab4edf" with signal terminated 
INFO[2018-12-03T22:25:33.263655226Z] TaskExit event &TaskExit{ContainerID:9d8424cb113f85c21d204ed6ab75a2f94f928e042ba52649e0ca34f61eab4edf,ID:9d8424cb113f85c21d204ed6ab75a2f94f928e042ba52649e0ca34f61eab4edf,Pid:19748,ExitStatus:143,ExitedAt:2018-12-03 22:25:33.226573862 +0000 UTC,} 

but the sleep 600 process is left running.

# ps aux | grep sleep
root     19775  0.0  0.0   4376   676 ?        S    22:25   0:00 sleep 600

The reason is that since containerd 1.2, for non host pid container, we won't call KillAll by default any more #2597

When crictl stop is called, the following things happened:

  1. We send SIGTERM to the container, which kills the init process https://github.com/containerd/cri/blob/master/pkg/server/container_stop.go#L97
  2. We get the TaskExit event, and tries to delete the container without KillAll. The delete stuck forever because the sleep 600 process is left running holding the IO.
  3. We timeout to wait for the process to be killed, and then try to run our own KillAll https://github.com/containerd/cri/blob/master/pkg/server/container_stop.go#L108. Somehow, that doesn't work in this case, I guess because init process has gone.

This is pretty bad. And to fix this, we should explicitly call WithProcessKill when calling Delete, given that KillAll is not called by default now.

@ivan4th
Copy link

ivan4th commented Dec 3, 2018

As a side note, I distilled the test case from Virtlet DaemonSet which uses hostPID: true and also suffers from the termination problem. I also tried this CRI pod sandbox config that uses host pid ns

{
  "metadata": {
    "name": "testpod",
    "uid": "0e7f9b63-d7f4-4dfb-b805-665dddd3c7c8",
    "namespace": "default"
  },
  "log_directory": "/tmp",
  "linux": {
    "security_context": {
      "namespace_options": {
        "pid": 2
      }
    }
  }
}

and it also did hang (checked with 1.2.1-rc0) -- but you may recheck just in case

@Random-Liu
Copy link
Member

Random-Liu commented Dec 3, 2018

@ivan4th I see. The reason is that 0 means POD level pid namespace https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/cri/runtime/v1alpha2/api.proto#L210.

So you are actually running a pod with shared pid namespace. In that case, even if init process sh dies, the child process sleep 600 will be left running. And we are not handling that case correctly after the default KillAll is removed from containerd.

I'm going to send out a fix soon.

And I think I can't produce this on my desktop because I haven't updated containerd-shim for too long time.

@estesp
Copy link
Member

estesp commented Dec 4, 2018

#2854 and #2856 are merged (master and release/1.2 respectively), and container/cri#983 is in-flight. After all are merged, we can generate a 1.2.1 build and verify the fix.

@ivan4th
Copy link

ivan4th commented Dec 4, 2018

Thanks folks! Will try containerd master / release-1.2 with Virtlet when the changes land.

@ivan4th
Copy link

ivan4th commented Dec 6, 2018

I confirm that Virtlet works just fine after the changes with the pod terminating correctly when Virtlet is removed from a node. Verified with 1.2.1 release.

ivan4th pushed a commit to ivan4th/kubeadm-dind-cluster that referenced this issue Dec 6, 2018
This is temporary change till a new docker-ce/containerd.io package
versions comes out.
There's critical bug in containerd 1.2.0:
containerd/containerd#2744
@thaJeztah
Copy link
Member

@Random-Liu is this one resolved, now that #2860 was merged? (I see GitHub didn't auto-close the ticket for some reason) or still changes needed for the 1.1 branch?

@crosbymichael
Copy link
Member

I'm going to close this one for now. Thanks for looking into it for us @ivan4th

ivan4th pushed a commit to ivan4th/kubeadm-dind-cluster that referenced this issue Dec 9, 2018
This is temporary change till a new docker-ce/containerd.io package
versions comes out.
There's critical bug in containerd 1.2.0:
containerd/containerd#2744
@andrewrynhard
Copy link

andrewrynhard commented Jan 11, 2019

I'm seeing this still for some reason. It would help if I gave a little bit of context because I think my scenario is a little unique and can cause some confusion. I'm building a linux distro designed to run Kubernetes. It uses containerd as the CRI. The kubelet is run as a container using hyperkube. I did not see this issue until I started working on moving the distro from being based on glibc to being based on musl. I noticed the issue when installing calico 3.4 and flannel 0.10. In the calico case, the calico-node pods are stuck with Init: 0/1 status.

$ kubectl get pods
NAME                                      READY   STATUS              RESTARTS   AGE
calico-node-kgq6g                         0/1     Init:0/1            0          21m
calico-node-nx599                         0/1     Init:0/1            0          21m

The init container seems to be holding it up from moving out of this state, however, the container runs successfully:

⟩ kubectl logs calico-node-kgq6g -c install-cni
ls: /calico-secrets: No such file or directory
Wrote Calico CNI binaries to /host/opt/cni/bin
CNI plugin version: v3.4.0
/host/secondary-bin-dir is non-writeable, skipping
Using CNI config template from CNI_NETWORK_CONFIG environment variable.
CNI config: {
  "name": "k8s-pod-network",
  "cniVersion": "0.3.0",
  "plugins": [
    {
      "type": "calico",
      "log_level": "info",
      "datastore_type": "kubernetes",
      "nodename": "192.168.124.201",
      "mtu": 1440,
      "ipam": {
        "type": "host-local",
        "subnet": "usePodCidr"
      },
      "policy": {
          "type": "k8s"
      },
      "kubernetes": {
          "kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
      }
    },
    {
      "type": "portmap",
      "snat": true,
      "capabilities": {"portMappings": true}
    }
  ]
}
Created CNI config 10-calico.conflist
Done configuring CNI.  Sleep=false

I'm seeing the following in the kubelet logs after deleting calico-node-nx599:

I0111 15:11:38.266224    2965 desired_state_of_world_populator.go:255] Pod "calico-node-nx599_kube-system(ac40b716-15ad-11e9-8e43-42c9c8e816b4)" has been removed from pod manager. However, it still has one or more containers in the non-exited state. Therefore, it will not be removed from volume manager.

I have tried containerd 1.2.1 and 1.2.2. I saw the same thing when attempting to use flannel instead of calico. Both use init containers. Anything with an init container seems to hang. I'm not clear on how moving from glibc to musl might cause this, so any insight would be helpful.

EDIT:
Probably worth mentioning I am using gcompat to run containerd on top of musl.

more kubelet logs:

time="2019-01-11T15:38:15.916557150Z" level=error msg="ReopenContainerLog for "062122cc9c744e375bfd0d3d0e551d9b7078814fde3522619841812c6749aab1" failed" error="failed to create and open log file: open /var/log/pods/ac40b716-15ad-11e9-8e43-42c9c8e816b4/install-cni/0.log: no such file or directory"
time="2019-01-11T15:38:17.180916115Z" level=error msg="StopContainer for "062122cc9c744e375bfd0d3d0e551d9b7078814fde3522619841812c6749aab1" failed" error="an error occurs during waiting for container "062122cc9c744e375bfd0d3d0e551d9b7078814fde3522619841812c6749aab1" to be killed: wait container "062122cc9c744e375bfd0d3d0e551d9b7078814fde3522619841812c6749aab1" stop timeout"
time="2019-01-11T15:38:17.200624041Z" level=info msg="StopPodSandbox for "b7cddcd52904d5a3f7ff2ec7a69e007e517a4dbec1db74ba6dbefb28780e62bb""
time="2019-01-11T15:38:17.215536080Z" level=info msg="Kill container "062122cc9c744e375bfd0d3d0e551d9b7078814fde3522619841812c6749aab1""

@Random-Liu
Copy link
Member

@andrewrynhard It may not be the same issue. Containerd and containerd-shim stacktrace would be helpful here.

You can get the stacktrace by sending SIGUSR1 to containerd and containerd-shim, but please note that you need to enable shim_debug to see containerd-shim logs.

@andrewrynhard
Copy link

andrewrynhard commented Jan 12, 2019

@Random-Liu I will work on grabbing the stacktraces. In the mean time I am seeing this:

root     30540  0.0  0.2   9340  4888 ?        Sl   00:40   0:00 containerd-shim -namespace k8s.io -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/k8s.io/74c72edce740074eeed56be52750c772de7526a486fb22b48b4df645456b0721 -address /run/containerd/containerd.sock -containerd-binary /lib/libc.so

Notice -containerd-binary /lib/libc.so. That doesn't seem right. Looking at the code, I'm not sure how it is /lib/libc.so. Is there a way to explicitly set it via the containerd config?

EDIT:
Missed this :)

The daemon invokes `containerd-shim -containerd-binary ...` with its own os.Executable() path.

Still hoping there is a way to explicitly set it.

EDIT 2:
Looking at this it looks like I cannot override os.Executable. Would you guys be open to a PR for this?

@andrewrynhard
Copy link

@Random-Liu it looks like it was indeed due to the bad value -containerd-binary. I have opened #2922.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants