failed to start or create containerd task #4068

rtheis · 2020-02-27T11:49:17Z

Description

Running Kubernetes conformance testing against a cluster with containerd runtime sometimes fails due to a pod not starting during one of the test cases. The general error is failed to start containerd task or failed to create containerd task. More detailed errors include the following:

ttrpc: closed: unknown
read: connection reset by peer: unknown
failed to start io pipe copy: unable to copy pipes: containerd-shim: opening w/o fifo ... failed: context deadline exceeded

Steps to reproduce the issue:

Option 1: Follow https://github.com/cncf/k8s-conformance/blob/master/instructions.md#running to run Kubernetes conformance testing via sonobuoy.

Option 2: Follow https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md#running-conformance-tests to run Kubernetes conformance testing via kubetest.

The more load on the cluster (i.e running conformance tests in parallel) makes the problem easier to reproduce. However, the problem is in general difficult to reproduce since the failure rate is low. For example, re-running the conformance tests after a failure is usually successful.

Describe the results you received:

See description.

Describe the results you expected:

Kubernetes conformance test passes because containerd retries the failed task.

Output of containerd --version:

We've seen this on various containerd 1.2.x and 1.3.x versions.

Any other relevant information:

We’ve noticed and have been monitoring these failures since October 2019. Although, they could have started long before that.

The text was updated successfully, but these errors were encountered:

dims · 2020-02-27T11:53:29Z

@rtheis this smells like the one @liggitt fixed in opencontainers/runc#2183 please use containerd 1.3.3 which uses runc v1.0.0-rc10 or just directly drop that version of runc in your environment and retry the tests.

thanks,
Dims

rtheis · 2020-02-27T12:16:47Z

@dims Thanks for the pointer. Unfortunately, our latest failure from today used containerd 1.3.3.

+ kubectl get nodes -o wide
NAME           STATUS   ROLES    AGE    VERSION       INTERNAL-IP    EXTERNAL-IP    OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
10.240.0.102   Ready    <none>   173m   v1.17.3+IKS   10.240.0.102   10.240.0.102   Ubuntu 18.04.4 LTS   4.15.0-76-generic   containerd://1.3.3
10.240.0.46    Ready    <none>   174m   v1.17.3+IKS   10.240.0.46    10.240.0.46    Ubuntu 18.04.4 LTS   4.15.0-76-generic   containerd://1.3.3
10.240.0.57    Ready    <none>   174m   v1.17.3+IKS   10.240.0.57    10.240.0.57    Ubuntu 18.04.4 LTS   4.15.0-76-generic   containerd://1.3.3

Feb 27 04:49:43.732: INFO: At 2020-02-27 04:45:09 +0000 UTC - event for pod-secrets-f8b06507-6b55-47aa-8ef9-c3082378a221: {default-scheduler } Scheduled: Successfully assigned secrets-4830/pod-secrets-f8b06507-6b55-47aa-8ef9-c3082378a221 to 10.240.0.102
Feb 27 04:49:43.732: INFO: At 2020-02-27 04:45:10 +0000 UTC - event for pod-secrets-f8b06507-6b55-47aa-8ef9-c3082378a221: {kubelet 10.240.0.102} Pulled: Container image "gcr.io/kubernetes-e2e-test-images/mounttest:1.0" already present on machine
Feb 27 04:49:43.732: INFO: At 2020-02-27 04:45:10 +0000 UTC - event for pod-secrets-f8b06507-6b55-47aa-8ef9-c3082378a221: {kubelet 10.240.0.102} Created: Created container dels-volume-test
Feb 27 04:49:43.732: INFO: At 2020-02-27 04:45:11 +0000 UTC - event for pod-secrets-f8b06507-6b55-47aa-8ef9-c3082378a221: {kubelet 10.240.0.102} Started: Started container dels-volume-test
Feb 27 04:49:43.732: INFO: At 2020-02-27 04:45:11 +0000 UTC - event for pod-secrets-f8b06507-6b55-47aa-8ef9-c3082378a221: {kubelet 10.240.0.102} Pulled: Container image "gcr.io/kubernetes-e2e-test-images/mounttest:1.0" already present on machine
Feb 27 04:49:43.732: INFO: At 2020-02-27 04:45:11 +0000 UTC - event for pod-secrets-f8b06507-6b55-47aa-8ef9-c3082378a221: {kubelet 10.240.0.102} Created: Created container upds-volume-test
Feb 27 04:49:43.732: INFO: At 2020-02-27 04:45:11 +0000 UTC - event for pod-secrets-f8b06507-6b55-47aa-8ef9-c3082378a221: {kubelet 10.240.0.102} Started: Started container upds-volume-test
Feb 27 04:49:43.732: INFO: At 2020-02-27 04:45:11 +0000 UTC - event for pod-secrets-f8b06507-6b55-47aa-8ef9-c3082378a221: {kubelet 10.240.0.102} Pulled: Container image "gcr.io/kubernetes-e2e-test-images/mounttest:1.0" already present on machine
Feb 27 04:49:43.732: INFO: At 2020-02-27 04:45:11 +0000 UTC - event for pod-secrets-f8b06507-6b55-47aa-8ef9-c3082378a221: {kubelet 10.240.0.102} Created: Created container creates-volume-test
Feb 27 04:49:43.732: INFO: At 2020-02-27 04:45:41 +0000 UTC - event for pod-secrets-f8b06507-6b55-47aa-8ef9-c3082378a221: {kubelet 10.240.0.102} Failed: Error: failed to create containerd task: failed to start io pipe copy: unable to copy pipes: containerd-shim: opening w/o fifo "/run/containerd/io.containerd.grpc.v1.cri/containers/creates-volume-test/io/965388781/creates-volume-test-stdout" failed: context deadline exceeded

dims · 2020-02-27T14:30:44Z

@rtheis i see it in upstream CI too - https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&text=opening%20w%2Fo%20fifo

cc @liggitt

dims · 2020-02-27T15:08:55Z

@rtheis one more thing to check ... can you please try 1.2.7? ( looks like this commit may be the one adding the 30 second timeout - a2a4241 and it is only in the 1.3.x AFAICT)

liggitt · 2020-02-27T16:30:39Z

I see that prior to our recent containerd/runc bump, so it appears to be an independent, pre-existing issue: https://storage.googleapis.com/k8s-gubernator/triage/index.html?date=2020-01-01&pr=1&text=opening%20w%2Fo%20fifo

rtheis · 2020-02-27T20:43:36Z

@dims Sorry, I don't have an environment running containerd 1.2.7 at this time.

liggitt · 2020-03-11T19:34:04Z

tracking on the Kubernetes side in kubernetes/kubernetes#89064

liggitt · 2020-03-11T19:49:39Z

comes from

containerd/pkg/process/exec.go

Lines 217 to 230 in c6851ac

    
           ctx, cancel := context.WithTimeout(ctx, 30*time.Second) 
        
           defer cancel() 
        
           if socket != nil { 
        
           	console, err := socket.ReceiveMaster() 
        
           	if err != nil { 
        
           		return errors.Wrap(err, "failed to retrieve console master") 
        
           	} 
        
           	if e.console, err = e.parent.Platform.CopyConsole(ctx, console, e.stdio.Stdin, e.stdio.Stdout, e.stdio.Stderr, &e.wg); err != nil { 
        
           		return errors.Wrap(err, "failed to start console copy") 
        
           	} 
        
           } else { 
        
           	if err := pio.Copy(ctx, &e.wg); err != nil { 
        
           		return errors.Wrap(err, "failed to start io pipe copy") 
        
           	}

liggitt · 2020-03-11T21:41:48Z

it's unclear from the error whether the deadline that was exceeded was the 30 second one or a deadline inherited from the wrapped context

fuweid · 2020-03-19T13:48:38Z

Will take a look on this. Thanks for reporting this !

fuweid · 2020-03-25T16:05:24Z

In my vagrant box with 4cpu4GB and high load withresource-consumer, I didn't reproduce it 😂 @rtheis Do you have any history metrics data about IO?

rtheis · 2020-03-25T16:58:57Z

@fuweid I don't have any I/O metrics data for these failures.

In #4595 we stopped failing integration tests whenever a pod restarted just once, which is being caused by containerd/containerd#4068. But we forgot to remove the warning event corresponding to that containerd failure, and such unexpected event continues to fail the tests. So this change adds that event to the list of expected ones.

) In #4595 we stopped failing integration tests whenever a pod restarted just once, which is being caused by containerd/containerd#4068. But we forgot to remove the warning event corresponding to that containerd failure, and such unexpected event continues to fail the tests. So this change adds that event to the list of expected ones.

BenTheElder · 2020-06-19T22:29:01Z

we still see this in kubernetes sometimes. it seems to happen more with certain testcases
https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&text=failed%20to%20start%20io%20pipe%20copy%3A%20unable%20to%20copy%20pipes%3A%20containerd-shim%3A%20opening%20w%2Fo%20fifo%20

oddly specific EmptyDir volumes should support (non-root,0666,default) [LinuxOnly] [NodeConformance] [Conformance]

pohly · 2020-09-19T12:22:45Z

I also encountered "failed to create containerd task: failed to start io pipe copy: unable to copy pipes: containerd-shim: opening w/o fifo ...: context deadline exceeded" in the CI tests for the PMEM-CSI driver. FWIW, I only saw it after updating to containerd 1.3.7 from 1.2.13.

containerd/containerd#4068 caused a container start to fail and get retried, which then broke tests because of our "no container restart" check. By treating this particular failure as non-fatal we get our tests to run reliably again.

estesp · 2022-02-23T22:16:52Z

It's been about 1.5 years since anyone commented or noted this issue. I also don't see any mentions of 1.4.x or above releases, which are the only ones in support.

Do we have any data on this happening with containerd 1.4.x or above? If we don't we may as well close this out.

rtheis · 2022-02-23T22:27:34Z

@estesp things look good with containerd 1.4, 1.5 and 1.6.

BenTheElder · 2022-02-24T01:49:49Z

Kubernetes is using 1.5.9 in KIND currently and 1.6.0 most elsewhere, we are not seeing this anymore that I can find.

estesp · 2022-02-24T16:09:51Z

Thanks for the feedback! Closing.

timchenxiaoyu · 2022-04-13T02:12:43Z

containerd 1.6.1 also have this problem

8:00" level=error msg="collecting metrics for 00f39eb23e6de53f353385bf3adfc55c4e101f8cf5ace562a04dae02de867d02" error="ttrpc: closed: unknown"
time="2022-04-12T16:34:21+08:00" level=info msg="shim disconnected" id=00f39eb23e6de53f353385bf3adfc55c4e101f8cf5ace562a04dae02de867d02
time="2022-04-12T16:34:21+08:00" level=warning msg="cleaning up after shim disconnected" id=00f39eb23e6de53f353385bf3adfc55c4e101f8cf5ace562a04dae02de867d02 namespace=k8s.io
time="2022-04-12T16:34:21+08:00" level=info msg="cleaning up dead shim"
time="2022-04-12T16:34:21+08:00" level=error msg="collecting metrics for 00f39eb23e6de53f353385bf3adfc55c4e101f8cf5ace562a04dae02de867d02" error="ttrpc: closed: unknown"
time="2022-04-12T16:34:21+08:00" level=error msg="StartContainer for "00f39eb23e6de53f353385bf3adfc55c4e101f8cf5ace562a04dae02de867d02" failed" error="failed to create containerd task: failed to create shim task: context deadline exceeded: unknown"

javad87 · 2022-11-09T09:24:55Z

I deplyed kubernets via microk8s (canonical and snap project) and got the error, seams the error is similar to error mentioned here and related to containerd:

kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: can't copy bootstrap data to pipe: write init-p: broken pipe: unknown

Is there any solution for it?
I upgrade my kernel version but still has the problem...

sreenivas-ps · 2023-04-17T11:40:03Z

I'm on GKE on version: v1.23.16-gke.1400 which is using containerd://1.5.13 and am seeing such issues so often:

    State:          Terminated
      Reason:       StartError
      Message:      failed to create containerd task: failed to create shim: context deadline exceeded: unknown

This is a cronjob and it just fails with the StartError

Any suggestions why this is happening?

shreben · 2023-05-16T09:55:27Z

Hello
Also noticed this issue on EKS node:

  Kernel Version:             5.10.178-162.673.amzn2.x86_64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.19
  Kubelet Version:            v1.24.11-eks-a59e1f0
  Kube-Proxy Version:         v1.24.11-eks-a59e1f0

Pod's status:

Last State:     Terminated
      Reason:       StartError
      Message:      failed to create containerd task: failed to create shim task: context canceled: unknown
      Exit Code:    128

rtheis added the kind/bug label Feb 27, 2020

liggitt mentioned this issue Mar 11, 2020

failed to create containerd task: failed to start io pipe copy: unable to copy pipes: containerd-shim: opening w/o fifo kubernetes/kubernetes#89064

Closed

BenTheElder mentioned this issue Apr 25, 2020

upgrade to containerd 1.3.4 kubernetes-sigs/kind#1511

Closed

alpeb mentioned this issue May 12, 2020

Flaky tests: ignore 'unable to copy pipes' linkerd/linkerd2#4372

Closed

alpeb mentioned this issue Jun 12, 2020

CI: restart count flakiness linkerd/linkerd2#4595

Closed

alpeb mentioned this issue Jun 18, 2020

Integration tests: Warn (instead of erroring) upon pod restarts, part two linkerd/linkerd2#4637

Merged

estesp closed this as completed Feb 24, 2022

javad87 mentioned this issue Nov 9, 2022

2 single node microk8s on different servers suddenly do not create pods after working perfectly for months... canonical/microk8s#3545

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failed to start or create containerd task #4068

failed to start or create containerd task #4068

rtheis commented Feb 27, 2020 •

edited

dims commented Feb 27, 2020

rtheis commented Feb 27, 2020

dims commented Feb 27, 2020

dims commented Feb 27, 2020

liggitt commented Feb 27, 2020

rtheis commented Feb 27, 2020

liggitt commented Mar 11, 2020

liggitt commented Mar 11, 2020

liggitt commented Mar 11, 2020

fuweid commented Mar 19, 2020

fuweid commented Mar 25, 2020

rtheis commented Mar 25, 2020

BenTheElder commented Jun 19, 2020

pohly commented Sep 19, 2020 •

edited

estesp commented Feb 23, 2022

rtheis commented Feb 23, 2022

BenTheElder commented Feb 24, 2022 •

edited

estesp commented Feb 24, 2022

timchenxiaoyu commented Apr 13, 2022

javad87 commented Nov 9, 2022

sreenivas-ps commented Apr 17, 2023

shreben commented May 16, 2023

failed to start or create containerd task #4068

failed to start or create containerd task #4068

Comments

rtheis commented Feb 27, 2020 • edited

dims commented Feb 27, 2020

rtheis commented Feb 27, 2020

dims commented Feb 27, 2020

dims commented Feb 27, 2020

liggitt commented Feb 27, 2020

rtheis commented Feb 27, 2020

liggitt commented Mar 11, 2020

liggitt commented Mar 11, 2020

liggitt commented Mar 11, 2020

fuweid commented Mar 19, 2020

fuweid commented Mar 25, 2020

rtheis commented Mar 25, 2020

BenTheElder commented Jun 19, 2020

pohly commented Sep 19, 2020 • edited

estesp commented Feb 23, 2022

rtheis commented Feb 23, 2022

BenTheElder commented Feb 24, 2022 • edited

estesp commented Feb 24, 2022

timchenxiaoyu commented Apr 13, 2022

javad87 commented Nov 9, 2022

sreenivas-ps commented Apr 17, 2023

shreben commented May 16, 2023

rtheis commented Feb 27, 2020 •

edited

pohly commented Sep 19, 2020 •

edited

BenTheElder commented Feb 24, 2022 •

edited