Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to start or create containerd task #4068

Closed
rtheis opened this issue Feb 27, 2020 · 22 comments
Closed

failed to start or create containerd task #4068

rtheis opened this issue Feb 27, 2020 · 22 comments
Labels

Comments

@rtheis
Copy link

rtheis commented Feb 27, 2020

Description

Running Kubernetes conformance testing against a cluster with containerd runtime sometimes fails due to a pod not starting during one of the test cases. The general error is failed to start containerd task or failed to create containerd task. More detailed errors include the following:

  • ttrpc: closed: unknown
  • read: connection reset by peer: unknown
  • failed to start io pipe copy: unable to copy pipes: containerd-shim: opening w/o fifo ... failed: context deadline exceeded

Steps to reproduce the issue:

Option 1: Follow https://github.com/cncf/k8s-conformance/blob/master/instructions.md#running to run Kubernetes conformance testing via sonobuoy.

Option 2: Follow https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md#running-conformance-tests to run Kubernetes conformance testing via kubetest.

The more load on the cluster (i.e running conformance tests in parallel) makes the problem easier to reproduce. However, the problem is in general difficult to reproduce since the failure rate is low. For example, re-running the conformance tests after a failure is usually successful.

Describe the results you received:

See description.

Describe the results you expected:

Kubernetes conformance test passes because containerd retries the failed task.

Output of containerd --version:

We've seen this on various containerd 1.2.x and 1.3.x versions.

Any other relevant information:

We’ve noticed and have been monitoring these failures since October 2019. Although, they could have started long before that.

@dims
Copy link
Member

dims commented Feb 27, 2020

@rtheis this smells like the one @liggitt fixed in opencontainers/runc#2183 please use containerd 1.3.3 which uses runc v1.0.0-rc10 or just directly drop that version of runc in your environment and retry the tests.

thanks,
Dims

@rtheis
Copy link
Author

rtheis commented Feb 27, 2020

@dims Thanks for the pointer. Unfortunately, our latest failure from today used containerd 1.3.3.

+ kubectl get nodes -o wide
NAME           STATUS   ROLES    AGE    VERSION       INTERNAL-IP    EXTERNAL-IP    OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
10.240.0.102   Ready    <none>   173m   v1.17.3+IKS   10.240.0.102   10.240.0.102   Ubuntu 18.04.4 LTS   4.15.0-76-generic   containerd://1.3.3
10.240.0.46    Ready    <none>   174m   v1.17.3+IKS   10.240.0.46    10.240.0.46    Ubuntu 18.04.4 LTS   4.15.0-76-generic   containerd://1.3.3
10.240.0.57    Ready    <none>   174m   v1.17.3+IKS   10.240.0.57    10.240.0.57    Ubuntu 18.04.4 LTS   4.15.0-76-generic   containerd://1.3.3
Feb 27 04:49:43.732: INFO: At 2020-02-27 04:45:09 +0000 UTC - event for pod-secrets-f8b06507-6b55-47aa-8ef9-c3082378a221: {default-scheduler } Scheduled: Successfully assigned secrets-4830/pod-secrets-f8b06507-6b55-47aa-8ef9-c3082378a221 to 10.240.0.102
Feb 27 04:49:43.732: INFO: At 2020-02-27 04:45:10 +0000 UTC - event for pod-secrets-f8b06507-6b55-47aa-8ef9-c3082378a221: {kubelet 10.240.0.102} Pulled: Container image "gcr.io/kubernetes-e2e-test-images/mounttest:1.0" already present on machine
Feb 27 04:49:43.732: INFO: At 2020-02-27 04:45:10 +0000 UTC - event for pod-secrets-f8b06507-6b55-47aa-8ef9-c3082378a221: {kubelet 10.240.0.102} Created: Created container dels-volume-test
Feb 27 04:49:43.732: INFO: At 2020-02-27 04:45:11 +0000 UTC - event for pod-secrets-f8b06507-6b55-47aa-8ef9-c3082378a221: {kubelet 10.240.0.102} Started: Started container dels-volume-test
Feb 27 04:49:43.732: INFO: At 2020-02-27 04:45:11 +0000 UTC - event for pod-secrets-f8b06507-6b55-47aa-8ef9-c3082378a221: {kubelet 10.240.0.102} Pulled: Container image "gcr.io/kubernetes-e2e-test-images/mounttest:1.0" already present on machine
Feb 27 04:49:43.732: INFO: At 2020-02-27 04:45:11 +0000 UTC - event for pod-secrets-f8b06507-6b55-47aa-8ef9-c3082378a221: {kubelet 10.240.0.102} Created: Created container upds-volume-test
Feb 27 04:49:43.732: INFO: At 2020-02-27 04:45:11 +0000 UTC - event for pod-secrets-f8b06507-6b55-47aa-8ef9-c3082378a221: {kubelet 10.240.0.102} Started: Started container upds-volume-test
Feb 27 04:49:43.732: INFO: At 2020-02-27 04:45:11 +0000 UTC - event for pod-secrets-f8b06507-6b55-47aa-8ef9-c3082378a221: {kubelet 10.240.0.102} Pulled: Container image "gcr.io/kubernetes-e2e-test-images/mounttest:1.0" already present on machine
Feb 27 04:49:43.732: INFO: At 2020-02-27 04:45:11 +0000 UTC - event for pod-secrets-f8b06507-6b55-47aa-8ef9-c3082378a221: {kubelet 10.240.0.102} Created: Created container creates-volume-test
Feb 27 04:49:43.732: INFO: At 2020-02-27 04:45:41 +0000 UTC - event for pod-secrets-f8b06507-6b55-47aa-8ef9-c3082378a221: {kubelet 10.240.0.102} Failed: Error: failed to create containerd task: failed to start io pipe copy: unable to copy pipes: containerd-shim: opening w/o fifo "/run/containerd/io.containerd.grpc.v1.cri/containers/creates-volume-test/io/965388781/creates-volume-test-stdout" failed: context deadline exceeded

@dims
Copy link
Member

dims commented Feb 27, 2020

@dims
Copy link
Member

dims commented Feb 27, 2020

@rtheis one more thing to check ... can you please try 1.2.7? ( looks like this commit may be the one adding the 30 second timeout - a2a4241 and it is only in the 1.3.x AFAICT)

@liggitt
Copy link
Contributor

liggitt commented Feb 27, 2020

I see that prior to our recent containerd/runc bump, so it appears to be an independent, pre-existing issue: https://storage.googleapis.com/k8s-gubernator/triage/index.html?date=2020-01-01&pr=1&text=opening%20w%2Fo%20fifo

@rtheis
Copy link
Author

rtheis commented Feb 27, 2020

@dims Sorry, I don't have an environment running containerd 1.2.7 at this time.

@liggitt
Copy link
Contributor

liggitt commented Mar 11, 2020

tracking on the Kubernetes side in kubernetes/kubernetes#89064

@liggitt
Copy link
Contributor

liggitt commented Mar 11, 2020

comes from

ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
defer cancel()
if socket != nil {
console, err := socket.ReceiveMaster()
if err != nil {
return errors.Wrap(err, "failed to retrieve console master")
}
if e.console, err = e.parent.Platform.CopyConsole(ctx, console, e.stdio.Stdin, e.stdio.Stdout, e.stdio.Stderr, &e.wg); err != nil {
return errors.Wrap(err, "failed to start console copy")
}
} else {
if err := pio.Copy(ctx, &e.wg); err != nil {
return errors.Wrap(err, "failed to start io pipe copy")
}

@liggitt
Copy link
Contributor

liggitt commented Mar 11, 2020

it's unclear from the error whether the deadline that was exceeded was the 30 second one or a deadline inherited from the wrapped context

@fuweid
Copy link
Member

fuweid commented Mar 19, 2020

Will take a look on this. Thanks for reporting this !

@fuweid
Copy link
Member

fuweid commented Mar 25, 2020

In my vagrant box with 4cpu4GB and high load withresource-consumer, I didn't reproduce it 😂 @rtheis Do you have any history metrics data about IO?

@rtheis
Copy link
Author

rtheis commented Mar 25, 2020

@fuweid I don't have any I/O metrics data for these failures.

alpeb added a commit to linkerd/linkerd2 that referenced this issue Jun 18, 2020
In #4595 we stopped failing integration tests whenever a pod restarted
just once, which is being caused by containerd/containerd#4068.

But we forgot to remove the warning event corresponding to that
containerd failure, and such unexpected event continues to fail the
tests. So this change adds that event to the list of expected ones.
alpeb added a commit to linkerd/linkerd2 that referenced this issue Jun 18, 2020
)

In #4595 we stopped failing integration tests whenever a pod restarted
just once, which is being caused by containerd/containerd#4068.

But we forgot to remove the warning event corresponding to that
containerd failure, and such unexpected event continues to fail the
tests. So this change adds that event to the list of expected ones.
@BenTheElder
Copy link
Contributor

we still see this in kubernetes sometimes. it seems to happen more with certain testcases
https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&text=failed%20to%20start%20io%20pipe%20copy%3A%20unable%20to%20copy%20pipes%3A%20containerd-shim%3A%20opening%20w%2Fo%20fifo%20

oddly specific EmptyDir volumes should support (non-root,0666,default) [LinuxOnly] [NodeConformance] [Conformance]

@pohly
Copy link

pohly commented Sep 19, 2020

I also encountered "failed to create containerd task: failed to start io pipe copy: unable to copy pipes: containerd-shim: opening w/o fifo ...: context deadline exceeded" in the CI tests for the PMEM-CSI driver. FWIW, I only saw it after updating to containerd 1.3.7 from 1.2.13.

pohly added a commit to pohly/pmem-CSI that referenced this issue Sep 21, 2020
containerd/containerd#4068 caused a
container start to fail and get retried, which then broke tests
because of our "no container restart" check. By treating this
particular failure as non-fatal we get our tests to run reliably
again.
@estesp
Copy link
Member

estesp commented Feb 23, 2022

It's been about 1.5 years since anyone commented or noted this issue. I also don't see any mentions of 1.4.x or above releases, which are the only ones in support.

Do we have any data on this happening with containerd 1.4.x or above? If we don't we may as well close this out.

@rtheis
Copy link
Author

rtheis commented Feb 23, 2022

@estesp things look good with containerd 1.4, 1.5 and 1.6.

@BenTheElder
Copy link
Contributor

BenTheElder commented Feb 24, 2022

Kubernetes is using 1.5.9 in KIND currently and 1.6.0 most elsewhere, we are not seeing this anymore that I can find.

@estesp
Copy link
Member

estesp commented Feb 24, 2022

Thanks for the feedback! Closing.

@estesp estesp closed this as completed Feb 24, 2022
@timchenxiaoyu
Copy link
Contributor

containerd 1.6.1 also have this problem

8:00" level=error msg="collecting metrics for 00f39eb23e6de53f353385bf3adfc55c4e101f8cf5ace562a04dae02de867d02" error="ttrpc: closed: unknown"
time="2022-04-12T16:34:21+08:00" level=info msg="shim disconnected" id=00f39eb23e6de53f353385bf3adfc55c4e101f8cf5ace562a04dae02de867d02
time="2022-04-12T16:34:21+08:00" level=warning msg="cleaning up after shim disconnected" id=00f39eb23e6de53f353385bf3adfc55c4e101f8cf5ace562a04dae02de867d02 namespace=k8s.io
time="2022-04-12T16:34:21+08:00" level=info msg="cleaning up dead shim"
time="2022-04-12T16:34:21+08:00" level=error msg="collecting metrics for 00f39eb23e6de53f353385bf3adfc55c4e101f8cf5ace562a04dae02de867d02" error="ttrpc: closed: unknown"
time="2022-04-12T16:34:21+08:00" level=error msg="StartContainer for "00f39eb23e6de53f353385bf3adfc55c4e101f8cf5ace562a04dae02de867d02" failed" error="failed to create containerd task: failed to create shim task: context deadline exceeded: unknown"

@javad87
Copy link

javad87 commented Nov 9, 2022

I deplyed kubernets via microk8s (canonical and snap project) and got the error, seams the error is similar to error mentioned here and related to containerd:

kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: can't copy bootstrap data to pipe: write init-p: broken pipe: unknown

Is there any solution for it?
I upgrade my kernel version but still has the problem...

@sreenivas-ps
Copy link

I'm on GKE on version: v1.23.16-gke.1400 which is using containerd://1.5.13 and am seeing such issues so often:

    State:          Terminated
      Reason:       StartError
      Message:      failed to create containerd task: failed to create shim: context deadline exceeded: unknown

This is a cronjob and it just fails with the StartError

Any suggestions why this is happening?

@shreben
Copy link

shreben commented May 16, 2023

Hello
Also noticed this issue on EKS node:

  Kernel Version:             5.10.178-162.673.amzn2.x86_64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.19
  Kubelet Version:            v1.24.11-eks-a59e1f0
  Kube-Proxy Version:         v1.24.11-eks-a59e1f0

Pod's status:

Last State:     Terminated
      Reason:       StartError
      Message:      failed to create containerd task: failed to create shim task: context canceled: unknown
      Exit Code:    128

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests