Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Containerd appears to get stuck occasionally while creating/removing a sandbox pod after which retry attempts to create the sandbox pod fails. #9947

Open
sandeepbudanur opened this issue Mar 8, 2024 · 2 comments

Comments

@sandeepbudanur
Copy link

sandeepbudanur commented Mar 8, 2024

Description

In a Kubernetes cluster, kubelet tries to create a pod which sometimes times out after 4 minutes. Kubelet retries to create the pod which results in "error="failed to reserve sandbox name".

kubelet and containerd logs:

2024-02-07T02:45:40.739274+00:00 NodeA containerd 22938 - - time="2024-02-07T02:45:40.738965836Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:podX-b181ca13-z7d7p,Uid:f4fc0242-4fb4-4015-a7eb-3e4a9cd73293,Namespace:default,Attempt:0,}

2024-02-07T02:49:40.739405+00:00 NodeA kubelet 24578 - - E0207 02:49:40.738973 24578 remote_runtime.go:176] "RunPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"

2024-02-07T02:49:40.739537+00:00 NodeA kubelet 24578 - - E0207 02:49:40.739041 24578 kuberuntime_sandbox.go:72] "Failed to create sandbox for pod" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" pod="default/podX-b181ca13-z7d7p"

2024-02-07T02:49:52.035922+00:00 NodeA containerd 22938 - - time="2024-02-07T02:49:52.035460535Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:podX-b181ca13-z7d7p,Uid:f4fc0242-4fb4-4015-a7eb-3e4a9cd73293,Namespace:default,Attempt:0,}"

2024-02-07T02:49:52.036144+00:00 NodeA containerd 22938 - - time="2024-02-07T02:49:52.035550152Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:podX-b181ca13-z7d7p,Uid:f4fc0242-4fb4-4015-a7eb-3e4a9cd73293,Namespace:default,Attempt:0,} failed, error" error="failed to reserve sandbox name "podX-b181ca13-z7d7p_cm_f4fc0242-4fb4-4015-a7eb-3e4a9cd73293_0": name "podX-b181ca13-z7d7p_cm_f4fc0242-4fb4-4015-a7eb-3e4a9cd73293_0" is reserved for "3897d260724a7061999dbc3b03243ae672a6bd7d287784009ddfad9428029b92""

The same issue was reproduced using 'crictl' test script more frequently. The log snippet given below is from one of the test runs where " failed to reserve sandbox name" was observed and containerd did NOT RECOVER from this state and kept posting the same error message for subsequent pod creation requests with the SAME POD NAME. Looks like containerd is not cleaning up the stale pod entries from its database.

Log snippet from Iteration: 17, Test Instance ID: 2

Sandbox Pod ID: fbb5843175d06d7c84d66ad3f5a07c4327910585c1157e8b71bdaa4dd0247017
Creating container inside pod fbb5843175d06d7c84d66ad3f5a07c4327910585c1157e8b71bdaa4dd0247017
b3c019d108f1800a7d2552d6c053ca07690d7d886db2b4131eee38cb16688922
Starting container b3c019d108f1800a7d2552d6c053ca07690d7d886db2b4131eee38cb16688922 inside pod fbb5843175d06d7c84d66ad3f5a07c4327910585c1157e8b71bdaa4dd0247017
Stopped sandbox fbb5843175d06d7c84d66ad3f5a07c4327910585c1157e8b71bdaa4dd0247017

# Remove-pod command timedout
E0307 21:36:57.596856 69228 remote_runtime.go:274] "RemovePodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="fbb5843175d06d7c84d66ad3f5a07c4327910585c1157e8b71bdaa4dd0247017"
removing the pod sandbox "fbb5843175d06d7c84d66ad3f5a07c4327910585c1157e8b71bdaa4dd0247017": rpc error: code = DeadlineExceeded desc = context deadline exceeded

# On retry with the same sandbox pod name, create pod fails with the error mentioned below. Containerd has not cleaned up the stale pod entries after remove-pod timed out.
E0307 21:36:57.940424 69750 remote_runtime.go:201] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to reserve sandbox name "bb-sandbox-2b_default_2b_1": name "bb-sandbox-2b_default_2b_1" is reserved for "fbb5843175d06d7c84d66ad3f5a07c4327910585c1157e8b71bdaa4dd0247017""
time="2024-03-07T21:36:57Z" level=fatal msg="run pod sandbox: rpc error: code = Unknown desc = failed to reserve sandbox name "bb-sandbox-2b_default_2b_1": name "bb-sandbox-2b_default_2b_1" is reserved for "fbb5843175d06d7c84d66ad3f5a07c4327910585c1157e8b71bdaa4dd0247017""

In this example, remove-pod has timed out after stopping the pod. Similarly create-pod can get stuck or timeout and the following create-pod attempt fails with "failed to reserve sandbox name" error.

It is also observed that if sleep(10s) is used between the 'crictl' commands, error rate reduced significantly but is not eliminated though.

Steps to reproduce the issue

Pre-requisite for successfully running crictl cmds:
(a) In /etc/containerd/config.toml, commented the line "SystemdCgroup: ..."
(b) systemctl stop kubelet

Five instances of test scripts running in parallel perform these steps repeatedly, each instance running 1000 iterations.

  1. Create sandbox pod
  2. Create container inside the sandbox pod
  3. Start the container
  4. Stop the sandbox pod
  5. Remove the sandbox pod

Describe the results you received and expected

Expected:
Successful creation and deletion of pods and containers.

Received:
Issue was reproduced on v1.6.20, v1.7.0 and v1.7.3
Observed "failed to reserve sandbox name" after crictl cmds timedout.
See the attached backtrace of containerd obtained when sandbox pod name error was observed.

What version of containerd are you using?

1.6.20, 1.7.0, 1.7.3

Any other relevant information

crictl version

Version: 0.1.0
RuntimeName: containerd
RuntimeVersion: v1.6.20
RuntimeApiVersion: v1

runc -v

runc version 1.1.5
commit: v1.1.5-0-gf19387a6
spec: 1.0.2-dev
go: go1.19.7
libseccomp: 2.5.1

uname -a

Linux smdev-esx1 5.16.0-0.bpo.4-amd64 #1 SMP PREEMPT Debian 5.16.12-1~bpo11+1 (2022-03-08) x86_64 GNU/Linux

crictl.txt
TestScripts.zip
containerd_backtrace.txt

Show configuration if it is related to CRI plugin.

config_toml.txt

@sandeepbudanur
Copy link
Author

Trying to fix the issue and test containerd privately. Any input would be greatly appreciated.

@yylt
Copy link
Contributor

yylt commented Jul 5, 2024

maybe duplicated with #9545

@dosubot dosubot bot added the Stale label Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants