Containerd appears to get stuck occasionally while creating/removing a sandbox pod after which retry attempts to create the sandbox pod fails. #9947

sandeepbudanur · 2024-03-08T01:19:25Z

Description

In a Kubernetes cluster, kubelet tries to create a pod which sometimes times out after 4 minutes. Kubelet retries to create the pod which results in "error="failed to reserve sandbox name".

kubelet and containerd logs:

2024-02-07T02:45:40.739274+00:00 NodeA containerd 22938 - - time="2024-02-07T02:45:40.738965836Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:podX-b181ca13-z7d7p,Uid:f4fc0242-4fb4-4015-a7eb-3e4a9cd73293,Namespace:default,Attempt:0,}

2024-02-07T02:49:40.739405+00:00 NodeA kubelet 24578 - - E0207 02:49:40.738973 24578 remote_runtime.go:176] "RunPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"

2024-02-07T02:49:40.739537+00:00 NodeA kubelet 24578 - - E0207 02:49:40.739041 24578 kuberuntime_sandbox.go:72] "Failed to create sandbox for pod" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" pod="default/podX-b181ca13-z7d7p"

2024-02-07T02:49:52.035922+00:00 NodeA containerd 22938 - - time="2024-02-07T02:49:52.035460535Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:podX-b181ca13-z7d7p,Uid:f4fc0242-4fb4-4015-a7eb-3e4a9cd73293,Namespace:default,Attempt:0,}"

2024-02-07T02:49:52.036144+00:00 NodeA containerd 22938 - - time="2024-02-07T02:49:52.035550152Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:podX-b181ca13-z7d7p,Uid:f4fc0242-4fb4-4015-a7eb-3e4a9cd73293,Namespace:default,Attempt:0,} failed, error" error="failed to reserve sandbox name "podX-b181ca13-z7d7p_cm_f4fc0242-4fb4-4015-a7eb-3e4a9cd73293_0": name "podX-b181ca13-z7d7p_cm_f4fc0242-4fb4-4015-a7eb-3e4a9cd73293_0" is reserved for "3897d260724a7061999dbc3b03243ae672a6bd7d287784009ddfad9428029b92""

The same issue was reproduced using 'crictl' test script more frequently. The log snippet given below is from one of the test runs where " failed to reserve sandbox name" was observed and containerd did NOT RECOVER from this state and kept posting the same error message for subsequent pod creation requests with the SAME POD NAME. Looks like containerd is not cleaning up the stale pod entries from its database.

Log snippet from Iteration: 17, Test Instance ID: 2

Sandbox Pod ID: fbb5843175d06d7c84d66ad3f5a07c4327910585c1157e8b71bdaa4dd0247017
Creating container inside pod fbb5843175d06d7c84d66ad3f5a07c4327910585c1157e8b71bdaa4dd0247017
b3c019d108f1800a7d2552d6c053ca07690d7d886db2b4131eee38cb16688922
Starting container b3c019d108f1800a7d2552d6c053ca07690d7d886db2b4131eee38cb16688922 inside pod fbb5843175d06d7c84d66ad3f5a07c4327910585c1157e8b71bdaa4dd0247017
Stopped sandbox fbb5843175d06d7c84d66ad3f5a07c4327910585c1157e8b71bdaa4dd0247017

# Remove-pod command timedout
E0307 21:36:57.596856 69228 remote_runtime.go:274] "RemovePodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="fbb5843175d06d7c84d66ad3f5a07c4327910585c1157e8b71bdaa4dd0247017"
removing the pod sandbox "fbb5843175d06d7c84d66ad3f5a07c4327910585c1157e8b71bdaa4dd0247017": rpc error: code = DeadlineExceeded desc = context deadline exceeded

# On retry with the same sandbox pod name, create pod fails with the error mentioned below. Containerd has not cleaned up the stale pod entries after remove-pod timed out.
E0307 21:36:57.940424 69750 remote_runtime.go:201] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to reserve sandbox name "bb-sandbox-2b_default_2b_1": name "bb-sandbox-2b_default_2b_1" is reserved for "fbb5843175d06d7c84d66ad3f5a07c4327910585c1157e8b71bdaa4dd0247017""
time="2024-03-07T21:36:57Z" level=fatal msg="run pod sandbox: rpc error: code = Unknown desc = failed to reserve sandbox name "bb-sandbox-2b_default_2b_1": name "bb-sandbox-2b_default_2b_1" is reserved for "fbb5843175d06d7c84d66ad3f5a07c4327910585c1157e8b71bdaa4dd0247017""

In this example, remove-pod has timed out after stopping the pod. Similarly create-pod can get stuck or timeout and the following create-pod attempt fails with "failed to reserve sandbox name" error.

It is also observed that if sleep(10s) is used between the 'crictl' commands, error rate reduced significantly but is not eliminated though.

Steps to reproduce the issue

Pre-requisite for successfully running crictl cmds:
(a) In /etc/containerd/config.toml, commented the line "SystemdCgroup: ..."
(b) systemctl stop kubelet

Five instances of test scripts running in parallel perform these steps repeatedly, each instance running 1000 iterations.

Create sandbox pod
Create container inside the sandbox pod
Start the container
Stop the sandbox pod
Remove the sandbox pod

Describe the results you received and expected

Expected:
Successful creation and deletion of pods and containers.

Received:
Issue was reproduced on v1.6.20, v1.7.0 and v1.7.3
Observed "failed to reserve sandbox name" after crictl cmds timedout.
See the attached backtrace of containerd obtained when sandbox pod name error was observed.

What version of containerd are you using?

1.6.20, 1.7.0, 1.7.3

Any other relevant information

crictl version

Version: 0.1.0
RuntimeName: containerd
RuntimeVersion: v1.6.20
RuntimeApiVersion: v1

runc -v

runc version 1.1.5
commit: v1.1.5-0-gf19387a6
spec: 1.0.2-dev
go: go1.19.7
libseccomp: 2.5.1

uname -a

Linux smdev-esx1 5.16.0-0.bpo.4-amd64 #1 SMP PREEMPT Debian 5.16.12-1~bpo11+1 (2022-03-08) x86_64 GNU/Linux

crictl.txt
TestScripts.zip
containerd_backtrace.txt

Show configuration if it is related to CRI plugin.

config_toml.txt

sandeepbudanur · 2024-03-13T16:51:44Z

Trying to fix the issue and test containerd privately. Any input would be greatly appreciated.

yylt · 2024-07-05T01:57:41Z

maybe duplicated with #9545

sandeepbudanur added the kind/bug label Mar 8, 2024

dosubot bot added the Stale label Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Containerd appears to get stuck occasionally while creating/removing a sandbox pod after which retry attempts to create the sandbox pod fails. #9947

Containerd appears to get stuck occasionally while creating/removing a sandbox pod after which retry attempts to create the sandbox pod fails. #9947

sandeepbudanur commented Mar 8, 2024 •

edited

Loading

sandeepbudanur commented Mar 13, 2024

yylt commented Jul 5, 2024

Containerd appears to get stuck occasionally while creating/removing a sandbox pod after which retry attempts to create the sandbox pod fails. #9947

Containerd appears to get stuck occasionally while creating/removing a sandbox pod after which retry attempts to create the sandbox pod fails. #9947

Comments

sandeepbudanur commented Mar 8, 2024 • edited Loading

Description

Steps to reproduce the issue

Describe the results you received and expected

What version of containerd are you using?

Any other relevant information

crictl version

runc -v

uname -a

Show configuration if it is related to CRI plugin.

sandeepbudanur commented Mar 13, 2024

yylt commented Jul 5, 2024

sandeepbudanur commented Mar 8, 2024 •

edited

Loading