Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CrashLoopBackOff - After update to containerd 1.5.5 #6009

Closed
bmcentos opened this issue Sep 15, 2021 · 10 comments
Closed

CrashLoopBackOff - After update to containerd 1.5.5 #6009

bmcentos opened this issue Sep 15, 2021 · 10 comments
Labels
area/cri Container Runtime Interface (CRI) status/more-info-needed Awaiting contributor information

Comments

@bmcentos
Copy link

Hi.

After update my binaries files with package containerd-1.5.5-linux-amd64.tar.gz, and restart services kubelet and containerd, all run fine, but after reboot the node, the pods kube-proxy-56k4n and kube-flannel-ds-95422 got CrashLoopBackOff error, ass:

Not running after rebot Containerd Version:

Client:
  Version:  v1.5.5
  Revision: 72cec4be58a9eb6b2910f5d10f1c01ca47d231c0
  Go version: go1.16.6

Server:
  Version:  v1.5.5
  Revision: 72cec4be58a9eb6b2910f5d10f1c01ca47d231c0
  UUID: 5159f238-f8dc-43f5-bec4-cdcb8b779e19

Running fine Containerd Version:

Client:
  Version:  v1.4.3
  Revision: 269548fa27e0089a8b8278fc4fc781d7f65a939b
  Go version: go1.15.5

Server:
  Version:  v1.4.3
  Revision: 269548fa27e0089a8b8278fc4fc781d7f65a939b
  UUID: 2ab18082-55a6-493a-96c5-effc31482fef
coredns-85d9df8444-r9qqr           0/1     Unknown            2          57d    <none>          srvXX   <none>           <none>
etcd-srvXX                      1/1     Running            938        95d    XX.XX.201.14   srvXX   <none>           <none>
kube-apiserver-srvXX            1/1     Running            1299       94d    XX.XX.201.14   srvXX   <none>           <none>
kube-controller-manager-srvXX   1/1     Running            201        95d    XX.XX.201.14   srvXX   <none>           <none>
kube-flannel-ds-95422              0/1     CrashLoopBackOff   13         95d    XX.XX.201.14   srvXX   <none>           <none>
kube-proxy-56k4n                   0/1     CrashLoopBackOff   12         95d    XX.XX.201.14   srvXX   <none>           <none>
kube-scheduler-srvXX            1/1     Running            187        94d    XX.XX.201.14   srvXX   <none>           <none>

My node is:

NAME       STATUS   ROLES                  AGE    VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                               KERNEL-VERSION                 CONTAINER-RUNTIME
srvXX   Ready    control-plane,master   231d   v1.21.1   XX.XX.201.14   <none>        Red Hat Enterprise Linux 8.4 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   containerd://1.5.5

After roll back binaries to 1.4.2, and restart my services, all come back to run

One log got my atention:

kube-proxy-vcfxr_kube-system(4f8dba71-4b34-48eb-854e-7fdc4d0cc345): RunContainerError: failed to create containerd task: failed to create shim: OCI runtime create failed: container_linux.go:370: starting container process caused: unknown capability "CAP_PERFMON": unknown
/var/log/messages:Sep 15 10:50:13 srv36464 kubelet[6248]: E0915 10:50:13.490279    6248 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-proxy\" with RunContainerError: \"failed to create containerd task: failed to create shim: OCI runtime create failed: container_linux.go:370: starting container process caused: unknown capability \\\"CAP_PERFMON\\\": unknown\"" pod="kube-system/kube-proxy-vcfxr" podUID=4f8dba71-4b34-48eb-854e-7fdc4d0cc345

My log from pod flannel errored is:

I0915 14:11:12.567582       1 main.go:520] Determining IP address of default interface
I0915 14:11:12.569722       1 main.go:533] Using interface with name ens192 and address XX.XX.201.14
I0915 14:11:12.569794       1 main.go:550] Defaulting external address to interface address (XX.XX.201.14)
W0915 14:11:12.569885       1 client_config.go:608] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
E0915 14:11:12.668731       1 main.go:251] Failed to create SubnetManager: error retrieving pod spec for 'kube-system/kube-flannel-ds-95422': Get "https://10.96.0.1:443/api/v1/namespaces/kube-system/pods/kube-flannel-ds-95422": dial tcp 10.96.0.1:443: connect: connection refused

/var/log/message

Sep 15 11:16:31 srvXX kubelet[1453]: E0915 11:16:31.693454    1453 remote_runtime.go:116] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"0af1c3669b008873c66a9593c41dea8875531cc3b5200451c227411ff6c07c60\": open /run/flannel/subnet.env: no such file or directory"
Sep 15 11:16:31 srvXX kubelet[1453]: E0915 11:16:31.693583    1453 kuberuntime_sandbox.go:68] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"0af1c3669b008873c66a9593c41dea8875531cc3b5200451c227411ff6c07c60\": open /run/flannel/subnet.env: no such file or directory" pod="kube-system/coredns-85d9df8444-r9qqr"
Sep 15 11:16:31 srvXX kubelet[1453]: E0915 11:16:31.693645    1453 kuberuntime_manager.go:790] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"0af1c3669b008873c66a9593c41dea8875531cc3b5200451c227411ff6c07c60\": open /run/flannel/subnet.env: no such file or directory" pod="kube-system/coredns-85d9df8444-r9qqr"
Sep 15 11:16:31 srvXX kubelet[1453]: E0915 11:16:31.693811    1453 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"coredns-85d9df8444-r9qqr_kube-system(caa28be8-7d83-4c09-b672-5c1068adcdfc)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"coredns-85d9df8444-r9qqr_kube-system(caa28be8-7d83-4c09-b672-5c1068adcdfc)\\\": rpc error: code = Unknown desc = failed to setup network for sandbox \\\"0af1c3669b008873c66a9593c41dea8875531cc3b5200451c227411ff6c07c60\\\": open /run/flannel/subnet.env: no such file or directory\"" pod="kube-system/coredns-85d9df8444-r9qqr" podUID=caa28be8-7d83-4c09-b672-5c1068adcdfc
Sep 15 11:16:33 srvXX kubelet[1453]: I0915 11:16:33.627202    1453 scope.go:111] "RemoveContainer" containerID="65c9c7008172d9523629048a8a74eeeca57b63e3107be608f16b9269ea674e49"
Sep 15 11:16:33 srvXX kubelet[1453]: E0915 11:16:33.627915    1453 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-proxy\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=kube-proxy pod=kube-proxy-56k4n_kube-system(11f6dcfc-f6a0-4291-8ec0-95c5645b5275)\"" pod="kube-system/kube-proxy-56k4n" podUID=11f6dcfc-f6a0-4291-8ec0-95c5645b5275
Sep 15 11:16:41 srvXX kubelet[1453]: I0915 11:16:41.626864    1453 scope.go:111] "RemoveContainer" containerID="c749c058d179a620eb8473a96995b556c2c81b8d7a29efcf1499a88c45be2501"
Sep 15 11:16:41 srvXX kubelet[1453]: E0915 11:16:41.627501    1453 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-flannel\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=kube-flannel pod=kube-flannel-ds-95422_kube-system(176df0eb-c640-4b82-9fba-8ef150ec12fe)\"" pod="kube-system/kube-flannel-ds-95422" podUID=176df0eb-c640-4b82-9fba-8ef150ec12fe
Sep 15 11:16:42 srvXX containerd[1464]: time="2021-09-15T11:16:42.627525937-03:00" level=info msg="StopPodSandbox for \"54e9600b731c8165281a916bc9e99d96cece5eb8beef5841e8282e40c430bf18\""
Sep 15 11:16:42 srvXX containerd[1464]: time="2021-09-15T11:16:42.627680288-03:00" level=info msg="Container to stop \"84d114db2fbbb20744f8487bdf2fa190f4136f4f58ba0fcb4155f441bbfe3bb3\" must be in running or unknown state, current state \"CONTAINER_EXITED\""
Sep 15 11:16:42 srvXX containerd[1464]: time="2021-09-15T11:16:42.642623026-03:00" level=info msg="TearDown network for sandbox \"54e9600b731c8165281a916bc9e99d96cece5eb8beef5841e8282e40c430bf18\" successfully"
Sep 15 11:16:42 srvXX containerd[1464]: time="2021-09-15T11:16:42.642707701-03:00" level=info msg="StopPodSandbox for \"54e9600b731c8165281a916bc9e99d96cece5eb8beef5841e8282e40c430bf18\" returns successfully"
Sep 15 11:16:42 srvXX containerd[1464]: time="2021-09-15T11:16:42.644396817-03:00" level=info msg="RunPodsandbox for &PodSandboxMetadata{Name:coredns-85d9df8444-r9qqr,Uid:caa28be8-7d83-4c09-b672-5c1068adcdfc,Namespace:kube-system,Attempt:3,}"
Sep 15 11:16:42 srvXX systemd[1765]: run-netns-cni\x2dad810ae1\x2d8f5d\x2d17a4\x2df76e\x2d9b529e1c4dbb.mount: Succeeded.
Sep 15 11:16:42 srvXX systemd[1]: run-netns-cni\x2dad810ae1\x2d8f5d\x2d17a4\x2df76e\x2d9b529e1c4dbb.mount: Succeeded.
Sep 15 11:16:42 srvXX containerd[1464]: time="2021-09-15T11:16:42.682314843-03:00" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:coredns-85d9df8444-r9qqr,Uid:caa28be8-7d83-4c09-b672-5c1068adcdfc,Namespace:kube-system,Attempt:3,} failed, error" error="failed to setup network for sandbox \"6c3252109b5c2fa7ad337f432ad150f0191cdb357dc781f2c4adb3f218fb8ffb\": open /run/flannel/subnet.env: no such file or directory"
Sep 15 11:16:42 srvXX kubelet[1453]: E0915 11:16:42.685495    1453 remote_runtime.go:116] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"6c3252109b5c2fa7ad337f432ad150f0191cdb357dc781f2c4adb3f218fb8ffb\": open /run/flannel/subnet.env: no such file or directory"
Sep 15 11:16:42 srvXX kubelet[1453]: E0915 11:16:42.685651    1453 kuberuntime_sandbox.go:68] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"6c3252109b5c2fa7ad337f432ad150f0191cdb357dc781f2c4adb3f218fb8ffb\": open /run/flannel/subnet.env: no such file or directory" pod="kube-system/coredns-85d9df8444-r9qqr"
Sep 15 11:16:42 srvXX kubelet[1453]: E0915 11:16:42.685736    1453 kuberuntime_manager.go:790] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"6c3252109b5c2fa7ad337f432ad150f0191cdb357dc781f2c4adb3f218fb8ffb\": open /run/flannel/subnet.env: no such file or directory" pod="kube-system/coredns-85d9df8444-r9qqr"
Sep 15 11:16:42 srvXX kubelet[1453]: E0915 11:16:42.685910    1453 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"coredns-85d9df8444-r9qqr_kube-system(caa28be8-7d83-4c09-b672-5c1068adcdfc)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"coredns-85d9df8444-r9qqr_kube-system(caa28be8-7d83-4c09-b672-5c1068adcdfc)\\\": rpc error: code = Unknown desc = failed to setup network for sandbox \\\"6c3252109b5c2fa7ad337f432ad150f0191cdb357dc781f2c4adb3f218fb8ffb\\\": open /run/flannel/subnet.env: no such file or directory\"" pod="kube-system/coredns-85d9df8444-r9qqr" podUID=caa28be8-7d83-4c09-b672-5c1068adcdfc

So, have any modification from capacibilities or dependencies in new version of containerd, or some limitation? Any one ca help me to understand this error?

@bmcentos
Copy link
Author

/kind/bug

@fuweid fuweid transferred this issue from containerd/cri Sep 15, 2021
@fuweid fuweid added the area/cri Container Runtime Interface (CRI) label Sep 15, 2021
@bmcentos
Copy link
Author

Sep 15 11:25:15 srv36463 kubelet[1459]: E0915 11:25:15.415072    1459 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"cattle-cluster-agent-6cdccd9d6f-7w4j9_cattle-system(4b6faf61-ddc2-4da2-8aef-167243c55af9)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"cattle-cluster-agent-6cdccd9d6f-7w4j9_cattle-system(4b6faf61-ddc2-4da2-8aef-167243c55af9)\\\": rpc error: code = Unknown desc = failed to setup network for sandbox \\\"741160b2bf4eb4e7e92324463644e4517f03b424ec261a1392ce431bc4515dbf\\\": open /run/flannel/subnet.env: no such file or directory\"" pod="cattle-system/cattle-cluster-agent-6cdccd9d6f-7w4j9" podUID=4b6faf61-ddc2-4da2-8aef-167243c55af9
``

@AkihiroSuda
Copy link
Member

open /run/flannel/subnet.env: no such file or directory

Doesn’t seem related to containerd.
Maybe you updated flannel too?

@bmcentos
Copy link
Author

Hi, My flannel alwady updated by flannel:v0.14.0 Version
image

@zouyee
Copy link
Contributor

zouyee commented Sep 16, 2021

open /run/flannel/subnet.env: no such file or directory

@fuweid
Copy link
Member

fuweid commented Sep 16, 2021

@bmcentos could you show the result of /proc/$(pidof containerd)/status? Thanks

@fuweid fuweid added the status/more-info-needed Awaiting contributor information label Sep 17, 2021
@ousiax
Copy link

ousiax commented Dec 7, 2021

Dec 07 17:19:32 node-2 kubelet[18728]: E1207 17:19:32.302776   18728 pod_workers.go:836] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-proxy\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=kube-proxy pod=kube-proxy-spmfv_kube-system(929cc7ea-33d8-4a37-881c-1f6e8266a36f)\"" pod="kube-system/kube-proxy-spmfv" podUID=929cc7ea-33d8-4a37-881c-1f6e8266a36f
Dec 07 17:19:41 node-2 kubelet[18728]: I1207 17:19:41.284190   18728 scope.go:110] "RemoveContainer" containerID="d2871155e72ad981f5d241aad57a9d1a08ce80e6fbd445040719ca37553e9ab5"
Dec 07 17:19:41 node-2 kubelet[18728]: E1207 17:19:41.284820   18728 pod_workers.go:836] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"node-exporter\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=node-exporter pod=node-exporter-j6qmk_monitor(7ccca029-7cce-47b7-8bb5-732b944fcb84)\"" pod="monitor/node-exporter-j6qmk" podUID=7ccca029-7cce-47b7-8bb5-732b944fcb84
Dec 07 17:19:44 node-2 kubelet[18728]: I1207 17:19:44.273365   18728 scope.go:110] "RemoveContainer" containerID="bc787d9132b0cf7072343972978d672e8fa3b683adfae8be36a9478edaeb13a0"
Dec 07 17:19:44 node-2 kubelet[18728]: E1207 17:19:44.274017   18728 pod_workers.go:836] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-proxy\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=kube-proxy pod=kube-proxy-spmfv_kube-system(929cc7ea-33d8-4a37-881c-1f6e8266a36f)\"" pod="kube-system/kube-proxy-spmfv" podUID=929cc7ea-33d8-4a37-881c-1f6e8266a36f

root@node-2:/tmp# cat /run/flannel/subnet.env 
FLANNEL_NETWORK=10.244.0.0/16
FLANNEL_SUBNET=10.244.3.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=true

root@node-2:/tmp# cat /proc/$(pidof containerd)/status
Name:	containerd
Umask:	0022
State:	S (sleeping)
Tgid:	18729
Ngid:	0
Pid:	18729
PPid:	1
TracerPid:	0
Uid:	0	0	0	0
Gid:	0	0	0	0
FDSize:	128
Groups:	 
NStgid:	18729
NSpid:	18729
NSpgid:	18729
NSsid:	18729
VmPeak:	 1355064 kB
VmSize:	 1355064 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	   52240 kB
VmRSS:	   40780 kB
RssAnon:	   27112 kB
RssFile:	   13668 kB
RssShmem:	       0 kB
VmData:	  190036 kB
VmStk:	     132 kB
VmExe:	   17300 kB
VmLib:	    1532 kB
VmPTE:	     268 kB
VmSwap:	       0 kB
HugetlbPages:	       0 kB
CoreDumping:	0
THP_enabled:	1
Threads:	11
SigQ:	0/3652
SigPnd:	0000000000000000
ShdPnd:	0000000000000000
SigBlk:	fffffffe3bfa2800
SigIgn:	0000000000000000
SigCgt:	ffffffffffc1feff
CapInh:	0000000000000000
CapPrm:	000001ffffffffff
CapEff:	000001ffffffffff
CapBnd:	000001ffffffffff
CapAmb:	0000000000000000
NoNewPrivs:	0
Seccomp:	0
Seccomp_filters:	0
Speculation_Store_Bypass:	thread vulnerable
Cpus_allowed:	00000000,00000000,00000000,00000001
Cpus_allowed_list:	0
Mems_allowed:	00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list:	0
voluntary_ctxt_switches:	119
nonvoluntary_ctxt_switches:	203
root@node-2:/tmp# containerd --version
containerd github.com/containerd/containerd 1.4.5~ds1 1.4.5~ds1-2+deb11u1
root@node-2:/tmp# kubelet --version
Kubernetes v1.22.4
root@node-2:/tmp# cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

@padraigconnolly
Copy link

The only way I was able to replicate this issue was with docker (20.10.12-0ubuntu4) + containerd (1.5.9-0ubuntu3) installed and I ran K8s 1.24 on top of it (1.24 will now by default use Containerd CRI not Dockershim). In this default state Containerd kept shutting down the K8s containers (Reporting level=warning msg="cleaning up after shim disconnected" and level=info msg="cleaning up dead shim") and then the kubelet would start them up again creating an endless loop.

To fix I created the containerd config file:

sudo mkdir -p /etc/containerd
containerd config default | sudo tee /etc/containerd/config.toml

Then set SystemdCgroup = true in /etc/containerd/config.toml (This bit is probably not important to fix the issue)

Then restarted containerd and the kubelet:

systemctl restart containerd
systemctl restart kubelet

I feel like this is a special edge case for me as I was just testing what happens if I installed Docker + Containerd (using apt install docker.io) like the old way (pre K8s 1.24) to see what the transition would be like for container noobs like myself.

It would be great if a containerd expert could explain why when I install containerd out of the box alongside docker that I need to create the default config.toml for containerd to work directly with K8s, is the CRI disabled when no config file is present so it just works with Docker?

@coreybx
Copy link

coreybx commented May 15, 2022

Fixed for my vanilla F36 install. Same symptoms. SystemdCgroup = true actually solved the issue for me. Without it I had the same symptoms

@fuweid
Copy link
Member

fuweid commented Aug 16, 2023

It would be great if a containerd expert could explain why when I install containerd out of the box alongside docker that I need to create the default config.toml for containerd to work directly with K8s, is the CRI disabled when no config file is present so it just works with Docker?

@padraigconnolly I don't know the reason. I think you can report it to docker community because the package is build by them :)

It seems like there was mismatch between kubelet and containerd. close

@fuweid fuweid closed this as not planned Won't fix, can't repro, duplicate, stale Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cri Container Runtime Interface (CRI) status/more-info-needed Awaiting contributor information
Projects
None yet
Development

No branches or pull requests

7 participants