kubelet service doesn't restart on failure #2512

tobias-zeptio · 2022-10-24T11:11:04Z

We are running Bottlerocket managed nodes in EKS, and last week got a node failure where kubelet.service stopped with the following log. That node was running version 1.9.2 on k8s 1.22.

Oct 21 05:10:31 ip-10-1-22-8.eu-west-1.compute.internal kubelet[520747]: W1021 05:10:31.864779  520747 clientconn.go:1326] [core] grpc: addrConn.createTransport failed to connect to {/run/dockershim.sock /run/dockershim.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: failed to write client preface: write unix @->/run/dockershim.sock: write: broken pipe".Reconnecting...

After this failure the kubectl node status became "NotReady" and the NodeGroup status "Unknown", which leads to a state where nothing happens. The EC2 doesn't get terminated since the kubelet status is ot part of the health check, and the kubelet service is not restarted automatically so the node is not reachable.

The containers running on the node continued running, so nothing triggered outside of the node status.

After logging into the node and manually starting the kubelet service again everything came back, but later the kubelet service failed again. After that we terminated the node and a new one was created with version 1.10.1, it's been stable since that.

What I'd like:

Add a restart condition to the kubelet service.

Any alternatives you've considered:

AWS adding a health monitor for EKS nodes which take the kubelet health into consideration, but that would lead to a complete node termination which might not be required.

The text was updated successfully, but these errors were encountered:

gthao313 · 2022-10-24T18:38:41Z

Thanks for opening this ticket!

Can you share with us more details about how to reproduce this issue? Did this happen frequently to you or just once?

Add a restart condition to the kubelet service.

We actually support kubelet restart on failure . Can you provide more logs about kubelet?

tobias-zeptio · 2022-10-26T05:57:18Z

This has happened at least 3 times since moving to Bottlerocket from AWS AMI a couple weeks ago. I have no way to reproduce it, there is nothing special about the workload we are running. A mix of workloads and about 20 containers per node.

Now I see I as looking at the "Drop-in" line in the systemctl status. But why is the service not restarted then, these are the last logs before shutdown.

Oct 21 05:10:31 ip-10-1-22-8.eu-west-1.compute.internal kubelet[520747]: W1021 05:10:31.865082  520747 clientconn.go:1326] [core] grpc: addrConn.createTransport failed to connect to {/run/dockershim.sock [localhost ](https://support.console.aws.amazon.com/support/localhost) 0xc000eca180 0 <nil>}. Err: connection error: desc = "transport: failed to write client preface: write unix @->/run/dockershim.sock: write: broken pipe". Reconnecting...
Oct 21 05:10:31 ip-10-1-22-8.eu-west-1.compute.internal kubelet[520747]: I1021 05:10:31.884977  520747 dynamic_cafile_content.go:170] "Shutting down controller" name="client-ca-bundle::/etc/kubernetes/pki/ca.crt"
Oct 21 05:10:31 ip-10-1-22-8.eu-west-1.compute.internal systemd[1]: Stopping Kubelet...
Oct 21 05:10:31 ip-10-1-22-8.eu-west-1.compute.internal systemd[1]: kubelet.service: Deactivated successfully.
Oct 21 05:10:31 ip-10-1-22-8.eu-west-1.compute.internal systemd[1]: Stopped Kubelet.
Oct 21 05:10:31 ip-10-1-22-8.eu-west-1.compute.internal systemd[1]: kubelet.service: Consumed 32min 55.770s CPU time.

bcressey · 2022-10-26T14:36:07Z

The kubelet service is configured with Restart=on-failure, which among other things will not restart if the exit code is zero (which seems like it might be the case from the logs).

It also appears that containerd is falling over given the pipe error on dockershim.sock (which is where the containerd socket was until more recent images).

That is worth digging into. My guess would be that it's getting killed or starved due to resource overcommit, but the journal hopefully has more detail. If that's the case, then addressing the underlying root cause might involve tuning the kube-reserved or system-reserved settings.

For the purposes of this issue, since we're observing kubelet dying "successfully" after an error, a Restart=always policy seems better from a self-healing perspective.

armujahid · 2022-12-27T14:09:55Z

I have faced this issue multiple times using both managed node groups and karpenter provisioned bottlerocket nodes.

Known workload:
Workload that I think is causing this is prometheus-server or grafana because each time the node in which the kubelet crashed and failed to restart was running prometheus-server and grafana pods with mounted PVs for both and I was using multiple grafana dashboards using port-forward. Note that both of these pods don't have any requests and limits. I think sporadic increase of CPU load in multiple containers can bring down a kubelet. Also mentioned at https://kubernetes.slack.com/archives/C8SH2GSL9/p1664873301475189

Temporary Resolution (Verified in case of node group node failure):
Restart kubelet or terminate that node manually :(

Go to fleet manager and connect to that instance

enable-admin-container
apiclient exec admin bash
sheltie

systemctl status kubelet
systemctl restart kubelet

I will collect kubelet logs in future using.

journalctl -u kubelet > kubelet.log

Currently, I don't have kubelet logs :(

Behavior in case of Node group node failure:
Node remained stuck in unknown status and managed node group wasn't replacing that node. EC2 health checks of that node was OK (because EC2 doesn't check kubelet status). I had to manually restart kubelet to resolve that issue.

Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Tue, 27 Dec 2022 16:14:47 +0500   Tue, 27 Dec 2022 16:17:49 +0500   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Tue, 27 Dec 2022 16:14:47 +0500   Tue, 27 Dec 2022 16:17:49 +0500   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Tue, 27 Dec 2022 16:14:47 +0500   Tue, 27 Dec 2022 16:17:49 +0500   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Tue, 27 Dec 2022 16:14:47 +0500   Tue, 27 Dec 2022 16:17:49 +0500   NodeStatusUnknown   Kubelet stopped posting node status.

Behavior in case of karpenter provisioned node failure:
Karpenter eventually replaces that node after waiting for some time (Timeout).

Events:
  Type     Reason                         Age                    From             Message
  ----     ------                         ----                   ----             -------
  Normal   NodeNotReady                   13m                    node-controller  Node <node>.ap-south-1.compute.internal status is now: NodeNotReady
  Normal   DeprovisioningTerminatingNode  7m55s                  karpenter        Deprovisioning node via delete, terminating 1 nodes <node>.ap-south-1.compute.internal/c6a.2xlarge/on-demand
  Warning  FailedDraining                 7m55s                  karpenter        Failed to drain node, 25 pods are waiting to be evicted
  Warning  FailedDraining                 3m53s (x2 over 5m53s)  karpenter        Failed to drain node, 15 pods are waiting to be evicted
  Warning  FailedDraining                 113s                   karpenter        Failed to drain node, 10 pods are waiting to be evicted
  Normal   DeprovisioningWaitingDeletion  111s (x4 over 7m55s)   karpenter        Waiting on deletion to continue deprovisioning

Note that I was unable to connect to that karpenter node using SSM manager. I was getting "i- is not connected." error.

Related:
aws/containers-roadmap#928

tobias-zeptio · 2023-01-04T08:03:32Z

We were also running high load at the time, for load testing. We are no longer doing that and haven't had the issue since so could possible be a cause of the instability.

armujahid · 2024-01-07T13:12:17Z

Noticed this issue again today in which I had to manually restart kubelet.
Managed Node group node was in unknown state for about 6 hours without any self healing

AMI release version: 1.16.1-763f6d4c
Kubernetes version: 1.24 (This is an old cluster that will soon be upgraded)

Kubelet logs before crash:

Jan 06 20:18:11 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: I0106 20:18:11.497205    1375 scope.go:110] "RemoveContainer" containerID="bade0a6134c7211412ab952ca64c84c4123a0f3c8b1d0caac2e3d26b7de75507"
Jan 06 20:18:13 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: I0106 20:18:13.477969    1375 kubelet_volumes.go:160] "Cleaned up orphaned pod volumes dir" podUID=4ad3a8ac-c487-46a2-9d05-ec14e126dd2f path="/var/lib/kubelet/pods/4ad3a8ac-c487-46a2-9d05-ec14e126dd2f/volumes"
Jan 06 23:15:29 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: I0106 23:15:29.609362    1375 scope.go:110] "RemoveContainer" containerID="b5e542c0c48535b9bd29260661564c897f58a2aa7c784610cd3cfeca99431dd1"
Jan 06 23:15:29 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: I0106 23:15:29.609817    1375 scope.go:110] "RemoveContainer" containerID="79ed1926b923f72077544f141a8126d6e6e5a18ab0985a91510119fc4405f883"
Jan 06 23:15:34 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: I0106 23:15:34.632998    1375 scope.go:110] "RemoveContainer" containerID="78161b987c2fffa81f0e8be7a6b5a6423221e372bc71e8088f381c07b8cf6e79"
Jan 06 23:15:34 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: I0106 23:15:34.634330    1375 scope.go:110] "RemoveContainer" containerID="6de8f299db9ba795cab5df55f5142a8c805105de7d3c9950e16dd9a738ad6996"
Jan 06 23:54:20 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: I0106 23:54:20.244083    1375 scope.go:110] "RemoveContainer" containerID="79ed1926b923f72077544f141a8126d6e6e5a18ab0985a91510119fc4405f883"
Jan 06 23:54:20 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: I0106 23:54:20.244649    1375 scope.go:110] "RemoveContainer" containerID="7fbdab74b77e665fb8dfd8c909bbee064d9931f387fd48f7c7b7b45d618adfa1"
Jan 07 00:55:38 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: I0107 00:55:38.355987    1375 scope.go:110] "RemoveContainer" containerID="6de8f299db9ba795cab5df55f5142a8c805105de7d3c9950e16dd9a738ad6996"
Jan 07 00:55:38 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: I0107 00:55:38.356612    1375 scope.go:110] "RemoveContainer" containerID="248048c92fd0d96ab8fab7728d006459454f12b3a94c36669d3865882ed71ea7"
Jan 07 01:57:39 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: I0107 01:57:39.652683    1375 scope.go:110] "RemoveContainer" containerID="7fbdab74b77e665fb8dfd8c909bbee064d9931f387fd48f7c7b7b45d618adfa1"
Jan 07 01:57:39 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: I0107 01:57:39.653175    1375 scope.go:110] "RemoveContainer" containerID="47a71e731e46da3598d485058591c8faf19874ff22870c68c5233b6269182726"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.507834    1375 upgradeaware.go:426] Error proxying data from client to backend: readfrom tcp 127.0.0.1:58550->127.0.0.1:40817: write tcp 127.0.0.1:58550->127.0.0.1:40817: write: connection reset by peer
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.512290    1375 remote_image.go:299] "ImageFsInfo from image service failed" err="rpc error: code = Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.512334    1375 remote_runtime.go:711] "ExecSync cmd from runtime service failed" err="rpc error: code =Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer" containerID="7d4d65dd400a988803df7ec47ac90a524ba715617bdfa78f48dbf32b02984ccd" cmd=[/app/grpc-health-probe -addr=:50051 -connect-timeout=5s -rpc-timeout=5s]
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.513133    1375 eviction_manager.go:254] "Eviction manager: failed to get summary stats" err="failed to get imageFs stats: rpc error: code = Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.513903    1375 remote_image.go:132] "ListImages with filter from image service failed" err="rpc error: code = Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer" filter="nil"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.515280    1375 kuberuntime_image.go:101] "Failed to list images" err="rpc error: code = Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: I0107 05:55:22.515332    1375 image_gc_manager.go:199] "Failed to update image list" err="rpc error: code = Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.515395    1375 remote_runtime.go:549] "ListContainers with filter from runtime service failed" err="rpcerror: code = Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer" filter="&ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},}"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.515465    1375 remote_runtime.go:549] "ListContainers with filter from runtime service failed" err="rpcerror: code = Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer" filter="&ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},}"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.515552    1375 kuberuntime_container.go:447] "ListContainers failed" err="rpc error: code = Unavailabledesc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.516300    1375 remote_runtime.go:370] "ListPodSandbox with filter from runtime service failed" err="rpcerror: code = Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer" filter="&PodSandboxFilter{Id:,State:&PodSandboxStateValue{State:SANDBOX_READY,},LabelSelector:map[string]string{},}"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.516358    1375 kuberuntime_sandbox.go:292] "Failed to list pod sandboxes" err="rpc error: code = Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.516380    1375 kubelet_pods.go:1115] "Error listing containers" err="rpc error: code = Unavailable desc= error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.516407    1375 kubelet.go:2180] "Failed cleaning pods" err="rpc error: code = Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.516435    1375 kubelet.go:2184] "Housekeeping took longer than 15s" err="housekeeping took too long" seconds=19.0475632
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: W0107 05:55:22.518009    1375 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/run/containerd/containerd.sock /run/containerd/containerd.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused". Reconnecting...
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.518181    1375 remote_runtime.go:370] "ListPodSandbox with filter from runtime service failed" err="rpcerror: code = Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer" filter="nil"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.518302    1375 kuberuntime_sandbox.go:292] "Failed to list pod sandboxes" err="rpc error: code = Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.518390    1375 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.518490    1375 remote_runtime.go:549] "ListContainers with filter from runtime service failed" err="rpcerror: code = Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer" filter="&ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},}"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.518584    1375 log_metrics.go:66] "Failed to get pod stats" err="failed to get pod or container map: failed to list all containers: rpc error: code = Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: W0107 05:55:22.518791    1375 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/run/containerd/containerd.sock /run/containerd/containerd.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused". Reconnecting...
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.518891    1375 remote_runtime.go:711] "ExecSync cmd from runtime service failed" err="rpc error: code =Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" containerID="7d4d65dd400a988803df7ec47ac90a524ba715617bdfa78f48dbf32b02984ccd" cmd=[/app/grpc-health-probe -addr=:50051 -connect-timeout=5s -rpc-timeout=5s]
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.518981    1375 remote_runtime.go:711] "ExecSync cmd from runtime service failed" err="rpc error: code =Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" containerID="7d4d65dd400a988803df7ec47ac90a524ba715617bdfa78f48dbf32b02984ccd" cmd=[/app/grpc-health-probe -addr=:50051 -connect-timeout=5s -rpc-timeout=5s]
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.519071    1375 remote_runtime.go:711] "ExecSync cmd from runtime service failed" err="rpc error: code =Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" containerID="7d4d65dd400a988803df7ec47ac90a524ba715617bdfa78f48dbf32b02984ccd" cmd=[/app/grpc-health-probe -addr=:50051 -connect-timeout=5s -rpc-timeout=5s]
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.519128    1375 remote_runtime.go:711] "ExecSync cmd from runtime service failed" err="rpc error: code =Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" containerID="7d4d65dd400a988803df7ec47ac90a524ba715617bdfa78f48dbf32b02984ccd" cmd=[/app/grpc-health-probe -addr=:50051 -connect-timeout=5s -rpc-timeout=5s]
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.519183    1375 remote_runtime.go:711] "ExecSync cmd from runtime service failed" err="rpc error: code =Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" containerID="7d4d65dd400a988803df7ec47ac90a524ba715617bdfa78f48dbf32b02984ccd" cmd=[/app/grpc-health-probe -addr=:50051 -connect-timeout=5s -rpc-timeout=5s]
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.519458    1375 remote_runtime.go:549] "ListContainers with filter from runtime service failed" err="rpcerror: code = Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer" filter="&ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},}"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.519508    1375 container_log_manager.go:183] "Failed to rotate container logs" err="failed to list containers: rpc error: code = Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.520056    1375 remote_runtime.go:150] "Version from runtime service failed" err="rpc error: code = Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.522721    1375 remote_runtime.go:711] "ExecSync cmd from runtime service failed" err="rpc error: code =Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer" containerID="58e300c057aa4c8469e4a494dbf3091025e8cd744e9c48412c0167ef83ca7117" cmd=[/grpc_health_probe -addr=:9111]
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.522973    1375 remote_runtime.go:711] "ExecSync cmd from runtime service failed" err="rpc error: code =Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" containerID="58e300c057aa4c8469e4a494dbf3091025e8cd744e9c48412c0167ef83ca7117" cmd=[/grpc_health_probe -addr=:9111]
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.523184    1375 remote_runtime.go:711] "ExecSync cmd from runtime service failed" err="rpc error: code =Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" containerID="58e300c057aa4c8469e4a494dbf3091025e8cd744e9c48412c0167ef83ca7117" cmd=[/grpc_health_probe -addr=:9111]
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.523355    1375 remote_runtime.go:711] "ExecSync cmd from runtime service failed" err="rpc error: code =Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" containerID="58e300c057aa4c8469e4a494dbf3091025e8cd744e9c48412c0167ef83ca7117" cmd=[/grpc_health_probe -addr=:9111]
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.523496    1375 remote_runtime.go:711] "ExecSync cmd from runtime service failed" err="rpc error: code =Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" containerID="58e300c057aa4c8469e4a494dbf3091025e8cd744e9c48412c0167ef83ca7117" cmd=[/grpc_health_probe -addr=:9111]
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.523637    1375 remote_runtime.go:711] "ExecSync cmd from runtime service failed" err="rpc error: code =Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" containerID="58e300c057aa4c8469e4a494dbf3091025e8cd744e9c48412c0167ef83ca7117" cmd=[/grpc_health_probe -addr=:9111]
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.523794    1375 remote_runtime.go:549] "ListContainers with filter from runtime service failed" err="rpcerror: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" filter="&ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},}"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.523909    1375 kuberuntime_container.go:447] "ListContainers failed" err="rpc error: code = Unavailabledesc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\""
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.524196    1375 remote_runtime.go:711] "ExecSync cmd from runtime service failed" err="rpc error: code =Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer" containerID="7d4d65dd400a988803df7ec47ac90a524ba715617bdfa78f48dbf32b02984ccd" cmd=[/app/grpc-health-probe -addr=:50051 -connect-timeout=5s -rpc-timeout=5s]
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.524264    1375 remote_runtime.go:370] "ListPodSandbox with filter from runtime service failed" err="rpcerror: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" filter="&PodSandboxFilter{Id:,State:&PodSandboxStateValue{State:SANDBOX_READY,},LabelSelector:map[string]string{},}"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.524296    1375 kuberuntime_sandbox.go:292] "Failed to list pod sandboxes" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\""
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.524305    1375 remote_runtime.go:711] "ExecSync cmd from runtime service failed" err="rpc error: code =Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" containerID="7d4d65dd400a988803df7ec47ac90a524ba715617bdfa78f48dbf32b02984ccd" cmd=[/app/grpc-health-probe -addr=:50051 -connect-timeout=5s -rpc-timeout=5s]
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.524317    1375 kubelet_pods.go:1115] "Error listing containers" err="rpc error: code = Unavailable desc= connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\""
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.524341    1375 kubelet.go:2180] "Failed cleaning pods" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\""
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.524352    1375 remote_runtime.go:711] "ExecSync cmd from runtime service failed" err="rpc error: code =Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" containerID="7d4d65dd400a988803df7ec47ac90a524ba715617bdfa78f48dbf32b02984ccd" cmd=[/app/grpc-health-probe -addr=:50051 -connect-timeout=5s -rpc-timeout=5s]
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.524413    1375 remote_runtime.go:711] "ExecSync cmd from runtime service failed" err="rpc error: code =Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" containerID="7d4d65dd400a988803df7ec47ac90a524ba715617bdfa78f48dbf32b02984ccd" cmd=[/app/grpc-health-probe -addr=:50051 -connect-timeout=5s -rpc-timeout=5s]
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.524455    1375 remote_runtime.go:711] "ExecSync cmd from runtime service failed" err="rpc error: code =Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" containerID="7d4d65dd400a988803df7ec47ac90a524ba715617bdfa78f48dbf32b02984ccd" cmd=[/app/grpc-health-probe -addr=:50051 -connect-timeout=5s -rpc-timeout=5s]
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.524516    1375 remote_runtime.go:711] "ExecSync cmd from runtime service failed" err="rpc error: code =Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" containerID="7d4d65dd400a988803df7ec47ac90a524ba715617bdfa78f48dbf32b02984ccd" cmd=[/app/grpc-health-probe -addr=:50051 -connect-timeout=5s -rpc-timeout=5s]
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.524643    1375 remote_runtime.go:946] "Status from runtime service failed" err="rpc error: code = Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.524695    1375 kubelet.go:2356] "Container runtime sanity check failed" err="rpc error: code = Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.524725    1375 remote_runtime.go:549] "ListContainers with filter from runtime service failed" err="rpcerror: code = Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer" filter="&ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},}"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.524750    1375 resource_metrics.go:126] "Error getting summary for resourceMetric prometheus endpoint" err="failed to list pod stats: failed to get pod or container map: failed to list all containers: rpc error: code = Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.525897    1375 kubelet.go:1279] "Container garbage collection failed" err="[rpc error: code = Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer, rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"]"
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: W0107 05:55:22.526826    1375 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/run/containerd/containerd.sock localhost 0xc000872268 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused". Reconnecting...
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal systemd[1]: Stopping Kubelet...
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal systemd[1]: kubelet.service: Deactivated successfully.
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal systemd[1]: Stopped Kubelet.
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal systemd[1]: kubelet.service: Consumed 1d 8h 39min 4.731s CPU time.

I think we should reopen this issue.

arnaldo2792 · 2024-01-08T22:17:08Z

From the logs, it seems like something was going on with containerd's socket:

Err: connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused". Reconnecting...
Jan 07 05:55:22 ip-10-20-3-58.ap-south-1.compute.internal kubelet[1375]: E0107 05:55:22.518181    1375 remote_runtime.go:370] "ListPodSandbox with filter from runtime service failed" err="rpcerror: code = Unavailable desc = error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer" filter="nil"

Do you by any chance have the logs for containerd.service? 😅

armujahid · 2024-01-08T22:59:37Z

@arnaldo2792 Nope. I will collect containerd logs as well next time. That node might still be available but logs might have rotated.

tobias-zeptio changed the title ~~Add kubelet service restart on failure~~ kubelet service doesn't restart on failure Oct 26, 2022

etungsten added type/bug Something isn't working area/kubernetes K8s including EKS, EKS-A, and including VMW status/research This issue is being researched and removed status/research This issue is being researched labels Nov 3, 2022

stmcginnis added the status/needs-triage Pending triage or re-evaluation label Dec 1, 2022

kdaula assigned stmcginnis and gthao313 and unassigned stmcginnis Jan 30, 2023

gthao313 mentioned this issue Feb 2, 2023

kubelet.service: change the restart policy to always #2774

Merged

2 tasks

gthao313 closed this as completed in #2774 Feb 10, 2023

armujahid mentioned this issue Jan 7, 2024

[EKS] [request]: Automated Node Health Checking aws/containers-roadmap#928

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubelet service doesn't restart on failure #2512

kubelet service doesn't restart on failure #2512

tobias-zeptio commented Oct 24, 2022 •

edited

Loading

gthao313 commented Oct 24, 2022 •

edited

Loading

tobias-zeptio commented Oct 26, 2022

bcressey commented Oct 26, 2022

armujahid commented Dec 27, 2022 •

edited

Loading

tobias-zeptio commented Jan 4, 2023

armujahid commented Jan 7, 2024 •

edited

Loading

arnaldo2792 commented Jan 8, 2024

armujahid commented Jan 8, 2024

kubelet service doesn't restart on failure #2512

kubelet service doesn't restart on failure #2512

Comments

tobias-zeptio commented Oct 24, 2022 • edited Loading

gthao313 commented Oct 24, 2022 • edited Loading

tobias-zeptio commented Oct 26, 2022

bcressey commented Oct 26, 2022

armujahid commented Dec 27, 2022 • edited Loading

tobias-zeptio commented Jan 4, 2023

armujahid commented Jan 7, 2024 • edited Loading

arnaldo2792 commented Jan 8, 2024

armujahid commented Jan 8, 2024

tobias-zeptio commented Oct 24, 2022 •

edited

Loading

gthao313 commented Oct 24, 2022 •

edited

Loading

armujahid commented Dec 27, 2022 •

edited

Loading

armujahid commented Jan 7, 2024 •

edited

Loading