-
Notifications
You must be signed in to change notification settings - Fork 508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubelet service doesn't restart on failure #2512
Comments
Thanks for opening this ticket! Can you share with us more details about how to reproduce this issue? Did this happen frequently to you or just once?
We actually support kubelet restart on failure . Can you provide more logs about kubelet? |
This has happened at least 3 times since moving to Bottlerocket from AWS AMI a couple weeks ago. I have no way to reproduce it, there is nothing special about the workload we are running. A mix of workloads and about 20 containers per node. Now I see I as looking at the "Drop-in" line in the systemctl status. But why is the service not restarted then, these are the last logs before shutdown.
|
The It also appears that That is worth digging into. My guess would be that it's getting killed or starved due to resource overcommit, but the journal hopefully has more detail. If that's the case, then addressing the underlying root cause might involve tuning the For the purposes of this issue, since we're observing |
I have faced this issue multiple times using both managed node groups and karpenter provisioned bottlerocket nodes. Known workload: Temporary Resolution (Verified in case of node group node failure):
enable-admin-container
apiclient exec admin bash
sheltie
systemctl status kubelet
systemctl restart kubelet I will collect kubelet logs in future using.
Currently, I don't have kubelet logs :( Behavior in case of Node group node failure:
Behavior in case of karpenter provisioned node failure:
Note that I was unable to connect to that karpenter node using SSM manager. I was getting "i- is not connected." error. Related: |
We were also running high load at the time, for load testing. We are no longer doing that and haven't had the issue since so could possible be a cause of the instability. |
From the logs, it seems like something was going on with containerd's socket:
Do you by any chance have the logs for |
@arnaldo2792 Nope. I will collect containerd logs as well next time. That node might still be available but logs might have rotated. |
We are running Bottlerocket managed nodes in EKS, and last week got a node failure where kubelet.service stopped with the following log. That node was running version 1.9.2 on k8s 1.22.
After this failure the kubectl node status became "NotReady" and the NodeGroup status "Unknown", which leads to a state where nothing happens. The EC2 doesn't get terminated since the kubelet status is ot part of the health check, and the kubelet service is not restarted automatically so the node is not reachable.
The containers running on the node continued running, so nothing triggered outside of the node status.
After logging into the node and manually starting the kubelet service again everything came back, but later the kubelet service failed again. After that we terminated the node and a new one was created with version 1.10.1, it's been stable since that.
What I'd like:
Add a restart condition to the kubelet service.
Any alternatives you've considered:
AWS adding a health monitor for EKS nodes which take the kubelet health into consideration, but that would lead to a complete node termination which might not be required.
The text was updated successfully, but these errors were encountered: