-
Notifications
You must be signed in to change notification settings - Fork 957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Liveness controller doesn't terminate nodes in all cases #1135
Comments
This is such a good find. Currently, the liveness controller terminates nodes that fail to connect: https://github.com/aws/karpenter/blob/main/pkg/controllers/node/liveness.go#L49. However, for nodes that do connect but dont become ready, this can happen! We've been aware of this gap, but haven't had an example of it happening, so we hadn't prioritized it. I think we probably want to apply liveness to all node unreadiness, with the ability for users to configure longer static stability You can protect again this by setting |
Are limits actually implemented? I had read the design doc here: https://github.com/aws/karpenter/blob/main/designs/limits.md but was under the understanding they aren't implemented yet? We actually have them commented out in our current Provisioner configuration as I was waiting until they were available 😄 |
They are implemented. You can limit via cpu and memory. We've got a doc update in PR right now that should hopefully clear out that confusion - #1117 |
Awesome, thanks. I'll enable those in the meantime as a failsafe. |
Released as part of https://github.com/aws/karpenter/releases/tag/v0.5.6 |
Version
Karpenter: v0.5.4
Kubernetes: v1.21.2
Expected Behavior
Karpenter should gracefully handle scenarios where nodes are unable to join the cluster. In this scenario, Karpenter could wait forever (or for some configurable timeout period) for the nodes to join the cluster, or maybe Karpenter should continually delete the nodes that were unable to join the cluster and launch new nodes.
Actual Behavior
Every ~5 minutes Karpenter batches pods and tries to launch a new set of instances. When following the
Getting Started
guide, using theinflate
deployment to test the auto scaling functionality at 10 replices, I ended up with ~30 m5.xlarge instances running, unable to join the cluster after 1 hours of testing. Old nodes were never deleted, only new ones were ever created. Additionally, after scaling down theinflate
deployment, these empty nodes did not have a TTL applied. I had to manually delete them.Steps to Reproduce the Problem
The easiest way I've found to do this is intentionally misconfigure the EKS VPC CNI plugin
Getting Started
guide - https://karpenter.sh/docs/getting-started/At this point, ipamd should be unable to create/attach new ENIs to the nodes Karpenter is trying to create, which will cause them to fail to join the cluster (the
aws-node
pod should get stuck in crashloopbackoff) for new nodes).When scaling to 10 replicas, Karpenter should get stuck in a loop which constantly creates new nodes.
Resource Specs and Logs
I've attached the logs from the controller during an example run. The pod used was the
pause
image provided in theGetting Started
example deployment - https://karpenter.sh/docs/getting-started/#automatic-node-provisioningHere is the provisioner spec:
KarpenterEndlessLaunchLogs.txt
The text was updated successfully, but these errors were encountered: