-
Notifications
You must be signed in to change notification settings - Fork 855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Karpenter is not respecting per-node Daemonsets #1649
Comments
I believe this was fixed in v0.7.3. Can you try upgrading to the latest v0.8.1 and see if that fixes this issue? |
@dewjam is probably correct in what you're seeing @snorlaX-sleeps . This patch will be released next week in v0.8.2 |
Hey @snorlaX-sleeps ! https://github.com/aws/karpenter/releases/tag/v0.8.2 Thanks again for reporting the problem! |
We are still seeing this behaviour after deploying v0.8.2 |
@dewjam - I deployed the update on 4/14 but didn't get testing the change. I will get back to you once I can test it |
Hey @dewjam / @bwagner5 Using the same initial deployments + daemonsets mentioned in the original comment, I created a new provisioner for these services and redeployed them. I deleted and redeployed the Helm Charts for all these services to recreate a fresh deployment for testing (the initial conditions for this error) and it is again creating Deleting all the instances created with this provisioner (call it rebalancing), still causes the issue to occur - it would bring up 3 Errors on the daemonset pods are generally one of the following:
or
It seems to pick larger instances that will also support all of the daemonsets, it's only at the smaller instance sizes - but again that could be down to CPU/mem thresholds separating the different instance types, rather than as part of the calculation. As a note, using Terraform with the Helm and K8s providers for controlling the deployment of these services (so everything gets deployed in batches) |
@snorlaX-sleeps Can you point me to a helm chart that has one of these daemon sets so I can try to reproduce, or paste the YAML for the daemonset? |
Hey @tzneal We are deploying the following Daemonsets:
It would mainly be Datadog and the secrets-store-csi that may not be scheduled / get stuck in pending Datadog resources
Also deploying the ALB ingress controller as 3 replicas via Helm (creating the small instance sizes), but this could be replaced with any other small pod definition
|
Hey there @snorlaX-sleeps , |
Thanks @dewjam - as I said in the initial post, there are ways to work around it (deploying everything, then destroying all the nodes, but thats only good for non-critical / non-production clusters), but I just want to help identify this issue 👍 |
Do you think it's possible Deployments are being applied to the cluster before some of the DaemonSets? |
To give some context on my question above: In short, DaemonSets should be applied before Deployments, otherwise DaemonSet pods may remain in a Pending state. |
That could be potentially true during the initial deployment, however I have seen this happening when deleting an existing Karpenter managed node to get it recreated as well.
I understand. I can rerun the tests again tomorrow and make sure I haven't missed something (something like missing limits or whatever, it's very late here) edit: @dewjam |
@dewjam - was able to replicate this again today, I will outline the steps below as this is possibly / potentially just an edge case, it seemed to work fine in a natural migration to the new karp provisioner. What seems to be happening:
I believe Karp logs for the node:
Looking at the node in question, it has 6 pods running in total but it is missing 3 daemonset pods, which is what is leading me to believe the
|
Thanks for the info. If you don't mind, I would like to see the manifest for
Do you have debug logging enabled in Karpenter? https://karpenter.sh/v0.8.2/development-guide/#change-log-level |
No, but I now know thats a thing!!!
It doesn't have tolerations. After that change I have been unable to repeat the error. |
That makes perfect sense. As you identified, if a pod has a higher priority-class than a DaemonSet Pod, then the DS Pod could end up displaced. Which would leave it in Pending state. I think you're right that you probably were seeing multiple issues. Some were fixed by updating priority-class, but others likely were fixed through the v0.8.2 release. Let me know if we can close this out. Thanks again for reporting the issue! |
Thanks for your help with this @dewjam, sorry for making you test stuff yesterday :( |
I want to reopen the issue since the daemon set related to EBS CSI does not start on the node provisioned by Karpenter. |
@thanh-tran-techx Can you create a new issue to track the problem you are running into? Yours may be something different. |
This is limitation presents a serious problem - it means that new DaemonSet workloads can never be reliably scheduled on existing clusters. If the workaround is that all DaemonSets need to have a PriorityClass assigned, that should probably be documented at a minimum. |
Hey, what's the current situation? Is it intended that new DaemonSet pods will be Pending forever instead of making room for them by migrating other pods to new nodes? I just tried creating a test DaemonSet, pods were Pending. Then re-applied with |
Facing exactly the same issue. Existing nodes will not have room for DaemonSets added in later helm chart installation. With consolidation enabled nothing will trigger Karpenter at the moment to create a new larger node to replace the old one. Anything else required to make this happen automatically? |
Facing the same issue. Any update? |
Facing this problem. Not sure Karpenter is ready for prime time (Honoring daemonset is foundational) |
Karpenter does calculate daemonset resources and take them into account for any future node launches. What it currently does not do is terminate running nodes that were launched prior to the daemonset being created if a newly created daemonset pod can't run on the node. This behavior may not be desired by all customers and normally isn't an issue since daemonsets typically have a high priority and will evict running pods so that they can run. |
@tzneal In my case these are brand new nodes, I am not patching the old nodes with new daemonset. In the below image Karpenter namespace runs in Fargate Profile |
@missourian55 If you think that Karpenter is not calculating daemonset resources correctly for daemonsets that existed prior to the node being launched, please file another issue and include Karpenter logs and daemonset specs. This particular issue is about daemonsets that were created after the nodes were already launched. |
You can easily resolve by signing your daemonsets a higher priorityClass. The k8s scheduler will then evict other pods to make room for the daemonset and move those other pods to me nodes. Eventually karpenter consolidation feature might even desire to merge 2 nodes. All works perfectly as long you give daemonsets a higher priority. |
@marcofranssen Thank you. I will look into the priority class. However, for EKS add-ons like |
There's nothing inherent to daemonsets that makes this statement true, users have to specifically attribute a priorityclass to their daemonsets to make it so. |
That's a good point. I think generally we see that users tend to put priority classes on their daemonsets to guarantee that they schedule. Otherwise, it's possible, even if the DS is running ahead of time to get pre-empted by some other pod with a higher priority and Karpenter won't trigger a scale-up for that DS that is now pending. |
This seems like the core issue, no? Requiring a high priorityclass for daemonsets is a workaround for this behaviour. I guess it's possible that users could encounter a similar condition on non-karpenter clusters, but since karpenter tries to right-size nodes users are significantly more likely to encounter it on karpenter nodes. |
I agree with this. It's not something that Karpenter (or other autoscalers) support today, but I'd love to see some design work to make this happen.
I wonder if it's worth exploring a KEP to implement support for this upstream. |
Agreed. We've had some ideas around consolidation/drift that would enable us to discover that a DS is stuck in a pending state and then spin up a new node that would expand the capacity of the current node to make sure that all the pods that need to be scheduled on the node will get scheduled. |
@jonathan-innis It will be great if even if you deploy DS after, karpenter be able to detect such situation and re-combine pods and nodes to make sure that new DS is running as expected. When do you plan to release this new logic? |
We're seeing this issue in our dev environment, as we're deploying our telemetry sensor as a deamon set this issue hit us hard as no new nodes are added for additional capacity when a developer deploys into a newly created namespace. Are there recommended work arounds for such cases? We have (for ex.) ~20 deamonsets running per node. It's possible that sum(deamonset.requirments) will outgrow the node capacity. This case does not initiats a scale up for karpenter and leaves the cluster in non usable state. |
@maximveksler you should set your provisioner to have at least the capacity to host your deamonsets |
Thanks, do you mean setting the machine type selection to fit in such a way that there will remain place for the deamonsets or is there a way to configure the provisioner to allocate "extra" space on the planning (which is a better option for a workaround I'd consider). An example here would be helpful IMHO. For reference here is my installation https://gist.github.com/maximveksler/38ec0cefa0ca2acccab748e71e5aebc0 |
This would be all I would tweak. |
I'm closing this issue since it looks like the original issue was resolved by v0.8.2. We are tracking some of the discussion around daemonset-driven consolidation through: kubernetes-sigs/karpenter#731 and we have updated our docs around daemonset priority and ordering here. It will make it easier for us to track if we continue most of this discussion over in kubernetes-sigs/karpenter#731. |
Version
Karpenter: v0.7.2
Kubernetes: v1.21.5
Context
We run several different daemonsets on a per-nodes basis: metrics, logging, EBS CSI, secrets-store CSI.
These need to be present on every node as they provide their functionality to every pod on a node.
(This could be a configuration / unset flag issue, looking for more information)
Expected Behavior
When choosing an instance type to provision for pending pods, Karpenter should take into account any Daemonsets that will be running on the node, not just the pending
service
pods that it will schedule there.Actual Behavior
This is most noticeable in a brand new cluster, but has also been seen with mature clusters:
When Karpenter brings up a node, it will correctly calculate the resources required to support the new
service
pod / replica. Theaws-node
andkube-proxy
pods will be started and then theservice
pod.When using a larger metrics / logging / CSI pod with requests of e.g 1Gb RAM / 0.5-1 CPU each, these pods will be perpetually stuck in a pending state and will never start, as there isn't enough room on the node for them.
This was most noticeable when creating a new cluster when the aws-load-balancer-controller was deployed, which only requires 0.05 CPU. Therefore even with 3 replicas, Karpenter spun up a
t3a.small
instance to support these.Even when adding more replicas (tested with 25 replicas), it continued to spin up
t3a.small
instances, presumably because they were the cheapest option, but leaving all the daemonset pods in apending
state, apart from one node where there was only oneaws-load-balancer-controller
pod - in this case one of the daemonset pods started, the rest were stuck in pending.I believe this is due to how Karpenter is scheduling the pods on the node (something about node-binding in the docs?):
aws-node
andkube-proxy
are in thesystem-node-critical
priority_class, they are always scheduled firstservice
pod nextservice
pod and therefore get stuck in a pending state if there is not enough room for themSteps to Reproduce the Problem
n
daemonsets with a highish resource consumption that will run on every nodeservice
deployment for a service with very low resource consumption, using the node selector for a karpenter provisionerservice
pods, but not able to support the daemonset(s)Resource Specs and Logs
Logs
Do not have access to these logs at this time - but it was correctly trying to schedule the pending pods, and calculating the instance size based on the
service
pod requestsThe text was updated successfully, but these errors were encountered: