Karpenter not respecting daemonsets resources #2751

JoseAlvarezSonos · 2022-10-28T14:46:53Z

Version

Karpenter Version: v0.18.1

Kubernetes Version: v1.23.9 (EKS)

Expected Behavior

When scaling up, Karpenter should calculate the "reserved" CPU needs for all DaemonSets from all namespaces and from there calculate the amount of instances needed for the "non-scheduled" pods. The expected behaviour in the end should be that for all new nodes all DaemonSets should always be up and running without any resources issues.

Actual Behavior

From time to time, we have scales up where multiple pods are created at the same time, it can go from 15 to 4k Pods. In this situations, there are some Pods from DaemonSets that won't get scheduled with the error " 0/X nodes are available: 1 Insufficient cpu, X node(s) didn't match Pod's node affinity/selector.".

We have tried adding priorityClass and nothing has really improved. All of the DaemonSets have proper resource definitions. Plus, this error seems random, so not really a deterministic way to find other reasons.

It's worth mentioning that I found some old similar issues which where technically solved in older versions. Also, this happens in Provisioners regardless of the feature "consolidation".

Steps to Reproduce the Problem

Install multiple DaemonSets like:

aws-node
ebs-csi-node
efs-csi-node
fluentd
kube-proxy
node-local-dns
aws-otel-agent
falco
falco-exporter
prometheus-prometheus-node-exporter

Install Karpenter and then try to autoscale thousands or hundreds Pods multiple times and the error should appear.

Resource Specs and Logs

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: example
  namespace: karpenter
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/instance: karpenter
    app.kubernetes.io/version: "1.0.0"
    app.kubernetes.io/part-of: karpenter
    helm.sh/chart: karpenter-1.0.0
    helm.sh/release: "karpenter"
    helm.sh/heritage: "Helm"
spec:
  ttlSecondsAfterEmpty: 300
  weight: 100
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["on-demand"]
    - key: kubernetes.io/arch
      operator: In
      values: ["amd64"]
    - key: karpenter.k8s.aws/instance-hypervisor
      operator: In
      values: ["nitro"]
    - key: "node.kubernetes.io/instance-type"
      operator: In
      values:
      - m6i.xlarge
      - m6i.2xlarge
      - m6i.4xlarge
      - m6i.8xlarge
      - m6i.12xlarge
  limits:
    resources:
      cpu: "5k"
  providerRef:
    name: default
  ttlSecondsUntilExpired: 10080
  taints:
    - effect: NoSchedule
      key: key_name
      value: value_name
  labels:
    key_name: value_name

There aren't any warnings, nor errors in the log's controller.

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

spring1843 · 2022-10-31T17:42:34Z

It's worth mentioning that I found some old similar issues which where technically solved in older versions.

Can you link to those issues?

spring1843 · 2022-10-31T18:40:33Z

Sometimes when this happens it's because the daemonset being created doesn't have a high priority so it won't cause an eviction an existing node that had no more capacity. The daemonset controller then will create a pod for that daemonset on the existing node but it'll fail to schedule.

Does this issue happen when you have given them a high priority?

mqsoh · 2022-11-01T13:39:59Z

I have the same issue as OP; I'm on a new EKS cluster. I have two nodes from the initial managed node group and third node created by Karpenter. I had a daemonset that was failing to schedule on two nodes. One of them was the node created by Karpenter.

Adding the system-node-critical priority to the daemonset caused them to schedule, however, it ended up evicting the Karpenter pods that had been running on the managed node group. After removing the priority, one Karpenter pod was able to reschedule and the daemonset that had been running on that node was no longer able to schedule (because it was replaced by a pod with no priority class).

So, I think you're talking about a different issue.

As the OP pointed out, Karpenter isn't reserving the resources needed by any daemonsets that will run on nodes it creates. In a way it makes sense because Karpenter's creating a node to fulfill the needs of unschedulable pods but the daemonset pods aren't being scheduled until after Karpenter creates the node.
A daemonset pod that can't be scheduled on a node should cause other pods to be evicted to make room. Your solution with the priority class fixes that problem, but not OPs.

ellistarn · 2022-11-01T22:04:41Z

We check the schedulability of daemonsets and include it in simulations. Sometimes, daemonsets with specific scheduling constraints can cause us to not know whether or not they will schedule, so they don't get included in the node sizing decision. However, if you use a high priority on your daemonset, then some of your workload pods will fail to schedule, and will simply get caught/healed in the next provisioning loop (and potentially later consolidated).

mqsoh · 2022-11-04T12:03:49Z

Ah, you're right. The priority class name works fine for me. My secondary issue was actually that I was adding another daemon set but my managed node group had nodes that were too small to accommodate anything else.

JoseAlvarezSonos · 2022-11-08T10:07:26Z

It's worth mentioning that I found some old similar issues which where technically solved in older versions.

Can you link to those issues?

Hello @spring1843 , sure, here are the ones I found related to this issue:

Regarding the priority class, yeah, all of our DaemonSets have either system-node-critical or system-cluster-critical, but personally I feel that it shouldn't matter, if a Pod either from a DaemonSet or Deployment is meant to run in one node ... Karpenter should proactively respect that and not wait for kube to evict some other Pods and then wait for Karpenter to spawn another node.
For example, in our particular situation, some of these workloads are ML related, and they have a long grace period time due to some cleanup they do and push results to different places, plus when they start ... oh boy, they start all-in so even if kube tries to evict them ... they won't, and we designed them like this because we needed to make sure that the Pod did always certain things that could take some time to finish.
So when we added the high priority classes I do have to admit that frequency of the issue was reduced but it still happens from time to time, whenever the "Pod surge" is small is way less likely, but when the "Pod surge" is big it will for certain happen at least for 1 Pod and again it would be nice that Karpenter calculates the proper "deamonset reserved CPU" and took it into account, or it could be a configuration where you could explicitly say "always remove X% CPU and Y% memory when calculating the required CPU", not sure if it makes sense but just throwing random ideas.

Thanks for the follow up btw 👍

tzneal · 2022-11-09T14:53:27Z

Are you using VPA on your daemonsets? We do calculate the sum of all of the daemonset resources to ensure that there is enough space. If a DS is modified after that, it could lead to pods not scheduling.

cdenneen · 2022-11-18T02:22:00Z

Running 0.19.0 still issue.
Currently I'm having it with DaemonSet for newrelic-bundle-nrk8s-kubelet:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-11-18T02:06:16Z"
    message: '0/7 nodes are available: 1 Insufficient cpu. preemption: 0/7 nodes are
      available: 7 No preemption victims found for incoming pod.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable

if I change the spec.template.spec.priorityClassName: system-node-critical on the DS it actually schedules but it booted another pod from another DS off that node. I don't think it's likely that every DS will be setting a priorityClassName and therefore Karpenter should scale and rebalance.

JoseAlvarezSonos · 2022-11-22T17:11:11Z

Hello @tzneal , no we don't use VPA at all, so the resources are pretty fixed.

ellistarn · 2022-11-22T21:01:04Z

Interesting. Looks like our scheduler doesn't think that the daemonset can schedule. I see your provisioner has

  taints:
    - effect: NoSchedule
      key: key_name
      value: value_name

Does your daemonset tolerate this? Can you share the pod spec?

JoseAlvarezSonos · 2022-11-30T15:53:28Z

@ellistarn yes, all of our daemonsets have:

tolerations:
    - effect: NoSchedule
      key: dedicated
      operator: Exists

ellistarn · 2022-11-30T19:10:29Z

From the snippit you shared, you'd need

tolerations:
    - effect: NoSchedule
      key: key_name
      operator: Exists

JoseAlvarezSonos · 2022-12-01T10:34:17Z

@ellistarn yes, sorry my mistake, I changed it to "key_name" in the original post, but the actual name we use is "dedicated".

ellistarn · 2022-12-01T15:52:36Z

Can you share your AWSNodeTemplate? It's possible that you're setting things in userdata that can impact scheduling calculations.

github-actions · 2022-12-22T12:06:56Z

Labeled for closure due to inactivity in 10 days.

wdonne · 2023-01-23T17:51:40Z

Hi,

I'm using Karpenter 0.22.1 and I still have this issue. Wasn't it supposed to be fixed in PR #1155?

Best regards,

Werner.

tzneal · 2023-01-23T17:56:43Z

Yes, it's fixed as far as we are aware. The only daemonset resource issue I'm aware of is that we don't currently support a LimitRange supplying a default resource for daemonsets correctly, but that's in work. If you are still experiencing a problem, please file a new issue with logs and daemonset/pod specs so we can investigate.

ospiegel91 · 2023-01-26T09:43:39Z

It's worth mentioning that I found some old similar issues which where technically solved in older versions.

Can you link to those issues?

Hello @spring1843 , sure, here are the ones I found related to this issue:

Karpenter kept scheduling pod on Node which has not enough resources #2309

Karpenter takes time to provision a node #2250

Karpenter is not respecting per-node Daemonsets #1649

Regarding the priority class, yeah, all of our DaemonSets have either system-node-critical or system-cluster-critical, but personally I feel that it shouldn't matter, if a Pod either from a DaemonSet or Deployment is meant to run in one node ... Karpenter should proactively respect that and not wait for kube to evict some other Pods and then wait for Karpenter to spawn another node. For example, in our particular situation, some of these workloads are ML related, and they have a long grace period time due to some cleanup they do and push results to different places, plus when they start ... oh boy, they start all-in so even if kube tries to evict them ... they won't, and we designed them like this because we needed to make sure that the Pod did always certain things that could take some time to finish. So when we added the high priority classes I do have to admit that frequency of the issue was reduced but it still happens from time to time, whenever the "Pod surge" is small is way less likely, but when the "Pod surge" is big it will for certain happen at least for 1 Pod and again it would be nice that Karpenter calculates the proper "deamonset reserved CPU" and took it into account, or it could be a configuration where you could explicitly say "always remove X% CPU and Y% memory when calculating the required CPU", not sure if it makes sense but just throwing random ideas.

Thanks for the follow up btw 👍

I second this notion, and still experience issues where daemonset pods dont have room on a node due to insufficient CPU,
I also feel Karpenter should take into some cpu and RAM block for daemonsets when determining node size necessary.
All my apps have reqs set for CPU and RAM. This shouldnt have to happen.

ahmedfourti · 2023-02-07T09:30:13Z

Hello, I am using Karpenter v0.23.0 and having the same proble with daemonset not beeing deployed.
I have this with Prometheus stack.

Hronom · 2023-05-07T15:10:57Z

Hello, same here with fluentd on v0.27.3

JoseAlvarezSonos added the bug Something isn't working label Oct 28, 2022

spring1843 removed the bug Something isn't working label Oct 31, 2022

github-actions bot added the lifecycle/stale label Dec 22, 2022

github-actions bot added the lifecycle/closed label Jan 2, 2023

github-actions bot closed this as completed Jan 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Karpenter not respecting daemonsets resources #2751

Karpenter not respecting daemonsets resources #2751

JoseAlvarezSonos commented Oct 28, 2022

spring1843 commented Oct 31, 2022

spring1843 commented Oct 31, 2022 •

edited

Loading

mqsoh commented Nov 1, 2022

ellistarn commented Nov 1, 2022

mqsoh commented Nov 4, 2022

JoseAlvarezSonos commented Nov 8, 2022 •

edited

Loading

tzneal commented Nov 9, 2022

cdenneen commented Nov 18, 2022

JoseAlvarezSonos commented Nov 22, 2022

ellistarn commented Nov 22, 2022

JoseAlvarezSonos commented Nov 30, 2022

ellistarn commented Nov 30, 2022

JoseAlvarezSonos commented Dec 1, 2022

ellistarn commented Dec 1, 2022

github-actions bot commented Dec 22, 2022

wdonne commented Jan 23, 2023

tzneal commented Jan 23, 2023

ospiegel91 commented Jan 26, 2023 •

edited

Loading

ahmedfourti commented Feb 7, 2023

Hronom commented May 7, 2023 •

edited

Loading

Karpenter not respecting daemonsets resources #2751

Karpenter not respecting daemonsets resources #2751

Comments

JoseAlvarezSonos commented Oct 28, 2022

Version

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Resource Specs and Logs

Community Note

spring1843 commented Oct 31, 2022

spring1843 commented Oct 31, 2022 • edited Loading

mqsoh commented Nov 1, 2022

ellistarn commented Nov 1, 2022

mqsoh commented Nov 4, 2022

JoseAlvarezSonos commented Nov 8, 2022 • edited Loading

tzneal commented Nov 9, 2022

cdenneen commented Nov 18, 2022

JoseAlvarezSonos commented Nov 22, 2022

ellistarn commented Nov 22, 2022

JoseAlvarezSonos commented Nov 30, 2022

ellistarn commented Nov 30, 2022

JoseAlvarezSonos commented Dec 1, 2022

ellistarn commented Dec 1, 2022

github-actions bot commented Dec 22, 2022

wdonne commented Jan 23, 2023

tzneal commented Jan 23, 2023

ospiegel91 commented Jan 26, 2023 • edited Loading

ahmedfourti commented Feb 7, 2023

Hronom commented May 7, 2023 • edited Loading

spring1843 commented Oct 31, 2022 •

edited

Loading

JoseAlvarezSonos commented Nov 8, 2022 •

edited

Loading

ospiegel91 commented Jan 26, 2023 •

edited

Loading

Hronom commented May 7, 2023 •

edited

Loading