Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter not respecting daemonsets resources #2751

Closed
JoseAlvarezSonos opened this issue Oct 28, 2022 · 20 comments
Closed

Karpenter not respecting daemonsets resources #2751

JoseAlvarezSonos opened this issue Oct 28, 2022 · 20 comments

Comments

@JoseAlvarezSonos
Copy link

Version

Karpenter Version: v0.18.1

Kubernetes Version: v1.23.9 (EKS)

Expected Behavior

When scaling up, Karpenter should calculate the "reserved" CPU needs for all DaemonSets from all namespaces and from there calculate the amount of instances needed for the "non-scheduled" pods. The expected behaviour in the end should be that for all new nodes all DaemonSets should always be up and running without any resources issues.

Actual Behavior

From time to time, we have scales up where multiple pods are created at the same time, it can go from 15 to 4k Pods. In this situations, there are some Pods from DaemonSets that won't get scheduled with the error " 0/X nodes are available: 1 Insufficient cpu, X node(s) didn't match Pod's node affinity/selector.".

We have tried adding priorityClass and nothing has really improved. All of the DaemonSets have proper resource definitions. Plus, this error seems random, so not really a deterministic way to find other reasons.

It's worth mentioning that I found some old similar issues which where technically solved in older versions. Also, this happens in Provisioners regardless of the feature "consolidation".

Steps to Reproduce the Problem

Install multiple DaemonSets like:

  • aws-node
  • ebs-csi-node
  • efs-csi-node
  • fluentd
  • kube-proxy
  • node-local-dns
  • aws-otel-agent
  • falco
  • falco-exporter
  • prometheus-prometheus-node-exporter

Install Karpenter and then try to autoscale thousands or hundreds Pods multiple times and the error should appear.

Resource Specs and Logs

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: example
  namespace: karpenter
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/instance: karpenter
    app.kubernetes.io/version: "1.0.0"
    app.kubernetes.io/part-of: karpenter
    helm.sh/chart: karpenter-1.0.0
    helm.sh/release: "karpenter"
    helm.sh/heritage: "Helm"
spec:
  ttlSecondsAfterEmpty: 300
  weight: 100
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["on-demand"]
    - key: kubernetes.io/arch
      operator: In
      values: ["amd64"]
    - key: karpenter.k8s.aws/instance-hypervisor
      operator: In
      values: ["nitro"]
    - key: "node.kubernetes.io/instance-type"
      operator: In
      values:
      - m6i.xlarge
      - m6i.2xlarge
      - m6i.4xlarge
      - m6i.8xlarge
      - m6i.12xlarge
  limits:
    resources:
      cpu: "5k"
  providerRef:
    name: default
  ttlSecondsUntilExpired: 10080
  taints:
    - effect: NoSchedule
      key: key_name
      value: value_name
  labels:
    key_name: value_name

There aren't any warnings, nor errors in the log's controller.

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@JoseAlvarezSonos JoseAlvarezSonos added the bug Something isn't working label Oct 28, 2022
@spring1843
Copy link
Contributor

It's worth mentioning that I found some old similar issues which where technically solved in older versions.

Can you link to those issues?

@spring1843 spring1843 removed the bug Something isn't working label Oct 31, 2022
@spring1843
Copy link
Contributor

spring1843 commented Oct 31, 2022

Sometimes when this happens it's because the daemonset being created doesn't have a high priority so it won't cause an eviction an existing node that had no more capacity. The daemonset controller then will create a pod for that daemonset on the existing node but it'll fail to schedule.

Does this issue happen when you have given them a high priority?

@mqsoh
Copy link

mqsoh commented Nov 1, 2022

I have the same issue as OP; I'm on a new EKS cluster. I have two nodes from the initial managed node group and third node created by Karpenter. I had a daemonset that was failing to schedule on two nodes. One of them was the node created by Karpenter.

Adding the system-node-critical priority to the daemonset caused them to schedule, however, it ended up evicting the Karpenter pods that had been running on the managed node group. After removing the priority, one Karpenter pod was able to reschedule and the daemonset that had been running on that node was no longer able to schedule (because it was replaced by a pod with no priority class).

So, I think you're talking about a different issue.

  1. As the OP pointed out, Karpenter isn't reserving the resources needed by any daemonsets that will run on nodes it creates. In a way it makes sense because Karpenter's creating a node to fulfill the needs of unschedulable pods but the daemonset pods aren't being scheduled until after Karpenter creates the node.

  2. A daemonset pod that can't be scheduled on a node should cause other pods to be evicted to make room. Your solution with the priority class fixes that problem, but not OPs.

@ellistarn
Copy link
Contributor

We check the schedulability of daemonsets and include it in simulations. Sometimes, daemonsets with specific scheduling constraints can cause us to not know whether or not they will schedule, so they don't get included in the node sizing decision. However, if you use a high priority on your daemonset, then some of your workload pods will fail to schedule, and will simply get caught/healed in the next provisioning loop (and potentially later consolidated).

@mqsoh
Copy link

mqsoh commented Nov 4, 2022

Ah, you're right. The priority class name works fine for me. My secondary issue was actually that I was adding another daemon set but my managed node group had nodes that were too small to accommodate anything else.

@JoseAlvarezSonos
Copy link
Author

JoseAlvarezSonos commented Nov 8, 2022

It's worth mentioning that I found some old similar issues which where technically solved in older versions.

Can you link to those issues?

Hello @spring1843 , sure, here are the ones I found related to this issue:

Regarding the priority class, yeah, all of our DaemonSets have either system-node-critical or system-cluster-critical, but personally I feel that it shouldn't matter, if a Pod either from a DaemonSet or Deployment is meant to run in one node ... Karpenter should proactively respect that and not wait for kube to evict some other Pods and then wait for Karpenter to spawn another node.
For example, in our particular situation, some of these workloads are ML related, and they have a long grace period time due to some cleanup they do and push results to different places, plus when they start ... oh boy, they start all-in so even if kube tries to evict them ... they won't, and we designed them like this because we needed to make sure that the Pod did always certain things that could take some time to finish.
So when we added the high priority classes I do have to admit that frequency of the issue was reduced but it still happens from time to time, whenever the "Pod surge" is small is way less likely, but when the "Pod surge" is big it will for certain happen at least for 1 Pod and again it would be nice that Karpenter calculates the proper "deamonset reserved CPU" and took it into account, or it could be a configuration where you could explicitly say "always remove X% CPU and Y% memory when calculating the required CPU", not sure if it makes sense but just throwing random ideas.

Thanks for the follow up btw 👍

@tzneal
Copy link
Contributor

tzneal commented Nov 9, 2022

Are you using VPA on your daemonsets? We do calculate the sum of all of the daemonset resources to ensure that there is enough space. If a DS is modified after that, it could lead to pods not scheduling.

@cdenneen
Copy link

Running 0.19.0 still issue.
Currently I'm having it with DaemonSet for newrelic-bundle-nrk8s-kubelet:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-11-18T02:06:16Z"
    message: '0/7 nodes are available: 1 Insufficient cpu. preemption: 0/7 nodes are
      available: 7 No preemption victims found for incoming pod.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable

if I change the spec.template.spec.priorityClassName: system-node-critical on the DS it actually schedules but it booted another pod from another DS off that node. I don't think it's likely that every DS will be setting a priorityClassName and therefore Karpenter should scale and rebalance.

@JoseAlvarezSonos
Copy link
Author

Hello @tzneal , no we don't use VPA at all, so the resources are pretty fixed.

@ellistarn
Copy link
Contributor

Interesting. Looks like our scheduler doesn't think that the daemonset can schedule. I see your provisioner has

  taints:
    - effect: NoSchedule
      key: key_name
      value: value_name

Does your daemonset tolerate this? Can you share the pod spec?

@JoseAlvarezSonos
Copy link
Author

@ellistarn yes, all of our daemonsets have:

tolerations:
    - effect: NoSchedule
      key: dedicated
      operator: Exists

@ellistarn
Copy link
Contributor

From the snippit you shared, you'd need

tolerations:
    - effect: NoSchedule
      key: key_name
      operator: Exists

@JoseAlvarezSonos
Copy link
Author

@ellistarn yes, sorry my mistake, I changed it to "key_name" in the original post, but the actual name we use is "dedicated".

@ellistarn
Copy link
Contributor

Can you share your AWSNodeTemplate? It's possible that you're setting things in userdata that can impact scheduling calculations.

@github-actions
Copy link
Contributor

Labeled for closure due to inactivity in 10 days.

@wdonne
Copy link

wdonne commented Jan 23, 2023

Hi,

I'm using Karpenter 0.22.1 and I still have this issue. Wasn't it supposed to be fixed in PR #1155?

Best regards,

Werner.

@tzneal
Copy link
Contributor

tzneal commented Jan 23, 2023

Yes, it's fixed as far as we are aware. The only daemonset resource issue I'm aware of is that we don't currently support a LimitRange supplying a default resource for daemonsets correctly, but that's in work. If you are still experiencing a problem, please file a new issue with logs and daemonset/pod specs so we can investigate.

@ospiegel91
Copy link

ospiegel91 commented Jan 26, 2023

It's worth mentioning that I found some old similar issues which where technically solved in older versions.

Can you link to those issues?

Hello @spring1843 , sure, here are the ones I found related to this issue:

Regarding the priority class, yeah, all of our DaemonSets have either system-node-critical or system-cluster-critical, but personally I feel that it shouldn't matter, if a Pod either from a DaemonSet or Deployment is meant to run in one node ... Karpenter should proactively respect that and not wait for kube to evict some other Pods and then wait for Karpenter to spawn another node. For example, in our particular situation, some of these workloads are ML related, and they have a long grace period time due to some cleanup they do and push results to different places, plus when they start ... oh boy, they start all-in so even if kube tries to evict them ... they won't, and we designed them like this because we needed to make sure that the Pod did always certain things that could take some time to finish. So when we added the high priority classes I do have to admit that frequency of the issue was reduced but it still happens from time to time, whenever the "Pod surge" is small is way less likely, but when the "Pod surge" is big it will for certain happen at least for 1 Pod and again it would be nice that Karpenter calculates the proper "deamonset reserved CPU" and took it into account, or it could be a configuration where you could explicitly say "always remove X% CPU and Y% memory when calculating the required CPU", not sure if it makes sense but just throwing random ideas.

Thanks for the follow up btw 👍

I second this notion, and still experience issues where daemonset pods dont have room on a node due to insufficient CPU,
I also feel Karpenter should take into some cpu and RAM block for daemonsets when determining node size necessary.
All my apps have reqs set for CPU and RAM. This shouldnt have to happen.

@ahmedfourti
Copy link

Hello, I am using Karpenter v0.23.0 and having the same proble with daemonset not beeing deployed.
I have this with Prometheus stack.

@Hronom
Copy link
Contributor

Hronom commented May 7, 2023

Hello, same here with fluentd on v0.27.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants