Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter is not respecting per-node Daemonsets #1649

Closed
snorlaX-sleeps opened this issue Apr 8, 2022 · 43 comments
Closed

Karpenter is not respecting per-node Daemonsets #1649

snorlaX-sleeps opened this issue Apr 8, 2022 · 43 comments
Assignees
Labels
bug Something isn't working

Comments

@snorlaX-sleeps
Copy link
Contributor

Version

Karpenter: v0.7.2

Kubernetes: v1.21.5

Context

We run several different daemonsets on a per-nodes basis: metrics, logging, EBS CSI, secrets-store CSI.
These need to be present on every node as they provide their functionality to every pod on a node.

(This could be a configuration / unset flag issue, looking for more information)

Expected Behavior

When choosing an instance type to provision for pending pods, Karpenter should take into account any Daemonsets that will be running on the node, not just the pending service pods that it will schedule there.

Actual Behavior

This is most noticeable in a brand new cluster, but has also been seen with mature clusters:
When Karpenter brings up a node, it will correctly calculate the resources required to support the new service pod / replica. The aws-node and kube-proxy pods will be started and then the service pod.

When using a larger metrics / logging / CSI pod with requests of e.g 1Gb RAM / 0.5-1 CPU each, these pods will be perpetually stuck in a pending state and will never start, as there isn't enough room on the node for them.

This was most noticeable when creating a new cluster when the aws-load-balancer-controller was deployed, which only requires 0.05 CPU. Therefore even with 3 replicas, Karpenter spun up a t3a.small instance to support these.
Even when adding more replicas (tested with 25 replicas), it continued to spin up t3a.small instances, presumably because they were the cheapest option, but leaving all the daemonset pods in a pending state, apart from one node where there was only one aws-load-balancer-controller pod - in this case one of the daemonset pods started, the rest were stuck in pending.

I believe this is due to how Karpenter is scheduling the pods on the node (something about node-binding in the docs?):

  • As aws-node and kube-proxy are in the system-node-critical priority_class, they are always scheduled first
  • Potentially Karpenter is then scheduling the service pod next
  • The other daemonsets, some with a much higher priority_class, are not scheduled until after the service pod and therefore get stuck in a pending state if there is not enough room for them

Steps to Reproduce the Problem

  • Create a fresh cluster with Karpenter deployed and a default provisioner
  • Create n daemonsets with a highish resource consumption that will run on every node
  • Create a service deployment for a service with very low resource consumption, using the node selector for a karpenter provisioner
  • Karpenter should select an instance-type suitable for the service pods, but not able to support the daemonset(s)

Resource Specs and Logs

### Default Provisioner
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  annotations:
    meta.helm.sh/release-name: karpenter-default-provisioner-chart
    meta.helm.sh/release-namespace: default
  labels:
    app.kubernetes.io/managed-by: Helm
  name: karpenter-default
spec:
  labels:
    env: karpenter-default
  provider:
    apiVersion: extensions.karpenter.sh/v1alpha1
    kind: AWS
    launchTemplate: <launch_template>
    subnetSelector:
      Service: Private
      kubernetes.io/cluster/<cluster_name>: '*'
  requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - on-demand
    - spot
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
  ttlSecondsAfterEmpty: 30

Logs

Do not have access to these logs at this time - but it was correctly trying to schedule the pending pods, and calculating the instance size based on the service pod requests

@snorlaX-sleeps snorlaX-sleeps added the bug Something isn't working label Apr 8, 2022
@bwagner5
Copy link
Contributor

bwagner5 commented Apr 8, 2022

I believe this was fixed in v0.7.3. Can you try upgrading to the latest v0.8.1 and see if that fixes this issue?

@dewjam
Copy link
Contributor

dewjam commented Apr 8, 2022

This may also be related to the issue outlined in #1573 . Which was fixed by #1616, but not yet released.

@bwagner5
Copy link
Contributor

bwagner5 commented Apr 8, 2022

This may also be related to the issue outlined in #1573 . Which was fixed by #1616, but not yet released.

@dewjam is probably correct in what you're seeing @snorlaX-sleeps . This patch will be released next week in v0.8.2

@snorlaX-sleeps
Copy link
Contributor Author

Thanks @dewjam @bwagner5 - the issue outlined in #1573 sounds like it touches upon some of the same areas, so I am glad there is a fix in 🙂
I will test it out when it gets released and close this issue out after.

@dewjam
Copy link
Contributor

dewjam commented Apr 13, 2022

Hey @snorlaX-sleeps !
v0.8.2 is out now. When you get a moment, would you be willing to test out the new version to make sure the issue is resolved?

https://github.com/aws/karpenter/releases/tag/v0.8.2

Thanks again for reporting the problem!

@shaunfink
Copy link

We are still seeing this behaviour after deploying v0.8.2

@snorlaX-sleeps
Copy link
Contributor Author

Hey @snorlaX-sleeps ! v0.8.2 is out now. When you get a moment, would you be willing to test out the new version to make sure the issue is resolved?

https://github.com/aws/karpenter/releases/tag/v0.8.2

Thanks again for reporting the problem!

@dewjam - I deployed the update on 4/14 but didn't get testing the change. I will get back to you once I can test it

@snorlaX-sleeps
Copy link
Contributor Author

Hey @dewjam / @bwagner5
After testing v0.8.2 (with Helm Chart v0.8.2) the issue is still occurring.

Using the same initial deployments + daemonsets mentioned in the original comment, I created a new provisioner for these services and redeployed them.
Initially this looked fine, until all the pods had been restarted and some were stuck in a pending state (daemonsets with a lower or no priority_class)

I deleted and redeployed the Helm Charts for all these services to recreate a fresh deployment for testing (the initial conditions for this error) and it is again creating t3a.small instance sizes, with pods failing to be scheduled.

Deleting all the instances created with this provisioner (call it rebalancing), still causes the issue to occur - it would bring up 3 t3a.small instances and 2 t3a.medium instances, due to PDBs for some of the services.
Several of these t3a.small nodes are running only the daemonsets and a single service pod - if they are batched as 2+ pods, then at least one of the daemonset pods fail to start.

Errors on the daemonset pods are generally one of the following:

Warning  FailedScheduling  38s (x13 over 11m)  default-scheduler  0/9 nodes are available: 1 Insufficient memory, 8 node(s) didn't match Pod's node affinity/selector

or

Warning  FailedScheduling  34s (x6 over 2m17s)  default-scheduler  0/10 nodes are available: 1 Insufficient memory, 1 Too many pods, 9 node(s) didn't match Pod's node affinity/selector

It seems to pick larger instances that will also support all of the daemonsets, it's only at the smaller instance sizes - but again that could be down to CPU/mem thresholds separating the different instance types, rather than as part of the calculation.

As a note, using Terraform with the Helm and K8s providers for controlling the deployment of these services (so everything gets deployed in batches)

@tzneal
Copy link
Contributor

tzneal commented Apr 19, 2022

@snorlaX-sleeps Can you point me to a helm chart that has one of these daemon sets so I can try to reproduce, or paste the YAML for the daemonset?

@tzneal tzneal added the burning Time sensitive issues label Apr 19, 2022
@snorlaX-sleeps
Copy link
Contributor Author

snorlaX-sleeps commented Apr 19, 2022

Hey @tzneal

We are deploying the following Daemonsets:

It would mainly be Datadog and the secrets-store-csi that may not be scheduled / get stuck in pending

Datadog resources

{
    cpu_requests    = "200m"
    cpu_limits      = "1"
    memory_requests = "256Mi"
    memory_limits   = "1Gi"
  }

Also deploying the ALB ingress controller as 3 replicas via Helm (creating the small instance sizes), but this could be replaced with any other small pod definition

{
    cpu_requests    = "50m"
    cpu_limits      = "300m"
    memory_requests = "50Mi"
    memory_limits   = "256Mi"
  }

@dewjam dewjam self-assigned this Apr 19, 2022
@dewjam
Copy link
Contributor

dewjam commented Apr 19, 2022

Hey there @snorlaX-sleeps ,
I'm working on reproducing this with the daemon sets you provided. I'll let you know what I find.

@snorlaX-sleeps
Copy link
Contributor Author

Thanks @dewjam - as I said in the initial post, there are ways to work around it (deploying everything, then destroying all the nodes, but thats only good for non-critical / non-production clusters), but I just want to help identify this issue 👍

@dewjam
Copy link
Contributor

dewjam commented Apr 19, 2022

As a note, using Terraform with the Helm and K8s providers for controlling the deployment of these services (so everything gets deployed in batches)

Do you think it's possible Deployments are being applied to the cluster before some of the DaemonSets?

@dewjam
Copy link
Contributor

dewjam commented Apr 19, 2022

To give some context on my question above:
Karpenter only takes action when it sees pending pods. For a Deployment, Karpenter will look at the resource requests/limits of pending pods to determine which instance type to launch. When it comes to Pods managed by DaemonSets, Karpenter instead has to look at the PodSpecTemplate for DaemonSets to get resource request/limit info (because the pods have yet to be created).

In short, DaemonSets should be applied before Deployments, otherwise DaemonSet pods may remain in a Pending state.

@snorlaX-sleeps
Copy link
Contributor Author

snorlaX-sleeps commented Apr 20, 2022

Do you think it's possible Deployments are being applied to the cluster before some of the DaemonSets?

That could be potentially true during the initial deployment, however I have seen this happening when deleting an existing Karpenter managed node to get it recreated as well.

Karpenter only takes action when it sees pending pods. For a Deployment, Karpenter will look at the resource requests/limits of pending pods to determine which instance type to launch. When it comes to Pods managed by DaemonSets, Karpenter instead has to look at the PodSpecTemplate for DaemonSets to get resource request/limit info (because the pods have yet to be created).

In short, DaemonSets should be applied before Deployments, otherwise DaemonSet pods may remain in a Pending state

I understand.
AFAICT, when deploying the services and seeing this issue, the "bound" service pods (the ones that are scaled) seem to start before daemonsets of a similar priority (or even slightly higher priority), going by the fact they don't schedule.

I can rerun the tests again tomorrow and make sure I haven't missed something (something like missing limits or whatever, it's very late here)

edit: @dewjam

@snorlaX-sleeps
Copy link
Contributor Author

snorlaX-sleeps commented Apr 20, 2022

@dewjam - was able to replicate this again today, I will outline the steps below as this is possibly / potentially just an edge case, it seemed to work fine in a natural migration to the new karp provisioner.

What seems to be happening:

  • Deleted all nodes in the provisioner group, causing pods to restart. However due to PDBs, not all pods could be automatically restarted until other replicas were available on the new nodes
  • Some pods went pending, forcing karp to create nodes and batch them
  • Karpenter brought up a t3a.small for a smaller service
  • However, at the time the node became available another deployment (called big_logging_pod below) seemed to put one of it's new pods on that node, which started before the daemonsets - this deployment had been waiting for a node to start but I see no mention of it in the Karp logs.

I believe big_logging_pod was not originally scheduled for this node, external-dns was, and this has somehow been scheduled on and started before the daemonsets - I will need to look into the priority_class of this pod

Karp logs for the node:

<time>	INFO	controller.provisioning	Batched 1 pods in 1.000245506s	{"commit": "c115db3", "provisioner": "karpenter-on-demand-test"}
<time>	INFO	controller.provisioning	Launched instance: <instance_id>, hostname: <hostname>, type: t3a.small, zone: us-east-1b, capacityType: on-demand	{"commit": "c115db3", "provisioner": "karpenter-on-demand-test"}
<time>	INFO	controller.provisioning	Created node with 1 pods requesting {"cpu":"1245m","memory":"1330Mi","pods":"8"} from types t3a.small, c4.large, c3.large, c6i.large, c5.large and 306 other(s)	{"commit": "c115db3", "provisioner": "karpenter-on-demand-test"}

Looking at the node in question, it has 6 pods running in total but it is missing 3 daemonset pods, which is what is leading me to believe the big_logging_pod is the cause of the issue. You can also compare the current resource usage with the expected in the logs:

Non-terminated Pods:          (6 in total)
  Namespace                   Name                                     CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                     ------------  ----------  ---------------  -------------  ---
  kube-system                 aws-node-8hwcw                           25m (1%)      0 (0%)      0 (0%)           0 (0%)         19m
  kube-system                 ebs-csi-node-tk84k                       300m (15%)    300m (15%)  384Mi (25%)      384Mi (25%)    18m
  kube-system                 kube-proxy-8sp2t                         100m (5%)     0 (0%)      0 (0%)           0 (0%)         19m
  logging                     fluentd-fpcg4                            450m (23%)    600m (31%)  400Mi (26%)      800Mi (52%)    18m
  logging                     <big_logging_pod>-774bd5cf76-lb4m6         500m (25%)    500m (25%)  600Mi (39%)      900Mi (59%)    16m
  p.                          external-dns-7d4d7c478d-48hjn            50m (2%)      300m (15%)  50Mi (3%)        256Mi (16%)    21m

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         1425m (73%)   1700m (88%)
  memory                      1434Mi (94%)  2340Mi (154%)

@dewjam
Copy link
Contributor

dewjam commented Apr 20, 2022

Thanks for the info. If you don't mind, I would like to see the manifest for <big_logging_pod>-774bd5cf76-lb4m6. Mainly curious about Tolerations and Priority Class (as you mentioned).

I see no mention of it in the Karp logs

Do you have debug logging enabled in Karpenter?

https://karpenter.sh/v0.8.2/development-guide/#change-log-level

@snorlaX-sleeps
Copy link
Contributor Author

snorlaX-sleeps commented Apr 20, 2022

Do you have debug logging enabled in Karpenter?

No, but I now know thats a thing!!!

Mainly curious about Tolerations and Priority Class (as you mentioned)

It doesn't have tolerations.
It's a 2 replica deployment given the same priority_class as the fluentd daemonset, as they are both part of the logging setup.
I believe it was the priority-class - I downgraded it's priority class to be lower than the daemonsets (secrets-csi and datadog) and everything correctly scheduled.

After that change I have been unable to repeat the error.
I don't believe this was the original issue, as it was more widespread and was fixed by the v0.8.2 release

@dewjam
Copy link
Contributor

dewjam commented Apr 20, 2022

That makes perfect sense. As you identified, if a pod has a higher priority-class than a DaemonSet Pod, then the DS Pod could end up displaced. Which would leave it in Pending state.

I think you're right that you probably were seeing multiple issues. Some were fixed by updating priority-class, but others likely were fixed through the v0.8.2 release.

Let me know if we can close this out. Thanks again for reporting the issue!

@snorlaX-sleeps
Copy link
Contributor Author

Thanks for your help with this @dewjam, sorry for making you test stuff yesterday :(
Thanks for releasing v0.8.2 so quick, I will close this out

@thanh-tran-techx
Copy link

I want to reopen the issue since the daemon set related to EBS CSI does not start on the node provisioned by Karpenter.

@tzneal
Copy link
Contributor

tzneal commented May 31, 2022

@thanh-tran-techx Can you create a new issue to track the problem you are running into? Yours may be something different.

@pdf
Copy link

pdf commented Jul 18, 2022

In short, DaemonSets should be applied before Deployments, otherwise DaemonSet pods may remain in a Pending state.

This is limitation presents a serious problem - it means that new DaemonSet workloads can never be reliably scheduled on existing clusters. If the workaround is that all DaemonSets need to have a PriorityClass assigned, that should probably be documented at a minimum.

@gchait
Copy link

gchait commented Sep 22, 2022

Hey, what's the current situation? Is it intended that new DaemonSet pods will be Pending forever instead of making room for them by migrating other pods to new nodes?
Can't this behavior already be justified with the current spec.consolidation.enabled field?

I just tried creating a test DaemonSet, pods were Pending. Then re-applied with spec.template.spec.priorityClassName: system-node-critical, same result.
If there is a clear workaround (not that it's ever logical to make pods Pending by design imo), please document it.

@marcofranssen
Copy link

Hey, what's the current situation? Is it intended that new DaemonSet pods will be Pending forever instead of making room for them by migrating other pods to new nodes?

Can't this behavior already be justified with the current spec.consolidation.enabled field?

I just tried creating a test DaemonSet, pods were Pending. Then re-applied with spec.template.spec.priorityClassName: system-node-critical, same result.

If there is a clear workaround (not that it's ever logical to make pods Pending by design imo), please document it.

Facing exactly the same issue. Existing nodes will not have room for DaemonSets added in later helm chart installation. With consolidation enabled nothing will trigger Karpenter at the moment to create a new larger node to replace the old one. Anything else required to make this happen automatically?

@ellistarn ellistarn reopened this Feb 1, 2023
@jonathan-innis jonathan-innis removed the burning Time sensitive issues label Feb 6, 2023
@billrayburn billrayburn assigned njtran and unassigned dewjam Mar 23, 2023
@beatrizdemiguelperez
Copy link

Facing the same issue. Any update?

@missourian55
Copy link

Facing this problem. Not sure Karpenter is ready for prime time (Honoring daemonset is foundational)

@tzneal
Copy link
Contributor

tzneal commented Apr 25, 2023

Facing this problem. Not sure Karpenter is ready for prime time (Honoring daemonset is foundational)

Karpenter does calculate daemonset resources and take them into account for any future node launches. What it currently does not do is terminate running nodes that were launched prior to the daemonset being created if a newly created daemonset pod can't run on the node. This behavior may not be desired by all customers and normally isn't an issue since daemonsets typically have a high priority and will evict running pods so that they can run.

@missourian55
Copy link

@tzneal In my case these are brand new nodes, I am not patching the old nodes with new daemonset.

In the below image Karpenter namespace runs in Fargate Profile
daemonset_stuck_pending

@tzneal
Copy link
Contributor

tzneal commented Apr 25, 2023

@missourian55 If you think that Karpenter is not calculating daemonset resources correctly for daemonsets that existed prior to the node being launched, please file another issue and include Karpenter logs and daemonset specs. This particular issue is about daemonsets that were created after the nodes were already launched.

@marcofranssen
Copy link

You can easily resolve by signing your daemonsets a higher priorityClass. The k8s scheduler will then evict other pods to make room for the daemonset and move those other pods to me nodes.

Eventually karpenter consolidation feature might even desire to merge 2 nodes.

All works perfectly as long you give daemonsets a higher priority.

@missourian55
Copy link

@marcofranssen Thank you. I will look into the priority class. However, for EKS add-ons like amazon-guardduty not sure how to set the priority class for those (we Enable addon in console or CF template it just deploys the daemonsets. I don't have control over it)

@pdf
Copy link

pdf commented Apr 25, 2023

normally isn't an issue since daemonsets typically have a high priority and will evict running pods so that they can run.

There's nothing inherent to daemonsets that makes this statement true, users have to specifically attribute a priorityclass to their daemonsets to make it so.

@jonathan-innis
Copy link
Contributor

There's nothing inherent to daemonsets that makes this statement true

That's a good point. I think generally we see that users tend to put priority classes on their daemonsets to guarantee that they schedule. Otherwise, it's possible, even if the DS is running ahead of time to get pre-empted by some other pod with a higher priority and Karpenter won't trigger a scale-up for that DS that is now pending.

@pdf
Copy link

pdf commented Apr 26, 2023

Karpenter won't trigger a scale-up for that DS that is now pending.

This seems like the core issue, no? Requiring a high priorityclass for daemonsets is a workaround for this behaviour. I guess it's possible that users could encounter a similar condition on non-karpenter clusters, but since karpenter tries to right-size nodes users are significantly more likely to encounter it on karpenter nodes.

@ellistarn
Copy link
Contributor

ellistarn commented Apr 26, 2023

This seems like the core issue, no?

I agree with this. It's not something that Karpenter (or other autoscalers) support today, but I'd love to see some design work to make this happen.

There's nothing inherent to daemonsets that makes this statement true

I wonder if it's worth exploring a KEP to implement support for this upstream.

@jonathan-innis
Copy link
Contributor

I'd love to see some design work to make this happen

Agreed. We've had some ideas around consolidation/drift that would enable us to discover that a DS is stuck in a pending state and then spin up a new node that would expand the capacity of the current node to make sure that all the pods that need to be scheduled on the node will get scheduled.

@billrayburn billrayburn assigned jonathan-innis and unassigned njtran Apr 26, 2023
@Hronom
Copy link
Contributor

Hronom commented May 7, 2023

@jonathan-innis It will be great if even if you deploy DS after, karpenter be able to detect such situation and re-combine pods and nodes to make sure that new DS is running as expected.

When do you plan to release this new logic?

@maximveksler
Copy link

We're seeing this issue in our dev environment, as we're deploying our telemetry sensor as a deamon set this issue hit us hard as no new nodes are added for additional capacity when a developer deploys into a newly created namespace.

Are there recommended work arounds for such cases? We have (for ex.) ~20 deamonsets running per node. It's possible that sum(deamonset.requirments) will outgrow the node capacity. This case does not initiats a scale up for karpenter and leaves the cluster in non usable state.

@FernandoMiguel
Copy link
Contributor

@maximveksler you should set your provisioner to have at least the capacity to host your deamonsets

@maximveksler
Copy link

maximveksler commented Jul 4, 2023

@FernandoMiguel

Thanks, do you mean setting the machine type selection to fit in such a way that there will remain place for the deamonsets or is there a way to configure the provisioner to allocate "extra" space on the planning (which is a better option for a workaround I'd consider).

An example here would be helpful IMHO.

For reference here is my installation https://gist.github.com/maximveksler/38ec0cefa0ca2acccab748e71e5aebc0

@FernandoMiguel
Copy link
Contributor

- key: "karpenter.k8s.aws/instance-cpu"
      operator: Gt
      values:
        - '10' 

This would be all I would tweak.
You know what you want to always load on your nodes in terms of DS, so set a minimum of cpu and/or RAM in your provisioner

@jonathan-innis
Copy link
Contributor

jonathan-innis commented Jul 5, 2023

I'm closing this issue since it looks like the original issue was resolved by v0.8.2. We are tracking some of the discussion around daemonset-driven consolidation through: kubernetes-sigs/karpenter#731 and we have updated our docs around daemonset priority and ordering here. It will make it easier for us to track if we continue most of this discussion over in kubernetes-sigs/karpenter#731.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests