Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter workload consolidation/defragmentation #1091

Closed
felix-zhe-huang opened this issue Jan 6, 2022 · 60 comments · Fixed by #2123
Closed

Karpenter workload consolidation/defragmentation #1091

felix-zhe-huang opened this issue Jan 6, 2022 · 60 comments · Fixed by #2123
Assignees
Labels
api Issues that require API changes coming-soon Issues that we're working on and will be released soon consolidation documentation Improvements or additions to documentation feature New feature or request

Comments

@felix-zhe-huang
Copy link
Contributor

Tell us about your request
As a cluster admin, I want Karpenter to consolidation the application workloads by moving pods to a fewer worker nodes and scale down the cluster so that I can improve the cluster resource utilization rate.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
In an under-utilized cluster, application pods are spread across worker nodes with exceeding among of resources. This wasteful situation can be improved by carefully packing the pods to a smaller number of worker nodes with the right size. Current version of Karpenter does not support rearranging pods and continuously improve the cluster utilization. The workload consolidation feature is the missing important component to complete the cluster scaling life cycle management loop.

This workload consolidation feature is nontrivial because of the following coupling problems.

  • Pod Packing:
    The pod packing problem determines which pods should be hosted together by the same worker node according to their taints and constraints. The goal is to produce a fewer well-balanced groups of pods that can be hosted by worker nodes with just the right size.
  • Instance Type Selection:
    According to the pod packing solution, the instance type selection problem determines which combination of instance types should be used to host the pods after the rearrangement.

The above problems deeply couple together so that one solution affect the other. Together the problem is a variant of the bin packing problem which is NP-complete. A practical solution will implements a quick heuristic algorithm that utilizes the special structure of the problem for specific use cases and user preferences. Therefore, thorough discussions with the customer is important.

Are you currently working around this issue?
Currently Karpenter will scale down empty nodes automatically. However, it does not actively move pods around to create empty nodes.

Additional context
Currently the workload consolidation feature is in the design phase. We should gather inputs from the customers about their objectives and preferences.

Attachments
If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@felix-zhe-huang felix-zhe-huang added feature New feature or request consolidation help-wanted Extra attention is needed labels Jan 6, 2022
@ellistarn ellistarn added api Issues that require API changes documentation Improvements or additions to documentation and removed help-wanted Extra attention is needed labels Jan 6, 2022
@matti
Copy link

matti commented Jan 7, 2022

what about https://github.com/kubernetes-sigs/descheduler which already implements some of this?

@olemarkus
Copy link
Contributor

i've been using descheduler for this. But mind one can only use HighNodeUtilization. If you enable the other strategies, you're in for a great time of karpenter illegally binding pods to nodes with taints they don't tolerate (the unready taints for example), descheduler terminating those pods, karpenter spinning up new pods due to #1044, descheduler again evicting those pods ...

Also mind that this is really sub-optimal compared to what Cluster Autoscaler (CAS) does. With CAS I can set a threshold of 70-80% and it will really condense the cluster. CAS gets scheduling mostly correct because it simulates kube-scheduler.
Descheduler, however, you need really low thresholds. Because it will just evict everything on the node once you hit that threshold, hoping it will rescheduler elsewhere. So with aggressive limits, it will just terminate pods over and over and over again.

I think the CAS approach is the correct one. I need to really condense the cluster, and the only way to do that is to simulate what will happen after evictions. At least evicting pods that will just reschedule is meaningless.

I also think changing any instance types is an overkill. If I have 1 remaining node with some CPUs to spare, that is fine. At least I would like a strategy that tries to put pods on existing nodes first, then check on the remaining node (the one with the lowest utilisation), if another instance type fits.

A more fancy thing to do, which I think comes much later, is to look at the overall mem vs cpu balance in the cluster. E.g if the cluster shifts from generally cpu bound to memory bound, it would be nice if karpenter could adjust for that. But then we hit the knapsack-like problem that can get a bit tricky to work out.

@akestner akestner changed the title Karpenter workload consolidation/defragmentation feature Karpenter workload consolidation/defragmentation Jan 13, 2022
@Anto450
Copy link

Anto450 commented Feb 4, 2022

Currently Karpenter will scale down empty nodes automatically. However, it does not actively move pods around to create empty nodes
We are more worried to use karpenter because of this missing major feature for ensuring that we run right sized instance at all time. Can someone let me know the progress here we are in need this feature badly

@stevehipwell
Copy link
Contributor

I'd like Karpenter to terminate (or request termination) of a node when it has a low density of pods and there is another node which could take the nodes (#1491).

@imagekitio
Copy link

Karpenter looks exciting, but for large-scale K8 clusters deployments, this is pretty much a prerequisite.

Is there any discussion or design document about the possible approaches that can be taken for bin packing of the existing workload?

@ellistarn
Copy link
Contributor

ellistarn commented Apr 6, 2022

We're currently laying the foundation by implementing the remaining scheduling spec (affinity/antiaffinity). After that, we plan to make rapid progress on defrag design. I expect we'll start small (e.g. 1 node at a time, simple compactions) and get more sophisticated over time. This is pending design, though.

In the short term, you can combine poddisruptionbudgets and ttlsecondsuntilexpired to achieve soft defrag.

@jcogilvie
Copy link

In the end my requirement is the same as the others in this thread, but one of the reasons for the requirement, which I have not yet seen captured, is that minimizing (to some reasonable limit) the number of hosts can impact the cost of anything billed per-host (e.g., DataDog APM).

@dragosrosculete
Copy link

Hi, this is really important, needs to have the same functionality as cluster-autoscaler. This is preventing me from switching to Karpenter.

@ryan4yin
Copy link

ryan4yin commented Apr 25, 2022

We alose need this feature, the cluster's cost increase when we migrate to karpenter due to the increased nodes count.

In our scenario, we use karpenter in a EMR on EKS cluster, which creates CR(jobbatch) on EKS cluster and those CR will create pods, we can not just add a poddisruptionbudgets for those workloads simply.

@mandeepgoyat
Copy link

mandeepgoyat commented Jun 3, 2022

We also need this feature. In our scenario, we would like to terminate under utilized nodes by actively move pods around to create empty nodes.

Any idea about its release date ?

@BrewedCoffee
Copy link

BrewedCoffee commented Jun 3, 2022

In the short term, you can combine poddisruptionbudgets and ttlsecondsuntilexpired to achieve soft defrag.

Am I understanding correctly: this could in theory terminate pods that had recently been spun up on an existing node?

Are there any workarounds right now to move affected pods to another node so there is no interruption? Or would this only be achieved with the tracked feature?

@ellistarn
Copy link
Contributor

We've laid down most of the groundwork for this feature with in flight nodes, not binding pods, topology, affinity, etc. This is one of our highest priorities right now, and we'll be sure to make early builds available to ya'll. If you're interested in discussing the design, feel free to drop by https://github.com/aws/karpenter/blob/main/WORKING_GROUP.md

@tzneal tzneal self-assigned this Jun 12, 2022
@dennisme
Copy link

public.ecr.aws/karpenter/controller:571b507deb9e8fad8b4d7189ba8cdc1bf095d465
public.ecr.aws/karpenter/webhook:571b507deb9e8fad8b4d7189ba8cdc1bf095d465

Running the previous helm command produces a yaml file with these images that cause the webhook to fail. After swapping out the images it worked fine.

$ ag image:
karpenter.yaml
305:          image: public.ecr.aws/karpenter/controller:v0.13.2@sha256:af463b2ab0a9b7b1fdf0991ee733dd8bcf5eabf80907f69ceddda28556aead31
344:          image: public.ecr.aws/karpenter/webhook:v0.13.2@sha256:e10488262a58173911d2b17d6ef1385979e33334807efd8783e040aa241dd239

error

Status:Failure,Message:mutation failed: cannot decode incoming new object: json: unknown field \"consolidation\",Reason:BadRequest,Details:nil,Code:400,}"}

@tzneal
Copy link
Contributor

tzneal commented Jul 25, 2022

@dennisme I can't reproduce this:

$ export COMMIT="63e6d43d0c6b30b260c63dc01afcb58baabc8020"
$ helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter --version v0-${COMMIT} --namespace karpenter \
  --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=${KARPENTER_IAM_ROLE_ARN} \
  --set clusterName=${CLUSTER_NAME} \
  --set clusterEndpoint=${CLUSTER_ENDPOINT} \
  --set aws.defaultInstanceProfile=KarpenterNodeInstanceProfile-${CLUSTER_NAME} \
  --wait
Release "karpenter" has been upgraded. Happy Helming!
NAME: karpenter
LAST DEPLOYED: Mon Jul 25 15:37:46 2022
NAMESPACE: karpenter
STATUS: deployed
REVISION: 55
TEST SUITE: None

$ k get deployment -n karpenter -o yaml | grep image
          image: public.ecr.aws/karpenter/controller:63e6d43d0c6b30b260c63dc01afcb58baabc8020@sha256:b66a0943cb07f2fcbd9dc072bd90e5dc3fd83896a95aedf3b41d994172d1f96b
          imagePullPolicy: IfNotPresent
          image: public.ecr.aws/karpenter/webhook:63e6d43d0c6b30b260c63dc01afcb58baabc8020@sha256:208a715b5774e10d1f70f48ef931c19541ef3c2d31c52f04e648af53fd767692
          imagePullPolicy: IfNotPresent

@dennisme
Copy link

@tzneal yeppp, was a local helm version issue and the oci registry vs https://charts.karpenter.sh/. My output is consistant to yours. Thanks for the reply.

@anguslees
Copy link

anguslees commented Aug 5, 2022

However, I faced the issue of karpenter leaving no room at all in the nodes so the cronjobs' runs make the cluster scale every other minute. The cluster has been scaling a node up/down every ~90 seconds in average for the last day. I am working this issue around by having the cluster-overprovisioner take some resources at the moment.

@offzale: I think you want to increase your provisioner.spec.ttlSecondsAfterEmpty to longer than your cron job period. This will keep the idle nodes around from the 'last' cron job run.

Alternatively, maybe shutting them down and recreating them is actually the right thing to do? This depends on the time intervals involved, cost of idle resources, desired 'cold' responsiveness, and instance shutdown/bootup overhead. Point being that I don't think there's a general crystal-ball strategy here that we can use for everyone... Unfortunately, I think you will need to tune it based on what you know about your predictable-future-workload and your desired response delay vs cost tradeoffs.

@liorfranko
Copy link

liorfranko commented Aug 8, 2022

I restarted the pod after a couple of minutes and that did the trick indeed. Thanks!

I will leave it running and keep an eye on it to gather some feedback. What I have noticed so far an increment on cpu usage by +640%, in comparison to the resources it normally takes.

Also, I believe it would be handy to have some sort of threshold configuration, e.g. I want karpenter to consider that the nodes cannot take any further load at an 85% of resources allocation, to make some room for cronjobs' runs. Otherwise, I could imagine the cluster constantly scaling up and down every time a few cronjobs get running at a time. But this could be a future improvement of course :)

I had a couple of things I needed to solve before testing Karpenter, and once I solved them, here is the first comparison:

I'm testing it by replacing an ASG with ~60 m5n.8xlarge instances.
I’m running 27 deployments and 1 daemonset, a total of ~430 very diverse pods.
Each deployment has anti-affinity so that each pod will not be deployed with other pods from the same deployment on the same node.

The total CPU request of all the pods is ~1800, the total Memory request is 6.8TB.
On the ASG, the allocatable CPU was 1950 (150 unallocated cores), and the allocatable Memory 7.93TB (1.13TB unallocated Memory)

With Karpenter the allocatable CPU is 2400 (510 unallocated cores), and the allocatable Memory 13.8TB (6.18TB unallocated Memory)

Here is the diversity of the nodes with Karpenter:

   2 r6a.8xlarge
   2 r5.2xlarge
   1 c4.8xlarge
   2 r5ad.4xlarge
   3 r5ad.8xlarge
   6 r5ad.4xlarge
   6 r5ad.8xlarge
   1 r5ad.4xlarge
   1 r5ad.8xlarge
   1 r5ad.4xlarge
   2 r5ad.8xlarge
   1 c6i.12xlarge
   2 r5ad.4xlarge
   1 r5ad.8xlarge
   1 r5ad.4xlarge
   4 r5ad.8xlarge
   1 r5ad.4xlarge
   3 r5ad.8xlarge
   1 c6i.12xlarge
   1 r5ad.4xlarge
   3 r5ad.8xlarge
   1 r5ad.4xlarge
   1 r5n.4xlarge
   1 r5.2xlarge
   1 r5dn.4xlarge
   1 r5.2xlarge
   2 c6i.12xlarge
   1 r6a.12xlarge
   2 c6i.12xlarge
   2 r6a.8xlarge
   1 r5.2xlarge
   5 c6i.12xlarge
   1 r6a.8xlarge
   1 r5.2xlarge
   1 r5n.4xlarge
   3 c6i.12xlarge
   1 r6a.8xlarge
   2 c6i.12xlarge
   1 r5.2xlarge
   1 r5n.4xlarge
   2 r5.2xlarge
   1 r5n.4xlarge
   1 r6a.8xlarge
   1 r5n.4xlarge
   1 c6i.12xlarge
   2 r5.2xlarge
   1 c5.2xlarge
   2 c6i.12xlarge

@tzneal
Copy link
Contributor

tzneal commented Aug 8, 2022

Thanks for the info @liorfranko. What does your provisioner look like? Karpenter implements two forms of consolidation. The first is where it will delete a node if the pods on that node can run elsewhere. Due to the anti-affinity rules on your pods, it sounds like this isn't possible.

The second is where it will replace a node with a cheaper node if possible. This should be happening in your case unless the provisioner is overly constrained to larger types only. Since you've got a few 2xlarge types there, that doesn't appear to be the case either.

That looks to be 85 nodes that Karpenter has launched. Do your work loads have preferred anti-affinities or node selectors?

@liorfranko
Copy link

Here is the provisioner spec:

spec:
  consolidation:
    enabled: true
  labels:
    intent: apps
    nodegroup-name: delivery-network-consumers-spot
    project: mobile-delivery-network-consumers
  providerRef:
    name: delivery-network-consumers-spot
  requirements:
    - key: topology.kubernetes.io/zone
      operator: In
      values:
        - us-east-1b
        - us-east-1d
        - us-east-1e
    - key: karpenter.sh/capacity-type
      operator: In
      values:
        - spot
    - key: karpenter.k8s.aws/instance-size
      operator: NotIn
      values:
        - nano
        - micro
        - small
        - large
        - 16xlarge
        - 18xlarge
        - 24xlarge
        - 32xlarge
        - 48xlarge
        - metal
    - key: karpenter.k8s.aws/instance-family
      operator: NotIn
      values:
        - t3
        - t3a
        - im4gn
        - is4gen
        - i4i
        - i3
        - i3en
        - d2
        - d3
        - d3en
        - h1
        - c4
        - r4
    - key: kubernetes.io/arch
      operator: In
      values:
        - amd64

Here is an example of possible consolidation:

kubectl describe nodes ip-10-206-7-199.ec2.internal


Name:               ip-10-206-7-199.ec2.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=r5.2xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-1
                    failure-domain.beta.kubernetes.io/zone=us-east-1b
                    intent=apps
                    karpenter.k8s.aws/instance-cpu=8
                    karpenter.k8s.aws/instance-family=r5
                    karpenter.k8s.aws/instance-hypervisor=nitro
                    karpenter.k8s.aws/instance-memory=65536
                    karpenter.k8s.aws/instance-pods=58
                    karpenter.k8s.aws/instance-size=2xlarge
                    karpenter.sh/capacity-type=spot
                    karpenter.sh/initialized=true
                    karpenter.sh/provisioner-name=delivery-network-consumers-spot
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-206-7-199.ec2.internalec2ssa.info
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=r5.2xlarge
                    nodegroup-name=delivery-network-consumers-spot
                    project=mobile-delivery-network-consumers
                    topology.ebs.csi.aws.com/zone=us-east-1b
                    topology.kubernetes.io/region=us-east-1
                    topology.kubernetes.io/zone=us-east-1b
Annotations:        csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-08f8f5e78bc941095"}
                    node.alpha.kubernetes.io/ttl: 15
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 08 Aug 2022 20:24:27 +0300
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-206-7-199.ec2.internal
  AcquireTime:     <unset>
  RenewTime:       Mon, 08 Aug 2022 20:41:06 +0300
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Mon, 08 Aug 2022 20:38:36 +0300   Mon, 08 Aug 2022 20:25:06 +0300   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Mon, 08 Aug 2022 20:38:36 +0300   Mon, 08 Aug 2022 20:25:06 +0300   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Mon, 08 Aug 2022 20:38:36 +0300   Mon, 08 Aug 2022 20:25:06 +0300   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Mon, 08 Aug 2022 20:38:36 +0300   Mon, 08 Aug 2022 20:25:36 +0300   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   10.206.7.199
  ExternalIP:   54.197.31.63
  Hostname:     ip-10-206-7-199.ec2.internal
  InternalDNS:  ip-10-206-7-199.ec2.internal
  InternalDNS:  ip-10-206-7-199.ec2ssa.info
  ExternalDNS:  ec2-54-197-31-63.compute-1.amazonaws.com
Capacity:
  attachable-volumes-aws-ebs:  25
  cpu:                         8
  ephemeral-storage:           20959212Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      65047656Ki
  pods:                        58
Allocatable:
  attachable-volumes-aws-ebs:  25
  cpu:                         7910m
  ephemeral-storage:           18242267924
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      64030824Ki
  pods:                        58
System Info:
  Machine ID:                 ec2d577aa3820e3e2f33858018d3bd99
  System UUID:                ec2d577a-a382-0e3e-2f33-858018d3bd99
  Boot ID:                    8ce73586-0ff6-4c6b-bbef-01172b11230c
  Kernel Version:             5.4.204-113.362.amzn2.x86_64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.4.13
  Kubelet Version:            v1.20.15-eks-99076b2
  Kube-Proxy Version:         v1.20.15-eks-99076b2
ProviderID:                   aws:///us-east-1b/i-08f8f5e78bc941095
Non-terminated Pods:          (9 in total)
  Namespace                   Name                                                    CPU Requests  CPU Limits   Memory Requests  Memory Limits  AGE
  ---------                   ----                                                    ------------  ----------   ---------------  -------------  ---
  delivery-apps               kafka-backup-mrmjz                                      500m (6%)     1 (12%)      600Mi (0%)       1Gi (1%)       16m
  delivery-apps               taskschd-consumer-78569c5c67-fcvqp                      2200m (27%)   3200m (40%)  3572Mi (5%)      3572Mi (5%)    12m
  istio-system                istio-cni-node-kwz56                                    0 (0%)        0 (0%)       0 (0%)           0 (0%)         16m
  kube-system                 aws-node-gwp4n                                          10m (0%)      0 (0%)       0 (0%)           0 (0%)         16m
  kube-system                 aws-node-termination-handler-7k5qb                      0 (0%)        0 (0%)       0 (0%)           0 (0%)         16m
  kube-system                 ebs-csi-node-fqdvr                                      0 (0%)        0 (0%)       0 (0%)           0 (0%)         16m
  kube-system                 kube-proxy-nqk9w                                        100m (1%)     0 (0%)       0 (0%)           0 (0%)         16m
  logging                     filebeat-8czsb                                          500m (6%)     2 (25%)      1Gi (1%)         1Gi (1%)       16m
  monitoring                  kube-prometheus-stack-prometheus-node-exporter-qtlfj    50m (0%)      0 (0%)       100Mi (0%)       100Mi (0%)     16m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests     Limits
  --------                    --------     ------
  cpu                         3360m (42%)  6200m (78%)
  memory                      5296Mi (8%)  5720Mi (9%)
  ephemeral-storage           0 (0%)       0 (0%)
  hugepages-1Gi               0 (0%)       0 (0%)
  hugepages-2Mi               0 (0%)       0 (0%)
  attachable-volumes-aws-ebs  0            0
Events:
  Type     Reason                   Age                From        Message
  ----     ------                   ----               ----        -------
  Normal   Starting                 16m                kubelet     Starting kubelet.
  Warning  InvalidDiskCapacity      16m                kubelet     invalid capacity 0 on image filesystem
  Normal   NodeAllocatableEnforced  16m                kubelet     Updated Node Allocatable limit across pods
  Normal   NodeHasSufficientMemory  16m (x3 over 16m)  kubelet     Node ip-10-206-7-199.ec2.internal status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    16m (x3 over 16m)  kubelet     Node ip-10-206-7-199.ec2.internal status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     16m (x3 over 16m)  kubelet     Node ip-10-206-7-199.ec2.internal status is now: NodeHasSufficientPID
  Normal   Starting                 15m                kube-proxy  Starting kube-proxy.
  Normal   NodeReady                15m                kubelet     Node ip-10-206-7-199.ec2.internal status is now: NodeReady

The pod taskschd-consumer-78569c5c67-fcvqp is the only applicative pod on that node, all the rest are deamonsets.
It can be moved to ip-10-206-30-103.ec2.internal:

kubectl describe nodes ip-10-206-30-103.ec2.internal
Name:               ip-10-206-30-103.ec2.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=c6i.12xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-1
                    failure-domain.beta.kubernetes.io/zone=us-east-1b
                    intent=apps
                    karpenter.k8s.aws/instance-cpu=48
                    karpenter.k8s.aws/instance-family=c6i
                    karpenter.k8s.aws/instance-hypervisor=nitro
                    karpenter.k8s.aws/instance-memory=98304
                    karpenter.k8s.aws/instance-pods=234
                    karpenter.k8s.aws/instance-size=12xlarge
                    karpenter.sh/capacity-type=spot
                    karpenter.sh/initialized=true
                    karpenter.sh/provisioner-name=delivery-network-consumers-spot
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-206-30-103.ec2.internalec2ssa.info
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=c6i.12xlarge
                    nodegroup-name=delivery-network-consumers-spot
                    project=mobile-delivery-network-consumers
                    topology.ebs.csi.aws.com/zone=us-east-1b
                    topology.kubernetes.io/region=us-east-1
                    topology.kubernetes.io/zone=us-east-1b
Annotations:        csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-04e505e8cc763973f"}
                    node.alpha.kubernetes.io/ttl: 15
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 08 Aug 2022 13:26:01 +0300
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-206-30-103.ec2.internal
  AcquireTime:     <unset>
  RenewTime:       Mon, 08 Aug 2022 20:41:58 +0300
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  Ready            True    Mon, 08 Aug 2022 20:38:21 +0300   Mon, 08 Aug 2022 13:27:10 +0300   KubeletReady                 kubelet is posting ready status
  MemoryPressure   False   Mon, 08 Aug 2022 20:38:21 +0300   Mon, 08 Aug 2022 13:26:40 +0300   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Mon, 08 Aug 2022 20:38:21 +0300   Mon, 08 Aug 2022 13:26:40 +0300   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Mon, 08 Aug 2022 20:38:21 +0300   Mon, 08 Aug 2022 13:26:40 +0300   KubeletHasSufficientPID      kubelet has sufficient PID available
Addresses:
  InternalIP:   10.206.30.103
  ExternalIP:   54.226.6.184
  Hostname:     ip-10-206-30-103.ec2.internal
  InternalDNS:  ip-10-206-30-103.ec2.internal
  InternalDNS:  ip-10-206-30-103.ec2ssa.info
  ExternalDNS:  ec2-54-226-6-184.compute-1.amazonaws.com
Capacity:
  attachable-volumes-aws-ebs:  39
  cpu:                         48
  ephemeral-storage:           20959212Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      97323012Ki
  pods:                        234
Allocatable:
  attachable-volumes-aws-ebs:  39
  cpu:                         47810m
  ephemeral-storage:           18242267924
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      94323716Ki
  pods:                        234
System Info:
  Machine ID:                 ec237b6d4fea6dcf056164a0fb4aad15
  System UUID:                ec237b6d-4fea-6dcf-0561-64a0fb4aad15
  Boot ID:                    7136840a-db82-4574-8313-2b39acce9907
  Kernel Version:             5.4.204-113.362.amzn2.x86_64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.4.13
  Kubelet Version:            v1.20.15-eks-99076b2
  Kube-Proxy Version:         v1.20.15-eks-99076b2
ProviderID:                   aws:///us-east-1b/i-04e505e8cc763973f
Non-terminated Pods:          (11 in total)
  Namespace                   Name                                                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                                    ------------  ----------  ---------------  -------------  ---
  delivery-apps               capping-consumer-6c77c7fbd8-z8j9w                       6 (12%)       10 (20%)    11Gi (12%)       11Gi (12%)     7h15m
  delivery-apps               device-install-consumer-7f78685654-k25f6                6 (12%)       8 (16%)     15860Mi (17%)    15860Mi (17%)  7h16m
  delivery-apps               kafka-backup-s6l46                                      500m (1%)     1 (2%)      600Mi (0%)       1Gi (1%)       7h15m
  delivery-apps               track-ad-consumer-688c575f45-2gvbn                      7 (14%)       10 (20%)    56820Mi (61%)    56820Mi (61%)  7h15m
  istio-system                istio-cni-node-g9m2x                                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         7h15m
  kube-system                 aws-node-8fbx2                                          10m (0%)      0 (0%)      0 (0%)           0 (0%)         7h15m
  kube-system                 aws-node-termination-handler-qspnn                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         7h15m
  kube-system                 ebs-csi-node-cjnrf                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         7h15m
  kube-system                 kube-proxy-bmnxm                                        100m (0%)     0 (0%)      0 (0%)           0 (0%)         7h15m
  logging                     filebeat-2nq4s                                          500m (1%)     2 (4%)      1Gi (1%)         1Gi (1%)       3h10m
  monitoring                  kube-prometheus-stack-prometheus-node-exporter-rlx8c    50m (0%)      0 (0%)      100Mi (0%)       100Mi (0%)     7h15m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests       Limits
  --------                    --------       ------
  cpu                         20160m (42%)   31 (64%)
  memory                      85668Mi (93%)  86092Mi (93%)
  ephemeral-storage           0 (0%)         0 (0%)
  hugepages-1Gi               0 (0%)         0 (0%)
  hugepages-2Mi               0 (0%)         0 (0%)
  attachable-volumes-aws-ebs  0              0
Events:                       <none>

@tzneal
Copy link
Contributor

tzneal commented Aug 8, 2022

What does the spec.affinity look like for taskschd-consumer-78569c5c67-fcvqp?

@liorfranko
Copy link

I think it's related to several PDBs that were configured with minAvailable: 100%
Let me change it and I'll get back to you

@liorfranko
Copy link

liorfranko commented Aug 8, 2022

It almost didn't effect, the total number of cores decreased by 20 cores and the memory by 200GB

I think that the problem is related to the chosen instances.
I see many c6i.12xlarge nodes where the CPU allocation is half full, but the memory is fully utilized:

Capacity:
  attachable-volumes-aws-ebs:  39
  cpu:                         48
  ephemeral-storage:           20959212Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      97323012Ki
  pods:                        234
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests       Limits
  --------                    --------       ------
  cpu                         22160m (46%)   34 (71%)
  memory                      87716Mi (95%)  88140Mi (95%)
  ephemeral-storage           0 (0%)         0 (0%)
  hugepages-1Gi               0 (0%)         0 (0%)
  hugepages-2Mi               0 (0%)         0 (0%)
  attachable-volumes-aws-ebs  0              0

On the other hand, I see many r6a.8xlarge nodes where the CPU allocation is full, and the memory is only 35% utilized.

Capacity:
  attachable-volumes-aws-ebs:  39
  cpu:                         32
  ephemeral-storage:           20959212Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      258598332Ki
  pods:                        234
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests       Limits
  --------                    --------       ------
  cpu                         29160m (91%)   40 (125%)
  memory                      88716Mi (35%)  89140Mi (35%)
  ephemeral-storage           0 (0%)         0 (0%)
  hugepages-1Gi               0 (0%)         0 (0%)
  hugepages-2Mi               0 (0%)         0 (0%)
  attachable-volumes-aws-ebs  0              0

Both of the above can be replaced with an m5.8xlarge each.

@tzneal
Copy link
Contributor

tzneal commented Aug 8, 2022

I looked at your provisioner and it looks like these are spot nodes. We currently don't replace spot nodes with smaller spot nodes. The reasoning for this is that we don't have a way of knowing if the node we would replace it with is as available or more available than the node that is being replaced. By restricting the instance size we could potentially be moving you from a node that you're likely to keep for a while to a node that will be terminated in a short period of time.

If these were on-demand nodes, then we would replace them with smaller instance types.

@liorfranko
Copy link

liorfranko commented Aug 8, 2022

Thanks @tzneal

Do you know when it would be supported? Or at least let me choose to enable it.

And the reason for choosing the c6i.12xl over the m5.8xl is the probability for interruptions?

@tzneal
Copy link
Contributor

tzneal commented Aug 8, 2022

For spot, we use the capacity-optimized-prioritized strategy when we launch the node. The strategies are documented here but essentially it makes a trade-off of a slightly more expensive node in exchange for less chance of interruptions.

Node size is also not always related to the cost, I just checked the spot pricing for us-east-1 at https://aws.amazon.com/ec2/spot/pricing/ and the r6a.8xlarge was actually cheaper than m5.8xlarge.

m5.8xlarge      $0.3664 per Hour
r6a.8xlarge     $0.3512 per Hour
c6i.12xlarge    $0.4811 per Hour

@liorfranko
Copy link

Thank @tzneal for all the information.

So up until now, everything works good.
I'll monitor it for couple more days and let you know.
Do you have an estimation when the current commit should be released?

tzneal added a commit to tzneal/karpenter that referenced this issue Aug 9, 2022
Implements cluster consolidation via:
- removing nodes if their workloads can run on other nodes
- replacing nodes with cheaper instances

Fixes aws#1091
tzneal added a commit to tzneal/karpenter that referenced this issue Aug 10, 2022
Implements cluster consolidation via:
- removing nodes if their workloads can run on other nodes
- replacing nodes with cheaper instances

Fixes aws#1091
tzneal added a commit that referenced this issue Aug 10, 2022
Implements cluster consolidation via:
- removing nodes if their workloads can run on other nodes
- replacing nodes with cheaper instances

Fixes #1091
@kahirokunn
Copy link

@tzneal
I am using Karpenter v0.18.1.
The spot instance does not scale down when I enable consolidation.
So I found this statement.

#1091 (comment)

In which version will this feature be released?
Thx.

@universam1
Copy link

The constraint to on-demand comes also to disappointment for us as we have clusters with solely spot instances running.
Would @tzneal be willing to make it available via a feature flag?
Thanks

@FernandoMiguel
Copy link
Contributor

@tzneal I am using Karpenter v0.18.1. The spot instance does not scale down when I enable consolidation. So I found this statement.

#1091 (comment)

In which version will this feature be released? Thx.

@kahirokunn you wish to replace spot nodes with cheaper ones that can actually be killed sooner than the one your workloads are before consolidating ?

@dekelev
Copy link

dekelev commented Oct 27, 2022

@FernandoMiguel I'm not sure how complicated is this to implement, but there's an AWS page here showing "Frequency of interruption" per instance type and with some types (e.g. m6i.large & m6i.xlarge) it is lower than 5% and should be considered as a very low factor when consolidating a very large and mostly idle instance into cheaper instances like m6i.xlarge. BTW I have a small list of instance types that I run Karpenter with.

@FernandoMiguel
Copy link
Contributor

not sure what workloads you run, but i tend to prefer my hosts to not change often. hence why deeper pools are prefered.

@dekelev
Copy link

dekelev commented Oct 27, 2022

I'm using Karpenter for non-sensitive workloads with a lot of peaks during the day that creates huge servers sometimes, which is good for me instead of having many small ones, but after an hour when the peak is down, it usually needs to be consolidated into smaller instances, because the cluster is left with huge servers that are mostly idle.

@matti
Copy link

matti commented Oct 27, 2022

@dekelev I don't run karpenter, but I have similar problem with "plain" EKS and (managed) nodegroups. I've solved it like this:

https://github.com/matti/k8s-nodeflow/blob/main/k8s-nodeflow.sh

this is running in the cluster and it ensures that all machines are constantly drained within N seconds.

this requires proper configuration of PodDisruptionPolicies and also https://github.com/kubernetes-sigs/descheduler is recommended to "consolidate" low-utilization nodes.

I think this OR something similar could work with Karpenter.

Btw my email is in my github profile if you want to have a call or something about this - my plan is to develop these things further at some point and having valid use cases would be helpful for me.

@Vlaaaaaaad
Copy link

+1, I'd love a flag to enable consolidation for Spot instances too.

Workloads vary wildly and I agree that by default this flag should be off, but it should be an option for the folks that need it. For example, I often see clusters with a bunch of 24xlarge nodes that have <1% utilization. We use node-problem-detector/descheduler to work around this, but Karpenter consolidating Spot nodes natively would be a much better solution.

@universam1
Copy link

universam1 commented Oct 27, 2022

@FernandoMiguel Spot instances are naturaly accompanied with workload that supports interruptions, guess no one will use spot when the workload is allergic against!?
We have also different use cases where we dislike interruptions and others with high spikes. Consider deployments with antifaffinity for nodes, being grouped by Karpenter at provisioning but not reconsidered later which leaves those huge instances running.

However, I could picture a case where Karpenter tries to schedule a smaller node which is however unavailable, falling back to bigger instance and thus running into a loop!? Not sure if K. is doing a preflight in such case?

@tzneal
Copy link
Contributor

tzneal commented Oct 27, 2022

The original problem that forced not replacing spot nodes with smaller spot nodes is that the straight forward approach eliminates the utility of the capacity optimized prioritized strategy that we use for launching spot nodes which considers pool depth and leads to launching nodes that may be slightly bigger than necessary, but are less likely to be reclaimed.

If consolidation did replace spot nodes, we would often launch a spot node, get one that's slightly too big due to pool depth and interruption rate, and then immediately replace it with a cheaper instance type with a higher interruption rate. This turns our capacity optimized prioritized strategy into a Karpenter operated lowest price strategy which isn't recommended for use with spot.

I'll be locking this issue as it was for the original consolidation feature and it's difficult to track new feature requests on closed issues. Feel free to open a new one specifically for spot replacement for that discussion.

@aws aws locked as resolved and limited conversation to collaborators Oct 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api Issues that require API changes coming-soon Issues that we're working on and will be released soon consolidation documentation Improvements or additions to documentation feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.