Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable cluster consolidation #2365

Closed
gaganyaan2 opened this issue Aug 25, 2022 · 10 comments
Closed

Configurable cluster consolidation #2365

gaganyaan2 opened this issue Aug 25, 2022 · 10 comments
Labels
feature New feature or request

Comments

@gaganyaan2
Copy link

gaganyaan2 commented Aug 25, 2022

Tell us about your request
can we make configurable cluster consolidation where we can reserved some CPU and Memory for better performance and tuned cost saving?

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: provisioner
spec:
   consolidation:
     enabled: true
     reserverd:
       cpu: 1000m
       memory: 2G

##### OR in percentage % #####
     reserverd:
       cpu: 5%
       memory: 5%

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
After implementing consolidation feature some of our pods are getting restarted due to failing readiness and liveness probe and when we increase cpu and memory it starts running.
By adding some reserved cpu and memory during calculation of consolidation it will act as soft tuning button for cost optimization and it will bring more confidence in user for using consolidation feature.

For example, If node has 10 CPU and 20 GB RAM and we want to reserve 10% then consolidation feature should only consider 9 CPU and 18GB RAM for making consolidation decision .

Are you currently working around this issue?
How are you currently solving this problem?

Possible solution:

  • Increase the timeout for initialDelay, readiness and liveness periodSeconds
  • Increase the CPU and Memory for Deployment
  • Configurable cluster consolidation which kind of act as safe tuning button

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@gaganyaan2 gaganyaan2 added the feature New feature or request label Aug 25, 2022
@tzneal
Copy link
Contributor

tzneal commented Aug 25, 2022

The problem with implementing this in consolidation is that it doesn't help generally. kube-scheduler is the entity that is scheduling your pods to nodes and it's doing that based on the resource requests of your pods vs the node's allocatable resources. That pods are failing is a signal that your resource requests for your pods are undersized.

Suppose you had consolidation turned off and Karpenter launched the same set of nodes because of a scale-up event, the pods would schedule in the same way across those nodes and would then fail similarly. The root cause is the undersized resource requests.

@gaganyaan2
Copy link
Author

That pods are failing is a signal that your resource requests for your pods are undersized.

This might be the case as pods also include Newrelic logging and APM agent but somehow our stage environment seems a little unstable after enabling consolidation before consolidation it was working fine and we never got any issue regarding readiness probe. We will be monitoring it for a few more days.

The other alternative solution we thought of is creating a Daemonset with image: google/pause:latest and limiting CPU/RAM to have reserve resources on the worker node. If this also does not work we will set update the resource for each deployment.

@tzneal
Copy link
Contributor

tzneal commented Aug 26, 2022

You were likely in some situation where your pods happened to land on nodes with a bit of extra space so that things worked better. One way of looking at it is that in your scenario your pods are spread across a set of nodes and are crashing or throttled so that they don't respond to probes. If you had created that same set of nodes with a managed node group, your pods would still be crashing/throttled running on them and Karpenter wouldn't be involved at all.

The EKS best practices guide has some information regarding setting request/limits here that might be helpful. Relevent to this situation:

Correctly sized requests are particularly important when using a node auto-scaling solution like Karpenter or Cluster AutoScaler. These tools look at your workload requests to determine the number and size of nodes to be provisioned. If your requests are too small with larger limits, you may find your workloads evicted or OOM killed if they have been tightly packed on a node.

There is also an EKS log collector script at https://github.com/awslabs/amazon-eks-ami/tree/master/log-collector-script/linux . It has recently been updated to identify situations where processes are being throttled by cgroups. If you run that script and extract the log archive it creates. You should find a system/cpu_throttling.txt that will show if any processes are throttled and how much.

@anguslees
Copy link

@koolwithk See also https://karpenter.sh/v0.16.0/provisioner/#system-reserved-resources

@gaganyaan2
Copy link
Author

gaganyaan2 commented Aug 30, 2022

Tried https://github.com/awslabs/amazon-eks-ami/tree/master/log-collector-script/linux on the node where pod were failing with readiness and liveness probe I did saw multiple processes in system/cpu_throttling.txt

Also, I did describe [kubectl describe node $node_name] the same node, the CPU request was 99%. There were multiple nodes with around the same 98-99% CPU request where pods were failing.

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests           Limits
  --------                    --------           ------
  cpu                         15656m (99%)       37762m (240%)
  memory                      32385323392 (49%)  62023377664 (94%)
  ephemeral-storage           0 (0%)             0 (0%)
  hugepages-1Gi               0 (0%)             0 (0%)
  hugepages-2Mi               0 (0%)             0 (0%)
  attachable-volumes-aws-ebs  0                  0

I believe we need to update our resource request and limit so that pod does not throttle the CPU.

We will also be trying https://karpenter.sh/v0.16.0/provisioner/#system-reserved-resources
OR
Daemonset with image: google/pause:latest for limiting CPU/RAM

Thank you! @tzneal

@gaganyaan2
Copy link
Author

I had no luck with https://karpenter.sh/v0.16.0/provisioner/#system-reserved-resources with below config

  kubeletConfiguration:
    systemReserved:
      cpu: "2"
      memory: 2Gi

Looking at https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#node-allocatable system-reserved only reserve resource for system processes like sshd but it does not subtract it from Allocatable resource.

Capacity:
  attachable-volumes-aws-ebs:  25
  cpu:                         16
  ephemeral-storage:           20959212Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      32117840Ki
  pods:                        205
Allocatable:
  attachable-volumes-aws-ebs:  25
  cpu:                         15700m
  ephemeral-storage:           18242267924
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      31491152Ki
  pods:                        205

Because of this, I'm still seeing Allocated CPU at 97% and some pods are still failing readiness and liveness

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests           Limits
  --------                    --------           ------
  cpu                         15356m (97%)       34062m (216%)
  memory                      27782074752 (86%)  49459340032 (153%)
  ephemeral-storage           0 (0%)             0 (0%)
  hugepages-1Gi               0 (0%)             0 (0%)
  hugepages-2Mi               0 (0%)             0 (0%)
  attachable-volumes-aws-ebs  0                  0

I am now trying Daemonset with image: google/pause:latest with the request of 2 CPU/RAM

@tzneal
Copy link
Contributor

tzneal commented Sep 2, 2022

@koolwithk The requested CPU value will always be high as that's how pods are packed on a node. If pods are failing their readiness/liveness checks, it indicates that the pod's requested resources are too low. You should increase the requested CPU for those deployments to a large enough value that they are healthy.

@jwcesign
Copy link

@gaganyaan2 How did u solve this issue?

@gaganyaan2
Copy link
Author

  • Finding the correct CPU/Memory request and limit for Pod is a key.
  • Not recommended - Running Daemonset with image: google/pause:latest with the request of 2 CPU/Memory. Which eventually hold system resources so that it can't provision more pods on same node. I did not see that much benefit out of this and at the end, ended with correcting the CPU/Memory request limit by using Prometheus and Grafana.

@jwcesign
Copy link

jwcesign commented Jul 16, 2024

App developer configures the CPU/Memory request, and as a DevOps engineer, I can't modify their YAML. Also, there are many YAMLs, and it's quite complex for me to modify them one by one.

So, running a daemonset is a good choice for me.

Thanks very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants