-
Notifications
You must be signed in to change notification settings - Fork 934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configurable cluster consolidation #2365
Comments
|
The problem with implementing this in consolidation is that it doesn't help generally. Suppose you had consolidation turned off and Karpenter launched the same set of nodes because of a scale-up event, the pods would schedule in the same way across those nodes and would then fail similarly. The root cause is the undersized resource requests. |
This might be the case as pods also include Newrelic logging and APM agent but somehow our stage environment seems a little unstable after enabling consolidation before consolidation it was working fine and we never got any issue regarding readiness probe. We will be monitoring it for a few more days. The other alternative solution we thought of is creating a Daemonset with |
|
You were likely in some situation where your pods happened to land on nodes with a bit of extra space so that things worked better. One way of looking at it is that in your scenario your pods are spread across a set of nodes and are crashing or throttled so that they don't respond to probes. If you had created that same set of nodes with a managed node group, your pods would still be crashing/throttled running on them and Karpenter wouldn't be involved at all. The EKS best practices guide has some information regarding setting request/limits here that might be helpful. Relevent to this situation:
There is also an EKS log collector script at https://github.com/awslabs/amazon-eks-ami/tree/master/log-collector-script/linux . It has recently been updated to identify situations where processes are being throttled by cgroups. If you run that script and extract the log archive it creates. You should find a |
|
@koolwithk See also https://karpenter.sh/v0.16.0/provisioner/#system-reserved-resources |
|
Tried https://github.com/awslabs/amazon-eks-ami/tree/master/log-collector-script/linux on the node where pod were failing with readiness and liveness probe I did saw multiple processes in Also, I did describe [ Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 15656m (99%) 37762m (240%)
memory 32385323392 (49%) 62023377664 (94%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0I believe we need to update our resource request and limit so that pod does not throttle the CPU. We will also be trying https://karpenter.sh/v0.16.0/provisioner/#system-reserved-resources Thank you! @tzneal |
|
I had no luck with https://karpenter.sh/v0.16.0/provisioner/#system-reserved-resources with below config kubeletConfiguration:
systemReserved:
cpu: "2"
memory: 2GiLooking at https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#node-allocatable Capacity:
attachable-volumes-aws-ebs: 25
cpu: 16
ephemeral-storage: 20959212Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32117840Ki
pods: 205
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 15700m
ephemeral-storage: 18242267924
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 31491152Ki
pods: 205Because of this, I'm still seeing Allocated CPU at 97% and some pods are still failing readiness and liveness Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 15356m (97%) 34062m (216%)
memory 27782074752 (86%) 49459340032 (153%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0I am now trying Daemonset with |
|
@koolwithk The requested CPU value will always be high as that's how pods are packed on a node. If pods are failing their readiness/liveness checks, it indicates that the pod's requested resources are too low. You should increase the requested CPU for those deployments to a large enough value that they are healthy. |
|
@gaganyaan2 How did u solve this issue? |
|
|
App developer configures the CPU/Memory request, and as a DevOps engineer, I can't modify their YAML. Also, there are many YAMLs, and it's quite complex for me to modify them one by one. So, running a daemonset is a good choice for me. Thanks very much! |
Tell us about your request
can we make configurable cluster consolidation where we can reserved some CPU and Memory for better performance and tuned cost saving?
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
After implementing consolidation feature some of our pods are getting restarted due to failing readiness and liveness probe and when we increase cpu and memory it starts running.
By adding some reserved
cpuandmemoryduring calculation of consolidation it will act as soft tuning button for cost optimization and it will bring more confidence in user for using consolidation feature.For example, If node has
10 CPUand20 GB RAMand we want to reserve 10% then consolidation feature should only consider9 CPUand18GB RAMfor making consolidation decision .Are you currently working around this issue?
How are you currently solving this problem?
Possible solution:
Community Note
The text was updated successfully, but these errors were encountered: