feat: include tainted nodes in utilisation denominator#285
feat: include tainted nodes in utilisation denominator#285FocalChord wants to merge 1 commit intoatlassian:masterfrom
Conversation
|
Thank you for your submission! Like many open source projects, we ask that you sign our CLA (Contributor License Agreement) before we can accept your contribution. Already signed the CLA? To re-check, try refreshing the page. |
87dc37f to
e667b75
Compare
| } | ||
| } | ||
|
|
||
| if c.isScaleOnStarve(nodeGroup, podRequests, nodeCapacity, untaintedNodes) { |
There was a problem hiding this comment.
This is still using untaintedNodes instead of capacityNodes. Is this intentional?
| metrics.NodeGroupMemCapacityLargestAvailableMem.WithLabelValues(nodegroup).Set(float64(nodeCapacity.LargestAvailableMemory.GetMemoryQuantity().MilliValue() / 1000)) | ||
|
|
||
| // If we ever get into a state where we have less nodes than the minimum | ||
| if len(untaintedNodes) < nodeGroup.Opts.MinNodes { |
There was a problem hiding this comment.
Similarly here, we are still still using untaintedNodes instead of capacityNodes. Is this intentional?
|
|
||
| // IncludeTaintedInCapacity includes tainted nodes in the capacity denominator | ||
| // for utilisation calculations, preventing artificial spikes when nodes are tainted. | ||
| IncludeTaintedInCapacity bool `json:"include_tainted_in_capacity,omitempty" yaml:"include_tainted_in_capacity,omitempty"` |
There was a problem hiding this comment.
This is adding a new user-facing config, please update in https://github.com/atlassian/escalator/blob/master/docs/configuration/nodegroup.md or https://github.com/atlassian/escalator/blob/master/docs/configuration/advanced-configuration.md
| capacityNodes := untaintedNodes | ||
| if nodeGroup.Opts.IncludeTaintedInCapacity { | ||
| capacityNodes = append(append([]*v1.Node{}, untaintedNodes...), taintedNodes...) | ||
| log.WithField("nodegroup", nodegroup).Infof("Including %v tainted nodes in capacity calculation (total capacity nodes: %v)", len(taintedNodes), len(capacityNodes)) |
There was a problem hiding this comment.
Should be at info level? This log will run at every-tick and could be very noisy. Can this be changed to debug level, or only logging when len(taintedNodes) > 0.
| // Determine which nodes count toward capacity for utilisation calculation | ||
| capacityNodes := untaintedNodes | ||
| if nodeGroup.Opts.IncludeTaintedInCapacity { | ||
| capacityNodes = append(append([]*v1.Node{}, untaintedNodes...), taintedNodes...) |
There was a problem hiding this comment.
Just thinking through the premise of the fix, does it make more sense to instead change the calculation on the numerator side?
That is, rather than including tainted nodes in the available capacity (which really, they aren't, as the scheduler can't place on them) we could exclude pods on tainted nodes from the load (which makes some sense, they are draining pods)
The crux of the utilisation calc is to determine what % pressure is on the pool of resources that are available to the scheduler
The concern could be that computing that exclusion could be expensive - we essentially need to do another filter pass on pods and inspect if the assigned node is part of our tainted list
@dtnyn thoughts?
There was a problem hiding this comment.
I think either way will address the specific problem about inaccurate numerator, but exclusion does seem more semantically accurate since like you mentioned they aren't actually "available capacity" so the same treatment to their workload make sense.
I don't think the extra check would be changing the time scaling cost. Since a cursory glance should be able to achieve it via checking membership in a set of tainted node which should be a subset of the loop we're already doing with mapPodsToNode() which goes through all current pods
|
closing this PR in favour of #288 |
What
This PR adds a new optional config field
include_tainted_in_capacity(bool, defaults tofalse) that includes tainted nodes alongside untainted nodes in the capacity denominator when calculating utilisation.When enabled, three call sites in
scaleNodeGroupswitch from usinguntaintedNodesto a combinedcapacityNodesslice:CalculateNodesCapacity(the denominator)calcPercentUsage(the node count for scale-from-zero detection)calcScaleUpDelta(the node set for scale-up calculation)Force-tainted nodes and cordoned nodes are never included regardless of the flag, since force-tainted nodes are being aggressively removed and cordoned nodes are manually excluded by operators.
Why
Escalator calculates utilisation as total pod requests divided by allocatable capacity of untainted nodes only. When a node is tainted, it drops from the denominator immediately, but its pods continue running and stay in the numerator. On workloads where pods do not drain immediately after tainting, this inflates utilisation artificially.
We observed this in production on a cluster running ~92 bare-metal nodes (m6id.metal, 128 vCPU each). During a demand transition:
With
include_tainted_in_capacity: true, the denominator stays stable after tainting because tainted nodes still count toward capacity. The utilisation reading reflects actual demand changes rather than the controller's own actions.Threshold re-tuning required
Enabling this flag changes the meaning of the utilisation metric. A cluster that previously read 70% (untainted-only denominator) might read 55% (full-fleet denominator) because the denominator is larger. Any deployment enabling this flag will need to adjust their
taint_lower_capacity_threshold_percent,taint_upper_capacity_threshold_percent, andscale_up_threshold_percentaccordingly.This is why it's a feature flag rather than a default change. Existing deployments are completely unaffected.
Testing
Three test cases covering:
Rovo Dev code review: Rovo Dev couldn't review this pull request
Upgrade to Rovo Dev Standard to continue using code review.