Skip to content

feat: include tainted nodes in utilisation denominator#285

Closed
FocalChord wants to merge 1 commit intoatlassian:masterfrom
FocalChord:feat/include-tainted-in-capacity
Closed

feat: include tainted nodes in utilisation denominator#285
FocalChord wants to merge 1 commit intoatlassian:masterfrom
FocalChord:feat/include-tainted-in-capacity

Conversation

@FocalChord
Copy link
Copy Markdown
Contributor

@FocalChord FocalChord commented Mar 6, 2026

What

This PR adds a new optional config field include_tainted_in_capacity (bool, defaults to false) that includes tainted nodes alongside untainted nodes in the capacity denominator when calculating utilisation.

When enabled, three call sites in scaleNodeGroup switch from using untaintedNodes to a combined capacityNodes slice:

  • CalculateNodesCapacity (the denominator)
  • calcPercentUsage (the node count for scale-from-zero detection)
  • calcScaleUpDelta (the node set for scale-up calculation)

Force-tainted nodes and cordoned nodes are never included regardless of the flag, since force-tainted nodes are being aggressively removed and cordoned nodes are manually excluded by operators.

Why

Escalator calculates utilisation as total pod requests divided by allocatable capacity of untainted nodes only. When a node is tainted, it drops from the denominator immediately, but its pods continue running and stay in the numerator. On workloads where pods do not drain immediately after tainting, this inflates utilisation artificially.

We observed this in production on a cluster running ~92 bare-metal nodes (m6id.metal, 128 vCPU each). During a demand transition:

  • Utilisation dropped to 47.8%, triggering tainting at 3 nodes per 30-second cycle
  • Over 11 minutes, 24 nodes were tainted
  • The shrinking denominator caused a 26-point utilisation spike in a single scan cycle (48% to 74.5%) with no change in actual demand
  • Escalator reversed and untainted 22 of the 24 nodes, but 12 EC2 instances had already been terminated and replaced

With include_tainted_in_capacity: true, the denominator stays stable after tainting because tainted nodes still count toward capacity. The utilisation reading reflects actual demand changes rather than the controller's own actions.

Threshold re-tuning required

Enabling this flag changes the meaning of the utilisation metric. A cluster that previously read 70% (untainted-only denominator) might read 55% (full-fleet denominator) because the denominator is larger. Any deployment enabling this flag will need to adjust their taint_lower_capacity_threshold_percent, taint_upper_capacity_threshold_percent, and scale_up_threshold_percent accordingly.

This is why it's a feature flag rather than a default change. Existing deployments are completely unaffected.

Testing

Three test cases covering:

  • Tainted nodes in denominator prevents artificial utilisation spike (flag enabled)
  • Existing behavior preserved when flag is disabled (backwards compatibility)
  • Force-tainted and cordoned nodes are never included in capacity regardless of flag

Rovo Dev code review: Rovo Dev couldn't review this pull request
Upgrade to Rovo Dev Standard to continue using code review.

@atlassian-cla-bot
Copy link
Copy Markdown

Thank you for your submission! Like many open source projects, we ask that you sign our CLA (Contributor License Agreement) before we can accept your contribution.
If your email is listed below, please ensure that you sign the CLA with the same email address.

The following users still need to sign our CLA:
❌nbhatt-atlassian

Already signed the CLA? To re-check, try refreshing the page.

@FocalChord FocalChord changed the title (feat) Add feature flag to include tainted nodes in utilisation denominator feat: add include_tainted_in_capacity flag Mar 6, 2026
@FocalChord FocalChord changed the title feat: add include_tainted_in_capacity flag feat: include tainted nodes in utilisation denominator Mar 6, 2026
@FocalChord FocalChord force-pushed the feat/include-tainted-in-capacity branch from 87dc37f to e667b75 Compare March 10, 2026 23:34
@FocalChord FocalChord requested a review from awprice March 13, 2026 18:02
}
}

if c.isScaleOnStarve(nodeGroup, podRequests, nodeCapacity, untaintedNodes) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still using untaintedNodes instead of capacityNodes. Is this intentional?

metrics.NodeGroupMemCapacityLargestAvailableMem.WithLabelValues(nodegroup).Set(float64(nodeCapacity.LargestAvailableMemory.GetMemoryQuantity().MilliValue() / 1000))

// If we ever get into a state where we have less nodes than the minimum
if len(untaintedNodes) < nodeGroup.Opts.MinNodes {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here, we are still still using untaintedNodes instead of capacityNodes. Is this intentional?


// IncludeTaintedInCapacity includes tainted nodes in the capacity denominator
// for utilisation calculations, preventing artificial spikes when nodes are tainted.
IncludeTaintedInCapacity bool `json:"include_tainted_in_capacity,omitempty" yaml:"include_tainted_in_capacity,omitempty"`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capacityNodes := untaintedNodes
if nodeGroup.Opts.IncludeTaintedInCapacity {
capacityNodes = append(append([]*v1.Node{}, untaintedNodes...), taintedNodes...)
log.WithField("nodegroup", nodegroup).Infof("Including %v tainted nodes in capacity calculation (total capacity nodes: %v)", len(taintedNodes), len(capacityNodes))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be at info level? This log will run at every-tick and could be very noisy. Can this be changed to debug level, or only logging when len(taintedNodes) > 0.

@cespo cespo closed this Mar 25, 2026
@cespo cespo reopened this Mar 25, 2026
// Determine which nodes count toward capacity for utilisation calculation
capacityNodes := untaintedNodes
if nodeGroup.Opts.IncludeTaintedInCapacity {
capacityNodes = append(append([]*v1.Node{}, untaintedNodes...), taintedNodes...)
Copy link
Copy Markdown
Contributor

@tomwwright tomwwright Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking through the premise of the fix, does it make more sense to instead change the calculation on the numerator side?

That is, rather than including tainted nodes in the available capacity (which really, they aren't, as the scheduler can't place on them) we could exclude pods on tainted nodes from the load (which makes some sense, they are draining pods)

The crux of the utilisation calc is to determine what % pressure is on the pool of resources that are available to the scheduler

The concern could be that computing that exclusion could be expensive - we essentially need to do another filter pass on pods and inspect if the assigned node is part of our tainted list

@dtnyn thoughts?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either way will address the specific problem about inaccurate numerator, but exclusion does seem more semantically accurate since like you mentioned they aren't actually "available capacity" so the same treatment to their workload make sense.

I don't think the extra check would be changing the time scaling cost. Since a cursory glance should be able to achieve it via checking membership in a set of tainted node which should be a subset of the loop we're already doing with mapPodsToNode() which goes through all current pods

@dtnyn
Copy link
Copy Markdown
Collaborator

dtnyn commented Mar 30, 2026

closing this PR in favour of #288

@dtnyn dtnyn closed this Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants