feat: include tainted nodes in utilisation denominator by FocalChord · Pull Request #285 · atlassian/escalator

FocalChord · 2026-03-06T20:22:17Z

What

This PR adds a new optional config field include_tainted_in_capacity (bool, defaults to false) that includes tainted nodes alongside untainted nodes in the capacity denominator when calculating utilisation.

When enabled, three call sites in scaleNodeGroup switch from using untaintedNodes to a combined capacityNodes slice:

CalculateNodesCapacity (the denominator)
calcPercentUsage (the node count for scale-from-zero detection)
calcScaleUpDelta (the node set for scale-up calculation)

Force-tainted nodes and cordoned nodes are never included regardless of the flag, since force-tainted nodes are being aggressively removed and cordoned nodes are manually excluded by operators.

Why

Escalator calculates utilisation as total pod requests divided by allocatable capacity of untainted nodes only. When a node is tainted, it drops from the denominator immediately, but its pods continue running and stay in the numerator. On workloads where pods do not drain immediately after tainting, this inflates utilisation artificially.

We observed this in production on a cluster running ~92 bare-metal nodes (m6id.metal, 128 vCPU each). During a demand transition:

Utilisation dropped to 47.8%, triggering tainting at 3 nodes per 30-second cycle
Over 11 minutes, 24 nodes were tainted
The shrinking denominator caused a 26-point utilisation spike in a single scan cycle (48% to 74.5%) with no change in actual demand
Escalator reversed and untainted 22 of the 24 nodes, but 12 EC2 instances had already been terminated and replaced

With include_tainted_in_capacity: true, the denominator stays stable after tainting because tainted nodes still count toward capacity. The utilisation reading reflects actual demand changes rather than the controller's own actions.

Threshold re-tuning required

Enabling this flag changes the meaning of the utilisation metric. A cluster that previously read 70% (untainted-only denominator) might read 55% (full-fleet denominator) because the denominator is larger. Any deployment enabling this flag will need to adjust their taint_lower_capacity_threshold_percent, taint_upper_capacity_threshold_percent, and scale_up_threshold_percent accordingly.

This is why it's a feature flag rather than a default change. Existing deployments are completely unaffected.

Testing

Three test cases covering:

Tainted nodes in denominator prevents artificial utilisation spike (flag enabled)
Existing behavior preserved when flag is disabled (backwards compatibility)
Force-tainted and cordoned nodes are never included in capacity regardless of flag

Rovo Dev code review: Rovo Dev couldn't review this pull request
Upgrade to Rovo Dev Standard to continue using code review.

atlassian-cla-bot · 2026-03-06T20:22:20Z

Thank you for your submission! Like many open source projects, we ask that you sign our CLA (Contributor License Agreement) before we can accept your contribution.
If your email is listed below, please ensure that you sign the CLA with the same email address.

The following users still need to sign our CLA:
❌nbhatt-atlassian

_{Already signed the CLA? To re-check, try refreshing the page.}

dtnyn · 2026-03-24T08:16:53Z

 		}
 	}

 	if c.isScaleOnStarve(nodeGroup, podRequests, nodeCapacity, untaintedNodes) {


This is still using untaintedNodes instead of capacityNodes. Is this intentional?

dtnyn · 2026-03-24T08:17:32Z

 	metrics.NodeGroupMemCapacityLargestAvailableMem.WithLabelValues(nodegroup).Set(float64(nodeCapacity.LargestAvailableMemory.GetMemoryQuantity().MilliValue() / 1000))

 	// If we ever get into a state where we have less nodes than the minimum
 	if len(untaintedNodes) < nodeGroup.Opts.MinNodes {


Similarly here, we are still still using untaintedNodes instead of capacityNodes. Is this intentional?

dtnyn · 2026-03-24T08:18:42Z


+	// IncludeTaintedInCapacity includes tainted nodes in the capacity denominator
+	// for utilisation calculations, preventing artificial spikes when nodes are tainted.
+	IncludeTaintedInCapacity bool `json:"include_tainted_in_capacity,omitempty" yaml:"include_tainted_in_capacity,omitempty"`


This is adding a new user-facing config, please update in https://github.com/atlassian/escalator/blob/master/docs/configuration/nodegroup.md or https://github.com/atlassian/escalator/blob/master/docs/configuration/advanced-configuration.md

dtnyn · 2026-03-24T08:20:58Z

+	capacityNodes := untaintedNodes
+	if nodeGroup.Opts.IncludeTaintedInCapacity {
+		capacityNodes = append(append([]*v1.Node{}, untaintedNodes...), taintedNodes...)
+		log.WithField("nodegroup", nodegroup).Infof("Including %v tainted nodes in capacity calculation (total capacity nodes: %v)", len(taintedNodes), len(capacityNodes))


Should be at info level? This log will run at every-tick and could be very noisy. Can this be changed to debug level, or only logging when len(taintedNodes) > 0.

tomwwright · 2026-03-26T03:45:06Z

+	// Determine which nodes count toward capacity for utilisation calculation
+	capacityNodes := untaintedNodes
+	if nodeGroup.Opts.IncludeTaintedInCapacity {
+		capacityNodes = append(append([]*v1.Node{}, untaintedNodes...), taintedNodes...)


Just thinking through the premise of the fix, does it make more sense to instead change the calculation on the numerator side?

That is, rather than including tainted nodes in the available capacity (which really, they aren't, as the scheduler can't place on them) we could exclude pods on tainted nodes from the load (which makes some sense, they are draining pods)

The crux of the utilisation calc is to determine what % pressure is on the pool of resources that are available to the scheduler

The concern could be that computing that exclusion could be expensive - we essentially need to do another filter pass on pods and inspect if the assigned node is part of our tainted list

@dtnyn thoughts?

I think either way will address the specific problem about inaccurate numerator, but exclusion does seem more semantically accurate since like you mentioned they aren't actually "available capacity" so the same treatment to their workload make sense.

I don't think the extra check would be changing the time scaling cost. Since a cursory glance should be able to achieve it via checking membership in a set of tainted node which should be a subset of the loop we're already doing with mapPodsToNode() which goes through all current pods

dtnyn · 2026-03-30T06:01:12Z

closing this PR in favour of #288

FocalChord changed the title ~~(feat) Add feature flag to include tainted nodes in utilisation denominator~~ feat: add include_tainted_in_capacity flag Mar 6, 2026

FocalChord changed the title ~~feat: add include_tainted_in_capacity flag~~ feat: include tainted nodes in utilisation denominator Mar 6, 2026

fix

e667b75

FocalChord force-pushed the feat/include-tainted-in-capacity branch from 87dc37f to e667b75 Compare March 10, 2026 23:34

FocalChord requested a review from awprice March 13, 2026 18:02

dtnyn reviewed Mar 24, 2026

View reviewed changes

cespo closed this Mar 25, 2026

cespo reopened this Mar 25, 2026

tomwwright reviewed Mar 26, 2026

View reviewed changes

tomwwright mentioned this pull request Mar 26, 2026

fix: consider only load on available nodes for utilisation #288

Merged

dtnyn closed this Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: include tainted nodes in utilisation denominator#285

feat: include tainted nodes in utilisation denominator#285
FocalChord wants to merge 1 commit intoatlassian:masterfrom
FocalChord:feat/include-tainted-in-capacity

FocalChord commented Mar 6, 2026 •

edited by atlassian Bot

Loading

Uh oh!

atlassian-cla-bot Bot commented Mar 6, 2026

Uh oh!

dtnyn Mar 24, 2026

Uh oh!

dtnyn Mar 24, 2026

Uh oh!

dtnyn Mar 24, 2026

Uh oh!

dtnyn Mar 24, 2026

Uh oh!

tomwwright Mar 26, 2026 •

edited

Loading

Uh oh!

dtnyn Mar 26, 2026

Uh oh!

dtnyn commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

FocalChord commented Mar 6, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Threshold re-tuning required

Testing

Uh oh!

atlassian-cla-bot Bot commented Mar 6, 2026

Uh oh!

dtnyn Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

dtnyn Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

dtnyn Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

dtnyn Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

tomwwright Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dtnyn Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

dtnyn commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

FocalChord commented Mar 6, 2026 •

edited by atlassian Bot

Loading

tomwwright Mar 26, 2026 •

edited

Loading