Topology Spread Constraints calculations don't apply across mutually exclusive provisioners #975

olemarkus · 2021-12-13T18:10:41Z

Version

Karpenter: v0.5.2

Kubernetes: v1.23.0

Expected Behavior

topologySpreadConstraint at least with whenUnsatisfiable: DoNotScheduler should never schedule when violated.

Actual Behavior

If there are two Provisioners with mutually exclusive subnets, Karpenter will just pick one and schedule all instances and Pods in the subnets that Provisioner has, going far above the configured maxSkew. This happens even if there are other Nodes in the cluster with other topologyKey values.

Steps to Reproduce the Problem

Create two Providers with mutually exclusive or only partially overlapping subnets.
Create a deployment with a handful of replicas, spread constraint with maxSkew: 1 and topologyKey on zone

The text was updated successfully, but these errors were encountered:

ellistarn · 2021-12-13T19:21:13Z

This is a super tricky problem and definitely an edge case for topology. Thanks for pointing this out.

Can you dig in a little bit for the use case of using a single topology spread selector across distinct provisioners? Provisioners manage distinct sets of capacity and are unaware of each other. In order to compute topology, we need to know the set of viable zones to spread across. Right now, that uses the zones of whichever provisioner was selected.

That case you're pointing out is cross-provisioner topology awareness. It's possible to construct topology selectors across multiple deployments, but this is not recommended:

So if this is not what you expect, we recommend the workload's topologySpreadConstraints[*].labelSelector to match its own labels.

ref: https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/#other-noticeable-semantics

In order to achieve what you're talking about, we would need to look at all of the viable zones for all provisioners that are able to schedule the pod, and then compute topology that way. This seems somewhat computationally expensive, and will require significant changes to the code. Anything you can provide to elaborate on the use case can help us make prioritization decisions of if/how to prioritize this.

bwagner5 · 2021-12-13T19:29:38Z

This is something for us to consider on how topology spread works across provisioners, but I suspect that part of the issue here is that kOps IGs are often setup w/ only 1 AZ which would require multiple provisioners. Ideally, one IG would span all zones available for worker nodes and you shouldn't run into weird topology spreads like you're seeing.

Is there a use-case for multiple provisioners?

olemarkus · 2021-12-13T19:45:47Z

So kOps will by default use 1 AZ per IG because of limitations with ASGs. There is not a particular reason to do this by default for karpenter.sh for the same reason.

On one hand, I do have use cases for Provisioners with individual zones (related to avoiding cross-AZ traffic across VPCs and accounts), so with even more Provisioners (one for this workload, one for generic workloads) this can still work.
On the other hand, topologySpreadConstraint is pretty clear about what it means and that it must work considering the state of the cluster, not the provisioner.

So similar to how karpenter bails on Pod affinities, it may be that it should bail on whenUnsatisfiable: DoNotSchedule.

I also think documentation should be very clear about how karpenter handles spread constraints and that it may not satisfy workload requirement.

ellistarn · 2021-12-13T21:20:27Z

Agree that we need to resolve this more long term, but it seems like the use case is fairly contrived. I'll think a bit about whether or not about the easiest way to enforce DoNotSchedule across multiple provisioners.

olemarkus · 2021-12-14T07:50:37Z

One way of doing it is something like

Check what values exist for a topology
Check each provider that it can launch instances matching all values, skip those that can't
If no single provider can launch instances in all zones, bail

To me, this seems reasonable. In our usecase we need those specific Provisioners for those zone specific workloads, and then we need an extra Provisioner that handles Deployments with zonal spread constraints.

Note that for kOps, we currently hard code the subnet selector to kubernetes.io/role/internal-elb=1, which happens to coincide with subnets that we launch instances into. This way we support those who bring their own subnets and do their own subnet tagging. We will probably allow custom selectors in the future.

ellistarn · 2022-02-02T19:02:04Z

Removing burning from this for a couple reasons. Right now we recommend that only one provisioner is viable per pod, and this bug doesn't cause problems for this case. In the future we may introduce other features like provisioner priority (#783), where this would become a bigger deal. Marking this as blocked until then.

github-actions · 2022-02-28T01:52:23Z

This issue is stale because it has been open 25 days with no activity. Remove stale label or comment or this will be closed in 5 days.

olemarkus · 2022-02-28T07:14:55Z

Bump. 25 days is aggressive

ellistarn · 2022-03-01T17:28:31Z

#1441

ellistarn · 2022-04-15T23:56:07Z

WIP for this here: #1673

olemarkus added the bug Something isn't working label Dec 13, 2021

ellistarn changed the title ~~Karpenter is not respecting topologySpreadConstraints~~ Topology Spread Constraints across mutually exclusion provisioners Dec 13, 2021

ellistarn changed the title ~~Topology Spread Constraints across mutually exclusion provisioners~~ Topology Spread Constraints calculations don't apply across mutually exclusion provisioners Dec 13, 2021

ellistarn added the scheduling Issues related to Kubernetes scheduling label Dec 13, 2021

ellistarn changed the title ~~Topology Spread Constraints calculations don't apply across mutually exclusion provisioners~~ Topology Spread Constraints calculations don't apply across mutually exclusive provisioners Dec 13, 2021

ellistarn added the burning Time sensitive issues label Jan 25, 2022

ellistarn added this to To Do in Engineering - February 2022 Jan 27, 2022

ellistarn added blocked Unable to make progress due to some dependency and removed burning Time sensitive issues labels Feb 2, 2022

ellistarn mentioned this issue Feb 2, 2022

Reimplement Spread/Antiaffinity to not use kubernetes.io/hostname label #1264

Closed

github-actions bot added the stale label Feb 28, 2022

ellistarn added roadmap We're thinking about next steps and removed roadmap We're thinking about next steps labels Mar 1, 2022

ellistarn removed the stale label Mar 1, 2022

ellistarn mentioned this issue Mar 28, 2022

Schedule and binpack #1561

Merged

3 tasks

ellistarn mentioned this issue Apr 15, 2022

Resolved cross-provisioner scheduling bugs by collapsing selection into a unified scheduler #1673

Merged

3 tasks

njtran removed the blocked Unable to make progress due to some dependency label Apr 18, 2022

ellistarn closed this as completed Apr 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topology Spread Constraints calculations don't apply across mutually exclusive provisioners #975

Topology Spread Constraints calculations don't apply across mutually exclusive provisioners #975

olemarkus commented Dec 13, 2021

ellistarn commented Dec 13, 2021

bwagner5 commented Dec 13, 2021

olemarkus commented Dec 13, 2021

ellistarn commented Dec 13, 2021

olemarkus commented Dec 14, 2021

ellistarn commented Feb 2, 2022

github-actions bot commented Feb 28, 2022

olemarkus commented Feb 28, 2022

ellistarn commented Mar 1, 2022

ellistarn commented Apr 15, 2022

Topology Spread Constraints calculations don't apply across mutually exclusive provisioners #975

Topology Spread Constraints calculations don't apply across mutually exclusive provisioners #975

Comments

olemarkus commented Dec 13, 2021

Version

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

ellistarn commented Dec 13, 2021

bwagner5 commented Dec 13, 2021

olemarkus commented Dec 13, 2021

ellistarn commented Dec 13, 2021

olemarkus commented Dec 14, 2021

ellistarn commented Feb 2, 2022

github-actions bot commented Feb 28, 2022

olemarkus commented Feb 28, 2022

ellistarn commented Mar 1, 2022

ellistarn commented Apr 15, 2022