Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topology Spread Constraints calculations don't apply across mutually exclusive provisioners #975

Closed
olemarkus opened this issue Dec 13, 2021 · 10 comments
Labels
bug Something isn't working scheduling Issues related to Kubernetes scheduling

Comments

@olemarkus
Copy link
Contributor

Version

Karpenter: v0.5.2

Kubernetes: v1.23.0

Expected Behavior

topologySpreadConstraint at least with whenUnsatisfiable: DoNotScheduler should never schedule when violated.

Actual Behavior

If there are two Provisioners with mutually exclusive subnets, Karpenter will just pick one and schedule all instances and Pods in the subnets that Provisioner has, going far above the configured maxSkew. This happens even if there are other Nodes in the cluster with other topologyKey values.

Steps to Reproduce the Problem

Create two Providers with mutually exclusive or only partially overlapping subnets.
Create a deployment with a handful of replicas, spread constraint with maxSkew: 1 and topologyKey on zone

@olemarkus olemarkus added the bug Something isn't working label Dec 13, 2021
@ellistarn ellistarn changed the title Karpenter is not respecting topologySpreadConstraints Topology Spread Constraints across mutually exclusion provisioners Dec 13, 2021
@ellistarn ellistarn changed the title Topology Spread Constraints across mutually exclusion provisioners Topology Spread Constraints calculations don't apply across mutually exclusion provisioners Dec 13, 2021
@ellistarn ellistarn added the scheduling Issues related to Kubernetes scheduling label Dec 13, 2021
@ellistarn
Copy link
Contributor

This is a super tricky problem and definitely an edge case for topology. Thanks for pointing this out.

Can you dig in a little bit for the use case of using a single topology spread selector across distinct provisioners? Provisioners manage distinct sets of capacity and are unaware of each other. In order to compute topology, we need to know the set of viable zones to spread across. Right now, that uses the zones of whichever provisioner was selected.

That case you're pointing out is cross-provisioner topology awareness. It's possible to construct topology selectors across multiple deployments, but this is not recommended:

So if this is not what you expect, we recommend the workload's topologySpreadConstraints[*].labelSelector to match its own labels.

ref: https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/#other-noticeable-semantics

In order to achieve what you're talking about, we would need to look at all of the viable zones for all provisioners that are able to schedule the pod, and then compute topology that way. This seems somewhat computationally expensive, and will require significant changes to the code. Anything you can provide to elaborate on the use case can help us make prioritization decisions of if/how to prioritize this.

@ellistarn ellistarn changed the title Topology Spread Constraints calculations don't apply across mutually exclusion provisioners Topology Spread Constraints calculations don't apply across mutually exclusive provisioners Dec 13, 2021
@bwagner5
Copy link
Contributor

This is something for us to consider on how topology spread works across provisioners, but I suspect that part of the issue here is that kOps IGs are often setup w/ only 1 AZ which would require multiple provisioners. Ideally, one IG would span all zones available for worker nodes and you shouldn't run into weird topology spreads like you're seeing.

Is there a use-case for multiple provisioners?

@olemarkus
Copy link
Contributor Author

So kOps will by default use 1 AZ per IG because of limitations with ASGs. There is not a particular reason to do this by default for karpenter.sh for the same reason.

On one hand, I do have use cases for Provisioners with individual zones (related to avoiding cross-AZ traffic across VPCs and accounts), so with even more Provisioners (one for this workload, one for generic workloads) this can still work.
On the other hand, topologySpreadConstraint is pretty clear about what it means and that it must work considering the state of the cluster, not the provisioner.

So similar to how karpenter bails on Pod affinities, it may be that it should bail on whenUnsatisfiable: DoNotSchedule.

I also think documentation should be very clear about how karpenter handles spread constraints and that it may not satisfy workload requirement.

@ellistarn
Copy link
Contributor

Agree that we need to resolve this more long term, but it seems like the use case is fairly contrived. I'll think a bit about whether or not about the easiest way to enforce DoNotSchedule across multiple provisioners.

@olemarkus
Copy link
Contributor Author

One way of doing it is something like

  • Check what values exist for a topology
  • Check each provider that it can launch instances matching all values, skip those that can't
  • If no single provider can launch instances in all zones, bail

To me, this seems reasonable. In our usecase we need those specific Provisioners for those zone specific workloads, and then we need an extra Provisioner that handles Deployments with zonal spread constraints.

Note that for kOps, we currently hard code the subnet selector to kubernetes.io/role/internal-elb=1, which happens to coincide with subnets that we launch instances into. This way we support those who bring their own subnets and do their own subnet tagging. We will probably allow custom selectors in the future.

@ellistarn ellistarn added the burning Time sensitive issues label Jan 25, 2022
@ellistarn ellistarn added blocked Unable to make progress due to some dependency and removed burning Time sensitive issues labels Feb 2, 2022
@ellistarn
Copy link
Contributor

Removing burning from this for a couple reasons. Right now we recommend that only one provisioner is viable per pod, and this bug doesn't cause problems for this case. In the future we may introduce other features like provisioner priority (#783), where this would become a bigger deal. Marking this as blocked until then.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 25 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the stale label Feb 28, 2022
@olemarkus
Copy link
Contributor Author

Bump. 25 days is aggressive

@ellistarn ellistarn added roadmap We're thinking about next steps and removed roadmap We're thinking about next steps labels Mar 1, 2022
@ellistarn
Copy link
Contributor

#1441

@ellistarn ellistarn removed the stale label Mar 1, 2022
@ellistarn ellistarn mentioned this issue Mar 28, 2022
3 tasks
@ellistarn
Copy link
Contributor

WIP for this here: #1673

@njtran njtran removed the blocked Unable to make progress due to some dependency label Apr 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working scheduling Issues related to Kubernetes scheduling
Projects
No open projects
Development

No branches or pull requests

4 participants