Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve docs on topology spread across zones #2572

Closed
chrisnegus opened this issue Sep 30, 2022 · 3 comments · Fixed by #4476
Closed

Improve docs on topology spread across zones #2572

chrisnegus opened this issue Sep 30, 2022 · 3 comments · Fixed by #4476
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@chrisnegus
Copy link
Member

Is an existing page relevant?

https://karpenter.sh/v0.16.3/tasks/scheduling/#topology-spread

What karpenter features are relevant?

Using topologySpreadConstraints in pod scheduling to balance pods across multiple zones.

How should the docs be improved?

Until a pod has been scheduled to a zone, the Kubernetes scheduler doesn't know that the zone exists. So if, for example, you wanted to use topologySpreadConstraints to spread pods across zone-a, zone-b, and zone-c, if the Kubernetes scheduler has scheduled pods to zone-a and zone-b, but not zone-c, it would only spread pods across nodes in zone-a and zone-b and never create nodes on zone-c.

I'm proposing to update the documentation to describe workarounds to this problem in the Topology Spread documentation. This could include doing things like launching a pause container to each zone you want the scheduler to use before launching the pods you want to spread across those zones.

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@chrisnegus chrisnegus changed the title Improve maxskew docs on topology spread across zones Improve docs on topology spread across zones Sep 30, 2022
@chrisnegus chrisnegus self-assigned this Sep 30, 2022
@chrisnegus chrisnegus added the documentation Improvements or additions to documentation label Sep 30, 2022
@hawkesn
Copy link
Contributor

hawkesn commented Oct 24, 2022

I did a bit of a dive into this topic because I was running into this issue. I was able to solve this by what you suggested @chrisnegus using the following pause pods:

---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: az-headroom
value: -1
globalDefault: false
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: az-headroom
  labels:
    app: az-headroom
spec:
  replicas: 3 # As many AZs as you have
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: az-headroom
  template:
    metadata:
      labels:
        app: az-headroom
      annotations: 
    spec:
      priorityClassName: az-headroom
      containers:
        - image: k8s.gcr.io/pause
          imagePullPolicy: IfNotPresent
          name: az-headroom
          resources:
            requests:
              cpu: 4 # Set to whatever
              memory: 4 Gi # Set to whatever
      nodeSelector:
        nodegroup: myNodeGroupIWantAcrossAZs
      tolerations:
      - operator: Exists
        effect: NoSchedule
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
                matchLabels:
                  app: az-headroom
            topologyKey: topology.kubernetes.io/zone

This is a similar approach to how the cluster-proportional-autoscaler works for overprovisioning.

However, if you're on Kubernetes 1.25, you may be able to use the new minDomains feature:
https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#spread-constraint-definition

minDomains indicates a minimum number of eligible domains. This field is optional. A domain is a particular instance of a topology. An eligible domain is a domain whose nodes match the node selector.

Which would let you use topologySpreadConstraints instead of podAntiAffinity.

@FernandoMiguel
Copy link
Contributor

we been using this with success, assuming there are pre-existing karpenter nodes in all AZs

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
  labels:
    app: inflate
spec:
  selector:
    matchLabels:
      app: inflate
  replicas: 2
  template:
    metadata:
      labels:
        app: inflate
    spec:
      terminationGracePeriodSeconds: 0
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.5
          resources:
            requests:
              cpu: 3
      topologySpreadConstraints:
        - labelSelector:
            matchLabels:
              app: inflate
          maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: ScheduleAnyway
        - labelSelector:
            matchLabels:
              app: inflate
          maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: DoNotSchedule
      nodeSelector:
        kubernetes.io/os: linux
        kubernetes.io/arch: amd64

we use this deployment, one per AZ in the ENIConfig to force karpenter to create one node in each AZ

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate-${az}
  labels:
    app: inflate
spec:
  selector:
    matchLabels:
      app: inflate
  replicas: 1
  template:
    metadata:
      labels:
        app: inflate
    spec:
      terminationGracePeriodSeconds: 0
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.5
      topologySpreadConstraints:
        - labelSelector:
            matchLabels:
              app: inflate
          maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: DoNotSchedule
      nodeSelector:
        kubernetes.io/os: linux
        kubernetes.io/arch: amd64
        topology.kubernetes.io/zone: ${az}

@tzneal
Copy link
Contributor

tzneal commented Jun 14, 2023

For what happens in this scenario,

You normally get lucky and it works out as Karpenter does launch the required nodes. The situation where it doesn’t is if:
a. The nodes launched can handle more than the required number of pods (e.g. you launch three pods and all can fit on one node)
b. The newly launched nodes startup at slightly different times, with the first ones to go Ready being those that are in AZs that your existing MNG nodes are already in.
c. You don’t have existing nodes in all three AZs

In this case, kube-scheduler will schedule the pods to the node(s) that launch first since it isn’t aware of the new AZ.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants