-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve docs on topology spread across zones #2572
Comments
I did a bit of a dive into this topic because I was running into this issue. I was able to solve this by what you suggested @chrisnegus using the following pause pods: ---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: az-headroom
value: -1
globalDefault: false
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: az-headroom
labels:
app: az-headroom
spec:
replicas: 3 # As many AZs as you have
strategy:
type: Recreate
selector:
matchLabels:
app: az-headroom
template:
metadata:
labels:
app: az-headroom
annotations:
spec:
priorityClassName: az-headroom
containers:
- image: k8s.gcr.io/pause
imagePullPolicy: IfNotPresent
name: az-headroom
resources:
requests:
cpu: 4 # Set to whatever
memory: 4 Gi # Set to whatever
nodeSelector:
nodegroup: myNodeGroupIWantAcrossAZs
tolerations:
- operator: Exists
effect: NoSchedule
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: az-headroom
topologyKey: topology.kubernetes.io/zone This is a similar approach to how the However, if you're on Kubernetes 1.25, you may be able to use the new
Which would let you use |
we been using this with success, assuming there are pre-existing karpenter nodes in all AZs apiVersion: apps/v1
kind: Deployment
metadata:
name: inflate
labels:
app: inflate
spec:
selector:
matchLabels:
app: inflate
replicas: 2
template:
metadata:
labels:
app: inflate
spec:
terminationGracePeriodSeconds: 0
containers:
- name: inflate
image: public.ecr.aws/eks-distro/kubernetes/pause:3.5
resources:
requests:
cpu: 3
topologySpreadConstraints:
- labelSelector:
matchLabels:
app: inflate
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
- labelSelector:
matchLabels:
app: inflate
maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
nodeSelector:
kubernetes.io/os: linux
kubernetes.io/arch: amd64 we use this deployment, one per AZ in the ENIConfig to force karpenter to create one node in each AZ apiVersion: apps/v1
kind: Deployment
metadata:
name: inflate-${az}
labels:
app: inflate
spec:
selector:
matchLabels:
app: inflate
replicas: 1
template:
metadata:
labels:
app: inflate
spec:
terminationGracePeriodSeconds: 0
containers:
- name: inflate
image: public.ecr.aws/eks-distro/kubernetes/pause:3.5
topologySpreadConstraints:
- labelSelector:
matchLabels:
app: inflate
maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
nodeSelector:
kubernetes.io/os: linux
kubernetes.io/arch: amd64
topology.kubernetes.io/zone: ${az} |
For what happens in this scenario, You normally get lucky and it works out as Karpenter does launch the required nodes. The situation where it doesn’t is if: In this case, kube-scheduler will schedule the pods to the node(s) that launch first since it isn’t aware of the new AZ. |
Is an existing page relevant?
https://karpenter.sh/v0.16.3/tasks/scheduling/#topology-spread
What karpenter features are relevant?
Using topologySpreadConstraints in pod scheduling to balance pods across multiple zones.
How should the docs be improved?
Until a pod has been scheduled to a zone, the Kubernetes scheduler doesn't know that the zone exists. So if, for example, you wanted to use topologySpreadConstraints to spread pods across zone-a, zone-b, and zone-c, if the Kubernetes scheduler has scheduled pods to zone-a and zone-b, but not zone-c, it would only spread pods across nodes in zone-a and zone-b and never create nodes on zone-c.
I'm proposing to update the documentation to describe workarounds to this problem in the Topology Spread documentation. This could include doing things like launching a pause container to each zone you want the scheduler to use before launching the pods you want to spread across those zones.
Community Note
The text was updated successfully, but these errors were encountered: