This project is WIP - DO NOT TRY TILL RELEASE
Scaling Simulator that determines which garden worker pool must be scaled to host unschedulable pods
- Ensure you are using Go version
1.22
. Usego version
to check your version. - Run
./hack/setup.sh
- This will generate a
launch.env
file in the project dir - Ex:
./hack/setup.sh -p scalesim
# setups scalesim for sap-landscape-dev (default) and scalesim cluster project - Ex:
./hack/setup.sh -l staging -p scalesim
# setups scalesim for sap-landscape-staging and scalesim cluster project
- This will generate a
- Take a look at generated
launch.env
and change params to your liking if you want. - Source the
launch.env
file using command below (only necessary once in term session)set -o allexport && source launch.env && set +o allexport
- Run the simulation server:
go run cmd/scalesim/main.go
- The
KUBECONFIG
for simulated control plane should be generated at/tmp/scalesim-kubeconfig.yaml
export KUBECONFIG=/tmp/scalesim-kubeconfig.yaml
kubectl get ns
- Install the EnvFile plugin.
- There is a run configuration already checked-in at
.idea/.idea/runConfigurations/LaunchSimServer.xml
- This will automatically source the generated
launch.env
leveraging the plugin - You should be able to execute using
Run > LaunchSimServer
- This will automatically source the generated
curl -XPOST localhost:8080/op/sync/<myShoot>
curl -XDELETE localhost:8080/op/virtual-cluster
curl -XPOST localhost:8080/scenarios/A
TODO: REFINE THE BELOW
Given a garden shoot configured with different worker pools and Pod(s) to be deployed on the shoot cluster: the simulator will report the following advice:
- In case scale-up is needed, the simulator will recommend which worker pool must be scaled-up to host the unschedulable pod(s).
- The simulator will recommend which node belonging to which worker pool will host the Pod(s)
- ?? Then check will be made against real-shoot cluster on which Pods will be deployed. The simulator's advice will be verified against real-world node scaleup and pod-assignment.
The above will be repeated for different worker pool and Pod specs representing various simulation scenarios
The Simulator works by replicating shoot cluster into its virtual cluster by maintaining its independent copy of api server+scheduler. The engine then executes various simulation scenarios.
graph LR
engine--1:GetShootWorkerPooolAndClusterData-->ShootCluster
subgraph ScalerSimulator
engine--2:PopulateVirtualCluster-->apiserver
engine--3:RunSimulation-->simulation
simulation--DeployPods-->apiserver
simulation--LauchNodesIfPodUnschedulable-->apiserver
simulation--QueryAssignedNode-->apiserver
scheduler--AssignPodToNode-->apiserver
simulation--ScalingRecommendation-->advice
end
advice[(ScalingRecommendation)]
Other simulations are Work In Progress at the moment.
Simple Worker Pool with m5.large
(vCPU:2,8GB).
graph TB
subgraph WorkerPool-P1
SpecB["machineType: m5.large\n(vCPU:2, 8GB)\nmin:1,max:5"]
end
- We taint the existing nodes in the real shoot cluster.
- We create
replicas
num of App pods in the real shoot cluster. - We get the daemon set pods from the real shoot cluster.
- We get the unscheduled app pods from the real shoot cluster.
- We synchronize the virtual cluster nodes with the real shoot cluster nodes.
- We scale all the virtual worker pools till max
Node.Allocatable
is now considered.
- We deploy the daemon set pods into the virtual cluster.
- We deploy the unscheduled application pods into the virtual cluster.
- We wait till there are no unscheduled pods or till timeout.
- We "Trim" the virtual cluster. (Delete empty nodes and daemon set pods on those nodes)
- We trim the Virtual Cluster after scheduler assigns pods.
- We obtain the Node<->Pod assignments
- We compute the scaling recommendation and print the same.
- We scale up the real shoot cluster and compare our scale-up recommendation against the shoot current scale-up.
This is to demonstrate preference for worker pool over others through simple order of declaration.
3 Worker Pools in decreasing order of resources. We ask operation to configure shoot with a declaration based priority paying careful attention to their max bound
graph TB
subgraph WorkerPool-P3
SpecC["m5.large\n(vCPU:2, 8GB)\nmin:1,max:2"]
end
subgraph WorkerPool-P2
SpecB["m5.xlarge\n(vCPU:4, 16GB)\nmin:1,max:2"]
end
subgraph WorkerPool-P1
SpecA["m5.2xlarge\n(vCPU:8, 32GB)\nmin:1,max:2"]
end
- We sync the virtual cluster nodes with real shoot cluster nodes.
- We deploy
podA
count of Pod-A's andpodB
count of Pod-B's. - We go through each worker pool by order by declaration.
- We scale the worker pool till max.
- We wait till for an interval to permit scheduler to assign pods to nodes.
- If there are still un-schedulable Pods we continue to next worker pool, else break.
- We trim the Virtual Cluster after scheduler finishes.
- We obtain the Node<->Pod assignments
- We compute the scaling recommendation and print the same.
This mechanism ensures that Nodes belonging to preferenced worker pool of higher priority are scaled first before pools of lower priority.
TODO: We can also enhance this scenario with a simulataed back-off when WPs run out of capacity.
graph TB
subgraph WP-B
SpecB["machineType: m5.large\nmin:1,max:2"]
end
subgraph WP-A
SpecA["machineType: m5.large\nmin:1,max:2,Taint:foo=bar:NoSchedule"]
end
- First worker pool is tainted with
NoSchedule
. - 2 Pod spec: X,Y are created: one with toleration to the taint and one without repectively.
- Replicas of Pod-X are deployed which crosses the capacity of tainted node belonging to
WP-A
- The simulation should advice scaling
WP-A
and assign the Pod to tainted nodes ofWP-A
.
- More replicas of
Pod-X
are created which cannot fit intoWP-A
since it has reached its max. - The simulator should report
WP-A
max is exceeded, pod replicas remain unschedulable and no other WP should be scaled.
- Many replicas of the
Pod-Y
(spec without toleration) are deployed which crosses the capacity of existing node inWP-B
- The simulation should scale
WP-B
and assign the Pod to nodes ofWP-B
graph TB
subgraph WP-A
SpecB["machineType: m5.large\nmin:1,max:3, zones:a,b,c"]
end
One Existing Worker Pool with 3 assigned zones
There is one node started in the first zone a
.
POD-X
has spec with replicas:3
, topologySpreadConstraints
with a maxSkew: 1
and whenUnsatisfiable: DoNotSchedule
- Deploy
Pod-X
mandating distribution of each replica on separate zone. - Simulator should recommend scaling Nodes for zones
b
,c
Check out how much time would such a simulation of node scale up take here.
- 400+pods
- Scale up WP in order of priority until max is reached, , then move to next WP in priority.
- Analogues our CA priority expander.
PROBLEM:
- We need a better algo than launching virtual nodes one-by-one across pools with priority.
- we need to measure how fast this approach is using virtual nodes with large number of Pods aand Worker Pools.
- TODO: Look into whether kube-scheduler has recommendation advice.
- Kerpenter like mechanics
We have a worker pool with started nodes and min-0.
- All Pods are un-deployed.
- After
scaleDownThreshold
time, the WP should be scaled down to min.
This requires resource utilization computation and we won't do this for now.
TODO: Maddy will describe this.
TODO: describe me
Vedran's concerns:
- Load with large number of Pods and Pools
- Reduce computational weight when where is a priority expander. Check performance.
- How to determine scheduler ran into error and failed assignment.
- How easy is it to consume the result of the kube-scheduler in case there is no assigned node.
- machine selector approach may not be computationally scalable ??
- in order to be computationally feasible we need the node priiority scores from the scheduler.
What do demo for vedran today ?
- daemon set + allocabtle is taken care of.
- declaration based priority -
We will let him know that we will take up: a) machine cost minimization b) machine resource minimization c) performance load test d) stretch: simple scale-down and then wind up the POC