Skip to content

Commit

Permalink
kueue is working!
Browse files Browse the repository at this point in the history
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
  • Loading branch information
vsoch committed Apr 17, 2024
1 parent 3c2bc6c commit bc20c69
Show file tree
Hide file tree
Showing 8 changed files with 456 additions and 66 deletions.
72 changes: 14 additions & 58 deletions google/scheduler/run9/README.md
Expand Up @@ -15,7 +15,6 @@ We can use the [c2d-standard-8](https://cloud.google.com/compute/docs/compute-op
3. If/ when a scheduler setup clogs (and the queue stops moving)

For experiment prototyping see [run8](../run8).


## Experiments

Expand Down Expand Up @@ -223,17 +222,12 @@ I'm trying Kueue first this time, before installing the cert manager, because I
kubectl apply --server-side -f ./crd/kueue.yaml
```

Then I did:

```console
TOTAL_ALLOCATABLE=$(kubectl get node --selector='!node-role.kubernetes.io/master,!node-role.kubernetes.io/control-plane' -o jsonpath='{range .items[*]}{.status.allocatable.memory}{"\n"}{end}' | numfmt --from=auto | awk '{s+=$1} END {print s}')
echo $TOTAL_ALLOCATABLE
```
And changed the memory in [crd/cluster-queues.yaml](crd/cluster-queues.yaml) to be double what is in the cluster.
Note that the kueue is configured for this cluster _exactly_ and if you change it you need to change that cluster! Any resource request
that is on the job pod template is going to need to be defined in the cluster queue resources, otherwise the Job (and pods) cannot be admitted.
Then apply:

```bash
kubectl apply -f cluster-queues.yaml
kubectl apply -f ./crd/cluster-queues.yaml
```

Then try running experiments:
Expand All @@ -242,16 +236,9 @@ Then try running experiments:
time python run_experiments.py --outdir ./results/mixed/kueue --config-name mixed --batches 1 --iters 10 --kueue
```

Following guide here https://kueue.sigs.k8s.io/docs/tasks/run/plain_pods/ and
https://kueue.sigs.k8s.io/docs/tasks/manage/setup_sequential_admission/. Note that queue is working now but pods not scheduling, so more configuration issues.

### Install Cert Manager
There is more information [here](https://kueue.sigs.k8s.io/docs/tasks/run/plain_pods/) and
[here](https://kueue.sigs.k8s.io/docs/tasks/manage/setup_sequential_admission/). The user experience of Kueue is really nice!

The newer version of fluence requires the certificate manager. There is likely a way to do self-signed certs but we haven't tried it yet.

```bash
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.1/cert-manager.yaml
```

### Fluence

Expand All @@ -267,6 +254,14 @@ helm install \
fluence as-a-second-scheduler/
```

### Install Cert Manager

The newer version of fluence requires the certificate manager. There is likely a way to do self-signed certs but we haven't tried it yet.

```bash
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.1/cert-manager.yaml
```


Ensure both scheduler pods are running (they just need to pull, etc).

Expand Down Expand Up @@ -384,47 +379,8 @@ helm uninstall fluence
### Analysis

```
python plot-lammps.py
python plot-schedulers.py
```
```console
scheduler experiment
coscheduling size-2-2-2-2 53.928729
size-3-2-2-2 3.83098
size-4-2-2-2 13.949011
size-5-2-2-2 10.058261
default size-2-2-2-2 83.771192
size-3-2-2-2 30.501831
size-4-2-2-2 44.728875
fluence size-2-2-2-2 36.072507
size-3-2-2-2 26.927059
size-4-2-2-2 21.665824
size-5-2-2-2 23.640751
size-6-2-2-2 29.867044
Name: total_time, dtype: object
scheduler experiment
coscheduling size-2-2-2-2 30.544364
size-3-2-2-2 1.042152
size-4-2-2-2 2.257874
size-5-2-2-2 NaN
default size-2-2-2-2 78.296599
size-3-2-2-2 NaN
size-4-2-2-2 NaN
fluence size-2-2-2-2 26.133976
size-3-2-2-2 40.233534
size-4-2-2-2 7.927240
size-5-2-2-2 12.581147
size-6-2-2-2 13.662106
Name: total_time, dtype: float64
fluence: 100
default: 22
cosched: 7
```

Results are in [img](img). Since the default clogged, coscheduling stopped working, and kueue didn't work, we can't say much from this, but we can compare the 100 jobs from fluence to the 22 default sched. General patterns I see:

- There is a tradeoff between "run it quickly" and "run it right." The default scheduler ran some jobs quickly, but at the expense of poor scheduling that led to clogging. Fluence took its time and completed all jobs, at the cost of waiting longer for each one (logically).
- I don't know why there would be difference in lammps runtimes aside from just having too small a sample


### Clean Up

Expand Down
10 changes: 7 additions & 3 deletions google/scheduler/run9/crd/cluster-queues.yaml
Expand Up @@ -9,13 +9,17 @@ metadata:
name: "cluster-queue"
spec:
namespaceSelector: {}
resourceGroups: {}
resourceGroups:
- coveredResources: ["memory"]
- coveredResources: ["pods", "cpu"]
flavors:
- name: "default-flavor"
resources:
- name: "memory"
nominalQuota: 58171Mi # double the value of allocatable memory in the cluster
- name: "pods"
nominalQuota: 1000
- name: "cpu"
# 8 nodes * 4 each
nominalQuota: 40
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
Expand Down
6 changes: 3 additions & 3 deletions google/scheduler/run9/crd/job.yaml
Expand Up @@ -26,9 +26,9 @@ spec:
app: {{ name }}
{% if scheduler == "scheduler-plugins-scheduler" %}scheduling.x-k8s.io/pod-group: {{ name }}{% endif %}
{% if scheduler == "kueue" %}kueue.x-k8s.io/queue-name: "user-queue"
kueue.x-k8s.io/pod-group-name: "{{ name }}"
#kueue-job: "yes"
kueue.x-k8s.io/pod-group-total-count: "{{ size }}"{% endif %}
# kueue.x-k8s.io/pod-group-total-count: "{{ size }}"
# annotations:
# kueue.x-k8s.io/pod-group-name: "{{ name }}"{% endif %}
spec:
subdomain: {{ service_name }}
{% if scheduler %}{% if scheduler != 'kueue' %}
Expand Down
1 change: 0 additions & 1 deletion google/scheduler/run9/crd/kueue.yaml
Expand Up @@ -11076,7 +11076,6 @@ data:
integrations:
frameworks:
- "batch/job"
- "pod"
# podOptions:
# namespaceSelector:
# matchExpressions:
Expand Down
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
157 changes: 157 additions & 0 deletions google/scheduler/run9/img/scheduler-times.csv
@@ -0,0 +1,157 @@
,size,experiment,scheduler,log_time,external_recorded_time,submit_to_completion_time
0,2,size-2,fluence,2,0.589017,510.32831
1,6,size-6,fluence,3,0.710151,1817.592267
2,5,size-5,fluence,2,0.898728,2362.197094
3,2,size-2,fluence,1,0.576367,2805.219594
4,4,size-4,fluence,3,0.967244,861.577124
5,2,size-2,fluence,2,0.584673,1277.216029
6,3,size-3,fluence,2,0.947917,526.178264
7,4,size-4,fluence,2,1.011188,2506.716435
8,2,size-2,fluence,2,0.663358,2232.112499
9,6,size-6,fluence,3,1.100947,1518.75398
10,4,size-4,fluence,2,0.490796,2118.974825
11,3,size-3,fluence,3,1.335554,800.415871
12,2,size-2,fluence,1,0.666937,2643.467794
13,4,size-4,fluence,4,1.044488,1347.101095
14,4,size-4,fluence,4,0.483078,2693.506373
15,6,size-6,fluence,3,1.533291,2037.168192
16,6,size-6,fluence,3,0.504361,1035.492243
17,4,size-4,fluence,3,1.221547,1652.170447
18,3,size-3,fluence,2,0.497328,2490.196611
19,5,size-5,fluence,2,0.577249,1959.487034
20,3,size-3,fluence,1,0.538583,31.708529
21,5,size-5,fluence,2,0.473009,2568.866225
22,4,size-4,fluence,1,1.419682,33.509917
23,6,size-6,fluence,4,0.932162,342.384829
24,5,size-5,fluence,3,0.629275,1430.091979
25,6,size-6,fluence,3,0.604219,2800.495662
26,2,size-2,fluence,3,1.333825,2046.897224
27,2,size-2,fluence,3,1.088789,734.356521
28,5,size-5,fluence,2,0.994428,200.822102
29,6,size-6,fluence,2,1.30197,729.184566
30,2,size-2,fluence,2,1.649017,1590.426322
31,4,size-4,fluence,3,1.101769,1888.36977
32,5,size-5,fluence,3,1.069636,2170.253013
33,5,size-5,fluence,4,0.953637,1191.825852
34,6,size-6,fluence,3,1.122108,1264.824395
35,4,size-4,fluence,4,0.397764,2300.008814
36,6,size-6,fluence,2,0.586958,2634.414065
37,4,size-4,fluence,2,0.565239,579.081353
38,2,size-2,fluence,3,1.329715,1828.961675
39,3,size-3,fluence,2,0.705032,2281.31817
40,4,size-4,fluence,2,1.868388,1127.248383
41,2,size-2,fluence,2,1.418063,1047.838346
42,3,size-3,fluence,2,0.377506,2680.405033
43,3,size-3,fluence,3,1.48763,2100.764718
44,3,size-3,fluence,4,0.866908,1106.371179
45,6,size-6,fluence,2,1.068261,2224.082753
46,2,size-2,fluence,3,0.340875,109.813602
47,5,size-5,fluence,3,0.477034,2742.236554
48,5,size-5,fluence,3,1.557414,643.209079
49,5,size-5,fluence,4,0.873065,948.191075
50,3,size-3,fluence,3,0.4244,1323.706434
51,3,size-3,fluence,3,0.716192,1873.146966
52,2,size-2,fluence,3,0.760174,2449.278918
53,5,size-5,fluence,3,0.609066,1732.595296
54,3,size-3,fluence,3,1.344927,1609.728984
55,6,size-6,fluence,2,0.627962,2440.178354
56,2,size-2,kueue,3,0.146829,506.477925
57,6,size-6,kueue,33,0.096887,1017.721932
58,5,size-5,kueue,3,0.189722,26.629374
59,4,size-4,kueue,2,0.165721,715.461503
60,5,size-5,kueue,39,0.315341,348.341686
61,4,size-4,kueue,3,0.15415,138.451462
62,5,size-5,kueue,3,0.097161,583.093251
63,5,size-5,kueue,3,0.227283,261.322345
64,4,size-4,kueue,49,0.165421,61.969014
65,3,size-3,kueue,1,0.093449,1022.582834
66,6,size-6,kueue,38,0.330991,181.319918
67,6,size-6,kueue,35,0.142017,625.392719
68,2,size-2,kueue,3,0.221957,306.372683
69,2,size-2,kueue,2,0.136727,342.48441
70,6,size-6,kueue,38,0.145277,666.438975
71,5,size-5,kueue,39,0.392928,428.741875
72,5,size-5,kueue,3,0.18278,946.017915
73,6,size-6,kueue,35,0.226663,104.852617
74,4,size-4,kueue,2,0.092978,943.540753
75,4,size-4,kueue,36,0.110133,499.241316
76,3,size-3,kueue,35,0.263459,628.427439
77,2,size-2,kueue,1,0.095823,911.95796
78,3,size-3,kueue,4,0.174026,63.210713
79,6,size-6,kueue,29,0.136418,545.233749
80,3,size-3,kueue,1,0.096095,781.871511
81,3,size-3,kueue,1,0.097079,504.558338
82,4,size-4,kueue,11,0.115939,424.541629
83,3,size-3,kueue,2,0.088885,752.461093
84,2,size-2,kueue,1,0.199553,545.973478
85,2,size-2,kueue,2,0.130215,815.413231
86,3,size-3,kueue,2,0.166119,21.179539
87,2,size-2,kueue,2,0.207813,140.905613
88,5,size-5,kueue,42,0.118305,824.452301
89,5,size-5,kueue,39,0.213,224.90303
90,4,size-4,kueue,32,0.12122,466.737051
91,4,size-4,kueue,37,0.083686,985.183364
92,6,size-6,kueue,36,0.139485,389.863842
93,2,size-2,kueue,2,0.144818,750.365121
94,3,size-3,kueue,3,0.233794,470.026376
95,4,size-4,kueue,26,0.116874,227.226818
96,6,size-6,kueue,30,0.15688,304.856931
97,4,size-4,kueue,36,0.162737,749.044844
98,6,size-6,kueue,39,0.096085,787.557149
99,3,size-3,kueue,34,0.180651,264.290258
100,3,size-3,kueue,3,0.159289,338.07696
101,2,size-2,kueue,27,0.123184,67.08784
102,6,size-6,kueue,29,0.110274,867.749182
103,5,size-5,kueue,42,0.260032,910.392438
104,5,size-5,kueue,33,0.151238,710.390347
105,2,size-2,kueue,3,0.126576,19.345697
106,2,size-2,kueue,23,0.290234,522.702194
107,6,size-6,kueue,6,0.235747,968.659423
108,5,size-5,kueue,53,0.356399,64.387946
109,4,size-4,kueue,30,0.647357,701.986525
110,5,size-5,kueue,37,0.588025,406.167425
111,4,size-4,kueue,37,0.578389,153.763932
112,5,size-5,kueue,35,0.327706,658.882435
113,5,size-5,kueue,44,0.369311,327.182841
114,4,size-4,kueue,2,0.369246,19.89736
115,3,size-3,kueue,28,0.254365,960.660927
116,6,size-6,kueue,3,0.610334,196.71446
117,6,size-6,kueue,38,1.046147,619.279541
118,2,size-2,kueue,3,0.366951,332.801075
119,2,size-2,kueue,2,0.643695,410.016496
120,6,size-6,kueue,37,0.22937,745.760187
121,5,size-5,kueue,5,0.470434,497.752391
122,5,size-5,kueue,3,0.089022,1017.868257
123,6,size-6,kueue,39,0.269354,108.964696
124,4,size-4,kueue,34,0.241455,892.516268
125,4,size-4,kueue,34,0.634629,571.745548
126,3,size-3,kueue,3,0.215581,669.058841
127,2,size-2,kueue,26,0.118423,925.852238
128,3,size-3,kueue,2,0.544165,114.225795
129,6,size-6,kueue,13,0.27497,537.75667
130,3,size-3,kueue,3,0.786692,823.239987
131,3,size-3,kueue,2,0.315318,577.554629
132,4,size-4,kueue,28,0.334534,361.789226
133,3,size-3,kueue,4,0.407313,751.048618
134,2,size-2,kueue,2,0.217357,664.755017
135,2,size-2,kueue,3,0.196758,819.318696
136,3,size-3,kueue,1,0.404324,20.386391
137,2,size-2,kueue,1,0.911823,189.751453
138,5,size-5,kueue,4,0.386092,783.985026
139,5,size-5,kueue,4,0.391565,158.885314
140,4,size-4,kueue,34,0.217625,488.326022
141,4,size-4,kueue,19,0.087459,1012.231904
142,6,size-6,kueue,34,0.31533,445.619168
143,2,size-2,kueue,27,0.327083,780.376245
144,3,size-3,kueue,3,0.261745,452.622785
145,4,size-4,kueue,36,0.967952,234.892172
146,6,size-6,kueue,39,0.381879,280.950521
147,4,size-4,kueue,28,0.365582,814.561787
148,6,size-6,kueue,34,0.316026,858.643991
149,3,size-3,kueue,3,0.391982,241.271833
150,3,size-3,kueue,2,0.592592,367.306026
151,2,size-2,kueue,2,0.443521,68.732719
152,6,size-6,kueue,2,0.126909,932.258394
153,5,size-5,kueue,3,0.166188,896.57746
154,5,size-5,kueue,3,0.295688,705.454469
155,2,size-2,kueue,1,0.242137,20.886991

0 comments on commit bc20c69

Please sign in to comment.