-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
experiment runs are ready for this weekend.
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
- Loading branch information
Showing
15 changed files
with
1,342 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
# OSU Experiments | ||
|
||
These are the official CRD experiments (we are ready!) We will generate data first, | ||
save all metadata and logs, and then parse them after. Each YAML in [crd](crd) | ||
represents a subset of experiments to run. Since networking | ||
might be influenced by having more than one job on the cluster, we run them each once | ||
at a time. | ||
|
||
*running this weekend* | ||
|
||
## OSU Benchmarks | ||
|
||
Note that we will be using this configuration (130 nodes) is ~$12/hour (rounded up) | ||
|
||
```bash | ||
GOOGLE_PROJECT=myproject | ||
|
||
# Add two nodes for jobset and metrics operator | ||
gcloud container clusters create osu-cluster \ | ||
--threads-per-core=1 \ | ||
--placement-type=COMPACT \ | ||
--num-nodes=130 \ | ||
--machine-type=c2d-standard-2 \ | ||
--enable-gvnic | ||
``` | ||
|
||
Install JobSet | ||
|
||
```bash | ||
VERSION=v0.2.0 | ||
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/$VERSION/manifests.yaml | ||
``` | ||
|
||
Install the metrics operator. Here we keep the exact version and digest. | ||
|
||
```bash | ||
kubectl apply -f ./operator/metrics-operator.yaml | ||
``` | ||
|
||
Save some metadata about the nodes: | ||
|
||
```bash | ||
kubectl get nodes -o json > nodes.json | ||
``` | ||
|
||
Install the Metrics Operator SDK. Version 19 has added support for custom (raw) log parsing. | ||
|
||
```bash | ||
pip install metricsoperator==0.0.19 | ||
``` | ||
|
||
Now we can automate with the script. Note that we target a directory of CRD, so you can target each combination | ||
of size and iterations (which varies). | ||
|
||
```bash | ||
# Run the test experiments - pull to all 128 pods first | ||
# Size 128 we don't attempt the size 1 runs (too long) | ||
python run-experiment.py --out ./results --input ./crd/metrics-20x-128.yaml --iter 20 --sleep 60 | ||
|
||
# Size 64 we split into 20x and 1x for larger runs | ||
python run-experiment.py --out ./results --input ./crd/metrics-20x-64.yaml --iter 20 --sleep 5 | ||
python run-experiment.py --out ./results --input ./crd/metrics-1x-64.yaml --iter 1 --sleep 5 | ||
|
||
# Size 32 is the same... | ||
python run-experiment.py --out ./results --input ./crd/metrics-20x-32.yaml --iter 20 --sleep 5 | ||
python run-experiment.py --out ./results --input ./crd/metrics-1x-32.yaml --iter 1 --sleep 5 | ||
|
||
# Size 16 flips ibarrier into the 20x group | ||
python run-experiment.py --out ./results --input ./crd/metrics-20x-16.yaml --iter 20 --sleep 5 | ||
python run-experiment.py --out ./results --input ./crd/metrics-1x-16.yaml --iter 1 --sleep 5 | ||
|
||
# Size 8 is the same... | ||
python run-experiment.py --out ./results --input ./crd/metrics-20x-8.yaml --iter 20 --sleep 5 | ||
python run-experiment.py --out ./results --input ./crd/metrics-1x-8.yaml --iter 1 --sleep 5 | ||
|
||
# Size 4 is all for 20 | ||
python run-experiment.py --out ./results --input ./crd/metrics-20x-4.yaml --iter 20 --sleep 5 | ||
``` | ||
|
||
When you are done, clean up! | ||
|
||
```bash | ||
gcloud container clusters delete osu-cluster | ||
``` | ||
|
||
Next time we will run the above, adjusted for adding our custom runs! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
apiVersion: flux-framework.org/v1alpha1 | ||
kind: MetricSet | ||
metadata: | ||
labels: | ||
app.kubernetes.io/name: metricset | ||
app.kubernetes.io/instance: metricset-sample | ||
name: metricset-sample | ||
spec: | ||
pods: 16 | ||
metrics: | ||
- name: network-osu-benchmark | ||
# Custom list of commands to run | ||
# See https://converged-computing.github.io/metrics-operator/getting_started/metrics.html#network-osu-benchmark | ||
listOptions: | ||
commands: | ||
- osu_mbw_mr | ||
- osu_multi_lat | ||
- osu_allgather | ||
- osu_allreduce | ||
options: | ||
# Wrap each one in time | ||
timed: "true" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
apiVersion: flux-framework.org/v1alpha1 | ||
kind: MetricSet | ||
metadata: | ||
labels: | ||
app.kubernetes.io/name: metricset | ||
app.kubernetes.io/instance: metricset-sample | ||
name: metricset-sample | ||
spec: | ||
pods: 32 | ||
metrics: | ||
- name: network-osu-benchmark | ||
# Custom list of commands to run | ||
# See https://converged-computing.github.io/metrics-operator/getting_started/metrics.html#network-osu-benchmark | ||
listOptions: | ||
commands: | ||
- osu_ibarrier | ||
- osu_mbw_mr | ||
- osu_multi_lat | ||
- osu_allgather | ||
- osu_allreduce | ||
options: | ||
# Wrap each one in time | ||
timed: "true" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
apiVersion: flux-framework.org/v1alpha1 | ||
kind: MetricSet | ||
metadata: | ||
labels: | ||
app.kubernetes.io/name: metricset | ||
app.kubernetes.io/instance: metricset-sample | ||
name: metricset-sample | ||
spec: | ||
pods: 64 | ||
metrics: | ||
- name: network-osu-benchmark | ||
# Custom list of commands to run | ||
# See https://converged-computing.github.io/metrics-operator/getting_started/metrics.html#network-osu-benchmark | ||
listOptions: | ||
commands: | ||
- osu_ibarrier | ||
- osu_mbw_mr | ||
- osu_multi_lat | ||
- osu_allgather | ||
- osu_allreduce | ||
options: | ||
# Wrap each one in time | ||
timed: "true" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
apiVersion: flux-framework.org/v1alpha1 | ||
kind: MetricSet | ||
metadata: | ||
labels: | ||
app.kubernetes.io/name: metricset | ||
app.kubernetes.io/instance: metricset-sample | ||
name: metricset-sample | ||
spec: | ||
pods: 8 | ||
metrics: | ||
- name: network-osu-benchmark | ||
# Custom list of commands to run | ||
# See https://converged-computing.github.io/metrics-operator/getting_started/metrics.html#network-osu-benchmark | ||
listOptions: | ||
commands: | ||
- osu_mbw_mr | ||
- osu_multi_lat | ||
- osu_allgather | ||
- osu_allreduce | ||
options: | ||
# Wrap each one in time | ||
timed: "true" |
34 changes: 34 additions & 0 deletions
34
google/kubecon/osu-benchmarks/run4/crd/metrics-20x-128.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
apiVersion: flux-framework.org/v1alpha1 | ||
kind: MetricSet | ||
metadata: | ||
labels: | ||
app.kubernetes.io/name: metricset | ||
app.kubernetes.io/instance: metricset-sample | ||
name: metricset-sample | ||
spec: | ||
pods: 128 | ||
metrics: | ||
- name: network-osu-benchmark | ||
# Custom list of commands to run | ||
# See https://converged-computing.github.io/metrics-operator/getting_started/metrics.html#network-osu-benchmark | ||
listOptions: | ||
commands: | ||
- osu_get_acc_latency | ||
- osu_acc_latency | ||
- osu_fop_latency | ||
- osu_get_latency | ||
- osu_put_latency | ||
- osu_latency | ||
- osu_bibw | ||
- osu_bw | ||
- osu_put_bw | ||
- osu_latency_mp | ||
- osu_put_bibw | ||
- osu_init | ||
- osu_get_bw | ||
- osu_cas_latency | ||
- osu_latency_mt | ||
- osu_hello | ||
options: | ||
# Wrap each one in time | ||
timed: "true" |
36 changes: 36 additions & 0 deletions
36
google/kubecon/osu-benchmarks/run4/crd/metrics-20x-16.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
apiVersion: flux-framework.org/v1alpha1 | ||
kind: MetricSet | ||
metadata: | ||
labels: | ||
app.kubernetes.io/name: metricset | ||
app.kubernetes.io/instance: metricset-sample | ||
name: metricset-sample | ||
spec: | ||
pods: 16 | ||
metrics: | ||
- name: network-osu-benchmark | ||
# Custom list of commands to run | ||
# See https://converged-computing.github.io/metrics-operator/getting_started/metrics.html#network-osu-benchmark | ||
listOptions: | ||
commands: | ||
- osu_get_acc_latency | ||
- osu_acc_latency | ||
- osu_fop_latency | ||
- osu_get_latency | ||
- osu_put_latency | ||
- osu_latency | ||
- osu_bibw | ||
- osu_bw | ||
- osu_put_bw | ||
- osu_latency_mp | ||
- osu_put_bibw | ||
- osu_init | ||
- osu_get_bw | ||
- osu_ibarrier | ||
- osu_cas_latency | ||
- osu_latency_mt | ||
- osu_hello | ||
- osu_barrier | ||
options: | ||
# Wrap each one in time | ||
timed: "true" |
35 changes: 35 additions & 0 deletions
35
google/kubecon/osu-benchmarks/run4/crd/metrics-20x-32.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
apiVersion: flux-framework.org/v1alpha1 | ||
kind: MetricSet | ||
metadata: | ||
labels: | ||
app.kubernetes.io/name: metricset | ||
app.kubernetes.io/instance: metricset-sample | ||
name: metricset-sample | ||
spec: | ||
pods: 32 | ||
metrics: | ||
- name: network-osu-benchmark | ||
# Custom list of commands to run | ||
# See https://converged-computing.github.io/metrics-operator/getting_started/metrics.html#network-osu-benchmark | ||
listOptions: | ||
commands: | ||
- osu_get_acc_latency | ||
- osu_acc_latency | ||
- osu_fop_latency | ||
- osu_get_latency | ||
- osu_put_latency | ||
- osu_latency | ||
- osu_bibw | ||
- osu_bw | ||
- osu_put_bw | ||
- osu_latency_mp | ||
- osu_put_bibw | ||
- osu_init | ||
- osu_get_bw | ||
- osu_cas_latency | ||
- osu_latency_mt | ||
- osu_hello | ||
- osu_barrier | ||
options: | ||
# Wrap each one in time | ||
timed: "true" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
apiVersion: flux-framework.org/v1alpha1 | ||
kind: MetricSet | ||
metadata: | ||
labels: | ||
app.kubernetes.io/name: metricset | ||
app.kubernetes.io/instance: metricset-sample | ||
name: metricset-sample | ||
spec: | ||
pods: 4 | ||
metrics: | ||
- name: network-osu-benchmark | ||
# Custom list of commands to run | ||
# See https://converged-computing.github.io/metrics-operator/getting_started/metrics.html#network-osu-benchmark | ||
listOptions: | ||
commands: | ||
- osu_get_acc_latency | ||
- osu_acc_latency | ||
- osu_fop_latency | ||
- osu_get_latency | ||
- osu_put_latency | ||
- osu_latency | ||
- osu_bibw | ||
- osu_bw | ||
- osu_put_bw | ||
- osu_latency_mp | ||
- osu_put_bibw | ||
- osu_init | ||
- osu_get_bw | ||
- osu_ibarrier | ||
- osu_cas_latency | ||
- osu_latency_mt | ||
- osu_hello | ||
- osu_barrier | ||
- osu_mbw_mr | ||
- osu_multi_lat | ||
- osu_allgather | ||
- osu_allreduce | ||
options: | ||
# Wrap each one in time | ||
timed: "true" |
35 changes: 35 additions & 0 deletions
35
google/kubecon/osu-benchmarks/run4/crd/metrics-20x-64.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
apiVersion: flux-framework.org/v1alpha1 | ||
kind: MetricSet | ||
metadata: | ||
labels: | ||
app.kubernetes.io/name: metricset | ||
app.kubernetes.io/instance: metricset-sample | ||
name: metricset-sample | ||
spec: | ||
pods: 64 | ||
metrics: | ||
- name: network-osu-benchmark | ||
# Custom list of commands to run | ||
# See https://converged-computing.github.io/metrics-operator/getting_started/metrics.html#network-osu-benchmark | ||
listOptions: | ||
commands: | ||
- osu_get_acc_latency | ||
- osu_acc_latency | ||
- osu_fop_latency | ||
- osu_get_latency | ||
- osu_put_latency | ||
- osu_latency | ||
- osu_bibw | ||
- osu_bw | ||
- osu_put_bw | ||
- osu_latency_mp | ||
- osu_put_bibw | ||
- osu_init | ||
- osu_get_bw | ||
- osu_cas_latency | ||
- osu_latency_mt | ||
- osu_hello | ||
- osu_barrier | ||
options: | ||
# Wrap each one in time | ||
timed: "true" |
Oops, something went wrong.