Skip to content

Commit

Permalink
experiment runs are ready for this weekend.
Browse files Browse the repository at this point in the history
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
  • Loading branch information
vsoch committed Sep 15, 2023
1 parent adff039 commit d064507
Show file tree
Hide file tree
Showing 15 changed files with 1,342 additions and 0 deletions.
1 change: 1 addition & 0 deletions google/kubecon/osu-benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ We will be using the Metrics Operator.
- [run1](run1): Focus on OSU to get sense of timings for single runs and plan experiments.
- [run2](run2): Get times for 128 size cluster (but actually we need 130!)
- [run3](run3): A small test run to setup the automation bit.
- [run4](run4): Full automation for planned experiments up to size 128.
86 changes: 86 additions & 0 deletions google/kubecon/osu-benchmarks/run4/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# OSU Experiments

These are the official CRD experiments (we are ready!) We will generate data first,
save all metadata and logs, and then parse them after. Each YAML in [crd](crd)
represents a subset of experiments to run. Since networking
might be influenced by having more than one job on the cluster, we run them each once
at a time.

*running this weekend*

## OSU Benchmarks

Note that we will be using this configuration (130 nodes) is ~$12/hour (rounded up)

```bash
GOOGLE_PROJECT=myproject

# Add two nodes for jobset and metrics operator
gcloud container clusters create osu-cluster \
--threads-per-core=1 \
--placement-type=COMPACT \
--num-nodes=130 \
--machine-type=c2d-standard-2 \
--enable-gvnic
```

Install JobSet

```bash
VERSION=v0.2.0
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/$VERSION/manifests.yaml
```

Install the metrics operator. Here we keep the exact version and digest.

```bash
kubectl apply -f ./operator/metrics-operator.yaml
```

Save some metadata about the nodes:

```bash
kubectl get nodes -o json > nodes.json
```

Install the Metrics Operator SDK. Version 19 has added support for custom (raw) log parsing.

```bash
pip install metricsoperator==0.0.19
```

Now we can automate with the script. Note that we target a directory of CRD, so you can target each combination
of size and iterations (which varies).

```bash
# Run the test experiments - pull to all 128 pods first
# Size 128 we don't attempt the size 1 runs (too long)
python run-experiment.py --out ./results --input ./crd/metrics-20x-128.yaml --iter 20 --sleep 60

# Size 64 we split into 20x and 1x for larger runs
python run-experiment.py --out ./results --input ./crd/metrics-20x-64.yaml --iter 20 --sleep 5
python run-experiment.py --out ./results --input ./crd/metrics-1x-64.yaml --iter 1 --sleep 5

# Size 32 is the same...
python run-experiment.py --out ./results --input ./crd/metrics-20x-32.yaml --iter 20 --sleep 5
python run-experiment.py --out ./results --input ./crd/metrics-1x-32.yaml --iter 1 --sleep 5

# Size 16 flips ibarrier into the 20x group
python run-experiment.py --out ./results --input ./crd/metrics-20x-16.yaml --iter 20 --sleep 5
python run-experiment.py --out ./results --input ./crd/metrics-1x-16.yaml --iter 1 --sleep 5

# Size 8 is the same...
python run-experiment.py --out ./results --input ./crd/metrics-20x-8.yaml --iter 20 --sleep 5
python run-experiment.py --out ./results --input ./crd/metrics-1x-8.yaml --iter 1 --sleep 5

# Size 4 is all for 20
python run-experiment.py --out ./results --input ./crd/metrics-20x-4.yaml --iter 20 --sleep 5
```

When you are done, clean up!

```bash
gcloud container clusters delete osu-cluster
```

Next time we will run the above, adjusted for adding our custom runs!
22 changes: 22 additions & 0 deletions google/kubecon/osu-benchmarks/run4/crd/metrics-1x-16.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
apiVersion: flux-framework.org/v1alpha1
kind: MetricSet
metadata:
labels:
app.kubernetes.io/name: metricset
app.kubernetes.io/instance: metricset-sample
name: metricset-sample
spec:
pods: 16
metrics:
- name: network-osu-benchmark
# Custom list of commands to run
# See https://converged-computing.github.io/metrics-operator/getting_started/metrics.html#network-osu-benchmark
listOptions:
commands:
- osu_mbw_mr
- osu_multi_lat
- osu_allgather
- osu_allreduce
options:
# Wrap each one in time
timed: "true"
23 changes: 23 additions & 0 deletions google/kubecon/osu-benchmarks/run4/crd/metrics-1x-32.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
apiVersion: flux-framework.org/v1alpha1
kind: MetricSet
metadata:
labels:
app.kubernetes.io/name: metricset
app.kubernetes.io/instance: metricset-sample
name: metricset-sample
spec:
pods: 32
metrics:
- name: network-osu-benchmark
# Custom list of commands to run
# See https://converged-computing.github.io/metrics-operator/getting_started/metrics.html#network-osu-benchmark
listOptions:
commands:
- osu_ibarrier
- osu_mbw_mr
- osu_multi_lat
- osu_allgather
- osu_allreduce
options:
# Wrap each one in time
timed: "true"
23 changes: 23 additions & 0 deletions google/kubecon/osu-benchmarks/run4/crd/metrics-1x-64.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
apiVersion: flux-framework.org/v1alpha1
kind: MetricSet
metadata:
labels:
app.kubernetes.io/name: metricset
app.kubernetes.io/instance: metricset-sample
name: metricset-sample
spec:
pods: 64
metrics:
- name: network-osu-benchmark
# Custom list of commands to run
# See https://converged-computing.github.io/metrics-operator/getting_started/metrics.html#network-osu-benchmark
listOptions:
commands:
- osu_ibarrier
- osu_mbw_mr
- osu_multi_lat
- osu_allgather
- osu_allreduce
options:
# Wrap each one in time
timed: "true"
22 changes: 22 additions & 0 deletions google/kubecon/osu-benchmarks/run4/crd/metrics-1x-8.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
apiVersion: flux-framework.org/v1alpha1
kind: MetricSet
metadata:
labels:
app.kubernetes.io/name: metricset
app.kubernetes.io/instance: metricset-sample
name: metricset-sample
spec:
pods: 8
metrics:
- name: network-osu-benchmark
# Custom list of commands to run
# See https://converged-computing.github.io/metrics-operator/getting_started/metrics.html#network-osu-benchmark
listOptions:
commands:
- osu_mbw_mr
- osu_multi_lat
- osu_allgather
- osu_allreduce
options:
# Wrap each one in time
timed: "true"
34 changes: 34 additions & 0 deletions google/kubecon/osu-benchmarks/run4/crd/metrics-20x-128.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
apiVersion: flux-framework.org/v1alpha1
kind: MetricSet
metadata:
labels:
app.kubernetes.io/name: metricset
app.kubernetes.io/instance: metricset-sample
name: metricset-sample
spec:
pods: 128
metrics:
- name: network-osu-benchmark
# Custom list of commands to run
# See https://converged-computing.github.io/metrics-operator/getting_started/metrics.html#network-osu-benchmark
listOptions:
commands:
- osu_get_acc_latency
- osu_acc_latency
- osu_fop_latency
- osu_get_latency
- osu_put_latency
- osu_latency
- osu_bibw
- osu_bw
- osu_put_bw
- osu_latency_mp
- osu_put_bibw
- osu_init
- osu_get_bw
- osu_cas_latency
- osu_latency_mt
- osu_hello
options:
# Wrap each one in time
timed: "true"
36 changes: 36 additions & 0 deletions google/kubecon/osu-benchmarks/run4/crd/metrics-20x-16.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
apiVersion: flux-framework.org/v1alpha1
kind: MetricSet
metadata:
labels:
app.kubernetes.io/name: metricset
app.kubernetes.io/instance: metricset-sample
name: metricset-sample
spec:
pods: 16
metrics:
- name: network-osu-benchmark
# Custom list of commands to run
# See https://converged-computing.github.io/metrics-operator/getting_started/metrics.html#network-osu-benchmark
listOptions:
commands:
- osu_get_acc_latency
- osu_acc_latency
- osu_fop_latency
- osu_get_latency
- osu_put_latency
- osu_latency
- osu_bibw
- osu_bw
- osu_put_bw
- osu_latency_mp
- osu_put_bibw
- osu_init
- osu_get_bw
- osu_ibarrier
- osu_cas_latency
- osu_latency_mt
- osu_hello
- osu_barrier
options:
# Wrap each one in time
timed: "true"
35 changes: 35 additions & 0 deletions google/kubecon/osu-benchmarks/run4/crd/metrics-20x-32.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
apiVersion: flux-framework.org/v1alpha1
kind: MetricSet
metadata:
labels:
app.kubernetes.io/name: metricset
app.kubernetes.io/instance: metricset-sample
name: metricset-sample
spec:
pods: 32
metrics:
- name: network-osu-benchmark
# Custom list of commands to run
# See https://converged-computing.github.io/metrics-operator/getting_started/metrics.html#network-osu-benchmark
listOptions:
commands:
- osu_get_acc_latency
- osu_acc_latency
- osu_fop_latency
- osu_get_latency
- osu_put_latency
- osu_latency
- osu_bibw
- osu_bw
- osu_put_bw
- osu_latency_mp
- osu_put_bibw
- osu_init
- osu_get_bw
- osu_cas_latency
- osu_latency_mt
- osu_hello
- osu_barrier
options:
# Wrap each one in time
timed: "true"
40 changes: 40 additions & 0 deletions google/kubecon/osu-benchmarks/run4/crd/metrics-20x-4.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
apiVersion: flux-framework.org/v1alpha1
kind: MetricSet
metadata:
labels:
app.kubernetes.io/name: metricset
app.kubernetes.io/instance: metricset-sample
name: metricset-sample
spec:
pods: 4
metrics:
- name: network-osu-benchmark
# Custom list of commands to run
# See https://converged-computing.github.io/metrics-operator/getting_started/metrics.html#network-osu-benchmark
listOptions:
commands:
- osu_get_acc_latency
- osu_acc_latency
- osu_fop_latency
- osu_get_latency
- osu_put_latency
- osu_latency
- osu_bibw
- osu_bw
- osu_put_bw
- osu_latency_mp
- osu_put_bibw
- osu_init
- osu_get_bw
- osu_ibarrier
- osu_cas_latency
- osu_latency_mt
- osu_hello
- osu_barrier
- osu_mbw_mr
- osu_multi_lat
- osu_allgather
- osu_allreduce
options:
# Wrap each one in time
timed: "true"
35 changes: 35 additions & 0 deletions google/kubecon/osu-benchmarks/run4/crd/metrics-20x-64.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
apiVersion: flux-framework.org/v1alpha1
kind: MetricSet
metadata:
labels:
app.kubernetes.io/name: metricset
app.kubernetes.io/instance: metricset-sample
name: metricset-sample
spec:
pods: 64
metrics:
- name: network-osu-benchmark
# Custom list of commands to run
# See https://converged-computing.github.io/metrics-operator/getting_started/metrics.html#network-osu-benchmark
listOptions:
commands:
- osu_get_acc_latency
- osu_acc_latency
- osu_fop_latency
- osu_get_latency
- osu_put_latency
- osu_latency
- osu_bibw
- osu_bw
- osu_put_bw
- osu_latency_mp
- osu_put_bibw
- osu_init
- osu_get_bw
- osu_cas_latency
- osu_latency_mt
- osu_hello
- osu_barrier
options:
# Wrap each one in time
timed: "true"

0 comments on commit d064507

Please sign in to comment.