Skip to content

Commit

Permalink
expose more osu benchmarks (#36)
Browse files Browse the repository at this point in the history
* expose more osu benchmarks

Also fixing a bug that resources were not added to metrics
Removing top level jobset resources - overhead is something else
ensuring python examples also cleanup with m.delete()
adding example generation of manifests for osu benchmarks
start of template for pod affinity / topology constraint (not enabled yet)

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
  • Loading branch information
vsoch committed Aug 11, 2023
1 parent 5d6dace commit 3baf5a6
Show file tree
Hide file tree
Showing 42 changed files with 1,441 additions and 366 deletions.
3 changes: 1 addition & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,7 @@ To learn more:
- make flux operator command generator
- For larger metric collections, we should have a log streaming mode (and not wait for Completed/Successful)
- For services we are measuring, we likely need to be able to kill after N seconds (to complete job) or to specify the success policy on the metrics containers instead of the application
- Python function to save entire spec to yaml (for MetricSet and JobSet)?
- Netmark / OSU need resources set to ensure 1 pod/node
- Look into pod affinity/anti-affintiy vs. topology constraint (which do we want)?
- Add assertions checking for python tests
- Plotting examples needed for
- io-sysstat
Expand Down
2 changes: 2 additions & 0 deletions api/v1alpha1/metric_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -255,7 +255,9 @@ type Metric struct {

// Get pod labels for a metric set
func (m *MetricSet) GetPodLabels() map[string]string {

podLabels := map[string]string{}
// This is for autoscaling, although haven't used yet
podLabels["cluster-name"] = m.Name
// This is for the headless service
podLabels["metricset-name"] = m.Name
Expand Down
54 changes: 21 additions & 33 deletions docs/getting_started/custom-resource-definition.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ An application is allowed to have one or more existing volumes. An existing volu

#### resources

Resource lists for an application container go under [Overhead](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-overhead/). Known keys include "memory" and "cpu" (should be provided in some string format that can be parsed) and all others are considered some kind of quantity request.
You can define resources for an application or a metric container. Known keys include "memory" and "cpu" (should be provided in some string format that can be parsed) and all others are considered some kind of quantity request.

```yaml
application:
Expand All @@ -112,6 +112,26 @@ metrics:
cpu: 4
```

If you wanted to, for example, request a GPU, that might look like:

```yaml
resources:
limits:
gpu-vendor.example/example-gpu: 1
```

Or for a particular type of networking fabric:

```yaml
resources:
limits:
vpc.amazonaws.com/efa: 1
```

Both limits and resources are flexible to accept a string or an integer value, and you'll get an error if you
provide something else. If you need something else, [let us know](https://github.com/converged-computing/metrics-operator/issues).
If you are requesting GPU, [this documentation](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/) is helpful.

### storage

When you want to measure some storage performance, you'll want to add a "storage" section to your MetricSet. This will typically just be a reference to some existing storage (see [existing volumes](#existing-volumes)) that we want to measure, and can also be done for some number of completions and metrics for storage.
Expand Down Expand Up @@ -206,38 +226,6 @@ Presence of absence of an option type depends on the metric. Metrics are free to
options as they see fit.


## resources

Resources for an entire spec are given to the Pod template of the Job. They can include limits and requests. Known keys include "memory" and "cpu" (should be provided in some
string format that can be parsed) and all others are considered some kind of quantity request.

```yaml
resources:
limits:
memory: 500M
cpu: 4
```

If you wanted to, for example, request a GPU, that might look like:

```yaml
resources:
limits:
gpu-vendor.example/example-gpu: 1
```

Or for a particulat type of networking fabric:

```yaml
resources:
limits:
vpc.amazonaws.com/efa: 1
```

Both limits and resources are flexible to accept a string or an integer value, and you'll get an error if you
provide something else. If you need something else, [let us know](https://github.com/converged-computing/metrics-operator/issues).
If you are requesting GPU, [this documentation](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/) is helpful.

## Existing Volumes

An existing volume can be provided to support an application (multiple) or one can be provided for assessing its performance (single).
Expand Down
74 changes: 74 additions & 0 deletions docs/getting_started/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,80 @@ Variables to customize include:
|-----|-------------|------------|------|---------|
| commands | Custom list of osu-benchmark one-sided commands to run | listOptions->commands | array | unset uses default set |

By default, we run a subset of commands:

- osu_get_acc_latency
- osu_acc_latency
- osu_fop_latency
- osu_get_latency
- osu_put_latency
- osu_allreduce
- osu_latency
- osu_bibw
- osu_bw

However all of the following are available for MPI

<details>

<summary>Commands available for OSU Benchmarks</summary>

```console
.
|-- collective
| |-- osu_allgather
| |-- osu_allgatherv
| |-- osu_allreduce
| |-- osu_alltoall
| |-- osu_alltoallv
| |-- osu_barrier
| |-- osu_bcast
| |-- osu_gather
| |-- osu_gatherv
| |-- osu_iallgather
| |-- osu_iallgatherv
| |-- osu_iallreduce
| |-- osu_ialltoall
| |-- osu_ialltoallv
| |-- osu_ialltoallw
| |-- osu_ibarrier
| |-- osu_ibcast
| |-- osu_igather
| |-- osu_igatherv
| |-- osu_ireduce
| |-- osu_iscatter
| |-- osu_iscatterv
| |-- osu_reduce
| |-- osu_reduce_scatter
| |-- osu_scatter
| `-- osu_scatterv
|-- one-sided
| |-- osu_acc_latency
| |-- osu_cas_latency
| |-- osu_fop_latency
| |-- osu_get_acc_latency
| |-- osu_get_bw
| |-- osu_get_latency
| |-- osu_put_bibw
| |-- osu_put_bw
| `-- osu_put_latency
|-- pt2pt
| |-- osu_bibw
| |-- osu_bw
| |-- osu_latency
| |-- osu_latency_mp
| |-- osu_latency_mt
| |-- osu_mbw_mr
| `-- osu_multi_lat
`-- startup
|-- osu_hello
`-- osu_init
```

</details>

Note that not all of these have been tested on our setups, so
if you have any questions please [let us know](https://github.com/converged-computing/metrics-operator/issues).

#### app-lammps

Expand Down
3 changes: 3 additions & 0 deletions examples/python/io-fio/run-metric.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,9 @@ def main():
print(json.dumps(output, indent=4))
utils.write_json(output, args.out)

# Ensure we cleanup!
m.delete()


if __name__ == "__main__":
main()
3 changes: 3 additions & 0 deletions examples/python/io-host-volume/run-metric.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,5 +53,8 @@ def main():
print(json.dumps(output, indent=4))
utils.write_json(output, args.out)

# Ensure we cleanup!
m.delete()

if __name__ == "__main__":
main()
3 changes: 3 additions & 0 deletions examples/python/network-netmark/run-metric.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,9 @@ def main():
utils.write_json(output, args.out)
plot_results(output)

# Ensure we cleanup!
m.delete()

def plot_results(output):
"""
Plot results to a histogram and matrix heatmap
Expand Down
6 changes: 5 additions & 1 deletion examples/python/network-osu-benchmark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,8 @@ and then wait for the pod to complete and parse the output in the log.
![img/OSU-MPI_Accumulate-latency-Test-v5.8.png](img/OSU-MPI_Accumulate-latency-Test-v5.8.png)
![img/OSU-MPI_Get_accumulate-latency-Test-v5.8.png](img/OSU-MPI_Get_accumulate-latency-Test-v5.8.png)
![img/OSU-MPI_Get-latency-Test-v5.8.png](img/OSU-MPI_Get-latency-Test-v5.8.png)
![img/OSU-MPI_Put-Latency-Test-v5.8.png](img/OSU-MPI_Put-Latency-Test-v5.8.png)
![img/OSU-MPI_Put-Latency-Test-v5.8.png](img/OSU-MPI_Put-Latency-Test-v5.8.png)
![img/OSU-MPI-Allreduce-Latency-Test-v5.8.png](img/OSU-MPI-Allreduce-Latency-Test-v5.8.png)
![img/OSU-MPI-Bandwidth-Test-v5.8.png](img/OSU-MPI-Bandwidth-Test-v5.8.png)
![img/OSU-MPI-Bi-Directional-Bandwidth-Test-v5.8.png](img/OSU-MPI-Bi-Directional-Bandwidth-Test-v5.8.png)
![img/OSU-MPI-Latency-Test-v5.8.png](img/OSU-MPI-Latency-Test-v5.8.png)
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
,Size,Avg Latency(us)
0,4.0,0.7
1,8.0,0.68
2,16.0,0.69
3,32.0,0.69
4,64.0,0.73
5,128.0,0.69
6,256.0,0.72
7,512.0,1.17
8,1024.0,1.25
9,2048.0,1.48
10,4096.0,3.94
11,8192.0,4.58
12,16384.0,6.33
13,32768.0,10.02
14,65536.0,16.23
15,131072.0,30.2
16,262144.0,51.89
17,524288.0,95.58
18,1048576.0,192.54
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
,Size,Bandwidth (MB/s)
0,1.0,3.59
1,2.0,7.36
2,4.0,13.96
3,8.0,29.1
4,16.0,59.55
5,32.0,115.48
6,64.0,186.9
7,128.0,354.48
8,256.0,707.87
9,512.0,1550.94
10,1024.0,3069.66
11,2048.0,5302.94
12,4096.0,3824.76
13,8192.0,7266.95
14,16384.0,10702.79
15,32768.0,12976.45
16,65536.0,14435.93
17,131072.0,15825.85
18,262144.0,18433.81
19,524288.0,19042.76
20,1048576.0,16864.11
21,2097152.0,17910.99
22,4194304.0,5601.69
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
,Size,Bandwidth (MB/s)
0,1.0,2.52
1,2.0,6.75
2,4.0,16.03
3,8.0,35.92
4,16.0,95.98
5,32.0,215.15
6,64.0,300.67
7,128.0,526.46
8,256.0,1129.38
9,512.0,2299.61
10,1024.0,4375.96
11,2048.0,7687.79
12,4096.0,6381.66
13,8192.0,10914.41
14,16384.0,13150.18
15,32768.0,21241.68
16,65536.0,26159.63
17,131072.0,30981.19
18,262144.0,25362.5
19,524288.0,23341.25
20,1048576.0,18255.77
21,2097152.0,7265.99
22,4194304.0,5972.25
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
,Size,Latency (us)
0,0.0,0.45
1,1.0,0.41
2,2.0,0.31
3,4.0,0.27
4,8.0,0.23
5,16.0,0.23
6,32.0,0.22
7,64.0,0.23
8,128.0,0.28
9,256.0,0.28
10,512.0,0.38
11,1024.0,0.44
12,2048.0,0.62
13,4096.0,1.32
14,8192.0,1.49
15,16384.0,1.98
16,32768.0,2.67
17,65536.0,4.1
18,131072.0,7.04
19,262144.0,16.6
20,524288.0,28.72
21,1048576.0,60.14
22,2097152.0,191.67
23,4194304.0,656.73
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -1,24 +1,24 @@
,Size,Latency (us)
0,1.0,0.56
1,2.0,0.41
2,4.0,0.36
3,8.0,0.3
0,1.0,0.5
1,2.0,0.38
2,4.0,0.33
3,8.0,0.28
4,16.0,0.27
5,32.0,0.25
6,64.0,0.26
7,128.0,0.29
8,256.0,0.39
9,512.0,0.57
10,1024.0,0.78
11,2048.0,1.49
12,4096.0,2.36
13,8192.0,4.68
14,16384.0,9.47
15,32768.0,18.38
16,65536.0,35.33
17,131072.0,68.67
18,262144.0,138.39
19,524288.0,271.58
20,1048576.0,542.47
21,2097152.0,1085.58
22,4194304.0,2288.26
5,32.0,0.44
6,64.0,0.31
7,128.0,0.27
8,256.0,0.33
9,512.0,0.43
10,1024.0,0.66
11,2048.0,1.17
12,4096.0,2.52
13,8192.0,4.45
14,16384.0,8.31
15,32768.0,17.97
16,65536.0,39.96
17,131072.0,67.01
18,262144.0,138.98
19,524288.0,259.99
20,1048576.0,524.74
21,2097152.0,1094.15
22,4194304.0,2229.0
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
,Size,Latency (us)
0,8.0,0.5
0,8.0,0.29

0 comments on commit 3baf5a6

Please sign in to comment.