expose more osu benchmarks (#36)

* expose more osu benchmarks Also fixing a bug that resources were not added to metrics Removing top level jobset resources - overhead is something else ensuring python examples also cleanup with m.delete() adding example generation of manifests for osu benchmarks start of template for pod affinity / topology constraint (not enabled yet) Signed-off-by: vsoch <vsoch@users.noreply.github.com>
converged-computing · Aug 11, 2023 · 3baf5a6 · 3baf5a6
1 parent 5d6dace
commit 3baf5a6
Show file tree

Hide file tree

Showing 42 changed files with 1,441 additions and 366 deletions.
diff --git a/README.md b/README.md
@@ -16,8 +16,7 @@ To learn more:
 - make flux operator command generator
 - For larger metric collections, we should have a log streaming mode (and not wait for Completed/Successful)
 - For services we are measuring, we likely need to be able to kill after N seconds (to complete job) or to specify the success policy on the metrics containers instead of the application
-- Python function to save entire spec to yaml (for MetricSet and JobSet)?
-- Netmark / OSU need resources set to ensure 1 pod/node
+- Look into pod affinity/anti-affintiy vs. topology constraint (which do we want)?
 - Add assertions checking for python tests
 - Plotting examples needed for
   - io-sysstat

diff --git a/api/v1alpha1/metric_types.go b/api/v1alpha1/metric_types.go
@@ -255,7 +255,9 @@ type Metric struct {
 
 // Get pod labels for a metric set
 func (m *MetricSet) GetPodLabels() map[string]string {
+
 	podLabels := map[string]string{}
+	// This is for autoscaling, although haven't used yet
 	podLabels["cluster-name"] = m.Name
 	// This is for the headless service
 	podLabels["metricset-name"] = m.Name

diff --git a/docs/getting_started/custom-resource-definition.md b/docs/getting_started/custom-resource-definition.md
@@ -93,7 +93,7 @@ An application is allowed to have one or more existing volumes. An existing volu
 
 #### resources
 
-Resource lists for an application container go under [Overhead](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-overhead/). Known keys include "memory" and "cpu" (should be provided in some string format that can be parsed) and all others are considered some kind of quantity request.
+You can define resources for an application or a metric container. Known keys include "memory" and "cpu" (should be provided in some string format that can be parsed) and all others are considered some kind of quantity request.
 
 ```yaml
 application:
@@ -112,6 +112,26 @@ metrics:
       cpu: 4
 ```
 
+If you wanted to, for example, request a GPU, that might look like:
+
+```yaml
+resources:
+  limits:
+    gpu-vendor.example/example-gpu: 1
+```
+
+Or for a particular type of networking fabric:
+
+```yaml
+resources:
+  limits:
+    vpc.amazonaws.com/efa: 1
+```
+
+Both limits and resources are flexible to accept a string or an integer value, and you'll get an error if you
+provide something else. If you need something else, [let us know](https://github.com/converged-computing/metrics-operator/issues).
+If you are requesting GPU, [this documentation](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/) is helpful.
+
 ### storage
 
 When you want to measure some storage performance, you'll want to add a "storage" section to your MetricSet. This will typically just be a reference to some existing storage (see [existing volumes](#existing-volumes)) that we want to measure, and can also be done for some number of completions and metrics for storage.
@@ -206,38 +226,6 @@ Presence of absence of an option type depends on the metric. Metrics are free to
 options as they see fit.
 
 
-## resources
-
-Resources for an entire spec are given to the Pod template of the Job. They can include limits and requests. Known keys include "memory" and "cpu" (should be provided in some
-string format that can be parsed) and all others are considered some kind of quantity request.
-
-```yaml
-resources:
-  limits:
-    memory: 500M
-    cpu: 4
-```
-
-If you wanted to, for example, request a GPU, that might look like:
-
-```yaml
-resources:
-  limits:
-    gpu-vendor.example/example-gpu: 1
-```
-
-Or for a particulat type of networking fabric:
-
-```yaml
-resources:
-  limits:
-    vpc.amazonaws.com/efa: 1
-```
-
-Both limits and resources are flexible to accept a string or an integer value, and you'll get an error if you
-provide something else. If you need something else, [let us know](https://github.com/converged-computing/metrics-operator/issues).
-If you are requesting GPU, [this documentation](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/) is helpful.
-
 ## Existing Volumes
 
 An existing volume can be provided to support an application (multiple) or one can be provided for assessing its performance (single).

diff --git a/docs/getting_started/metrics.md b/docs/getting_started/metrics.md
@@ -118,6 +118,80 @@ Variables to customize include:
 |-----|-------------|------------|------|---------|
 | commands | Custom list of osu-benchmark one-sided commands to run | listOptions->commands | array | unset uses default set |
 
+By default, we run a subset of commands:
+
+- osu_get_acc_latency
+- osu_acc_latency
+- osu_fop_latency
+-	osu_get_latency
+-	osu_put_latency
+-	osu_allreduce
+- osu_latency
+- osu_bibw
+-	osu_bw
+
+However all of the following are available for MPI
+
+<details>
+
+<summary>Commands available for OSU Benchmarks</summary>
+
+```console
+.
+|-- collective
+|   |-- osu_allgather
+|   |-- osu_allgatherv
+|   |-- osu_allreduce
+|   |-- osu_alltoall
+|   |-- osu_alltoallv
+|   |-- osu_barrier
+|   |-- osu_bcast
+|   |-- osu_gather
+|   |-- osu_gatherv
+|   |-- osu_iallgather
+|   |-- osu_iallgatherv
+|   |-- osu_iallreduce
+|   |-- osu_ialltoall
+|   |-- osu_ialltoallv
+|   |-- osu_ialltoallw
+|   |-- osu_ibarrier
+|   |-- osu_ibcast
+|   |-- osu_igather
+|   |-- osu_igatherv
+|   |-- osu_ireduce
+|   |-- osu_iscatter
+|   |-- osu_iscatterv
+|   |-- osu_reduce
+|   |-- osu_reduce_scatter
+|   |-- osu_scatter
+|   `-- osu_scatterv
+|-- one-sided
+|   |-- osu_acc_latency
+|   |-- osu_cas_latency
+|   |-- osu_fop_latency
+|   |-- osu_get_acc_latency
+|   |-- osu_get_bw
+|   |-- osu_get_latency
+|   |-- osu_put_bibw
+|   |-- osu_put_bw
+|   `-- osu_put_latency
+|-- pt2pt
+|   |-- osu_bibw
+|   |-- osu_bw
+|   |-- osu_latency
+|   |-- osu_latency_mp
+|   |-- osu_latency_mt
+|   |-- osu_mbw_mr
+|   `-- osu_multi_lat
+`-- startup
+    |-- osu_hello
+    `-- osu_init
+```
+
+</details>
+
+Note that not all of these have been tested on our setups, so
+if you have any questions please [let us know](https://github.com/converged-computing/metrics-operator/issues).
 
 #### app-lammps
 

diff --git a/examples/python/io-fio/run-metric.py b/examples/python/io-fio/run-metric.py
@@ -51,6 +51,9 @@ def main():
         print(json.dumps(output, indent=4))
         utils.write_json(output, args.out)
 
+    # Ensure we cleanup!
+    m.delete()
+
 
 if __name__ == "__main__":
     main()
diff --git a/examples/python/io-host-volume/run-metric.py b/examples/python/io-host-volume/run-metric.py
@@ -53,5 +53,8 @@ def main():
         print(json.dumps(output, indent=4))
         utils.write_json(output, args.out)
 
+    # Ensure we cleanup!
+    m.delete()
+
 if __name__ == "__main__":
     main()
diff --git a/examples/python/network-netmark/run-metric.py b/examples/python/network-netmark/run-metric.py
@@ -58,6 +58,9 @@ def main():
         utils.write_json(output, args.out)
         plot_results(output)
 
+    # Ensure we cleanup!
+    m.delete()
+
 def plot_results(output):
     """
     Plot results to a histogram and matrix heatmap

diff --git a/examples/python/network-osu-benchmark/README.md b/examples/python/network-osu-benchmark/README.md
@@ -15,4 +15,8 @@ and then wait for the pod to complete and parse the output in the log.
 ![img/OSU-MPI_Accumulate-latency-Test-v5.8.png](img/OSU-MPI_Accumulate-latency-Test-v5.8.png)
 ![img/OSU-MPI_Get_accumulate-latency-Test-v5.8.png](img/OSU-MPI_Get_accumulate-latency-Test-v5.8.png)
 ![img/OSU-MPI_Get-latency-Test-v5.8.png](img/OSU-MPI_Get-latency-Test-v5.8.png)
-![img/OSU-MPI_Put-Latency-Test-v5.8.png](img/OSU-MPI_Put-Latency-Test-v5.8.png)
+![img/OSU-MPI_Put-Latency-Test-v5.8.png](img/OSU-MPI_Put-Latency-Test-v5.8.png)
+![img/OSU-MPI-Allreduce-Latency-Test-v5.8.png](img/OSU-MPI-Allreduce-Latency-Test-v5.8.png)
+![img/OSU-MPI-Bandwidth-Test-v5.8.png](img/OSU-MPI-Bandwidth-Test-v5.8.png)
+![img/OSU-MPI-Bi-Directional-Bandwidth-Test-v5.8.png](img/OSU-MPI-Bi-Directional-Bandwidth-Test-v5.8.png)
+![img/OSU-MPI-Latency-Test-v5.8.png](img/OSU-MPI-Latency-Test-v5.8.png)
diff --git a/examples/python/network-osu-benchmark/img/OSU-MPI-Allreduce-Latency-Test-v5.8.csv b/examples/python/network-osu-benchmark/img/OSU-MPI-Allreduce-Latency-Test-v5.8.csv
@@ -0,0 +1,20 @@
+,Size,Avg Latency(us)
+0,4.0,0.7
+1,8.0,0.68
+2,16.0,0.69
+3,32.0,0.69
+4,64.0,0.73
+5,128.0,0.69
+6,256.0,0.72
+7,512.0,1.17
+8,1024.0,1.25
+9,2048.0,1.48
+10,4096.0,3.94
+11,8192.0,4.58
+12,16384.0,6.33
+13,32768.0,10.02
+14,65536.0,16.23
+15,131072.0,30.2
+16,262144.0,51.89
+17,524288.0,95.58
+18,1048576.0,192.54
diff --git a/examples/python/network-osu-benchmark/img/OSU-MPI-Allreduce-Latency-Test-v5.8.png b/examples/python/network-osu-benchmark/img/OSU-MPI-Allreduce-Latency-Test-v5.8.png
diff --git a/examples/python/network-osu-benchmark/img/OSU-MPI-Bandwidth-Test-v5.8.csv b/examples/python/network-osu-benchmark/img/OSU-MPI-Bandwidth-Test-v5.8.csv
@@ -0,0 +1,24 @@
+,Size,Bandwidth (MB/s)
+0,1.0,3.59
+1,2.0,7.36
+2,4.0,13.96
+3,8.0,29.1
+4,16.0,59.55
+5,32.0,115.48
+6,64.0,186.9
+7,128.0,354.48
+8,256.0,707.87
+9,512.0,1550.94
+10,1024.0,3069.66
+11,2048.0,5302.94
+12,4096.0,3824.76
+13,8192.0,7266.95
+14,16384.0,10702.79
+15,32768.0,12976.45
+16,65536.0,14435.93
+17,131072.0,15825.85
+18,262144.0,18433.81
+19,524288.0,19042.76
+20,1048576.0,16864.11
+21,2097152.0,17910.99
+22,4194304.0,5601.69
diff --git a/examples/python/network-osu-benchmark/img/OSU-MPI-Bandwidth-Test-v5.8.png b/examples/python/network-osu-benchmark/img/OSU-MPI-Bandwidth-Test-v5.8.png
diff --git a/examples/python/network-osu-benchmark/img/OSU-MPI-Bi-Directional-Bandwidth-Test-v5.8.csv b/examples/python/network-osu-benchmark/img/OSU-MPI-Bi-Directional-Bandwidth-Test-v5.8.csv
@@ -0,0 +1,24 @@
+,Size,Bandwidth (MB/s)
+0,1.0,2.52
+1,2.0,6.75
+2,4.0,16.03
+3,8.0,35.92
+4,16.0,95.98
+5,32.0,215.15
+6,64.0,300.67
+7,128.0,526.46
+8,256.0,1129.38
+9,512.0,2299.61
+10,1024.0,4375.96
+11,2048.0,7687.79
+12,4096.0,6381.66
+13,8192.0,10914.41
+14,16384.0,13150.18
+15,32768.0,21241.68
+16,65536.0,26159.63
+17,131072.0,30981.19
+18,262144.0,25362.5
+19,524288.0,23341.25
+20,1048576.0,18255.77
+21,2097152.0,7265.99
+22,4194304.0,5972.25
diff --git a/...python/network-osu-benchmark/img/OSU-MPI-Bi-Directional-Bandwidth-Test-v5.8.png b/...python/network-osu-benchmark/img/OSU-MPI-Bi-Directional-Bandwidth-Test-v5.8.png
diff --git a/examples/python/network-osu-benchmark/img/OSU-MPI-Latency-Test-v5.8.csv b/examples/python/network-osu-benchmark/img/OSU-MPI-Latency-Test-v5.8.csv
@@ -0,0 +1,25 @@
+,Size,Latency (us)
+0,0.0,0.45
+1,1.0,0.41
+2,2.0,0.31
+3,4.0,0.27
+4,8.0,0.23
+5,16.0,0.23
+6,32.0,0.22
+7,64.0,0.23
+8,128.0,0.28
+9,256.0,0.28
+10,512.0,0.38
+11,1024.0,0.44
+12,2048.0,0.62
+13,4096.0,1.32
+14,8192.0,1.49
+15,16384.0,1.98
+16,32768.0,2.67
+17,65536.0,4.1
+18,131072.0,7.04
+19,262144.0,16.6
+20,524288.0,28.72
+21,1048576.0,60.14
+22,2097152.0,191.67
+23,4194304.0,656.73
diff --git a/examples/python/network-osu-benchmark/img/OSU-MPI-Latency-Test-v5.8.png b/examples/python/network-osu-benchmark/img/OSU-MPI-Latency-Test-v5.8.png
diff --git a/examples/python/network-osu-benchmark/img/OSU-MPI_Accumulate-latency-Test-v5.8.csv b/examples/python/network-osu-benchmark/img/OSU-MPI_Accumulate-latency-Test-v5.8.csv
@@ -1,24 +1,24 @@
 ,Size,Latency (us)
-0,1.0,0.56
-1,2.0,0.41
-2,4.0,0.36
-3,8.0,0.3
+0,1.0,0.5
+1,2.0,0.38
+2,4.0,0.33
+3,8.0,0.28
 4,16.0,0.27
-5,32.0,0.25
-6,64.0,0.26
-7,128.0,0.29
-8,256.0,0.39
-9,512.0,0.57
-10,1024.0,0.78
-11,2048.0,1.49
-12,4096.0,2.36
-13,8192.0,4.68
-14,16384.0,9.47
-15,32768.0,18.38
-16,65536.0,35.33
-17,131072.0,68.67
-18,262144.0,138.39
-19,524288.0,271.58
-20,1048576.0,542.47
-21,2097152.0,1085.58
-22,4194304.0,2288.26
+5,32.0,0.44
+6,64.0,0.31
+7,128.0,0.27
+8,256.0,0.33
+9,512.0,0.43
+10,1024.0,0.66
+11,2048.0,1.17
+12,4096.0,2.52
+13,8192.0,4.45
+14,16384.0,8.31
+15,32768.0,17.97
+16,65536.0,39.96
+17,131072.0,67.01
+18,262144.0,138.98
+19,524288.0,259.99
+20,1048576.0,524.74
+21,2097152.0,1094.15
+22,4194304.0,2229.0
diff --git a/examples/python/network-osu-benchmark/img/OSU-MPI_Accumulate-latency-Test-v5.8.png b/examples/python/network-osu-benchmark/img/OSU-MPI_Accumulate-latency-Test-v5.8.png
diff --git a/examples/python/network-osu-benchmark/img/OSU-MPI_Fetch_and_op-latency-Test-v5.8.csv b/examples/python/network-osu-benchmark/img/OSU-MPI_Fetch_and_op-latency-Test-v5.8.csv
@@ -1,2 +1,2 @@
 ,Size,Latency (us)
-0,8.0,0.5
+0,8.0,0.29