Benchmarking cuda streams and graphs

Summary

We want to implement a benchmark that studies the performance of co-executing kernels by using either cuda streams or cuda graphs. The benchmark should categorize (when and why) and quantify (how much) the overhead costs or performance benefits of using co-execution.

We will measure the throughput (total execution time to complete the benchmark), across different points for which the varying parameters are:

Number of benchmark iterations (epochs) (E)
Number of cuda streams/ fanout for cuda graphs (S)
Number of kernels launched per stream/ number of kernels launched in every branch of the fork for cuda graphs (K)
Total size of the problem (A)
Block size (B)
Multithreaded (boolean value) (M)

The amount of work is W = E * A Ideal speedup is then P = S

K should not affect the results if there is no measurable overheads of breaking down one kernel call into several.

What to implement:

Benchmark receives E, S, A, K, B and M as command line parameters; outputs throughput.

Kernels:

empty
axpy
newton (with fixed number of iterations)

Cuda streams: The benchmark splits the work on an array(s) of size A into S*K pieces. The work is performed on S streams that launch K sequential kernels each working on a problem with size A/(S*K), then the GPU is synchronized. If M is true, we create S CPU threads and associate each GPU stream to a CPU thread, launching K kernels from each thread on a separate kernel. This is repeated E times.

Cuda graph: The benchmark splits the work on an array(s) of size A into S*K pieces. The graph forks into S branches that launch K sequential kernels each working on a problem with size A/(S*K), then the GPU is synchronized. This is repeated E times.

How to profile performance:

Either with nvprof or nsys - nvprof illustrates better the overlap in execution but is not supported for A100.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking cuda streams and graphs

Summary

What to implement:

How to profile performance:

Clone this wiki locally