Skip to content

Benchmarking cuda streams and graphs

Nora Abi Akar edited this page Oct 12, 2020 · 1 revision

Summary

We want to implement a benchmark that studies the performance of co-executing kernels by using either cuda streams or cuda graphs. The benchmark should categorize (when and why) and quantify (how much) the overhead costs or performance benefits of using co-execution.

We will measure the throughput (total execution time to complete the benchmark), across different points for which the varying parameters are:

  • Number of benchmark iterations (epochs) (E)
  • Number of cuda streams/ fanout for cuda graphs (S)
  • Number of kernels launched per stream/ number of kernels launched in every branch of the fork for cuda graphs (K)
  • Total size of the problem (A)
  • Block size (B)
  • Multithreaded (boolean value) (M)

The amount of work is W = E * A Ideal speedup is then P = S

K should not affect the results if there is no measurable overheads of breaking down one kernel call into several.

What to implement:

Benchmark receives E, S, A, K, B and M as command line parameters; outputs throughput.

Kernels:

  • empty
  • axpy
  • newton (with fixed number of iterations)

Cuda streams: The benchmark splits the work on an array(s) of size A into S*K pieces. The work is performed on S streams that launch K sequential kernels each working on a problem with size A/(S*K), then the GPU is synchronized. If M is true, we create S CPU threads and associate each GPU stream to a CPU thread, launching K kernels from each thread on a separate kernel. This is repeated E times.

Cuda graph: The benchmark splits the work on an array(s) of size A into S*K pieces. The graph forks into S branches that launch K sequential kernels each working on a problem with size A/(S*K), then the GPU is synchronized. This is repeated E times.

How to profile performance:

Either with nvprof or nsys - nvprof illustrates better the overlap in execution but is not supported for A100.