## CUPTI Counter / FLOPs Analysis

### About

In this demo we leverage the PyTorch Profiler to capture performance characteristics of CUDA kernels. See the section below on how to collect counters using PyTorch Profiler.

### Motivation and context

Performance counter measurements can provide insights on how to speed up GPU kernels, conduct roofline analysis and other low level optimizations. The PyTorch Profiler includes a lightweight API to program and measure detailed performance counters from the GPU. This mode leverages [CUPTI Range Profiler API](https://docs.nvidia.com/cupti/r_main.html#r_profiler) and supports an extensive list of performance metrics.

The annotated trace contains:
* Performance measurement events, which are logged under the `cuda_profiler_range` category.
* Counter values, which are logged in the args section of the above events.


### Instructions

#### Collecting the trace with CUPTI Profiler Counters
One can collect performance metrics by adding the list of metrics using the experimental config option in PyTorch Profiler.

```
with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CUDA, torch.profiler.ProfilerActivity.CPU],
    record_shapes=True,
    on_trace_ready=trace_handler,
    experimental_config=torch.profiler._ExperimentalConfig(
        profiler_metrics=[
            "kineto__tensor_core_insts",
            "dram__bytes_read.sum",
            "dram__bytes_write.sum"],
    profiler_measure_per_kernel=True),
) as prof:
    res = train_batch(modeldef)
    prof.step()```
```

To collect the trace used in the example we ran [PARAM Benchmarks](https://github.com/facebookresearch/param/tree/main/train/compute/python). PARAM provides a repository of communication and computation micro-benchmarks for AI training and inference. For this example, we ran a simple convolutional neural network model - AlexNet - as a benchmark and collected the trace. Instructions for the same are shown below-

Run using the following commands:

```
# Inside dir "param/train/compute"
$ python -m python.pytorch.run_benchmark -c python/examples/pytorch/configs/alex_net.json -p -i 1 -d cuda --cupti-profiler --cupti-profiler-measure-per-kernel
```

#### Trace Analysis

To run this demo notebook on your laptop
1. Clone the repo `git clone https://github.com/facebookresearch/HolisticTraceAnalysis.git`
1. [Optional and recommended] Setup a venv or conda environment. See README for details.
1. Set the `trace_dir` parameter in the next cell to the location of the folder containing your collected PyTorch Profiler trace.


In [None]:
from hta.trace_analysis import TraceAnalysis
from hta.analyzers.cupti_counter_analysis import CUDA_SASS_INSTRUCTION_COUNTER_FLOPS

trace_prefix = # ENTER PATH TO HTA HERE
trace_dir = f"{trace_prefix}/tests/data/cupti_profiler/"
analyzer = TraceAnalysis(trace_dir=trace_dir)

In [8]:
analyzer.get_cupti_counter_data_with_operators?

In [9]:
gpu_kernels = analyzer.get_cupti_counter_data_with_operators(ranks=[0])[0]

In [10]:
gpu_kernels.head()[["name", "op_stack", "top_level_op"]\
                   + list(CUDA_SASS_INSTRUCTION_COUNTER_FLOPS.keys())]

Unnamed: 0,name,op_stack,top_level_op,smsp__sass_thread_inst_executed_op_ffma_pred_on.sum,smsp__sass_thread_inst_executed_op_fmul_pred_on.sum,smsp__sass_thread_inst_executed_op_fadd_pred_on.sum,smsp__sass_thread_inst_executed_op_hfma_pred_on.sum,smsp__sass_thread_inst_executed_op_hmul_pred_on.sum,smsp__sass_thread_inst_executed_op_hadd_pred_on.sum,smsp__sass_thread_inst_executed_op_dfma_pred_on.sum,smsp__sass_thread_inst_executed_op_dmul_pred_on.sum,smsp__sass_thread_inst_executed_op_dadd_pred_on.sum
0,void at::native::(anonymous namespace)::distri...,"[cudaLaunchKernel, aten::uniform_, aten::rand]",aten::rand,38731776,0,0,4866048,0,0,0,0,0
1,__missing__,"[cudaLaunchKernel, cudaFuncSetAttribute, cudaF...",aten::conv2d,9119334400,24780800,0,0,0,0,0,0,0
2,"void at::native::elementwise_kernel<128, 2, at...","[cudaLaunchKernel, aten::add_, cudaFuncSetAttr...",aten::conv2d,24780800,0,0,49561600,0,0,0,0,0
3,void at::native::vectorized_elementwise_kernel...,"[cudaLaunchKernel, aten::clamp_min_, aten::rel...",aten::relu_,0,0,0,0,0,0,0,0,0
4,void at::native::(anonymous namespace)::max_po...,"[cudaLaunchKernel, aten::max_pool2d_with_indic...",aten::max_pool2d,0,0,0,5971968,0,0,0,0,0


In [11]:
gpu_kernels["flops"] = 0
for counter, flops in CUDA_SASS_INSTRUCTION_COUNTER_FLOPS.items():
    gpu_kernels["flops"] += gpu_kernels[counter] * flops

In [12]:
gpu_kernels[["name", "bottom_level_op", "top_level_op", "flops"]].head()

Unnamed: 0,name,bottom_level_op,top_level_op,flops
0,void at::native::(anonymous namespace)::distri...,aten::uniform_,aten::rand,87195648
1,__missing__,aten::convolution,aten::conv2d,18263449600
2,"void at::native::elementwise_kernel<128, 2, at...",aten::add_,aten::conv2d,148684800
3,void at::native::vectorized_elementwise_kernel...,aten::clamp_min_,aten::relu_,0
4,void at::native::(anonymous namespace)::max_po...,aten::max_pool2d_with_indices,aten::max_pool2d,11943936
