## CUPTI Counter / FLOPs Analysis

### About

In this demo we leverage the PyTorch Profiler to capture performance characteristics of CUDA kernels. See the section below on how to collect counters using PyTorch Profiler.

### Motivation and context

Performance counter measurements can provide insights on how to speed up GPU kernels, conduct roofline analysis and other low level optimizations. The PyTorch Profiler includes a lightweight API to program and measure detailed performance counters from the GPU. This mode leverages [CUPTI Range Profiler API](https://docs.nvidia.com/cupti/r_main.html#r_profiler) and supports an extensive list of performance metrics.

The annotated trace contains:
* Performance measurement events, which are logged under the `cuda_profiler_range` category.
* Counter values, which are logged in the args section of the above events.


### Instructions

#### Collecting the trace with CUPTI Profiler Counters
One can collect performance metrics by adding the list of metrics using the experimental config option in PyTorch Profiler.

```
with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CUDA, torch.profiler.ProfilerActivity.CPU],
    record_shapes=True,
    on_trace_ready=trace_handler,
    experimental_config=torch.profiler._ExperimentalConfig(
        profiler_metrics=[
            "kineto__tensor_core_insts",
            "dram__bytes_read.sum",
            "dram__bytes_write.sum"],
    profiler_measure_per_kernel=True),
) as prof:
    res = train_batch(modeldef)
    prof.step()```
```

To collect the trace used in the example we ran [PARAM Benchmarks](https://github.com/facebookresearch/param/tree/main/train/compute/python). PARAM provides a repository of communication and computation micro-benchmarks for AI training and inference. For this example, we ran a simple convolutional neural network model - AlexNet - as a benchmark and collected the trace. Instructions for the same are shown below-

Run using the following commands:

```
# Inside dir "param/train/compute"
$ python -m python.pytorch.run_benchmark -c python/examples/pytorch/configs/alex_net.json -p -i 1 -d cuda --cupti-profiler --cupti-profiler-measure-per-kernel
```

#### Trace Analysis

To run this demo notebook on your laptop
1. Clone the repo `git clone https://github.com/facebookresearch/HolisticTraceAnalysis.git`
1. [Optional and recommended] Setup a venv or conda environment. See README for details.
1. Set the `trace_dir` parameter in the next cell to the location of the folder containing your collected PyTorch Profiler trace.


In [1]:
from hta.trace_analysis import TraceAnalysis
from hta.analyzers.cupti_counter_analysis import CUDA_SASS_INSTRUCTION_COUNTER_FLOPS

#trace_prefix = # ENTER PATH TO HTA HERE
trace_dir = "~/Work/hta/debug_cupti_yang/"
analyzer = TraceAnalysis(trace_dir=trace_dir)

2023-09-07 00:43:28,711 - hta - trace.py:L404 - INFO - /Users/bcoutinho/Work/hta/debug_cupti_yang
2023-09-07 00:43:28,749 - hta - trace_file.py:L94 - INFO - Rank to trace file map:
{0: '/Users/bcoutinho/Work/hta/debug_cupti_yang/libkineto_activities_3660455.json'}
2023-09-07 00:43:28,751 - hta - trace.py:L550 - INFO - ranks=[0]
2023-09-07 00:43:28,764 - hta - trace.py:L132 - INFO - Parsed /Users/bcoutinho/Work/hta/debug_cupti_yang/libkineto_activities_3660455.json time = 0.01 seconds 


In [2]:
analyzer.get_cupti_counter_data_with_operators?

[0;31mSignature:[0m
[0manalyzer[0m[0;34m.[0m[0mget_cupti_counter_data_with_operators[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mranks[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mList[0m[0;34m[[0m[0mint[0m[0;34m][0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0mList[0m[0;34m[[0m[0mpandas[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mframe[0m[0;34m.[0m[0mDataFrame[0m[0;34m][0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Performance counters provide insights on how to speed up GPU
kernels. The PyTorch Profiler has a lightweight API [CUPTI Range
Profiler API](https://docs.nvidia.com/cupti/r_main.html#r_profiler)
that enables users to monitor performance counters from the device.

When the CUPTI Profiler mode is enabled then PyTorch will emit the
performance counters and annotates them in the trace.
    * The events are logged under the `cuda_profiler_range` category.
    * Counter values are logged

In [5]:
gpu_kernels = analyzer.get_cupti_counter_data_with_operators(ranks=[0])[0]

In [8]:
gpu_kernels.loc[0]

index                                                                                               635
cat                                                                                 cuda_profiler_range
name                                                                   ampere_fp16_sgemm_fp16_32x128_tn
pid                                                                                                   7
tid                                                                                                   0
ts                                                                                                38473
dur                                                                                             3619088
Trace iteration                                                                                      -1
dram__bytes_read.sum                                                                                  0
memory_bw_gbps                                                  

In [6]:
gpu_kernels.head()[["name", "op_stack", "top_level_op"]\
                   + list(CUDA_SASS_INSTRUCTION_COUNTER_FLOPS.keys())]

KeyError: "['smsp__sass_thread_inst_executed_op_ffma_pred_on.sum', 'smsp__sass_thread_inst_executed_op_fmul_pred_on.sum', 'smsp__sass_thread_inst_executed_op_fadd_pred_on.sum', 'smsp__sass_thread_inst_executed_op_hfma_pred_on.sum', 'smsp__sass_thread_inst_executed_op_hmul_pred_on.sum', 'smsp__sass_thread_inst_executed_op_hadd_pred_on.sum', 'smsp__sass_thread_inst_executed_op_dfma_pred_on.sum', 'smsp__sass_thread_inst_executed_op_dmul_pred_on.sum', 'smsp__sass_thread_inst_executed_op_dadd_pred_on.sum'] not in index"

In [11]:
gpu_kernels["flops"] = 0
for counter, flops in CUDA_SASS_INSTRUCTION_COUNTER_FLOPS.items():
    gpu_kernels["flops"] += gpu_kernels[counter] * flops

In [12]:
gpu_kernels[["name", "bottom_level_op", "top_level_op", "flops"]].head()

Unnamed: 0,name,bottom_level_op,top_level_op,flops
0,void at::native::(anonymous namespace)::distri...,aten::uniform_,aten::rand,87195648
1,__missing__,aten::convolution,aten::conv2d,18263449600
2,"void at::native::elementwise_kernel<128, 2, at...",aten::add_,aten::conv2d,148684800
3,void at::native::vectorized_elementwise_kernel...,aten::clamp_min_,aten::relu_,0
4,void at::native::(anonymous namespace)::max_po...,aten::max_pool2d_with_indices,aten::max_pool2d,11943936
