# L_001: How to profile CUDA kernels in PyTorch

## At a glance

## S1 - C++ & Python Kernels

Key Terms:

- PyTorch `Profiler`
- Binding - interfacing or connecting code written in different programming languages.
- Numba - a Python library that serves as an interface to craft CUDA kernels using Python syntax.
  
        -> Unless we have very specific needs that Triton doesn’t cover, its built-in optimizations will usually yield better results.

Numba can be used to implement CUDA kernels using Python syntax. It employs slightly different terminology for kernel creation compared to kernels written in C++. Alternatively, we can bind C++ code in our Python script using an inline binding package, which creates a temporary folder where the generated and compiled code is saved. By doing so, it abstracts away the complexity of makefiles and CUDA compiler flags.

## S2 - Triton

Key Terms:

- Triton – DSL (Domain Specific Language) for Generating PTX
- New interperter mode -> `@triton.jit(interpreter=True)` == `TRITON_INTERPRET=1`
- **TIP**: Write PyToch program and flag it with `TORCH_LOGS="output_code" python compile_square.py` -> get a Triton Kernel
- Most important optimization = Fusions
  


## S3 - CUDA Profilers

Key Terms:

- CUDA profilers
  - ncu (Doesn't work on most cloud vendors as they won't give us that profile information)
    - Neat visual profiler: `ncu --set full -o output $(which python) train.py`
      - Nsight
  
  - ncu give us actionable hints 
    - Tail effect + achieved occupacy -< often controlled by things like Padding >- We can control Padding 
    - Long scoreboard stalls -<memory coalescing, use shared memory >- We can't control (Triton does it for us)
  

# S1 - C++ & Python Kernels

## PyTorch Profiler

In [2]:
!python pytorch_square.py

tensor([1., 4., 9.])
tensor([1., 4., 9.])
tensor([1., 4., 9.])
Profiling torch.square
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                           aten::square         0.37%       6.484us        88.49%       1.544ms       1.544ms       0.000us         0.00%     262.431us     262.431us             1  
                                              aten::pow        86.66%       1.512ms        88.11%       1.537ms       1.5

In [4]:
!python pt_profiler.py

In [36]:
!tensorboard --logdir=./log --verbosity=1

2025-02-18 12:32:42.281996: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-02-18 12:32:42.293486: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1739881962.307101 1591562 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1739881962.311068 1591562 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-18 12:32:42.324384: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

We should always maintain our server clean by limiting the inclusion of unnecessary data and removing unused configurations, such as exposed ports. If we have forwarded a port, after completing the task over that connection, we should check whether the port remains active by inspecting the SSH connections with: 
`ps aux | grep ssh` or `lsof -i:6006`

## Binding

We are going to use a Torch utility to integrate C++ code into our Python script. This type of mixed programming is known as binding, which allows us to leverage C++ performance within Python.

A temporary directory (tmp) will be created containing the following files:

- .ninja_log: A log file generated by Ninja during the build process.
- build.ninja: The build file that defines the rules and dependencies needed to compile and link our C++ code into a shared library (my_module.so).
- .ninja_deps: A file that stores dependency information for Ninja.
- main.cpp: The C++ source file that defines a simple function and uses Pybind11 macros to create a Python module.
- main.o: The object file generated from compiling main.cpp.
- my_module.so: The final shared library that is ready to be imported into Python.

In [3]:
!python hello_load_inline.py

Emitting ninja build file ./tmp/build.ninja...
Building extension module my_module...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module my_module...
Hello World!


In [2]:
!ls tmp

build.ninja  main.cpp  main.o  my_module.so


In [None]:
!TORCH_CUDA_ARCH_LIST=9.0 python load_inline.py                             

tensor([[ 1.,  4.,  9.],
        [16., 25., 36.]], device='cuda:0')


```Python
square_matrix_extension = load_inline(
    name='square_matrix_extension',  # Name of the Python extension module that will expose the C++/CUDA functions
    cpp_sources=cpp_source,          # C++ source code defining the interface and wrapper function for the CUDA kernel
    cuda_sources=cuda_source,        # CUDA source code implementing the actual CUDA kernel
    functions=['square_matrix'],     # List of function symbols to be registered and exposed to Python
    with_cuda=True, 
    extra_cuda_cflags=["-O2"],
    build_directory='./load_inline_cuda',
    # extra_cuda_cflags=['--expt-relaxed-constexpr']
)
```

> NOTE: No great interaction with ncu -< Require to used `--target-processes` flag >- `ncu --target-processes all python load_inline.py`

In [9]:
!ncu --target-processes all python load_inline.py

==PROF== Connected to process 1573040 (/home/alex/miniforge3/envs/triton/bin/python3.10)
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
==PROF== Profiling "square_matrix_kernel" - 0: 0%....50%....100% - 10 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 1: 0%....50%....100% - 10 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 2: 0%....50%....100% - 10 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 3: 0%....50%....100% - 10 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 4: 0%....50%....100% - 10 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 5: 0%....50%....100% - 10 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 6: 0%....50%....100% - 10 passes
==PROF== Profiling "DeviceReduceSingleTileKernel" - 7: 0%....50%....100% - 10 passes
==PROF== Profiling "DeviceCompactInitKernel" - 8: 0%....50%....100% - 10 passes
==PROF== Profiling "DeviceSelectSweepKernel" - 9: 0%....50%....100% - 10 passes
==

## Numba

In [1]:
!python numba_square.py

[[1. 2. 3.]
 [4. 5. 6.]]
[[ 1.  4.  9.]
 [16. 25. 36.]]


## S2 - Triton

In [40]:
!code triton_square.py

In [None]:
!code square_kernel.ptx

In [4]:
!TORCH_LOGS=output_code python ax_pytorch2triton.py # TORCH_LOGS=help python -c "import torch"

V0218 13:13:36.929000 13100 site-packages/torch/_inductor/graph.py:2045] [0/0] [__output_code] Output code: 
V0218 13:13:36.929000 13100 site-packages/torch/_inductor/graph.py:2045] [0/0] [__output_code] # AOT ID: ['0_inference']
V0218 13:13:36.929000 13100 site-packages/torch/_inductor/graph.py:2045] [0/0] [__output_code] from ctypes import c_void_p, c_long, c_int
V0218 13:13:36.929000 13100 site-packages/torch/_inductor/graph.py:2045] [0/0] [__output_code] import torch
V0218 13:13:36.929000 13100 site-packages/torch/_inductor/graph.py:2045] [0/0] [__output_code] import math
V0218 13:13:36.929000 13100 site-packages/torch/_inductor/graph.py:2045] [0/0] [__output_code] import random
V0218 13:13:36.929000 13100 site-packages/torch/_inductor/graph.py:2045] [0/0] [__output_code] import os
V0218 13:13:36.929000 13100 site-packages/torch/_inductor/graph.py:2045] [0/0] [__output_code] import tempfile
V0218 13:13:36.929000 13100 site-packages/torch/_inductor/graph.py:2045] [0/0] [__output_cod

## S3 - CUDA Profilers

In [None]:
!ncu --set full -o ax_train $(which python) ax_train.py

==PROF== Connected to process 13212 (/home/alex/miniforge3/envs/gpum/bin/python3.10)

==PROF== Profiling "distribution_elementwise_grid..." - 0: 0%....50%....100% - 37 passes
==PROF== Profiling "square_kernel" - 1: 0%....50%....100% - 37 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 2: 0%....50%....100% - 37 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 3: 0%....50%....100% - 37 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 4: 0%....50%....100% - 37 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 5: 0%....50%....100% - 37 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 6: 0%....50%....100% - 37 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 7: 0%....50%....100% - 37 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 8: 0%....50%....100% - 37 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 9: 0%....50%....100% - 37 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 10: 0%....50%.

Upon obtaining the profiling results, we can open the Nsight app to quickly inspect them in a visual manner.

    -> Check the performance oportunities discovered
         - Tail effect + achieved occupacy -< often controlled by things like Padding >- We can control Padding 
         - Long scoreboard stalls -<memory coalescing, use shared memory >- We can't control (Triton does it for us)