
# PyTorch Profiler With TensorBoard
This tutorial demonstrates how to use TensorBoard plugin with PyTorch Profiler
to detect performance bottlenecks of the model.

## Introduction
PyTorch 1.8 includes an updated profiler API capable of
recording the CPU side operations as well as the CUDA kernel launches on the GPU side.
The profiler can visualize this information
in TensorBoard Plugin and provide analysis of the performance bottlenecks. In this tutorial, we will use a simple Mobilenet_V2 model to demonstrate how to
use TensorBoard plugin to analyze model performance.

Pytorch example "PyTorch Profiler With TensorBoard" is used for base model which is available [Link](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html) accessed on February 13, 2024


## Setup
To install ``torch`` and ``torchvision`` use the following command:

```
pip install torch torchvision
```


## Steps

1. Prepare the data and model
2. Use profiler to record execution events
3. Run the profiler
4. Use TensorBoard to view results and analyze model performance
5. Improve performance with the help of profiler
6. Analyze performance with other advanced features

### 1. Prepare the data and model

First, import all necessary libraries:




In [2]:
import torch
import torch.nn
import torch.optim
import torch.profiler
import torch.utils.data
import torchvision.datasets
import torchvision.models
import torchvision.transforms as T

Then prepare the input data. For this tutorial, we use the CIFAR10 dataset.
Transform it to the desired format and use ``DataLoader`` to load each batch.



In [3]:
transform = T.Compose(
    [T.Resize(224),
     T.ToTensor()#,
     ]) #T.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
train_set = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_set, batch_size=128, shuffle=True, num_workers=2, pin_memory=True)

Files already downloaded and verified


Next, create Mobilenet_V2 model, loss function, and optimizer objects.
To run on GPU, move model and loss to GPU device.



In [4]:
device = torch.device("cuda:0")
model = torchvision.models.mobilenet_v2(weights='IMAGENET1K_V1').cuda(device)
criterion = torch.nn.CrossEntropyLoss().cuda(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
model.train()

MobileNetV2(
  (features): Sequential(
    (0): Conv2dNormActivation(
      (0): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU6(inplace=True)
    )
    (1): InvertedResidual(
      (conv): Sequential(
        (0): Conv2dNormActivation(
          (0): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=32, bias=False)
          (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): ReLU6(inplace=True)
        )
        (1): Conv2d(32, 16, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (2): InvertedResidual(
      (conv): Sequential(
        (0): Conv2dNormActivation(
          (0): Conv2d(16, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (1): BatchNorm2d(96, eps=

Define the training step for each batch of input data.



In [5]:
def train(data):
    inputs, labels = data[0].to(device=device, non_blocking=True), data[1].to(device=device, non_blocking=True)
    #inputs, labels = data[0].to(device=device), data[1].to(device=device)
    inputs = (inputs.to(torch.float32) / 255. - 0.5) / 0.5
    with torch.autocast(device_type='cuda', dtype=torch.float16):
        outputs = model(inputs)
        loss = criterion(outputs, labels) 
    #outputs = model(inputs)
    #loss = criterion(outputs, labels)
    # Note - torch.cuda.amp.GradScaler() may be required
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

### 2. Use profiler to record execution events

The profiler is enabled through the context manager and accepts several parameters,
some of the most useful are:

- ``schedule`` - callable that takes step (int) as a single parameter
  and returns the profiler action to perform at each step.

  In this example with ``wait=1, warmup=1, active=3, repeat=1``,
  profiler will skip the first step/iteration,
  start warming up on the second,
  record the following three iterations,
  after which the trace will become available and on_trace_ready (when set) is called.
  In total, the cycle repeats once. Each cycle is called a "span" in TensorBoard plugin.

  During ``wait`` steps, the profiler is disabled.
  During ``warmup`` steps, the profiler starts tracing but the results are discarded.
  This is for reducing the profiling overhead.
  The overhead at the beginning of profiling is high and easy to bring skew to the profiling result.
  During ``active`` steps, the profiler works and records events.
- ``on_trace_ready`` - callable that is called at the end of each cycle;
  In this example we use ``torch.profiler.tensorboard_trace_handler`` to generate result files for TensorBoard.
  After profiling, result files will be saved into the ``./log/mobilenet_v2`` directory.
  Specify this directory as a ``logdir`` parameter to analyze profile in TensorBoard.
- ``record_shapes`` - whether to record shapes of the operator inputs.
- ``profile_memory`` - Track tensor memory allocation/deallocation. Note, for old version of pytorch with version
  before 1.10, if you suffer long profiling time, please disable it or upgrade to new version.
- ``with_stack`` - Record source information (file and line number) for the ops.
  If the TensorBoard is launched in VS Code ([reference](https://code.visualstudio.com/docs/datascience/pytorch-support#_tensorboard-integration)),
  clicking a stack frame will navigate to the specific code line.



In [6]:
with torch.profiler.profile(
        schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1),
        on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/mobilenet_v2'),
        record_shapes=True,
        profile_memory=True,
        with_stack=True
) as prof:
    for step, batch_data in enumerate(train_loader):
        prof.step()  # Need to call this at each step to notify profiler of steps' boundary.
        if step >= 1 + 1 + 3:
            break
        train(batch_data)

STAGE:2024-02-14 02:23:43 1804:1804 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
STAGE:2024-02-14 02:23:43 1804:1804 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-02-14 02:23:43 1804:1804 ActivityProfilerController.cpp:322] Completed Stage: Post Processing


Alternatively, the following non-context manager start/stop is supported as well.



In [None]:
# prof = torch.profiler.profile(
#         schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1),
#         on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/mobilenet_v2'),
#         record_shapes=True,
#         with_stack=True)
# prof.start()
# for step, batch_data in enumerate(train_loader):
#     prof.step()
#     if step >= 1 + 1 + 3:
#         break
#     train(batch_data)
# prof.stop()

### 3. Run the profiler

Run the above code. The profiling result will be saved under ``./log/mobilenet_v2`` directory.



### 4. Use TensorBoard to view results and analyze model performance

<div class="alert alert-info"><h4>Note</h4><p>TensorBoard Plugin support has been deprecated, so some of these functions may not
    work as previously. Please take a look at the replacement, [HTA](https://github.com/pytorch/kineto/tree/main#holistic-trace-analysis).</p></div>

Install PyTorch Profiler TensorBoard Plugin.

```
pip install torch_tb_profiler
```


Launch the TensorBoard.

```
tensorboard --logdir=./log
```


Open the TensorBoard profile URL in Google Chrome browser or Microsoft Edge browser.

```
http://localhost:6006/#pytorch_profiler
```


You could see Profiler plugin page as shown below.

- Overview
The overview shows a high-level summary of model performance.

The "GPU Summary" panel shows the GPU configuration, GPU usage and Tensor Cores usage.
In this example, the GPU Utilization is low.
The details of these metrics are [here](https://github.com/pytorch/kineto/blob/main/tb_plugin/docs/gpu_utilization.md).

The "Step Time Breakdown" shows distribution of time spent in each step over different categories of execution.
In this example, you can see the ``DataLoader`` overhead is significant.

The bottom "Performance Recommendation" uses the profiling data
to automatically highlight likely bottlenecks,
and gives you actionable optimization suggestions.

You can change the view page in left "Views" dropdown list.

- Operator view
The operator view displays the performance of every PyTorch operator
that is executed either on the host or device.

- Kernel view
The GPU kernel view shows all kernels’ time spent on GPU.

Tensor Cores Used:
Whether this kernel uses Tensor Cores.

Mean Blocks per SM:
Blocks per SM = Blocks of this kernel / SM number of this GPU.
If this number is less than 1, it indicates the GPU multiprocessors are not fully utilized.
"Mean Blocks per SM" is weighted average of all runs of this kernel name, using each run’s duration as weight.

Mean Est. Achieved Occupancy:
Est. Achieved Occupancy is defined in this column’s tooltip.
For most cases such as memory bandwidth bounded kernels, the higher the better.
"Mean Est. Achieved Occupancy" is weighted average of all runs of this kernel name,
using each run’s duration as weight.

- Trace view
The trace view shows timeline of profiled operators and GPU kernels.
You can select it to see details as below.

You can move the graph and zoom in/out with the help of right side toolbar.
And keyboard can also be used to zoom and move around inside the timeline.
The ‘w’ and ‘s’ keys zoom in centered around the mouse,
and the ‘a’ and ‘d’ keys move the timeline left and right.
You can hit these keys multiple times until you see a readable representation.




### 5. Improve performance with the help of profiler

At the bottom of "Overview" page, the suggestion in "Performance Recommendation" hints the bottleneck is ``DataLoader``.
The PyTorch ``DataLoader`` uses single process by default.
User could enable multi-process data loading by setting the parameter ``num_workers``.
[Here](https://pytorch.org/docs/stable/data.html#single-and-multi-process-data-loading) is more details.

## Learn More

Take a look at the following documents to continue your learning,
and feel free to open an issue [here](https://github.com/pytorch/kineto/issues).

-  [PyTorch TensorBoard Profiler Github](https://github.com/pytorch/kineto/tree/master/tb_plugin)
-  [torch.profiler API](https://pytorch.org/docs/master/profiler.html)
-  [HTA](https://github.com/pytorch/kineto/tree/main#holistic-trace-analysis)

