<a href="https://colab.research.google.com/github/chihina/pytorch-tutorial/blob/master/7_1_profiliong_module.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Tutorial 7-1: Profiling your PyTorch Module**  
https://pytorch.org/tutorials/beginner/profiler.html

#**Tutorial 7-1: Profiling your PyTorch Module**  
https://pytorch.org/tutorials/beginner/profiler.html

# Overview
In this tutorial, we use the profiler API.  
We can know the time and memory cost of various PyTorch oprerations.  

In [1]:
import torch
import numpy as np
from torch import nn
import torch.autograd.profiler as profiler

# Performance debugging using Profiler

Profiler can be useful to identify performance bottlenecks in our models.  

We wrap code for sub-task in separate labelled context manager using `profiler.record_function("label")`

In [2]:
class MyModule(nn.Module):
    def __init__(self, in_features: int, out_features: int, bias: bool = True):
        super(MyModule, self).__init__()
        self.linear = nn.Linear(in_features, out_features, bias)

    def forward(self, input, mask):
        with profiler.record_function("LINEAR PASS"):
            out = self.linear(input)

        with profiler.record_function("MASK INDICES"):
            threshold = out.sum(axis=1).mean().item()
            hi_idx = np.argwhere(mask.cpu().numpy() > threshold)
            hi_idx = torch.from_numpy(hi_idx).cuda()

        return out, hi_idx

# Profile the forward pass
Before we run the profiler, we warm-up CUDA to ensure accurate performance benchmarking.  

We wrap the forward pass of our module in the profiler.profile context manager. The with_stack=True parameter appends the file and line number of the operation in the trace.

In [3]:
model = MyModule(500, 10).cuda()
input = torch.rand(128, 500).cuda()
mask = torch.rand((500, 500, 500), dtype=torch.double).cuda()

# warm-up
model(input, mask)

with profiler.profile(with_stack=True, profile_memory=True) as prof:
    out, idx = model(input, mask)

# Print profiler results

`profiler.key_averages` aggregates the results by operator name, and optionally by input shapes and/or stack trace events.

In [4]:
print(prof.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total', row_limit=5))

-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ---------------------------------------------------------------------------  
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  Source Location                                                              
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ---------------------------------------------------------------------------  
                 MASK INDICES        81.17%        4.043s        99.93%        4.978s        4.978s          -4 b    -953.67 Mb       2.79 Gb      -1.00 Kb             1  /usr/local/lib/python3.6/dist-packages/torch/autograd/profiler.py(503): __e  
   

# Improve memory performance

Let’s try to tackle the memory consumption first. We can see that the .to() operation at line 12 consumes 953.67 Mb.  
This operation copies mask to the CPU. mask is initialized with a torch.double datatype. Can we reduce the memory footprint by casting it to torch.float instead?

In [5]:
model = MyModule(500, 10).cuda()
input = torch.rand(128, 500).cuda()
mask = torch.rand((500, 500, 500), dtype=torch.float).cuda()

# warm-up
model(input, mask)

with profiler.profile(with_stack=True, profile_memory=True) as prof:
    out, idx = model(input, mask)

print(prof.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total', row_limit=5))

-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ---------------------------------------------------------------------------  
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  Source Location                                                              
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ---------------------------------------------------------------------------  
                 MASK INDICES        78.97%        1.991s        99.96%        2.520s        2.520s          -4 b    -476.84 Mb       1.60 Gb      -1.00 Kb             1  /usr/local/lib/python3.6/dist-packages/torch/autograd/profiler.py(503): __e  
   

# Improve time performance
Turns out copying a matrix from CUDA to CPU is pretty expensive!  

The aten::copy_ operator in forward (12) copies mask to CPU so that it can use the NumPy argwhere function.  

aten::copy_ at forward(13) copies the array back to CUDA as a tensor. We could eliminate both of these if we use a torch function nonzero() here instead.



In [7]:
class MyModule(nn.Module):
    def __init__(self, in_features: int, out_features: int, bias: bool = True):
        super(MyModule, self).__init__()
        self.linear = nn.Linear(in_features, out_features, bias)

    def forward(self, input, mask):
        with profiler.record_function("LINEAR PASS"):
            out = self.linear(input)

        with profiler.record_function("MASK INDICES"):
            threshold = out.sum(axis=1).mean()
            hi_idx = (mask > threshold).nonzero(as_tuple=True)

        return out, hi_idx


model = MyModule(500, 10).cuda()
input = torch.rand(128, 500).cuda()
mask = torch.rand((500, 500, 500), dtype=torch.float).cuda()

# warm-up
model(input, mask)

with profiler.profile(with_stack=True, profile_memory=True) as prof:
    out, idx = model(input, mask)

print(prof.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total', row_limit=5))

-----------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ---------------------------------------------------------------------------  
                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  Source Location                                                              
-----------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ---------------------------------------------------------------------------  
          aten::nonzero        78.78%       6.770ms        81.07%       6.967ms       6.967ms           0 b           0 b           0 b           0 b             1  <ipython-input-7-c7eb287ee4a0>(12): forward                                  
                           

# Concusion
we can use profiler to reduce memory and time costs. 