#### Fast Transformer Inference with Better Transformer
  
As part of PyTorch 1.12 release, it was included as a new Transformer named "Better Transformer" (BT), BT is a production ready fastpath to accelerate deployment of Transformer models with high performance on CPU and GPU. The fastpath feature works transparently for models based either directly on PyTorch core nn.module or with torchtext.

In [1]:
import torch
import torch.nn as nn 

print( f"torch version: {torch.__version__}")

DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

print( f"torch cuda available: {torch.cuda.is_available}")

import torch, torchtext
from torchtext.models import RobertaClassificationHead
from torchtext.functional import to_tensor
xlmr_large = torchtext.models.XLMR_LARGE_ENCODER
classifier_head = torchtext.models.RobertaClassificationHead( num_classes = 2, input_dim = 1024 )
model = xlmr_large.get_model( head = classifier_head )
transform = xlmr_large.transform()

torch version: 2.2.0+cu118
torch cuda available: <function is_available at 0x7a03e63b9090>


Downloading: "https://download.pytorch.org/models/text/xlmr.large.encoder.pt" to /home/xamanek/.cache/torch/hub/checkpoints/xlmr.large.encoder.pt
100%|██████████| 2.08G/2.08G [03:20<00:00, 11.1MB/s]
100%|██████████| 5.07M/5.07M [00:00<00:00, 7.34MB/s]
Downloading: "https://download.pytorch.org/models/text/xlmr.vocab.pt" to /home/xamanek/.cache/torch/hub/checkpoints/xlmr.vocab.pt
100%|██████████| 4.85M/4.85M [00:00<00:00, 8.34MB/s]


#### Dataset setup  
Set up two types of inputs: A small input batch and a big input batch with sparsity

In [2]:
small_input_batch = [
               "Hello world",
               "How are you!"
]

big_input_batch = [
               "Hello world",
               "How are you!",
               """`Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.`

It was in July, 1805, and the speaker was the well-known Anna
Pavlovna Scherer, maid of honor and favorite of the Empress Marya
Fedorovna. With these words she greeted Prince Vasili Kuragin, a man
of high rank and importance, who was the first to arrive at her
reception. Anna Pavlovna had had a cough for some days. She was, as
she said, suffering from la grippe; grippe being then a new word in
St. Petersburg, used only by the elite."""
]

Select either input batch, preprocess the inputs and test the model

In [4]:
input_batch = big_input_batch

model_input = to_tensor( transform( input_batch ), padding_value = 1 )
output = model( model_input )
output.shape

torch.Size([3, 2])

Set the benchmark iteration count

In [5]:
ITERATIONS = 10

#### Execution  
  
Run and benchmark the inference on CPU with and without BT fastpath (native MHA only)

In [6]:
print( "slow path: " )
print( "===========" )
with torch.autograd.profiler.profile( use_cuda = False ) as prof: 
  for i in range( ITERATIONS ):
    output = model( model_input )
print( prof )

model.eval()

print( "fast path: " )
print( "===========" )
with torch.autograd.profiler.profile( use_cuda = False ) as prof:
  with torch.no_grad():
    for i in range( ITERATIONS ):
      output = model( model_input )
print( prof )

slow path: 


STAGE:2024-02-15 10:04:25 10075:10075 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-02-15 10:04:38 10075:10075 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-02-15 10:04:38 10075:10075 ActivityProfilerController.cpp:324] Completed Stage: Post Processing


--------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                        Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
--------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                    aten::eq         0.00%      28.000us         0.00%      28.000us      28.000us             1  
                             aten::embedding         0.00%     228.000us         0.01%     814.000us     814.000us             1  
                               aten::reshape         0.00%       2.000us         0.00%       5.000us       5.000us             1  
                                  aten::view         0.00%       3.000us         0.00%       3.000us       3.000us             1  
                          aten::index_select         0.00%     564.000us         0.

STAGE:2024-02-15 10:04:39 10075:10075 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
  output = torch._nested_tensor_from_mask(output, src_key_padding_mask.logical_not(), mask_check=False)
STAGE:2024-02-15 10:04:46 10075:10075 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-02-15 10:04:46 10075:10075 ActivityProfilerController.cpp:324] Completed Stage: Post Processing


-------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                   aten::eq         0.00%      30.000us         0.00%      30.000us      30.000us             1  
                            aten::embedding         0.00%      10.000us         0.00%     256.000us     256.000us             1  
                              aten::reshape         0.00%       2.000us         0.00%       4.000us       4.000us             1  
                                 aten::view         0.00%       2.000us         0.00%       2.000us       2.000us             1  
                         aten::index_select         0.00%     232.000us         0.00%     

Check for BT sparsity setting

In [7]:
model.encoder.transformer.layers.enable_nested_tensor

True

Disable the BT fastpath sparsity

In [8]:
model.encoder.transformer.layers.enable_nested_tensor = False

Run the model on DEVICE, and collect the profile information for native MHA execution on DEVICE:  
  
First run uses traditional ("slow path") execution  
  
Second run enables BT fastpath execution by putting the model in inference mode using model.eval()  
and disables gradient collection with 'torch.no_grad()'  
  
As we are executing on a GPU, you should see a significant speedup, in particular for a small  
input batch setting.  

In [9]:
model.to(DEVICE)
model_input = model_input.to(DEVICE)

print( "slow path: " )
print( "===========" )
with torch.autograd.profiler.profile( use_cuda = True ) as prof: 
  for i in range( ITERATIONS ):
    output = model( model_input )
print( prof )

model.eval()

print( "fast path: " )
print( "===========" )
with torch.autograd.profiler.profile( use_cuda = True ) as prof:
  with torch.no_grad():
    for i in range( ITERATIONS ):
      output = model( model_input )
print( prof )

slow path: 


STAGE:2024-02-15 10:08:50 10075:10075 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-02-15 10:08:51 10075:10075 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-02-15 10:08:51 10075:10075 ActivityProfilerController.cpp:324] Completed Stage: Post Processing


-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                        cudaEventRecord         0.00%       9.000us         0.00%       9.000us       9.000us       0.000us         0.00%       0.000us       0.000us             1  
                                               aten::eq         0.14%       1.418ms         0.82%       8.392ms       8.392ms       8.337ms         0.77%       8.337ms       8.337ms             1  
         

STAGE:2024-02-15 10:08:54 10075:10075 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-02-15 10:08:55 10075:10075 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-02-15 10:08:55 10075:10075 ActivityProfilerController.cpp:324] Completed Stage: Post Processing


-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                        cudaEventRecord         0.00%       9.000us         0.00%       9.000us       9.000us       0.000us         0.00%       0.000us       0.000us             1  
                                               aten::eq         0.01%      44.000us         0.01%      59.000us      59.000us       3.000us         0.00%       3.000us       3.000us             1  
         

Run and benchmark the inference on ( configurable ) DEVICE with and without BT fastpath (native MHA only)  
  
Enable sparsity support:

In [10]:
model.encoder.transformer.layers.enable_nested_tensor = True

Run the model on DEVICE, and collect profile information for native MHA and sparsity support execution on DEVICE  
  
1. First run uses traditional ("slow path") execution  
  
2. Second run enables BT fastpath execution by putting the model in inference mode using model.eval()  
and disables gradient collection with 'torch.no_grad()'  
  
When executing on a GPU, we should see a significant speedup, in particular for the large input batch setting  
which includes sparsity.

In [11]:
model.to( DEVICE )
model_input = model_input.to( DEVICE )

print( "slow path: " )
print( "===========" )
with torch.autograd.profiler.profile( use_cuda = True ) as prof: 
  for i in range( ITERATIONS ):
    output = model( model_input )
print( prof )

model.eval()

print( "fast path: " )
print( "===========" )
with torch.autograd.profiler.profile( use_cuda = True ) as prof:
  with torch.no_grad():
    for i in range( ITERATIONS ):
      output = model( model_input )
print( prof )

slow path: 


STAGE:2024-02-15 10:11:04 10075:10075 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-02-15 10:11:05 10075:10075 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-02-15 10:11:05 10075:10075 ActivityProfilerController.cpp:324] Completed Stage: Post Processing


-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                        cudaEventRecord         0.89%       8.387ms         0.89%       8.387ms       8.387ms       0.000us         0.00%       0.000us       0.000us             1  
                                               aten::eq         0.01%     135.000us         0.02%     150.000us     150.000us      81.000us         0.01%      81.000us      81.000us             1  
         

STAGE:2024-02-15 10:11:08 10075:10075 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-02-15 10:11:09 10075:10075 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-02-15 10:11:09 10075:10075 ActivityProfilerController.cpp:324] Completed Stage: Post Processing


-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                        cudaEventRecord         0.00%       8.000us         0.00%       8.000us       8.000us       0.000us         0.00%       0.000us       0.000us             1  
                                               aten::eq         0.01%      43.000us         0.01%      57.000us      57.000us       4.000us         0.00%       4.000us       4.000us             1  
         