# Fast Transformer Interence with Better Transformer

Shows how to use better transformer for production inference with torchtext. Better Transformer is a production read fastpath to accelerate deployment of transformer models with high performance on cpu and gpu. The fastpath feature works transparently for models based either directly on pytorch core or with torchtext.

Better transformer offers two types of acceleration:
- native multihead attention (mha) implementation for cpu and gpu to improve overall executino efficiency
- exploiting sparsity in nlp interence. Because of variable input length, input tokens may contain a large number of padding tokens for which proceassing may be skipped, delivering significant speedups

source: https://pytorch.org/tutorials/beginner/bettertransformer_tutorial.html

# Setup

We download the xlmr model from the predifined torchtext models by follwing the instruction in torchtext.models. 

In [1]:
import torch
import torch.nn as nn

print(f"torch version:{torch.__version__}")

torch version:2.3.1


In [2]:
DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

print(f"torch cuda available: {torch.cuda.is_available()}")

torch cuda available: False


In [3]:
import torch, torchtext
from torchtext.models import RobertaClassificationHead
from torchtext.functional import to_tensor



In [4]:
xlmr_large = torchtext.models.XLMR_LARGE_ENCODER
classifier_head = torchtext.models.RobertaClassificationHead(num_classes=2, input_dim= 1024)
model = xlmr_large.get_model(head=classifier_head)
transform = xlmr_large.transform()

### Dataset Setup

We setup two types of inputs
- a small input batch 
- larget input batch with sparsity

In [5]:
small_input_batch = [
               "Hello world",
               "How are you!"
]
big_input_batch = [
               "Hello world",
               "How are you!",
               """`Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.`

It was in July, 1805, and the speaker was the well-known Anna
Pavlovna Scherer, maid of honor and favorite of the Empress Marya
Fedorovna. With these words she greeted Prince Vasili Kuragin, a man
of high rank and importance, who was the first to arrive at her
reception. Anna Pavlovna had had a cough for some days. She was, as
she said, suffering from la grippe; grippe being then a new word in
St. Petersburg, used only by the elite."""
]

Next we select either the small or large input batch, preprocess the inputs and test the model

In [6]:
input_batch = big_input_batch

model_input = to_tensor(transform(input_batch), padding_value=1)
output = model(model_input)
output.shape

torch.Size([3, 2])

Finally we set the benchmark interation count

In [7]:
INTERATION = 10

## Execution

We run the model on cpu, and collect profile information
- the first run uses traditional ("slow path") execution
- the second run bt fastpath execution by putting the model in inference mode using model.eval() and disables graient collection with torch.no_grad()

We can see an improvement (whose magnitude will depend on the cput model) when the model is executing on cpu. Notice that the fastpath profile shows most of the execution time in the native transformerencoderlayer implementation 

### Slow path

In [8]:
with torch.autograd.profiler.profile(use_cuda=False) as prof:
    for i in range(INTERATION):
        output = model(model_input)
print(prof)

model.eval()

STAGE:2024-08-06 06:22:28 26961:5551113 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-08-06 06:22:54 26961:5551113 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-08-06 06:22:54 26961:5551113 ActivityProfilerController.cpp:324] Completed Stage: Post Processing


--------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                        Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
--------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                    aten::eq         0.00%      25.000us         0.00%      25.000us      25.000us             1  
                             aten::embedding         0.00%      23.000us         0.01%       1.334ms       1.334ms             1  
                               aten::reshape         0.00%       1.000us         0.00%       3.000us       3.000us             1  
                                  aten::view         0.00%       2.000us         0.00%       2.000us       2.000us             1  
                          aten::index_select         0.01%       1.300ms         0.

RobertaModel(
  (encoder): RobertaEncoder(
    (transformer): TransformerEncoder(
      (token_embedding): Embedding(250002, 1024, padding_idx=1)
      (layers): TransformerEncoder(
        (layers): ModuleList(
          (0-23): 24 x TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): NonDynamicallyQuantizableLinear(in_features=1024, out_features=1024, bias=True)
            )
            (linear1): Linear(in_features=1024, out_features=4096, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=4096, out_features=1024, bias=True)
            (norm1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
          )
        )
      )
      (positional_embedding): PositionalEmbedding(
        (embedding): Embe

### Fast path

In [9]:
with torch.autograd.profiler.profile(use_cuda=False) as prof:
    with torch.no_grad():
        for i in range(INTERATION):
            output = model(model_input)
print(prof)

STAGE:2024-08-06 06:23:04 26961:5551113 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
  output = torch._nested_tensor_from_mask(output, src_key_padding_mask.logical_not(), mask_check=False)
STAGE:2024-08-06 06:23:10 26961:5551113 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-08-06 06:23:10 26961:5551113 ActivityProfilerController.cpp:324] Completed Stage: Post Processing


-------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                   aten::eq         0.00%      79.000us         0.00%      79.000us      79.000us             1  
                            aten::embedding         0.00%     222.000us         0.03%       1.494ms       1.494ms             1  
                              aten::reshape         0.00%       2.000us         0.00%     226.000us     226.000us             1  
                                 aten::view         0.00%     224.000us         0.00%     224.000us     224.000us             1  
                         aten::index_select         0.02%       1.015ms         0.02%     

Run and benchmark inference on DEVICE with and without fastpath (native mha only)

We check the bt sparsity setting

In [10]:
model.encoder.transformer.layers.enable_nested_tensor

True

We will need to disbale it

In [11]:
model.encoder.transformer.layers.enable_nested_tensor=False

We run the model on device, and collect profile information for native mha execution on device
- first we uses traditional slow path execution
- The second part we enable bt fastpath execution by putting the model in inference mode using model.eval and disable gradient collection

In [12]:
model.to(DEVICE)
model_input = model_input.to(DEVICE)

slow path

In [13]:
with torch.autograd.profiler.profile(use_cuda=True) as prof:
    for i in range(INTERATION):
        output = model(model_input)
print(prof)

model.eval()

  warn("CUDA is not available, disabling CUDA profiling")
STAGE:2024-08-06 06:23:23 26961:5551113 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-08-06 06:23:36 26961:5551113 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-08-06 06:23:36 26961:5551113 ActivityProfilerController.cpp:324] Completed Stage: Post Processing


-----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                 Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                             aten::eq         0.00%      18.000us         0.00%      18.000us      18.000us             1  
                                      aten::embedding         0.00%      27.000us         0.00%     492.000us     492.000us             1  
                                        aten::reshape         0.00%       1.000us         0.00%       3.000us       3.000us             1  
                                           aten::view         0.00%       2.000us         0.00%       2.000us       2.000us             1  
                    

RobertaModel(
  (encoder): RobertaEncoder(
    (transformer): TransformerEncoder(
      (token_embedding): Embedding(250002, 1024, padding_idx=1)
      (layers): TransformerEncoder(
        (layers): ModuleList(
          (0-23): 24 x TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): NonDynamicallyQuantizableLinear(in_features=1024, out_features=1024, bias=True)
            )
            (linear1): Linear(in_features=1024, out_features=4096, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=4096, out_features=1024, bias=True)
            (norm1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
          )
        )
      )
      (positional_embedding): PositionalEmbedding(
        (embedding): Embe

fast path

In [14]:
with torch.autograd.profiler.profile(use_cuda=True) as prof:
    with torch.no_grad():
        for i in range(INTERATION):
            output = model(model_input)
print(prof)

STAGE:2024-08-06 06:23:38 26961:5551113 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-08-06 06:23:44 26961:5551113 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-08-06 06:23:44 26961:5551113 ActivityProfilerController.cpp:324] Completed Stage: Post Processing


-------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                   aten::eq         0.00%      84.000us         0.00%      84.000us      84.000us             1  
                            aten::embedding         0.00%      48.000us         0.02%       1.361ms       1.361ms             1  
                              aten::reshape         0.00%       2.000us         0.00%       5.000us       5.000us             1  
                                 aten::view         0.00%       3.000us         0.00%       3.000us       3.000us             1  
                         aten::index_select         0.02%       1.258ms         0.02%     

### Run and benchmark inference without bt fastpath

We enable sparsity support

In [15]:
model.encoder.transformer.layers.enable_nested_tensor = True

slow path

In [16]:
model.to(DEVICE)
model_input = model_input.to(DEVICE)

In [17]:
with torch.autograd.profiler.profile(use_cuda=True) as prof:
    for i in range(INTERATION):
        output = model(model_input)
print(prof)

model.eval()

STAGE:2024-08-06 06:23:56 26961:5551113 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-08-06 06:24:07 26961:5551113 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-08-06 06:24:07 26961:5551113 ActivityProfilerController.cpp:324] Completed Stage: Post Processing


-----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                 Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                             aten::eq         0.00%      20.000us         0.00%      20.000us      20.000us             1  
                                      aten::embedding         0.00%      18.000us         0.00%     236.000us     236.000us             1  
                                        aten::reshape         0.00%       2.000us         0.00%       3.000us       3.000us             1  
                                           aten::view         0.00%       1.000us         0.00%       1.000us       1.000us             1  
                    

RobertaModel(
  (encoder): RobertaEncoder(
    (transformer): TransformerEncoder(
      (token_embedding): Embedding(250002, 1024, padding_idx=1)
      (layers): TransformerEncoder(
        (layers): ModuleList(
          (0-23): 24 x TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): NonDynamicallyQuantizableLinear(in_features=1024, out_features=1024, bias=True)
            )
            (linear1): Linear(in_features=1024, out_features=4096, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=4096, out_features=1024, bias=True)
            (norm1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
          )
        )
      )
      (positional_embedding): PositionalEmbedding(
        (embedding): Embe

fast path

In [18]:
with torch.autograd.profiler.profile(use_cuda=True) as prof:
    with torch.no_grad():
        for i in range(INTERATION):
            output = model(model_input)
print(prof)

STAGE:2024-08-06 06:24:09 26961:5551113 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-08-06 06:24:15 26961:5551113 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-08-06 06:24:15 26961:5551113 ActivityProfilerController.cpp:324] Completed Stage: Post Processing


-------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                   aten::eq         0.00%      22.000us         0.00%      22.000us      22.000us             1  
                            aten::embedding         0.00%      19.000us         0.01%     586.000us     586.000us             1  
                              aten::reshape         0.00%       2.000us         0.00%       4.000us       4.000us             1  
                                 aten::view         0.00%       2.000us         0.00%       2.000us       2.000us             1  
                         aten::index_select         0.01%     552.000us         0.01%     