# Fast transformer inference with better transformer

本教程介绍了PyTorch 1.12版本中的Better Transformer（BT）。在本教程中，我们展示了如何使用Better Transformer进行生产推断，并结合torchtext使用。Better Transformer是一个生产就绪的快速路径，可加速在CPU和GPU上部署Transformer模型，实现高性能。快速路径功能对基于PyTorch核心`nn.module`直接实现的模型或使用torchtext的模型都透明地生效。

可以通过Better Transformer快速路径执行加速的模型是使用以下PyTorch核心`torch.nn.module`类的模型：`TransformerEncoder`，`TransformerEncoderLayer`和`MultiHeadAttention`。此外，torchtext已更新为使用核心库模块，以从快速路径加速中受益。（将来可能会启用其他模块的快速路径执行。）

Better Transformer提供了两种类型的加速：

- 本地多头注意力（MHA）实现，可提高CPU和GPU的整体执行效率。
- 利用NLP推断中的稀疏性。由于变长输入序列，输入标记可能包含大量的填充标记，可以跳过处理这些标记，从而实现显著加速。

快速路径执行需要满足一些条件。最重要的是，模型必须在推断模式下执行，并且操作的输入张量不收集梯度信息（例如，使用torch.no_grad运行）。

## 在本教程中的更好的Transformer功能

- **加载预训练模型（在PyTorch版本1.12之前创建的，不包含Better Transformer）**
- **在CPU上运行和基准测试推理，包括BT快速路径和非BT快速路径（仅原生MHA）**
- **在（可配置的）设备上运行和基准测试推理，包括BT快速路径和非BT快速路径（仅原生MHA）**
- **启用稀疏支持**
- **在（可配置的）设备上运行和基准测试推理，包括BT快速路径和非BT快速路径（原生MHA + 稀疏）**

## 其他信息

**有关Better Transformer的其他信息可以在PyTorch.Org博客[快速Transformer推理的更好Transformer](https://pytorch.org/blog/a-better-transformer-for-fast-transformer-encoder-inference//)中找到。**

## 1. **设置**

### 1.1 加载预训练模型

**我们通过按照[torchtext.models](https://pytorch.org/text/main/models.html)中的说明从预定义的torchtext模型下载XLM-R模型。我们还将设备设置为在加速器上执行测试。（根据您的环境启用适当的GPU执行。）**

In [1]:
import torch
import torch.nn as nn

print(f"torch version: {torch.__version__}")

DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

print(f"torch cuda available: {torch.cuda.is_available()}")

import torch, torchtext
from torchtext.models import RobertaClassificationHead
from torchtext.functional import to_tensor
xlmr_large = torchtext.models.XLMR_LARGE_ENCODER
classifier_head = torchtext.models.RobertaClassificationHead(num_classes=2, input_dim = 1024)
model = xlmr_large.get_model(head=classifier_head)
transform = xlmr_large.transform()

torch version: 2.1.2+cu121
torch cuda available: True


Downloading: "https://download.pytorch.org/models/text/xlmr.large.encoder.pt" to /home/richardliu/.cache/torch/hub/checkpoints/xlmr.large.encoder.pt
100%|██████████| 2.08G/2.08G [12:57<00:00, 2.87MB/s] 
100%|██████████| 5.07M/5.07M [00:03<00:00, 1.55MB/s]
Downloading: "https://download.pytorch.org/models/text/xlmr.vocab.pt" to /home/richardliu/.cache/torch/hub/checkpoints/xlmr.vocab.pt
100%|██████████| 4.85M/4.85M [00:03<00:00, 1.67MB/s]


### 1.2 数据集设置

**我们设置了两种类型的输入：一个小的输入批次和一个带有稀疏性的大输入批次。**

In [2]:
small_input_batch = [
               "Hello world",
               "How are you!"
]
big_input_batch = [
               "Hello world",
               "How are you!",
               """`Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.`

It was in July, 1805, and the speaker was the well-known Anna
Pavlovna Scherer, maid of honor and favorite of the Empress Marya
Fedorovna. With these words she greeted Prince Vasili Kuragin, a man
of high rank and importance, who was the first to arrive at her
reception. Anna Pavlovna had had a cough for some days. She was, as
she said, suffering from la grippe; grippe being then a new word in
St. Petersburg, used only by the elite."""
]

接下来，我们选择小批次或者大批次的任选其一，处理输入并测试模型

In [3]:
input_batch=big_input_batch

model_input = to_tensor(transform(input_batch), padding_value=1)
output = model(model_input)
output.shape

torch.Size([3, 2])

最后，我们设置benchmark iteration count

In [4]:
ITERATIONS=10

## 2. 执行

### 2.1 **在 CPU 上运行并对推理进行基准测试，包括启用和禁用 BT 快速路径（仅使用原生 MHA）**

**我们在 CPU 上运行模型，并收集性能信息：**

- **第一次运行使用传统的（“慢路径”）执行方式。**
- **第二次运行通过将模型设置为推理模式（使用 model.eval()）并禁用梯度收集（使用 torch.no_grad()）来启用 BT 快速路径执行。**

**当模型在 CPU 上执行时，您会看到一定的改进（其大小取决于 CPU 型号）。请注意，快速路径性能剖析显示大部分执行时间花在了原生的 TransformerEncoderLayer 实现 aten::_transformer_encoder_layer_fwd 上。**

In [5]:
print("slow path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=False) as prof:
  for i in range(ITERATIONS):
    output = model(model_input)
print(prof)

model.eval()

print("fast path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=False) as prof:
  with torch.no_grad():
    for i in range(ITERATIONS):
      output = model(model_input)
print(prof)

slow path:


  from .autonotebook import tqdm as notebook_tqdm
STAGE:2024-01-23 14:03:16 766285:766285 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
STAGE:2024-01-23 14:03:22 766285:766285 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-23 14:03:22 766285:766285 ActivityProfilerController.cpp:322] Completed Stage: Post Processing


--------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                        Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
--------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                    aten::eq         0.00%      21.000us         0.00%      21.000us      21.000us             1  
                             aten::embedding         0.00%      17.000us         0.00%     191.000us     191.000us             1  
                               aten::reshape         0.00%       2.000us         0.00%       4.000us       4.000us             1  
                                  aten::view         0.00%       2.000us         0.00%       2.000us       2.000us             1  
                          aten::index_select         0.00%     161.000us         0.

STAGE:2024-01-23 14:03:23 766285:766285 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
  output = torch._nested_tensor_from_mask(output, src_key_padding_mask.logical_not(), mask_check=False)
STAGE:2024-01-23 14:03:25 766285:766285 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-23 14:03:25 766285:766285 ActivityProfilerController.cpp:322] Completed Stage: Post Processing


-------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                   aten::eq         0.00%      23.000us         0.00%      23.000us      23.000us             1  
                            aten::embedding         0.00%       8.000us         0.01%     180.000us     180.000us             1  
                              aten::reshape         0.00%       2.000us         0.00%       4.000us       4.000us             1  
                                 aten::view         0.00%       2.000us         0.00%       2.000us       2.000us             1  
                         aten::index_select         0.01%     160.000us         0.01%     

### **2.2 在（可配置的）设备上运行和基准推理，包括使用和不使用BT快速路径（仅原生MHA）**

**我们检查BT稀疏设置：**

In [6]:
model.encoder.transformer.layers.enable_nested_tensor

True

**我们禁用BT稀疏**：

In [7]:
model.encoder.transformer.layers.enable_nested_tensor=False

我们在设备上运行模型，并收集有关在设备上进行本机MHA执行的配置信息：

- 第一次运行使用传统的（“慢速路径”）执行。
- 第二次运行通过使用model.eval()将模型置于推理模式，并使用torch.no_grad()禁用梯度收集，从而启用BT快速路径执行。

在使用GPU执行时，您应该看到显着的加速，特别是对于小批量输入设置：

In [8]:
model.to(DEVICE)
model_input = model_input.to(DEVICE)

print("slow path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=True) as prof:
  for i in range(ITERATIONS):
    output = model(model_input)
print(prof)

model.eval()

print("fast path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=True) as prof:
  with torch.no_grad():
    for i in range(ITERATIONS):
      output = model(model_input)
print(prof)

slow path:


STAGE:2024-01-23 14:07:33 766285:766285 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
STAGE:2024-01-23 14:07:33 766285:766285 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-23 14:07:33 766285:766285 ActivityProfilerController.cpp:322] Completed Stage: Post Processing


-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                               aten::eq         0.02%      52.000us         1.95%       6.417ms       6.417ms       6.355ms         1.53%       6.355ms       6.355ms             1  
                                       cudaLaunchKernel         1.93%       6.365ms         1.93%       6.365ms       6.365ms       0.000us         0.00%       0.000us       0.000us             1  
         

STAGE:2024-01-23 14:07:34 766285:766285 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
STAGE:2024-01-23 14:07:35 766285:766285 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-23 14:07:35 766285:766285 ActivityProfilerController.cpp:322] Completed Stage: Post Processing


-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                         aten::eq         0.01%      36.000us         0.01%      50.000us      50.000us       3.000us         0.00%       3.000us       3.000us             1  
                                 cudaLaunchKernel         0.00%      14.000us         0.00%      14.000us      14.000us       0.000us         0.00%       0.000us       0.000us             1  
                                  aten:

### **2.3 在（可配置的）设备上运行和测试推理，使用BT快速通道（本地MHA + 稀疏性）和不使用BT快速通道**

**我们启用稀疏性支持：**

In [9]:
model.encoder.transformer.layers.enable_nested_tensor = True

我们在DEVICE上运行模型，并收集用于在DEVICE上执行本地MHA和稀疏支持的配置文件信息：

- 第一次运行使用传统的（“慢速路径”）执行。
- 第二次运行通过将模型置于推理模式（model.eval()）并使用torch.no_grad()禁用梯度收集来启用BT快速路径执行。

在GPU上执行时，您应该会看到显着的加速，特别是对于包含稀疏性的大输入批处理设置。

In [11]:
model.to(DEVICE)
model_input = model_input.to(DEVICE)

print("slow path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=True) as prof:
  for i in range(ITERATIONS):
    output = model(model_input)
print(prof)

model.eval()

print("fast path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=True) as prof:
  with torch.no_grad():
    for i in range(ITERATIONS):
      output = model(model_input)
print(prof)

slow path:


STAGE:2024-01-23 14:10:37 766285:766285 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
STAGE:2024-01-23 14:10:38 766285:766285 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-23 14:10:38 766285:766285 ActivityProfilerController.cpp:322] Completed Stage: Post Processing


-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                         aten::eq         0.02%      38.000us         0.03%      50.000us      50.000us       4.000us         0.00%       4.000us       4.000us             1  
                                 cudaLaunchKernel         0.01%      12.000us         0.01%      12.000us      12.000us       0.000us         0.00%       0.000us       0.000us             1  
                                  aten:

STAGE:2024-01-23 14:10:39 766285:766285 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
STAGE:2024-01-23 14:10:39 766285:766285 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-23 14:10:39 766285:766285 ActivityProfilerController.cpp:322] Completed Stage: Post Processing


-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                         aten::eq         0.01%      39.000us         0.01%      51.000us      51.000us       3.000us         0.00%       3.000us       3.000us             1  
                                 cudaLaunchKernel         0.00%      12.000us         0.00%      12.000us      12.000us       0.000us         0.00%       0.000us       0.000us             1  
                                  aten:

## 总结

**在本教程中，我们介绍了在torchtext中使用PyTorch核心Better Transformer支持进行快速变换器推理的方法，其中使用了Better Transformer用于变换器编码器模型的快速路径执行。我们演示了在BT快速路径执行可用之前训练的模型中使用Better Transformer的方法。我们演示并对比了BT快速路径执行模式，包括本地MHA执行和BT稀疏加速的使用和性能。**