[향샹된 트랜스포머를 이용한 빠른 추론](https://tutorials.pytorch.kr/beginner/bettertransformer_tutorial.html)

## 향상된 트랜스포머의 특징

torchtext와 BT(Better Transformer)를 이용한 추론의 예시

`TransformerEncoder`, `TransformerEncoderLayer`, `MultiHeadAttention`을 이용

BT

1. multihead attention 을 효율적으로 수행(__MHA__)
2. NLP 추론의 __희소성__ 이용(길이가 다양해서 padding token을 많이 포함하고, 이는 skip되어 속도 향상을 야기함)
3. inference mode 에서 실행되어야 한다

학습 목표

1. 사전 학습된 모델 불러오기
2. MHA: CPU에서 BT fastpath 유무의 실행 결과 비교
3. MHA: DEVICE에서 BT fastpath 유무의 실행 결과 비교
4. 희소성 지원 가능하게 하기
5. MHA + 희소성: DEVICE에서 BT fastpath 유무의 실행 결과 비교


## 세팅

In [4]:
"""
사전학습된 모델 불러오기

XLM-R 모델 다운로드
"""

import torch
import torch.nn as nn

print(f"torch version: {torch.__version__}")

DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

print(f"torch cuda available: {torch.cuda.is_available()}")

import torchtext
from torchtext.models import RobertaClassificationHead
from torchtext.functional import to_tensor

# 원문: LARGE사용, RAM이슈로 BASE로 변경
xlmr_base = torchtext.models.XLMR_BASE_ENCODER

# downgrade 1024 to 768
classifier_head = RobertaClassificationHead(num_classes=2, input_dim=768)
model = xlmr_base.get_model(head=classifier_head)
transform = xlmr_base.transform()

torch version: 2.0.1+cu118
torch cuda available: True


In [5]:
# 데이터셋 세팅
# ㄴ 1. 작은 인풋 배치
# ㄴ 2. 큰 인풋 배치(sparsity)

small_input_batch = [
               "Hello world",
               "How are you!"
]
big_input_batch = [
               "Hello world",
               "How are you!",
               """`Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.`

It was in July, 1805, and the speaker was the well-known Anna
Pavlovna Scherer, maid of honor and favorite of the Empress Marya
Fedorovna. With these words she greeted Prince Vasili Kuragin, a man
of high rank and importance, who was the first to arrive at her
reception. Anna Pavlovna had had a cough for some days. She was, as
she said, suffering from la grippe; grippe being then a new word in
St. Petersburg, used only by the elite."""
]

In [6]:
# 입력값 전처리 및 모델 테스트 하기

input_batch = big_input_batch

model_input = to_tensor(transform(input_batch), padding_value=1)
output = model(model_input)
output.shape

torch.Size([3, 2])

In [7]:
# 벤치마크 반복횟수 설정
ITERATIONS=10

## 실행

In [10]:
"""1. CPU에서 실행(BT fastpath 유무 비교)

CPU에서 모델 돌려서 fastpath 유무 비교하기
1. slow path 실행
2. BT fastpath 실행(model.eval(): 명령어 실행하고 gradient collection 중지
"""

print("slow path")
print("=========")

# 프로파일러를 이용해서 코드 내의 연산에 대한 시간과 메모리 비용을 파악
with torch.autograd.profiler.profile(use_cuda=False) as prof:
    for i in range(ITERATIONS):
        output = model(model_input)
print(prof)


model.eval()
print("fast path:")
print("=========")

with torch.autograd.profiler.profile(use_cuda=False) as prof:
    with torch.no_grad():
        for i in range(ITERATIONS):
            output = model(model_input)
print(prof)

slow path
--------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                        Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
--------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                    aten::eq         0.00%      42.000us         0.00%      42.000us      42.000us             1  
                             aten::embedding         0.00%      56.000us         0.00%     475.000us     475.000us             1  
                               aten::reshape         0.00%      32.000us         0.00%      40.000us      40.000us             1  
                        aten::_reshape_alias         0.00%       8.000us         0.00%       8.000us       8.000us             1  
                          aten::index_select         0.00%     354.000us 

In [12]:
# DEVCIE에서 모델 실행

# BT sparsity 세팅 확인하기
model.encoder.transformer.layers.enable_nested_tensor

True

In [13]:
# BT sparsity 기능 끄기
model.encoder.transformer.layers.enable_neste_tensor=False

In [14]:
# GPU 사용했으 때의 성능 비교하기
model.to(DEVICE)
model_input = model_input.to(DEVICE)

print("slow path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=True) as prof:
    for i in range(ITERATIONS):
        output = model(model_input)
print(prof)

model.eval()

print("fast path")
print("=========")
with torch.autograd.profiler.profile(use_cuda=True) as prof:
    with torch.no_grad():
        for i in range(ITERATIONS):
            output = model(model_input)
print(prof)

slow path:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                               aten::eq         0.27%       8.077ms         0.28%       8.116ms       8.116ms       8.140ms         0.26%       8.140ms       8.140ms             1  
                                       cudaLaunchKernel         0.00%      39.000us         0.00%      39.000us      39.000us       0.000us         0.00%       0.000us       0.000us             1 

In [15]:
# BT fastpath와 함께 sparsity도 사용하여 결과 확인하기
model.encoder.transformer.layers.enable_nested_tensor = True

In [16]:
# GPU 사용했으 때의 성능 비교하기
model.to(DEVICE)
model_input = model_input.to(DEVICE)

print("slow path:")
print("==========")
with torch.autograd.profiler.profile(use_cuda=True) as prof:
    for i in range(ITERATIONS):
        output = model(model_input)
print(prof)

model.eval()

print("fast path")
print("=========")
with torch.autograd.profiler.profile(use_cuda=True) as prof:
    with torch.no_grad():
        for i in range(ITERATIONS):
            output = model(model_input)
print(prof)

slow path:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                               aten::eq         0.03%      95.000us         0.04%     133.000us     133.000us     154.000us         0.03%     154.000us     154.000us             1  
                                       cudaLaunchKernel         0.01%      38.000us         0.01%      38.000us      38.000us       0.000us         0.00%       0.000us       0.000us             1 

### BT sparsity acceleration

입력 시퀀스에서 패딩 토큰들을 건너뛰고 연산을 수행함으로써 모델의 속도를 향상시키는 방법