# Transformer

구조
- RNN, LSTM을 사용하지 않고 어텐션 기반으로 동작하는 인코더-디코더 구조의 모델
- 인코더
  - 입력 시퀀스를 받아들이며, 각 입력 단어에 대한 정보를 추출
  - 멀티 헤드 어텐션과 피드포워드 네트워크로 구성됨
- 디코더
  - 인코드의 출력과 디코더의 이전 타임스텝 출력을 사용하여 다음 출력을 예측
  - 멀티 헤드 어텐션, 피드포워드 네트워크, 셀프 어텐션으로 구성됨

핵심 개념
- 어텐션 메커니즘
  - 셀프 어텐션: 입력의 각 위치가 다른 모든 위치의 정보를 고려하는 구조
  - 멀티 헤드 어텐션: 여러 개의 독립적인 어텐션을 병렬로 수행하여 서로 다른 부분에 집중할 수 있도록 함
- 포지셔널 인코딩: 트랜스포머는 입력의 순서를 학습할 수 없기 때문에 입력에 위치 정보를 추가해주어야 함
- 병렬 처리 가능: 데이터를 순차적으로 입력하지 않고 동시에 병렬로 처리할 수 있기 때문에 빠르게 학습할 수 있음


## 셀프 어텐션 (Self Attention)

Query, Key, Value 벡터로 입력을 변환

Query와 Key의 내역을 통해 각 입력의 중요도를 계산한 후, 소프트맥스를 적용하여 확률을 출력

각 Value에 확률(어텐션 가중치)를 곱하여 Weighted Sum을 계산


## 멀티 헤드 어텐션


여러 개의 셀프 어텐션 블록을 병렬로 배치한 후 그 결과를 concat하는 방식

## Transformer 구현하기

모듈 임포트

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from sklearn.model_selection import train_test_split

데이터셋 클래스 준비

In [2]:
# 데이터셋 샘플 생성 (임의의 데이터셋 사용)
class TextDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=50):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]

        # 토큰화 및 패딩
        inputs = self.tokenizer(text, return_tensors='pt', max_length=self.max_len, padding='max_length', truncation=True)
        input_ids = inputs['input_ids'].squeeze(0)
        attention_mask = inputs['attention_mask'].squeeze(0)

        # Attention mask를 boolean 타입으로 변환
        attention_mask = attention_mask.bool()

        return input_ids, attention_mask, torch.tensor(label)

Transformer 모델 정의

In [3]:
# 간단한 Transformer 모델 정의
class TransformerModel(nn.Module):
    def __init__(self, vocab_size, embed_size, num_classes, num_heads, num_layers, max_len):
        super(TransformerModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.pos_encoder = nn.Embedding(max_len, embed_size)
        self.transformer = nn.Transformer(d_model=embed_size, nhead=num_heads, num_encoder_layers=num_layers)
        self.fc = nn.Linear(embed_size, num_classes)

    def forward(self, input_ids, attention_mask):
        seq_len = input_ids.size(1)
        pos = torch.arange(0, seq_len).unsqueeze(0).to(input_ids.device)
        x = self.embedding(input_ids) + self.pos_encoder(pos)
        x = x.transpose(0, 1)  # Transformer expects input in (sequence, batch, feature) format
        x = self.transformer(x, x, src_key_padding_mask=attention_mask)
        x = x.mean(dim=0)  # Sequence-level representation
        return self.fc(x)

하이퍼 파라미터 정의

In [4]:
# 하이퍼파라미터
vocab_size = 30522  # 임의의 토큰 개수 (BERT tokenizer 사용 시의 vocab_size)
embed_size = 128
num_classes = 2  # 이진 분류 예시
num_heads = 8
num_layers = 2
max_len = 50

데이터셋 준비

In [5]:
texts = [
    "I absolutely love this new phone",  # 1
    "This restaurant was terrible",      # 2
    "I am very satisfied with the purchase",  # 3
    "The movie was a complete waste of time", # 4
    "What a beautiful day",                    # 5
    "I hate rainy weather",                    # 6
    "The staff was extremely friendly",        # 7
    "I am disappointed with the product quality", # 8
    "Everything about this experience was fantastic", # 9
    "The food was cold and tasteless",               # 10
    "This new software update is amazing",            # 11
    "I regret ever buying this device",               # 12
    "The wait time was acceptable",                   # 13
    "It was a nightmare trying to find parking",      # 14
    "I love the atmosphere of this place",            # 15
    "The instructions were unclear and confusing",    # 16
    "The quality is top-notch",                       # 17
    "My order arrived damaged",                       # 18
    "I had a wonderful time at the event",            # 19
    "This is the worst day of my life",               # 20
    "I am thrilled with the results",                 # 21
    "They didn't provide any assistance",             # 22
    "The design is sleek and modern",                 # 23
    "The packaging was broken",                       # 24
    "I feel so relaxed after using it",               # 25
    "It's absolutely horrible",                       # 26
    "I enjoy every moment here",                      # 27
    "The shipping was delayed for weeks",             # 28
    "The performance is outstanding",                 # 29
    "I can't stand the noise in here",                # 30
    "Great value for the price",                      # 31
    "They charged me extra for no reason",            # 32
    "I'm impressed by the customer service",          # 33
    "I feel cheated by their false claims",           # 34
    "The color looks stunning",                       # 35
    "The product stopped working after a day",        # 36
    "I highly recommend this brand",                  # 37
    "They never respond to inquiries",                # 38
    "The taste is absolutely delicious",              # 39
    "The instructions were missing",                  # 40
    "I will definitely buy this again",               # 41
    "I want a refund",                                # 42
    "The interface is user-friendly",                 # 43
    "It gave me a terrible headache",                 # 44
    "The seats were incredibly comfortable",          # 45
    "They refused to help me",                        # 46
    "Everything went smoothly",                       # 47
    "It was a big disappointment",                    # 48
    "I feel great after using this service",          # 49
    "The website kept crashing",                      # 50
    "They really went above and beyond",              # 51
    "I am never going back there",                    # 52
    "The flavor is just perfect",                     # 53
    "The staff ignored me the whole time",            # 54
    "It looks even better in person",                 # 55
    "This place is so dirty",                         # 56
    "My experience here was phenomenal",              # 57
    "They lost my reservation",                       # 58
    "I appreciate the prompt response",               # 59
    "The screen flickers constantly",                 # 60
    "It is the best gift I've ever received",         # 61
    "Their attitude was condescending",               # 62
    "I love how easy it is to set up",                # 63
    "I can't believe how bad this turned out",        # 64
    "The packaging was so cute",                      # 65
    "It didn't match the description at all",         # 66
    "I'm very happy with my purchase",                # 67
    "The customer support was terrible",              # 68
    "This is a game-changer",                         # 69
    "The product feels cheap and flimsy",             # 70
    "I can't wait to use it again",                   # 71
    "They messed up my entire order",                 # 72
    "So glad I found this item",                      # 73
    "The lines were way too long",                    # 74
    "They handled my request efficiently",            # 75
    "The sound quality is awful",                     # 76
    "I'm amazed by how well it works",                # 77
    "It was a waste of money",                        # 78
    "Truly the best experience I've had",             # 79
    "I am completely unsatisfied",                    # 80
    "They delivered faster than expected",            # 81
    "The paint started peeling off immediately",      # 82
    "The customer service agent was polite",          # 83
    "I am furious about the lack of communication",   # 84
    "I love the texture of this product",             # 85
    "It arrived broken and unusable",                 # 86
    "The staff made me feel welcomed",                # 87
    "It's not worth the hype",                        # 88
    "It was an unforgettable experience",             # 89
    "I regret choosing this place",                   # 90
    "I adore the packaging design",                   # 91
    "They never apologized for the inconvenience",    # 92
    "The user guide was extremely helpful",           # 93
    "I was stuck with a defective item",              # 94
    "I feel so satisfied with the outcome",           # 95
    "The seats were filthy and uncomfortable",        # 96
    "They resolved my issue quickly",                 # 97
    "The website is confusing and slow",              # 98
    "It's such a relief to find something this good", # 99
    "I absolutely hate how it turned out"             # 100
]

labels = [
    1, # 1
    0, # 2
    1, # 3
    0, # 4
    1, # 5
    0, # 6
    1, # 7
    0, # 8
    1, # 9
    0, # 10
    1, # 11
    0, # 12
    1, # 13
    0, # 14
    1, # 15
    0, # 16
    1, # 17
    0, # 18
    1, # 19
    0, # 20
    1, # 21
    0, # 22
    1, # 23
    0, # 24
    1, # 25
    0, # 26
    1, # 27
    0, # 28
    1, # 29
    0, # 30
    1, # 31
    0, # 32
    1, # 33
    0, # 34
    1, # 35
    0, # 36
    1, # 37
    0, # 38
    1, # 39
    0, # 40
    1, # 41
    0, # 42
    1, # 43
    0, # 44
    1, # 45
    0, # 46
    1, # 47
    0, # 48
    1, # 49
    0, # 50
    1, # 51
    0, # 52
    1, # 53
    0, # 54
    1, # 55
    0, # 56
    1, # 57
    0, # 58
    1, # 59
    0, # 60
    1, # 61
    0, # 62
    1, # 63
    0, # 64
    1, # 65
    0, # 66
    1, # 67
    0, # 68
    1, # 69
    0, # 70
    1, # 71
    0, # 72
    1, # 73
    0, # 74
    1, # 75
    0, # 76
    1, # 77
    0, # 78
    1, # 79
    0, # 80
    1, # 81
    0, # 82
    1, # 83
    0, # 84
    1, # 85
    0, # 86
    1, # 87
    0, # 88
    1, # 89
    0, # 90
    1, # 91
    0, # 92
    1, # 93
    0, # 94
    1, # 95
    0, # 96
    1, # 97
    0, # 98
    1, # 99
    0  # 100
]

데이터셋 전처리

In [6]:
!pip install sacremoses

Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/897.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m38.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sacremoses
Successfully installed sacremoses-0.1.1


In [7]:
# 데이터셋 분할
train_texts, val_texts, train_labels, val_labels = train_test_split(texts, labels, test_size=0.2)

# tokenizer = torch.hub.load('huggingface/transformers', 'tokenizer', 'bert-base-uncased')
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

train_dataset = TextDataset(train_texts, train_labels, tokenizer, max_len=max_len)
val_dataset = TextDataset(val_texts, val_labels, tokenizer, max_len=max_len)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

모델 초기화 및 학습 설정

In [8]:
# 모델 초기화 및 학습 설정
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = TransformerModel(vocab_size, embed_size, num_classes, num_heads, num_layers, max_len).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)



훈련

In [9]:
# 학습 루프
num_epochs = 10
for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for input_ids, attention_mask, labels in train_loader:
        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    print(f"Epoch {epoch+1}, Loss: {total_loss / len(train_loader)}")

Epoch 1, Loss: 1.553677213191986
Epoch 2, Loss: 0.7094502449035645
Epoch 3, Loss: 0.7139646410942078
Epoch 4, Loss: 0.7164085507392883
Epoch 5, Loss: 0.70516916513443
Epoch 6, Loss: 0.7138293862342835
Epoch 7, Loss: 0.6909373521804809
Epoch 8, Loss: 0.7063598990440368
Epoch 9, Loss: 0.6917636394500732
Epoch 10, Loss: 0.6916982412338257


평가

In [10]:
# 평가
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for input_ids, attention_mask, labels in val_loader:
        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)
        outputs = model(input_ids, attention_mask)
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f"Validation Accuracy: {correct / total * 100:.2f}%")

Validation Accuracy: 30.00%


추론

In [11]:
def predict(text, model, tokenizer, max_len=50):
    model.eval()
    with torch.no_grad():
        # 입력 문자열을 토크나이즈하고 attention mask 생성
        inputs = tokenizer(text, return_tensors='pt', max_length=max_len, padding='max_length', truncation=True)
        input_ids = inputs['input_ids'].squeeze(0).unsqueeze(0).to(device)  # 배치 차원 추가
        attention_mask = inputs['attention_mask'].squeeze(0).unsqueeze(0).to(device)  # 배치 차원 추가

        # Attention mask를 boolean 타입으로 변환
        attention_mask = attention_mask.bool()

        # 모델을 통해 추론
        outputs = model(input_ids, attention_mask)
        _, predicted = torch.max(outputs, 1)

    return predicted.item()

In [12]:
# 문자열 입력을 통해 추론
input_text = "I love it"
predicted_class = predict(input_text, model, tokenizer, max_len)
print(f"입력: {input_text}, 예측된 클래스: {predicted_class}")

입력: I love it, 예측된 클래스: 0
