# Tranformers

In the previous chapter, we covered RNNs, the modeling architecture in vogue in NLP until the Transformer architecture gained prominence.

Transformers are the workhorse of modern NLP. The original architecture, first propsed in 2017, has taken the (deep learning) world by storm. Since then, NLP literature has been inundated with all sorts of new architectures that are broadly classified into either Sesame street characters or words that end with "-former"

In this chapter, we'll look at that very architecture-the transformer-in detail. We'll analyze the core innovations and explore a hot new category of neural network layers: the attention mechanism.

## Building a Transformer from Scratch

In Chapter 2 and 3, we explored how to use transformers in practice and how to leverage pretrained transformers to solve complex NLP problems. Now we're going to take a deep dive into the architecture itself and learn how transformers work from first principles.

What does "first principles" mean? Well, for starters, it means we're not allowed to use the Hugging Face Transformers library. We've raved about it plenty in this book already, so it's about time we take a break from that and see how things actually work under the hood. For this chapter, we're going to be using raw PyTorch instead.

PyTorch, being a fully fledged deep learning library that most researchers use, naturally has an implementation of the extremly popular transformer architecture, just like a Hugging Face library does. This version, though, exposed as an nn.Module, is much more DIY and is meant to be used with the other familiar PyTorch tools like dataloaders, optimizers, etc.

As we've mentioned before, one of the best ways to see what any deep learning related class/function does is by looking at the type signature and he dimensionality of the inputs and outputs. So let's do that:

In [None]:
import torch

In [None]:
model = torch.nn.Transformer()
model.encoder.layers[0]

In [None]:
TransformerEncoderLayer(
    (self_attn): MultiheadAttention(
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
    )
    (linear1): Linear(in_features=512, out_features=2048, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (linear2): Linear(in_features=2048, out_features=512, bias=True)
    (norm1): LayerNorm(512,), eps=1e-0.5, elementwise_affine=True)
    (norm2): LayerNorm(512,), eps=1e-0.5, elementwise_affine=True)
    (dropout1): Dropout(p=0.1, inplace=False)
    (dropout2): Dropout(p=0.1, inplace=False)
)    

At the outset, there doesn't seem to be too much to take away from this. It's a fairly standard PyTorch nn.Module with the standard forward() function defined for us. In principle, we could just plug it into our training pipeline and carry on. in fact, this is essentially what we did in Chapter 2. But let's try to understand the exact components of this module.

Of particular interest is the MultiheadAttention layer. Most of the other layers, like Dropout, Linear, and LayerNorm, are things you'd expect to see in nontransformer models as well.This particular implementation of he transformer by PyTorch (with no additional configuration parameters), exactly matches the specification of the architecture in the original paper (shown in Figure 7-1) which, coincidentally, is titled "Attention Is All you Need."

In short, it's safe to say that the most important component of this Transformer class is the MultiheadAttention layer. So it makes sense to take some time to understand what that is and how it works.

## Attention Mechanisms

An attention mechanism is a layer in a deep neural network. Its job, while still open to interpretation, is to learn long-range, "global" features. An attention mechanism acts as what we like to call an "information router" that decides what components of the input sequence of embedding vectors contribute to a single output vector. This idea will become more clear as we actually work through the details.

We're just as excited to talk about attention as the other couple thousand people that attended NeurIPS within that last year, but before we de, we should mention that an important theme to pay attention to is the computaional complexity of the operations involved.Think about how many dot products/matrix multiplications you see and the size of the tensors involved.

### Dot Product Attention

OK, strictly speaking, we don't think we've seen this type of attention actually being applied in real networks. Scaled dot product attention is usually just talked about as a component of the next thing we'll discuss: Multi-Head Self-Attention.

The most important question you need to ask in the world of exotic attention mechanisms is this: how, exactly, do you measure the similarity between things? This core idea, shrouded in a veil of linear algebra and bucket-loads of GPUs, is what drives the fundamental behavior of neural nets in NLP today.

And the scaled dot product uses probably one of the simplest and most intuitive methods of measureing similarity-the dot product.

You should be familiar with this, but let's do a quick recap. The dot product is an operation that takes two vectors, multiples them element-wise, and then adds up the results. This measures similarity becasue if the two vectors that we're "dot-producting" have similar components, the product of their elements will be large, and vice versa (in the sense that vectors with dissimilar components will have a small dot product).

But the real question is, what exactly are we taking the dot product of? To answer this question, let's focus on how these attention mechanisms are implemented in transformers(see Figure 7-1).

![image](images/transformer_layer.png)

A typical transformer takes in sequences of word vectors as input, and at each layer, transforms (and no, we don't think that's how they got their name) them into another sequence of vectors, which we call the hidden representation/state.

So at each hidden layer in the network, we have sequences of vectors that we want to "attend" over. See Figure 7-2.

![image](images/dotproduct_1-768x412.png)

Now, here's the important bit, so pay attention (pun very much intented).

What we're going to do is transform each one of these hidden state vectors into three seperate, completely independent vectors-the query, the key, and the value.

We do this transformation via a simple matrix multiply, and the dimensions of these vectors are up to us. The only restriction is that the query and key vectors need to have the same dimensions (since we're going to take the dot product between them):

### Scaled Dot Product Attention

In [None]:
import numpy as np

In [None]:
small_dots = [
    np.dot(np.random.randn(10),
           np.random.randn(10))
    for i in range(100)]
np.mean(np.absolute(small_dots))

In [None]:
large_dots = [np.dot(np.random.randn(10000),
                     np.random.randn(10000))
              for i in range(100)]
np.mean(np.absolute(large_dots))

### Multi Head Self Attention

### Adaptive Attention Span

### Persistent Memory/All-Attention

### Product-Key Memory

## Transformers for Computer Vision

# Fine-tuning BERT for sentiment analysis

## Contents
1. 학습데이터 확인
2. 보조함수 확인
3. SenitmentClassifier 확인
4. Analyser 확인
5. 학습 과정 확인
6. tests

## 학습 데이터

In [None]:
!pip3 install torch
!pip3 install transformers
from typing import List, Tuple
from transformers import BertTokenizer, BertModel
import torch
from torch.nn import functional as F


DATA: List[Tuple[str, int]] = [
    # 긍정적인 문장 - 1
    ("난 너를 좋아해", 1),
    # --- 부정적인 문장 - 레이블 = 0
    ("난 너를 싫어해", 0)
]

TESTS = [
    "나는 자연어처리가 좋아",
    "나는 자연어처리가 싫어",
    "나는 너가 좋다",
    "너는 참 좋다",
]

## 보조함수

In [None]:
# cuda를 사용할 수 있는지를 체크, 사용가능하다면 cuda로 설정된 device를 출력.
def load_device() -> torch.device:
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    return device


# 텐서를 구축하는 부분 - X
def build_X(sents: List[str], tokenizer: BertTokenizer, device: torch.device) -> torch.Tensor:
    """
    return X (N, 3, L).  N, 0, L -> input_ids / N, 1, L -> token_type_ids / N, 2, L -> attention_mask
    """
    encodings = tokenizer(text=sents,
                          add_special_tokens=True,
                          return_tensors='pt',
                          truncation=True,
                          padding=True)
    return torch.stack([
        encodings['input_ids'],
        encodings['token_type_ids'],
        encodings['attention_mask']
    ], dim=1).to(device)

# 텐서를 구축하는 부분 - y
def build_y(labels: List[int], device: torch.device) -> torch.Tensor:
    return torch.FloatTensor(labels).unsqueeze(-1).to(device)

## Sentiment Classifier

In [None]:
class SentimentClassifier(torch.nn.Module):
    def __init__(self, bert: BertModel):
        super().__init__()
        self.bert = bert
        self.hidden_size = bert.config.hidden_size
        # TODO 1
        self.W_hy = torch.nn.Linear(..., ...)

    def forward(self, X: torch.Tensor) -> torch.Tensor:
        """
        :param X: (N, 3, L)
        :return: H_all (N, L, H)
        """
        input_ids = X[:, 0]
        token_type_ids = X[:, 1]
        attention_mask = X[:, 2]
        H_all = self.bert(input_ids, token_type_ids, attention_mask)[0]
        return H_all

    def predict(self, X: torch.Tensor) -> torch.Tensor:
        """
        :param X:
        :return:
        """
        # TODO 2
        H_all = self.forward(X)
        H_cls = ...
        y_hat = ...  # (N, H) * (H, 1) -> (N, 1)
        return torch.sigmoid(y_hat)

    def training_step(self, X: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
        # TODO 3
        y_hat = self.predict(X)
        loss = ...
        return loss.sum()

## Analyser

In [None]:

class Analyser:
    """
    Bert 기반 감성분석기.
    """
    def __init__(self, classifier: SentimentClassifier, tokenizer: BertTokenizer, device: torch.device):
        self.classifier = classifier
        self.tokenizer = tokenizer
        self.device = device

    def __call__(self, text: str) -> float:
        X = build_X(sents=[text], tokenizer=self.tokenizer, device=self.device)
        y_hat = self.classifier.predict(X)
        return y_hat.item()

## Training

In [None]:
# 사전학습된 버트 모델을 로드
tokenizer = BertTokenizer.from_pretrained('beomi/kcbert-base')
bert = BertModel.from_pretrained('beomi/kcbert-base')

# --- have a look at the config --- #
print(bert.config)
print(bert.config.hidden_size)

In [None]:
# --- hyper parameters --- #
EPOCHS = 20
LR = 0.0001


device = load_device()
print(device)

# --- build the dataset --- # 
sents = [sent for sent, _ in DATA]
labels = [label for _, label in DATA]
X = build_X(sents, tokenizer, device)
y = build_y(labels, device)

# --- instantiate the classifier --- #
classifier = SentimentClassifier(bert)
classifier.to(device)  # 모델도 gpu에 올리기. 
optimizer = torch.optim.Adam(classifier.parameters(), lr=LR)  # 최적화 알고리즘을 선택.

# --- 학습시작 --- #
for epoch in range(EPOCHS):
    loss = classifier.training_step(X, y)
    loss.backward()  # 오차 역전파
    optimizer.step()  # 경사도 하강
    optimizer.zero_grad()  # 기울기 축적방지
    print(f"epoch:{epoch}, loss:{loss.item()} ")

## Test

In [None]:
classifier.eval()
analyser = Analyser(classifier, tokenizer, device)

for sent in TESTS:
    print(sent, "->", analyser(sent))

## Conclusion