# paper wrapup - [Attention is All you need](https://arxiv.org/abs/1706.03762)

## Problem

### LSTM, GRU don't scale well

- 기존의 Sequence Modeling을 위해 사용되던 LSTM, GRU는 재귀적 특성(Sequence)으로 인해 병렬화에 제약을 지님
- 이러한 Sequence Length의 증가에 따른 Computation & Memory Complexity를 해결하기 위해 Convolution 기반의 방법 등이 시도되었으나 Long distance의 의존성을 제대로 다룰 수 없었음
  - 예를 임의의 두 위치의 거리에 따라서 
  - ConvS2S는 O(n)
  - ByteNet는 O(logN)
- 즉, long distance의 의존성을 적절하게 다룰 수 있으면서 동시에 sequence에 따른 복잡도를 최소화 할 수 있는 방법의 필요

### Transformers

- Transformer는 두 지점의 거리에 대해서 O(1)의 복잡도를 제공 (Generation의 경우 예외)
- Self-Attention은 Reading Comprehension, Semantic Representation, 등에 뛰어난 성능을 보였음
- Transfomer는 RNN이나 CNN 등을 사용하지 않고 순수하게 Self-Attention만을 사용한 최초의 사례

### Model Architecture

Encoder-Decoder Architecture

#### Encoder

- 6 identical layers
- each layer
  - multi-head self-attention (w/ residual connection)
  - position-wise feedforward (w/ residual connection)

```python

layer_1 = LayerNorm(MultiHeadSelfAttention(x) + x)
layer_2 = LayerNorm(PositionWiseFeedforward(x) + x)

```

#### Decoder

- 6 identical layers
- each layer
  - multi-head self-attention (w/ residual connection)
  - multi-head attention (w/ residual connection)
  - position-wise feedforward (w/residual connection)

```python

layer_1 = LayerNorm(MultiHeadAttention(x) + x)
layer_2 = LayerNorm(MultiHeadAttention(qk: encoder_output, v: x))
layer_3 = LayerNorm(PositionWiseFeedforward(x) + x)

```
  

## Scaled dot-product attention

- Attention의 주요 알고리즘으로 Scaled dot-product attention을 사용
- dot-product
  - attention을 구하는 연산으로 Mat Mul을 사용
- scaled
  - dimension of key and value. 즉, word embedding vector의 dimension dk
  - 1/sqrt(dk)로 attension을 scaling


```python
ScaledDotProduct(Q,K,V) = Softmax(Q@K.T / sqrt(dk))@V
```


In [1]:
import torch

class ScaledDotProductAttention(torch.nn.Module):

    def __init__(self, d_model, device=None, dtype: torch.dtype=torch.float, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.dk = torch.sqrt(torch.scalar_tensor(d_model, device=device, dtype=dtype))
    
    def forward(self, q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
        # input should have (B,N,d_model)
        # q (b,1,d_model) , k (b,n,d_model)
        # qk = (b,1,n)
        scaled_qk = q@torch.transpose(k, 2, 1) / self.dk
        if mask is not None:
            scaled_qk = scaled_qk * mask
        attention_weights = torch.softmax(scaled_qk, dim=-1)
        return  attention_weights @  v
        



## Multi-Head attention

- d_model 즉, Query, Key ,Value의 word embedding vector를 다수의 sub vector로 나누어서 각 sub vector를 입력으로 하는 scaled dot-product attention 다수를 조합하여 하나의 Attention block을 구성함.

```python
MultiHeadAttention(Q,K,V,n_heads) = concat(*[ScaledDotProduct(linear(qi),linear(ki),linear(vi)) for qi,ki,vi in zip(Q.split(n_heads), K.split(n_heads), V.split(n_heads))])
```



In [2]:

import torch

class MultiHeadAttention(torch.nn.Module):

    def __init__(self, d_model, n_head, device=None, dtype: torch.dtype=torch.float, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.n_head = n_head
        self.depth = d_model // n_head

        self.q_linear = torch.nn.Linear(d_model, d_model, device=device, dtype=dtype)
        self.k_linear = torch.nn.Linear(d_model, d_model, device=device, dtype=dtype)
        self.v_linear = torch.nn.Linear(d_model, d_model, device=device, dtype=dtype)

        self.attns = torch.nn.ModuleList([ScaledDotProductAttention(d_model=self.depth, device=device, dtype=dtype) for _ in range(n_head)])
        self.output_linear = torch.nn.Linear(d_model, d_model, device=device, dtype=dtype)
                
        
    def forward(self, input: torch.Tensor, mask:torch.Tensor=None) -> torch.Tensor:
        if len(input.shape) == 2:
            input = input.unsqueeze(0)
        if len(input.shape) != 3:
            raise ValueError(f'unsupported tensor shape: {input.shape}, should be form of (B,N,d)')
        
        b,n,d = input.shape
        q = self.q_linear.forward(input).view((b, n, self.n_head, -1))
        k = self.k_linear.forward(input).view((b, n, self.n_head, -1))
        v = self.v_linear.forward(input).view((b, n, self.n_head, -1))
        attn_output = torch.concat([self.attns[i].forward(q[:,:,i,:].view((b,n,self.depth)), 
                                            k[:,:,i,:].view((b,n, self.depth)), 
                                            v[:,:,i,:].view((b,n, self.depth)), mask) for i in range(self.n_head)],dim=-1)
        return self.output_linear.forward(attn_output)
        

In [3]:
import torch


device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
test = torch.rand((1, 1024, 512), device=device, dtype=torch.float16)

attention = MultiHeadAttention(512, 8, device=device, dtype=torch.float16)
attention = torch.compile(attention)




  from .autonotebook import tqdm as notebook_tqdm


In [4]:
out = attention.forward(test)
out.shape

torch.Size([1, 1024, 512])

## Position-wise feedforward

- Linear - Relu - Linear


## Transformer Layer

- GPT논문의 Transformer layer는 Attention is all you need의 transformer는 Encoder 유사한 구조로 residual connection을 갖는 Multi-head attention과 position wise feedforward의 2개의 sublayer로 구성되어 있고 각 sublayer의 출력에 layer norm이 추가되는 형태
- 첫 GPT 논문에서는 이러한 transformer block 12개를 쌓아 model을 구성


In [12]:
import torch

class PositionWiseFeedforward(torch.nn.Module):

    def __init__(self, d_model:int, device, dtype: torch.dtype=torch.float, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.pwff = torch.nn.Sequential(torch.nn.Linear(d_model, d_model * 4, device=device,dtype=dtype), 
                                            torch.nn.GELU(), 
                                            torch.nn.Linear(4* d_model, d_model, device=device, dtype=dtype))
        
    def forward(self, input: torch.Tensor)-> torch.Tensor:
        return self.pwff.forward(input)


class Transformer(torch.nn.Module):

    def __init__(self, n_head, d_model, device, dtype:torch.dtype=torch.float, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.mha = MultiHeadAttention(d_model=d_model, n_head=n_head, device=device, dtype=dtype)
        self.mha_lnorm = torch.nn.LayerNorm(d_model, device=device,dtype=dtype)
        self.pw_ff = PositionWiseFeedforward(d_model=d_model, device=device, dtype=dtype)
        self.out_lnorm = torch.nn.LayerNorm(d_model, device=device, dtype=dtype)

    def forward(self, input:torch.Tensor) -> torch.Tensor:
        mha_output = self.mha_lnorm(input + self.mha.forward(input))
        return self.out_lnorm(mha_output + self.pw_ff(mha_output))


In [13]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
transformer = Transformer(8, 512, device=device, dtype=torch.float16)
output = transformer.forward(test)
output.shape

torch.Size([1, 1024, 512])