전체 로드맵 (논문 → 코드 1:1 대응)

Transformer 논문 구조를 그대로 클래스 단위로 쪼갭니다.
```mathematica
Input
 └─ Embedding
 └─ Positional Encoding
 └─ Encoder Block × N
     ├─ Multi-Head Self Attention
     ├─ Add & Norm
     ├─ Feed Forward
     └─ Add & Norm
 └─ (Decoder도 동일 구조 + Mask)
```

## 1️⃣ Scaled Dot-Product Attention (논문 수식부터)

논문 핵심 수식:
$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$

### 수식 해석 (중요)

- $Q \in \mathbb{R}^{T \times d_k}$
- $K \in \mathbb{R}^{T \times d_k}$
- $V \in \mathbb{R}^{T \times d_v}$
- $QK^T \rightarrow (T \times T)$ → “모든 토큰 간 유사도”
- $\sqrt{d_k}$ → 분산 폭주 방지 (논문에서 핵심)

## 2️⃣ ScaledDotProductAttention 클래스 (밑바닥 구현)

In [None]:
import torch
import torch.nn as nn
import math

class ScaledDotProductAttention(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, Q, K, V, mask=None):
        """
        Q, K, V: (batch, head, seq_len, d_k)
        """
        scores = torch.mm(Q, K.transpose(-2, -1)) # QK^T
        scores = scores / math.sqrt(Q.size(-1)) # / sqrt(d_k)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9) # True인 위치를 -1e9로 채움

        attn = torch.softmax(scores, dim=-1)
        out = torch.mm(attn, V)
        return out, attn

```py
mask = torch.tensor([[1, 0, 1],
                     [0, 1, 0]])

# 비교 연산 → Boolean 텐서 생성
mask == 0
# tensor([[False,  True, False],
#         [ True, False,  True]])

scores.masked_fill(mask == 0, -1e9)
# mask가 0인 곳을 -1e9로 채움
```

## 3️⃣ Multi-Head Attention (논문 정의)

논문 수식:

$$
\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
$$

$$\text{MultiHead}(Q,K,V)
= \text{Concat}(\text{head}_1,\dots,\text{head}_h)W^O
$$

### 핵심 포인트

- $W^Q, W^K, W^V \in \mathbb{R}^{d_{\text{model}} \times d_{\text{model}}}$ 
- head별 projection을 **하나의 큰 행렬로 처리**
- reshape & transpose로 head 분리

## 4️⃣ MultiHeadAttention 클래스 (진짜 중요)

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0 # 조건이 거짓이면 프로그램을 멈추는 디버깅 도구

        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        # W_Q, W_K, W_V는 head별이 아니라 전체 projection
        self.W_Q = nn.Linear(d_model, d_model)
        self.W_K = nn.Linear(d_model, d_model)
        self.W_V = nn.Linear(d_model, d_model)

        self.attention = ScaledDotProductAttention()
        self.W_O = nn.Linear(d_model, d_model)

    def forward(self, Q, K, V, mask=None):
        batch_size = Q.size(0) # (batch_size, seq_len, d_model)

        # 1. Linear projection
        Q = self.W_Q(Q)
        K = self.W_K(K)
        V = self.W_V(V)

        # 2. Split heads
        Q = Q.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2) # -1은 seq_len
        K = K.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = V.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # 3. Attention
        out, attn = self.attention(Q, K, V, mask)

        # 4. Concat heads
        out = out.transpose(1, 2).contiguous() # .contiguous(): 메모리 레이아웃을 연속적으로 재배치
        out = out.view(batch_size, -1, self.d_model) # reshape은 내부적으로 필요시 contiguous 호출
        # view(): contiguous 텐서만 받음
        
        # 5. Final linear
        out = self.W_O(out)
        return out

## 5️⃣ Position-wise Feed Forward Network

논문 수식:
$$
\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2
$$

- 모든 시점에 독립적으로 적용
- kernel_size=1인 Conv1D는 각 위치를 독립적으로 처리하므로, position-wise Linear와 수학적으로 완전히 동일

In [3]:
class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
    
    def forward(self, x):
        return self.fc2(torch.relu(self.fc1(x)))

## 6️⃣ Add & Norm (Residual + LayerNorm)

In [4]:
class AddNorm(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x, sublayer_out):
        return self.norm(x + sublayer_out)

## 7️⃣ Encoder Block (논문 한 블록)

In [5]:
class EncoderBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff):
        super().__init__()
        self.attn = MultiHeadAttention(d_model, num_heads)
        self.addnorm1 = AddNorm(d_model)

        self.ffn = FeedForward(d_model, d_ff)
        self.addnorm2 = AddNorm(d_model)

    def forward(self, x, mask=None):
        attn_out = self.attn(x, x, x, mask)
        x = self.addnorm1(x, attn_out)

        ffn_out = self.ffn(x)
        x = self.addnorm2(x, ffn_out)
        return x

## 8️⃣ Encoder 전체

In [6]:
class Encoder(nn.Module):
    def __init__(self, num_layers, d_model, num_heads, d_ff):
        super().__init__()
        self.layers = nn.ModuleList([
            EncoderBlock(d_model, num_heads, d_ff) \
            for _ in range(num_layers)
        ])

    def forward(self, x, mask=None):
        for layer in self.layers:
            x = layer(x, mask)
        return x

In [10]:
encoder = Encoder(3, 16, 4, 2)

In [12]:
encoder

Encoder(
  (layers): ModuleList(
    (0-2): 3 x EncoderBlock(
      (attn): MultiHeadAttention(
        (W_Q): Linear(in_features=16, out_features=16, bias=True)
        (W_K): Linear(in_features=16, out_features=16, bias=True)
        (W_V): Linear(in_features=16, out_features=16, bias=True)
        (attention): ScaledDotProductAttention()
        (W_O): Linear(in_features=16, out_features=16, bias=True)
      )
      (addnorm1): AddNorm(
        (norm): LayerNorm((16,), eps=1e-05, elementwise_affine=True)
      )
      (ffn): FeedForward(
        (fc1): Linear(in_features=16, out_features=2, bias=True)
        (fc2): Linear(in_features=2, out_features=16, bias=True)
      )
      (addnorm2): AddNorm(
        (norm): LayerNorm((16,), eps=1e-05, elementwise_affine=True)
      )
    )
  )
)