# CH07-04: Transformer 編碼器 (Encoder)

**課程時長**: 90 分鐘  
**難度**: ⭐⭐⭐⭐  
**前置知識**: CH07-01, CH07-02, CH07-03  

---

## 📚 本節學習目標

1. ✅ 理解 Transformer Encoder 的完整架構
2. ✅ 實作多層 Encoder 堆疊 (Stacking)
3. ✅ 使用 Encoder 進行文本分類任務
4. ✅ 理解 BERT 模型的架構設計
5. ✅ 掌握 Encoder-Only 模型的應用場景

---

## 📖 目錄

1. [Encoder Layer 架構回顧](#1-encoder-layer)
2. [多層 Encoder 堆疊](#2-multi-layer-encoder)
3. [完整 Transformer Encoder 實作](#3-full-encoder)
4. [文本分類實戰](#4-text-classification)
5. [BERT 架構解析](#5-bert-architecture)
6. [總結與練習](#6-summary)

---

## 1. Encoder Layer 架構回顧 {#1-encoder-layer}

### 1.1 單層 Encoder 的組成

回顧 CH07-03，一個 Encoder Layer 包含：

```
Input
  ↓
Multi-Head Self-Attention
  ↓
Add & Norm (Residual Connection + Layer Normalization)
  ↓
Feed-Forward Network (2-layer MLP)
  ↓
Add & Norm
  ↓
Output
```

### 1.2 關鍵組件數學定義

**Multi-Head Attention**:
$$
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O
$$

**Feed-Forward Network**:
$$
\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2
$$

**Layer Normalization**:
$$
\text{LayerNorm}(x) = \gamma \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta
$$

In [None]:
# 載入必要套件
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Optional, Tuple
import warnings

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['font.sans-serif'] = ['Microsoft JhengHei', 'Arial Unicode MS']
plt.rcParams['axes.unicode_minus'] = False

print("✅ 套件載入完成")

## 2. 多層 Encoder 堆疊 {#2-multi-layer-encoder}

### 2.1 為什麼要堆疊多層？

**單層 Encoder**:
- 只能捕捉淺層的語義關係
- 無法學習複雜的層次化特徵

**多層 Encoder (N=6 in original paper)**:
- 低層：捕捉語法結構（詞性、句法）
- 中層：捕捉語義關係（同義詞、上下位詞）
- 高層：捕捉抽象概念（情感、主題）

### 2.2 堆疊架構

```
Input Embeddings + Positional Encoding
           ↓
    Encoder Layer 1
           ↓
    Encoder Layer 2
           ↓
         ...
           ↓
    Encoder Layer N
           ↓
   Final Representations
```

In [None]:
class LayerNormalization:
    """Layer Normalization implementation"""
    
    def __init__(self, d_model: int, eps: float = 1e-6):
        self.gamma = np.ones(d_model)   # scale parameter
        self.beta = np.zeros(d_model)   # shift parameter
        self.eps = eps
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        """
        Args:
            x: shape (batch_size, seq_len, d_model) or (seq_len, d_model)
        
        Returns:
            Normalized output with same shape as input
        """
        # Compute mean and variance along last dimension (d_model)
        mean = np.mean(x, axis=-1, keepdims=True)
        variance = np.var(x, axis=-1, keepdims=True)
        
        # Normalize
        x_norm = (x - mean) / np.sqrt(variance + self.eps)
        
        # Scale and shift
        return self.gamma * x_norm + self.beta


# 測試 Layer Normalization
d_model = 512
x = np.random.randn(10, d_model)  # 10 tokens, 512 dimensions

layer_norm = LayerNormalization(d_model)
x_normalized = layer_norm.forward(x)

print(f"原始輸入 - Mean: {x.mean():.4f}, Std: {x.std():.4f}")
print(f"標準化後 - Mean: {x_normalized.mean():.4f}, Std: {x_normalized.std():.4f}")
print(f"每個 token 的 mean 是否接近 0: {np.allclose(x_normalized.mean(axis=-1), 0, atol=1e-5)}")
print(f"每個 token 的 std 是否接近 1: {np.allclose(x_normalized.std(axis=-1), 1, atol=1e-1)}")

In [None]:
class FeedForwardNetwork:
    """Position-wise Feed-Forward Network"""
    
    def __init__(self, d_model: int, d_ff: int):
        """
        Args:
            d_model: Model dimension (e.g., 512)
            d_ff: Hidden layer dimension (typically 4 * d_model = 2048)
        """
        # Xavier initialization
        self.W1 = np.random.randn(d_model, d_ff) / np.sqrt(d_model)
        self.b1 = np.zeros(d_ff)
        self.W2 = np.random.randn(d_ff, d_model) / np.sqrt(d_ff)
        self.b2 = np.zeros(d_model)
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        """
        FFN(x) = max(0, xW1 + b1)W2 + b2
        
        Args:
            x: shape (seq_len, d_model)
        
        Returns:
            output: shape (seq_len, d_model)
        """
        # First linear transformation + ReLU
        hidden = np.maximum(0, np.dot(x, self.W1) + self.b1)
        
        # Second linear transformation
        output = np.dot(hidden, self.W2) + self.b2
        
        return output


# 測試 Feed-Forward Network
d_model = 512
d_ff = 2048
seq_len = 10

x = np.random.randn(seq_len, d_model)
ffn = FeedForwardNetwork(d_model, d_ff)
output = ffn.forward(x)

print(f"輸入形狀: {x.shape}")
print(f"輸出形狀: {output.shape}")
print(f"參數量: W1={d_model * d_ff:,}, W2={d_ff * d_model:,}, 總計={2 * d_model * d_ff:,}")

## 3. 完整 Transformer Encoder 實作 {#3-full-encoder}

### 3.1 單層 Encoder Layer 完整實作

In [None]:
def scaled_dot_product_attention(
    Q: np.ndarray, 
    K: np.ndarray, 
    V: np.ndarray, 
    mask: Optional[np.ndarray] = None
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Scaled Dot-Product Attention (from CH07-03)
    
    Args:
        Q: Query matrix (seq_len, d_k)
        K: Key matrix (seq_len, d_k)
        V: Value matrix (seq_len, d_v)
        mask: Optional mask (seq_len, seq_len)
    
    Returns:
        output: Attention output (seq_len, d_v)
        attention_weights: Attention weights (seq_len, seq_len)
    """
    d_k = K.shape[-1]
    
    # Compute attention scores
    scores = np.matmul(Q, K.T) / np.sqrt(d_k)
    
    # Apply mask if provided
    if mask is not None:
        scores = np.where(mask == 0, -1e9, scores)
    
    # Apply softmax
    attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
    
    # Compute weighted sum of values
    output = np.matmul(attention_weights, V)
    
    return output, attention_weights


class MultiHeadAttention:
    """Multi-Head Attention implementation"""
    
    def __init__(self, d_model: int, num_heads: int):
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        # Weight matrices for Q, K, V for all heads (combined)
        self.W_q = np.random.randn(d_model, d_model) / np.sqrt(d_model)
        self.W_k = np.random.randn(d_model, d_model) / np.sqrt(d_model)
        self.W_v = np.random.randn(d_model, d_model) / np.sqrt(d_model)
        
        # Output projection
        self.W_o = np.random.randn(d_model, d_model) / np.sqrt(d_model)
    
    def split_heads(self, x: np.ndarray) -> np.ndarray:
        """
        Split the last dimension into (num_heads, d_k)
        
        Args:
            x: shape (seq_len, d_model)
        
        Returns:
            shape (num_heads, seq_len, d_k)
        """
        seq_len = x.shape[0]
        # Reshape to (seq_len, num_heads, d_k)
        x = x.reshape(seq_len, self.num_heads, self.d_k)
        # Transpose to (num_heads, seq_len, d_k)
        return x.transpose(1, 0, 2)
    
    def forward(
        self, 
        x: np.ndarray, 
        mask: Optional[np.ndarray] = None
    ) -> Tuple[np.ndarray, np.ndarray]:
        """
        Args:
            x: Input (seq_len, d_model)
            mask: Optional mask (seq_len, seq_len)
        
        Returns:
            output: (seq_len, d_model)
            attention_weights: (num_heads, seq_len, seq_len)
        """
        seq_len = x.shape[0]
        
        # Linear projections
        Q = np.dot(x, self.W_q)  # (seq_len, d_model)
        K = np.dot(x, self.W_k)
        V = np.dot(x, self.W_v)
        
        # Split into multiple heads
        Q = self.split_heads(Q)  # (num_heads, seq_len, d_k)
        K = self.split_heads(K)
        V = self.split_heads(V)
        
        # Apply attention for each head
        outputs = []
        attention_weights_all = []
        
        for i in range(self.num_heads):
            output, attn_weights = scaled_dot_product_attention(
                Q[i], K[i], V[i], mask
            )
            outputs.append(output)
            attention_weights_all.append(attn_weights)
        
        # Concatenate all heads
        # outputs: list of (seq_len, d_k) -> (seq_len, d_model)
        concat_output = np.concatenate(outputs, axis=-1)
        
        # Final linear projection
        final_output = np.dot(concat_output, self.W_o)
        
        # Stack attention weights
        attention_weights = np.stack(attention_weights_all, axis=0)
        
        return final_output, attention_weights


# 測試 Multi-Head Attention
d_model = 512
num_heads = 8
seq_len = 10

x = np.random.randn(seq_len, d_model)
mha = MultiHeadAttention(d_model, num_heads)
output, attn_weights = mha.forward(x)

print(f"輸入形狀: {x.shape}")
print(f"輸出形狀: {output.shape}")
print(f"注意力權重形狀: {attn_weights.shape}")
print(f"每個 head 的維度 d_k: {mha.d_k}")

In [None]:
class TransformerEncoderLayer:
    """Complete Transformer Encoder Layer"""
    
    def __init__(
        self, 
        d_model: int, 
        num_heads: int, 
        d_ff: int, 
        dropout: float = 0.1
    ):
        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ffn = FeedForwardNetwork(d_model, d_ff)
        
        self.layernorm1 = LayerNormalization(d_model)
        self.layernorm2 = LayerNormalization(d_model)
        
        self.dropout = dropout
    
    def forward(
        self, 
        x: np.ndarray, 
        mask: Optional[np.ndarray] = None
    ) -> Tuple[np.ndarray, np.ndarray]:
        """
        Args:
            x: Input (seq_len, d_model)
            mask: Optional padding mask (seq_len, seq_len)
        
        Returns:
            output: (seq_len, d_model)
            attention_weights: (num_heads, seq_len, seq_len)
        """
        # Multi-Head Self-Attention + Residual + LayerNorm
        attn_output, attn_weights = self.mha.forward(x, mask)
        x = self.layernorm1.forward(x + attn_output)  # Residual connection
        
        # Feed-Forward Network + Residual + LayerNorm
        ffn_output = self.ffn.forward(x)
        x = self.layernorm2.forward(x + ffn_output)
        
        return x, attn_weights


# 測試單層 Encoder Layer
d_model = 512
num_heads = 8
d_ff = 2048
seq_len = 10

x = np.random.randn(seq_len, d_model)
encoder_layer = TransformerEncoderLayer(d_model, num_heads, d_ff)
output, attn_weights = encoder_layer.forward(x)

print("✅ 單層 Transformer Encoder Layer 測試通過")
print(f"輸入形狀: {x.shape}")
print(f"輸出形狀: {output.shape}")
print(f"注意力權重形狀: {attn_weights.shape}")

### 3.2 多層 Encoder 堆疊實作

In [None]:
class TransformerEncoder:
    """Complete Transformer Encoder with multiple layers"""
    
    def __init__(
        self,
        num_layers: int,
        d_model: int,
        num_heads: int,
        d_ff: int,
        vocab_size: int,
        max_seq_len: int,
        dropout: float = 0.1
    ):
        self.num_layers = num_layers
        self.d_model = d_model
        
        # Embedding layers
        self.token_embedding = np.random.randn(vocab_size, d_model) / np.sqrt(d_model)
        self.positional_encoding = self.get_positional_encoding(max_seq_len, d_model)
        
        # Stack of encoder layers
        self.encoder_layers = [
            TransformerEncoderLayer(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ]
    
    @staticmethod
    def get_positional_encoding(max_seq_len: int, d_model: int) -> np.ndarray:
        """Generate positional encoding"""
        pos_encoding = np.zeros((max_seq_len, d_model))
        
        for pos in range(max_seq_len):
            for i in range(0, d_model, 2):
                pos_encoding[pos, i] = np.sin(pos / (10000 ** (i / d_model)))
                if i + 1 < d_model:
                    pos_encoding[pos, i + 1] = np.cos(pos / (10000 ** (i / d_model)))
        
        return pos_encoding
    
    def forward(
        self,
        token_ids: np.ndarray,
        mask: Optional[np.ndarray] = None
    ) -> Tuple[np.ndarray, list]:
        """
        Args:
            token_ids: Token IDs (seq_len,)
            mask: Optional padding mask (seq_len, seq_len)
        
        Returns:
            output: Final encoder output (seq_len, d_model)
            all_attention_weights: List of attention weights from each layer
        """
        seq_len = token_ids.shape[0]
        
        # Token embedding + Positional encoding
        x = self.token_embedding[token_ids]  # (seq_len, d_model)
        x = x + self.positional_encoding[:seq_len, :]  # Add positional encoding
        
        # Pass through all encoder layers
        all_attention_weights = []
        
        for encoder_layer in self.encoder_layers:
            x, attn_weights = encoder_layer.forward(x, mask)
            all_attention_weights.append(attn_weights)
        
        return x, all_attention_weights


# 測試完整 Transformer Encoder
num_layers = 6
d_model = 512
num_heads = 8
d_ff = 2048
vocab_size = 10000
max_seq_len = 100
seq_len = 10

# 隨機生成 token IDs
token_ids = np.random.randint(0, vocab_size, size=seq_len)

# 建立 Transformer Encoder
encoder = TransformerEncoder(
    num_layers=num_layers,
    d_model=d_model,
    num_heads=num_heads,
    d_ff=d_ff,
    vocab_size=vocab_size,
    max_seq_len=max_seq_len
)

# 前向傳播
output, all_attn_weights = encoder.forward(token_ids)

print("✅ 完整 Transformer Encoder 測試通過")
print(f"輸入 token IDs: {token_ids.shape}")
print(f"輸出形狀: {output.shape}")
print(f"Encoder 層數: {len(all_attn_weights)}")
print(f"每層注意力權重形狀: {all_attn_weights[0].shape}")

# 計算參數量
embedding_params = vocab_size * d_model
layer_params = (
    4 * d_model * d_model +  # Q, K, V, O projections
    2 * d_model * d_ff +      # FFN
    4 * d_model               # LayerNorm gamma, beta (x2)
)
total_params = embedding_params + num_layers * layer_params

print(f"\n📊 參數量統計:")
print(f"  Embedding: {embedding_params:,}")
print(f"  每層 Encoder: {layer_params:,}")
print(f"  總參數量: {total_params:,}")

## 4. 文本分類實戰 {#4-text-classification}

### 4.1 任務說明

使用 Transformer Encoder 進行情感分類（Positive/Negative）

**架構**:
```
Input Tokens
     ↓
Transformer Encoder (6 layers)
     ↓
[CLS] Token Representation
     ↓
Classification Head (Linear + Softmax)
     ↓
Positive / Negative
```

### 4.2 使用 Keras 實作

In [None]:
try:
    from tensorflow import keras
    from tensorflow.keras import layers
    import tensorflow as tf
    
    print(f"✅ TensorFlow {tf.__version__} 載入成功")
    
    class TransformerBlock(layers.Layer):
        """Transformer Encoder Block using Keras"""
        
        def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
            super().__init__()
            self.att = layers.MultiHeadAttention(
                num_heads=num_heads, 
                key_dim=d_model // num_heads
            )
            self.ffn = keras.Sequential([
                layers.Dense(d_ff, activation='relu'),
                layers.Dense(d_model)
            ])
            
            self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
            self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
            
            self.dropout1 = layers.Dropout(dropout)
            self.dropout2 = layers.Dropout(dropout)
        
        def call(self, inputs, training=False):
            # Multi-Head Self-Attention
            attn_output = self.att(inputs, inputs)
            attn_output = self.dropout1(attn_output, training=training)
            out1 = self.layernorm1(inputs + attn_output)
            
            # Feed-Forward Network
            ffn_output = self.ffn(out1)
            ffn_output = self.dropout2(ffn_output, training=training)
            out2 = self.layernorm2(out1 + ffn_output)
            
            return out2
    
    
    class PositionalEncoding(layers.Layer):
        """Positional Encoding Layer"""
        
        def __init__(self, max_seq_len, d_model):
            super().__init__()
            self.pos_encoding = self.get_positional_encoding(max_seq_len, d_model)
        
        def get_positional_encoding(self, max_seq_len, d_model):
            pos = np.arange(max_seq_len)[:, np.newaxis]
            i = np.arange(d_model)[np.newaxis, :]
            
            angle_rates = 1 / np.power(10000, (2 * (i // 2)) / d_model)
            angle_rads = pos * angle_rates
            
            # Apply sin to even indices
            angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
            # Apply cos to odd indices
            angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
            
            return tf.cast(angle_rads[np.newaxis, ...], dtype=tf.float32)
        
        def call(self, inputs):
            seq_len = tf.shape(inputs)[1]
            return inputs + self.pos_encoding[:, :seq_len, :]
    
    
    def build_transformer_classifier(
        vocab_size=10000,
        max_seq_len=100,
        d_model=128,
        num_heads=8,
        d_ff=512,
        num_layers=2,
        num_classes=2,
        dropout=0.1
    ):
        """Build Transformer-based text classifier"""
        
        # Input
        inputs = layers.Input(shape=(max_seq_len,), dtype=tf.int32)
        
        # Embedding
        x = layers.Embedding(vocab_size, d_model)(inputs)
        x = x * tf.math.sqrt(tf.cast(d_model, tf.float32))  # Scale embeddings
        
        # Positional Encoding
        x = PositionalEncoding(max_seq_len, d_model)(x)
        
        # Encoder layers
        for _ in range(num_layers):
            x = TransformerBlock(d_model, num_heads, d_ff, dropout)(x)
        
        # Global average pooling (alternative to [CLS] token)
        x = layers.GlobalAveragePooling1D()(x)
        
        # Classification head
        x = layers.Dropout(dropout)(x)
        outputs = layers.Dense(num_classes, activation='softmax')(x)
        
        model = keras.Model(inputs=inputs, outputs=outputs)
        return model
    
    
    # 建立模型
    model = build_transformer_classifier(
        vocab_size=10000,
        max_seq_len=100,
        d_model=128,
        num_heads=8,
        d_ff=512,
        num_layers=2,
        num_classes=2
    )
    
    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    print("\n✅ Transformer 分類器建立成功")
    model.summary()
    
except ImportError:
    print("⚠️ TensorFlow 未安裝，跳過 Keras 實作範例")
    print("   安裝指令: pip install tensorflow")

### 4.3 測試分類器

使用少量模擬數據測試模型是否能正常運行

In [None]:
try:
    # 生成模擬數據
    num_samples = 100
    max_seq_len = 100
    vocab_size = 10000
    
    X_train = np.random.randint(0, vocab_size, size=(num_samples, max_seq_len))
    y_train = np.random.randint(0, 2, size=num_samples)
    
    print("訓練數據形狀:")
    print(f"  X_train: {X_train.shape}")
    print(f"  y_train: {y_train.shape}")
    
    # 訓練模型（僅 1 個 epoch 用於測試）
    print("\n開始訓練...")
    history = model.fit(
        X_train, y_train,
        batch_size=16,
        epochs=1,
        verbose=1
    )
    
    # 測試預測
    test_input = X_train[:5]
    predictions = model.predict(test_input)
    
    print("\n✅ 預測測試:")
    for i in range(5):
        pred_class = np.argmax(predictions[i])
        confidence = predictions[i][pred_class]
        print(f"  樣本 {i+1}: 類別 {pred_class}, 信心度 {confidence:.2%}")
    
except NameError:
    print("⚠️ 模型未建立，跳過測試")

## 5. BERT 架構解析 {#5-bert-architecture}

### 5.1 BERT = Encoder-Only Transformer

**BERT (Bidirectional Encoder Representations from Transformers)**:
- 只使用 Transformer 的 Encoder 部分
- 12 層（BERT-Base）或 24 層（BERT-Large）
- 使用特殊 token: `[CLS]`, `[SEP]`, `[MASK]`

### 5.2 BERT vs 原始 Transformer Encoder

| 特性 | 原始 Transformer Encoder | BERT |
|------|-------------------------|------|
| 層數 | 6 層 | 12 層 (Base) / 24 層 (Large) |
| 隱藏維度 | 512 | 768 (Base) / 1024 (Large) |
| 注意力頭數 | 8 | 12 (Base) / 16 (Large) |
| 參數量 | ~65M | 110M (Base) / 340M (Large) |
| 訓練任務 | 序列轉序列 | MLM + NSP |
| 應用場景 | 機器翻譯 | 文本理解 |

### 5.3 BERT 的兩個訓練任務

**1. Masked Language Model (MLM)**:
```
原始: The cat sat on the mat
遮蔽: The [MASK] sat on the [MASK]
預測: cat, mat
```

**2. Next Sentence Prediction (NSP)**:
```
句子 A: I love NLP.
句子 B: It is very interesting.
標籤: IsNext (1) or NotNext (0)
```

### 5.4 BERT 的輸入表示

```
Input = Token Embeddings + Segment Embeddings + Position Embeddings
```

**範例**:
```
Token:    [CLS] I love NLP [SEP] It is great [SEP]
Segment:    A   A   A   A    A     B  B   B     B
Position:   0   1   2   3    4     5  6   7     8
```

In [None]:
# 視覺化 BERT 輸入表示
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

tokens = ['[CLS]', 'I', 'love', 'NLP', '[SEP]', 'It', 'is', 'great', '[SEP]']
segment_ids = [0, 0, 0, 0, 0, 1, 1, 1, 1]
position_ids = list(range(len(tokens)))

# Token Embeddings
token_embeds = np.random.randn(len(tokens), 8)
sns.heatmap(token_embeds.T, annot=False, cmap='RdBu_r', 
            xticklabels=tokens, yticklabels=False, ax=axes[0], cbar=True)
axes[0].set_title('Token Embeddings', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Tokens')

# Segment Embeddings
segment_embeds = np.array([segment_ids] * 8)
sns.heatmap(segment_embeds, annot=False, cmap='coolwarm', 
            xticklabels=tokens, yticklabels=False, ax=axes[1], cbar=True)
axes[1].set_title('Segment Embeddings', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Tokens')

# Position Embeddings
pos_embeds = np.array([position_ids] * 8)
sns.heatmap(pos_embeds, annot=False, cmap='viridis', 
            xticklabels=tokens, yticklabels=False, ax=axes[2], cbar=True)
axes[2].set_title('Position Embeddings', fontsize=14, fontweight='bold')
axes[2].set_xlabel('Tokens')

plt.tight_layout()
plt.show()

print("\n說明:")
print("• Token Embeddings: 每個 token 的語義向量")
print("• Segment Embeddings: 區分不同句子 (句子 A=0, 句子 B=1)")
print("• Position Embeddings: 標記 token 在序列中的位置")

### 5.5 BERT 的下游任務微調

**常見應用**:

1. **文本分類**: 使用 `[CLS]` token 的輸出
2. **命名實體識別 (NER)**: 使用每個 token 的輸出
3. **問答系統**: 預測答案的起始和結束位置
4. **文本相似度**: 計算兩個句子的 `[CLS]` 向量相似度

**微調架構**:
```
BERT Encoder (預訓練)
      ↓
[CLS] 表示向量
      ↓
任務特定層 (分類器/回歸器)
      ↓
    輸出
```

## 6. 總結與練習 {#6-summary}

### 6.1 本節重點回顧

✅ **架構理解**:
- Transformer Encoder = 多層堆疊的 Encoder Layer
- 每層包含: Multi-Head Attention + FFN + Residual + LayerNorm

✅ **實作能力**:
- 從零實作 LayerNormalization, FeedForwardNetwork
- 組合完整的 TransformerEncoderLayer
- 堆疊多層 Encoder 建立完整模型

✅ **應用場景**:
- 文本分類: 使用 Global Pooling 或 [CLS] token
- BERT: Encoder-Only 架構的代表作

✅ **BERT 特色**:
- 使用 MLM + NSP 預訓練
- 三種 Embeddings: Token + Segment + Position
- 可微調至各種 NLP 任務

### 6.2 關鍵參數對比

| 模型 | 層數 | d_model | num_heads | 參數量 |
|------|------|---------|-----------|--------|
| Transformer-Base | 6 | 512 | 8 | ~65M |
| BERT-Base | 12 | 768 | 12 | 110M |
| BERT-Large | 24 | 1024 | 16 | 340M |
| GPT-2 | 12-48 | 768-1600 | 12-25 | 117M-1.5B |

### 6.3 實作練習

#### 練習 1: 可視化層級特徵

**任務**: 提取並可視化 Encoder 每一層的注意力權重，觀察不同層學到的模式。

**提示**:
```python
# 使用 all_attention_weights[layer_idx][head_idx]
# 繪製熱圖觀察注意力分佈
```

#### 練習 2: 實作 [CLS] Token 機制

**任務**: 修改 TransformerEncoder，在輸入序列前添加 `[CLS]` token，並在分類時只使用 `[CLS]` 的輸出而非 Global Pooling。

**提示**:
```python
# 在 token_ids 前插入 CLS_TOKEN_ID
# 最終輸出取 output[0, :] (第一個 token)
```

#### 練習 3: 實作 Masked Language Model

**任務**: 實作簡單的 MLM 訓練，隨機遮蔽 15% 的 token，訓練模型預測被遮蔽的詞。

**提示**:
```python
def create_masked_lm_data(tokens, mask_prob=0.15):
    # 隨機選擇 token 進行遮蔽
    # 返回 masked_tokens, original_tokens, mask_positions
    pass
```

### 6.4 延伸閱讀

1. **論文**:
   - [Attention Is All You Need](https://arxiv.org/abs/1706.03762) - 原始 Transformer
   - [BERT: Pre-training of Deep Bidirectional Transformers](https://arxiv.org/abs/1810.04805)

2. **實作資源**:
   - [Hugging Face Transformers](https://huggingface.co/docs/transformers)
   - [Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)

3. **視覺化工具**:
   - [BertViz](https://github.com/jessevig/bertviz) - BERT 注意力視覺化
   - [Tensor2Tensor Visualization](https://github.com/tensorflow/tensor2tensor)

---

## 🎯 下一節預告

**CH07-05: Transformer 解碼器 (Decoder)**
- Decoder 的架構與 Encoder 的差異
- Masked Self-Attention 的實作
- Cross-Attention 機制
- 完整的 Encoder-Decoder 架構
- 機器翻譯實戰

---

**課程完成時間**: `____年____月____日`  
**學習心得**: ___________________________________