# CH07-05: Transformer 解碼器 (Decoder)

**課程時長**: 90 分鐘  
**難度**: ⭐⭐⭐⭐⭐  
**前置知識**: CH07-01, CH07-02, CH07-03, CH07-04  

---

## 📚 本節學習目標

1. ✅ 理解 Decoder 與 Encoder 的差異
2. ✅ 實作 Masked Self-Attention (Look-Ahead Mask)
3. ✅ 實作 Cross-Attention (Encoder-Decoder Attention)
4. ✅ 組合完整的 Encoder-Decoder 架構
5. ✅ 理解自回歸生成 (Autoregressive Generation)

---

## 📖 目錄

1. [Decoder 架構概覽](#1-decoder-overview)
2. [Masked Self-Attention 實作](#2-masked-attention)
3. [Cross-Attention 實作](#3-cross-attention)
4. [完整 Decoder Layer 實作](#4-decoder-layer)
5. [Encoder-Decoder 組合](#5-encoder-decoder)
6. [自回歸生成與推論](#6-generation)
7. [總結與練習](#7-summary)

---

## 1. Decoder 架構概覽 {#1-decoder-overview}

### 1.1 Decoder vs Encoder 的關鍵差異

| 特性 | Encoder | Decoder |
|------|---------|----------|
| Self-Attention | 雙向 (Bidirectional) | 單向 (Unidirectional) |
| Masking | Padding Mask | Look-Ahead Mask + Padding Mask |
| Cross-Attention | ❌ 無 | ✅ 有（與 Encoder 輸出互動）|
| 子層數量 | 2 (MHA + FFN) | 3 (Masked MHA + Cross MHA + FFN) |
| 應用場景 | 文本理解 | 文本生成 |

### 1.2 Decoder Layer 架構

```
Input (Target Sequence)
         ↓
Masked Multi-Head Self-Attention  (看不到未來的 token)
         ↓
    Add & Norm
         ↓
Cross-Attention (與 Encoder 輸出互動)
         ↓
    Add & Norm
         ↓
Feed-Forward Network
         ↓
    Add & Norm
         ↓
       Output
```

### 1.3 為什麼需要 Masked Self-Attention?

**問題**: 在生成任務中，模型不應該看到未來的 token

**範例 (機器翻譯)**:
```
Source (Encoder):  I  love  NLP
Target (Decoder):  我  愛   自然語言處理
```

當預測「愛」時:
- ✅ 可以看到: 「我」（已生成的 token）
- ❌ 不能看到: 「自然語言處理」（未來的 token）

**Look-Ahead Mask** 確保每個位置只能看到自己和之前的 token。

In [None]:
# 載入必要套件
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Optional, Tuple
import warnings

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['font.sans-serif'] = ['Microsoft JhengHei', 'Arial Unicode MS']
plt.rcParams['axes.unicode_minus'] = False

print("✅ 套件載入完成")

## 2. Masked Self-Attention 實作 {#2-masked-attention}

### 2.1 Look-Ahead Mask 的數學定義

對於長度為 $n$ 的序列，Look-Ahead Mask 是一個 $n \times n$ 的下三角矩陣:

$$
\text{Mask}_{ij} = \begin{cases}
1 & \text{if } i \geq j \\
0 & \text{if } i < j
\end{cases}
$$

**範例 (序列長度 = 5)**:
```
     t0  t1  t2  t3  t4
t0 [  1   0   0   0   0 ]  ← 位置 0 只能看到自己
t1 [  1   1   0   0   0 ]  ← 位置 1 可以看到 0, 1
t2 [  1   1   1   0   0 ]  ← 位置 2 可以看到 0, 1, 2
t3 [  1   1   1   1   0 ]  ← 位置 3 可以看到 0, 1, 2, 3
t4 [  1   1   1   1   1 ]  ← 位置 4 可以看到所有
```

In [None]:
def create_look_ahead_mask(seq_len: int) -> np.ndarray:
    """
    Create look-ahead mask for decoder self-attention
    
    Args:
        seq_len: Sequence length
    
    Returns:
        mask: Lower triangular matrix (seq_len, seq_len)
              1 = can attend, 0 = cannot attend
    """
    mask = np.tril(np.ones((seq_len, seq_len)))
    return mask


# 視覺化 Look-Ahead Mask
seq_len = 8
look_ahead_mask = create_look_ahead_mask(seq_len)

plt.figure(figsize=(8, 6))
sns.heatmap(
    look_ahead_mask, 
    annot=True, 
    fmt='.0f', 
    cmap='YlGnBu',
    cbar_kws={'label': 'Can Attend'},
    xticklabels=[f't{i}' for i in range(seq_len)],
    yticklabels=[f't{i}' for i in range(seq_len)]
)
plt.title('Look-Ahead Mask (Decoder Self-Attention)', fontsize=14, fontweight='bold')
plt.xlabel('Key Position (被關注的位置)')
plt.ylabel('Query Position (當前位置)')
plt.tight_layout()
plt.show()

print("說明:")
print("• 1 (藍色): 可以關注 (attend)")
print("• 0 (白色): 不能關注 (masked out)")
print("• 下三角矩陣確保每個位置只能看到自己和之前的 token")

### 2.2 Masked Self-Attention 實作

與 Encoder 的 Self-Attention 相比，唯一差異是添加 Look-Ahead Mask。

In [None]:
def scaled_dot_product_attention(
    Q: np.ndarray, 
    K: np.ndarray, 
    V: np.ndarray, 
    mask: Optional[np.ndarray] = None
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Scaled Dot-Product Attention with optional masking
    
    Args:
        Q: Query matrix (seq_len, d_k)
        K: Key matrix (seq_len, d_k)
        V: Value matrix (seq_len, d_v)
        mask: Optional mask (seq_len, seq_len), 1=attend, 0=mask
    
    Returns:
        output: Attention output (seq_len, d_v)
        attention_weights: Attention weights (seq_len, seq_len)
    """
    d_k = K.shape[-1]
    
    # Compute attention scores
    scores = np.matmul(Q, K.T) / np.sqrt(d_k)
    
    # Apply mask: set masked positions to large negative value
    if mask is not None:
        scores = np.where(mask == 0, -1e9, scores)
    
    # Apply softmax
    attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
    
    # Compute weighted sum of values
    output = np.matmul(attention_weights, V)
    
    return output, attention_weights


# 測試 Masked Self-Attention
seq_len = 5
d_k = 64

# 隨機生成 Q, K, V
Q = K = V = np.random.randn(seq_len, d_k)

# 建立 Look-Ahead Mask
mask = create_look_ahead_mask(seq_len)

# 計算 attention (有 mask vs 無 mask)
output_masked, attn_masked = scaled_dot_product_attention(Q, K, V, mask)
output_unmasked, attn_unmasked = scaled_dot_product_attention(Q, K, V, None)

# 視覺化對比
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.heatmap(attn_unmasked, annot=True, fmt='.2f', cmap='YlOrRd', 
            xticklabels=[f't{i}' for i in range(seq_len)],
            yticklabels=[f't{i}' for i in range(seq_len)],
            ax=axes[0], cbar_kws={'label': 'Attention Weight'})
axes[0].set_title('Encoder Self-Attention (無 Mask)', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Key Position')
axes[0].set_ylabel('Query Position')

sns.heatmap(attn_masked, annot=True, fmt='.2f', cmap='YlOrRd', 
            xticklabels=[f't{i}' for i in range(seq_len)],
            yticklabels=[f't{i}' for i in range(seq_len)],
            ax=axes[1], cbar_kws={'label': 'Attention Weight'})
axes[1].set_title('Decoder Masked Self-Attention (有 Mask)', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Key Position')
axes[1].set_ylabel('Query Position')

plt.tight_layout()
plt.show()

print("\n觀察:")
print("• 左圖: Encoder 可以看到所有位置（雙向注意力）")
print("• 右圖: Decoder 只能看到當前及之前位置（單向注意力）")
print("• 右上角三角區域的注意力權重為 0（被 mask 掉）")

## 3. Cross-Attention 實作 {#3-cross-attention}

### 3.1 Cross-Attention 的作用

**定義**: Decoder 使用 Encoder 的輸出來計算注意力

**數學形式**:
$$
\text{CrossAttention}(Q_{\text{decoder}}, K_{\text{encoder}}, V_{\text{encoder}})
$$

**關鍵差異**:
- **Self-Attention**: $Q, K, V$ 都來自同一序列
- **Cross-Attention**: $Q$ 來自 Decoder，$K, V$ 來自 Encoder

### 3.2 Cross-Attention 在機器翻譯中的應用

```
Encoder 輸入:  I     love   NLP    (Source)
               ↓      ↓      ↓
Encoder 輸出: [e1]   [e2]   [e3]   ← 作為 K, V
                ↑      ↑      ↑
                └──────┴──────┘
                       ↓
Decoder Query: [我]   [愛]   [?]    ← 作為 Q
                       ↓
Cross-Attention 計算每個 target token 與 source tokens 的相關性
```

**實例**: 當生成「愛」時
- Query: 「愛」的表示向量
- Keys: ["I", "love", "NLP"] 的表示向量
- Attention 分數: 「愛」會高度關注 "love"

In [None]:
# 模擬 Cross-Attention 範例
def cross_attention_demo():
    """
    Demo: Visualize cross-attention between encoder and decoder
    """
    # 序列長度
    encoder_seq_len = 4  # Source: "I love NLP ."
    decoder_seq_len = 5  # Target: "我 愛 自然 語言 處理"
    d_k = 64
    
    # 模擬 Encoder 輸出 (作為 K, V)
    encoder_output = np.random.randn(encoder_seq_len, d_k)
    
    # 模擬 Decoder 當前狀態 (作為 Q)
    decoder_query = np.random.randn(decoder_seq_len, d_k)
    
    # Cross-Attention: Q from decoder, K,V from encoder
    # 注意: K 和 V 的序列長度與 encoder 一致
    K = V = encoder_output
    Q = decoder_query
    
    # 計算 cross-attention scores
    scores = np.matmul(Q, K.T) / np.sqrt(d_k)
    attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
    
    # 視覺化
    source_tokens = ['I', 'love', 'NLP', '.']
    target_tokens = ['我', '愛', '自然', '語言', '處理']
    
    plt.figure(figsize=(10, 7))
    sns.heatmap(
        attention_weights, 
        annot=True, 
        fmt='.3f', 
        cmap='RdYlGn',
        xticklabels=source_tokens,
        yticklabels=target_tokens,
        cbar_kws={'label': 'Attention Weight'}
    )
    plt.title('Cross-Attention: Decoder → Encoder', fontsize=14, fontweight='bold')
    plt.xlabel('Source (Encoder Keys/Values)', fontsize=12)
    plt.ylabel('Target (Decoder Queries)', fontsize=12)
    plt.tight_layout()
    plt.show()
    
    print("說明:")
    print("• 每一行: Decoder 中一個 token 對 Encoder 所有 tokens 的注意力分佈")
    print("• 例: '愛' 這一行顯示它對 ['I', 'love', 'NLP', '.'] 的注意力")
    print("• 理想情況: '愛' 會高度關注 'love'")

cross_attention_demo()

### 3.3 Cross-Attention 實作

In [None]:
def cross_attention(
    Q_decoder: np.ndarray,
    encoder_output: np.ndarray,
    mask: Optional[np.ndarray] = None
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Cross-Attention between decoder and encoder
    
    Args:
        Q_decoder: Decoder queries (decoder_seq_len, d_k)
        encoder_output: Encoder output as K,V (encoder_seq_len, d_k)
        mask: Optional padding mask for encoder (decoder_seq_len, encoder_seq_len)
    
    Returns:
        output: Cross-attention output (decoder_seq_len, d_k)
        attention_weights: (decoder_seq_len, encoder_seq_len)
    """
    # Encoder output serves as both K and V
    K = V = encoder_output
    Q = Q_decoder
    
    d_k = K.shape[-1]
    
    # Compute attention scores
    scores = np.matmul(Q, K.T) / np.sqrt(d_k)
    
    # Apply padding mask if provided
    if mask is not None:
        scores = np.where(mask == 0, -1e9, scores)
    
    # Softmax
    attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
    
    # Weighted sum
    output = np.matmul(attention_weights, V)
    
    return output, attention_weights


# 測試 Cross-Attention
encoder_seq_len = 6
decoder_seq_len = 4
d_k = 64

encoder_output = np.random.randn(encoder_seq_len, d_k)
decoder_query = np.random.randn(decoder_seq_len, d_k)

output, attn_weights = cross_attention(decoder_query, encoder_output)

print("✅ Cross-Attention 測試通過")
print(f"Decoder Query 形狀: {decoder_query.shape}")
print(f"Encoder Output 形狀: {encoder_output.shape}")
print(f"Cross-Attention 輸出形狀: {output.shape}")
print(f"Attention Weights 形狀: {attn_weights.shape}")
print(f"\n每個 decoder token 的注意力和應為 1.0:")
print(f"  {np.sum(attn_weights, axis=1)}")

## 4. 完整 Decoder Layer 實作 {#4-decoder-layer}

### 4.1 Decoder Layer 的三個子層

```
Input
  ↓
① Masked Multi-Head Self-Attention
  ↓
Add & Norm
  ↓
② Cross-Attention (與 Encoder 輸出)
  ↓
Add & Norm
  ↓
③ Feed-Forward Network
  ↓
Add & Norm
  ↓
Output
```

In [None]:
# 輔助組件 (從 CH07-04 複製)

class LayerNormalization:
    """Layer Normalization"""
    
    def __init__(self, d_model: int, eps: float = 1e-6):
        self.gamma = np.ones(d_model)
        self.beta = np.zeros(d_model)
        self.eps = eps
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        mean = np.mean(x, axis=-1, keepdims=True)
        variance = np.var(x, axis=-1, keepdims=True)
        x_norm = (x - mean) / np.sqrt(variance + self.eps)
        return self.gamma * x_norm + self.beta


class FeedForwardNetwork:
    """Position-wise Feed-Forward Network"""
    
    def __init__(self, d_model: int, d_ff: int):
        self.W1 = np.random.randn(d_model, d_ff) / np.sqrt(d_model)
        self.b1 = np.zeros(d_ff)
        self.W2 = np.random.randn(d_ff, d_model) / np.sqrt(d_ff)
        self.b2 = np.zeros(d_model)
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        hidden = np.maximum(0, np.dot(x, self.W1) + self.b1)
        output = np.dot(hidden, self.W2) + self.b2
        return output


class MultiHeadAttention:
    """Multi-Head Attention (可用於 Self-Attention 或 Cross-Attention)"""
    
    def __init__(self, d_model: int, num_heads: int):
        assert d_model % num_heads == 0
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        self.W_q = np.random.randn(d_model, d_model) / np.sqrt(d_model)
        self.W_k = np.random.randn(d_model, d_model) / np.sqrt(d_model)
        self.W_v = np.random.randn(d_model, d_model) / np.sqrt(d_model)
        self.W_o = np.random.randn(d_model, d_model) / np.sqrt(d_model)
    
    def split_heads(self, x: np.ndarray) -> np.ndarray:
        seq_len = x.shape[0]
        x = x.reshape(seq_len, self.num_heads, self.d_k)
        return x.transpose(1, 0, 2)
    
    def forward(
        self, 
        Q_input: np.ndarray,
        K_input: np.ndarray,
        V_input: np.ndarray,
        mask: Optional[np.ndarray] = None
    ) -> Tuple[np.ndarray, np.ndarray]:
        """
        Args:
            Q_input: Query input (seq_len_q, d_model)
            K_input: Key input (seq_len_k, d_model)
            V_input: Value input (seq_len_v, d_model)
            mask: Optional mask (seq_len_q, seq_len_k)
        """
        Q = np.dot(Q_input, self.W_q)
        K = np.dot(K_input, self.W_k)
        V = np.dot(V_input, self.W_v)
        
        Q = self.split_heads(Q)
        K = self.split_heads(K)
        V = self.split_heads(V)
        
        outputs = []
        attention_weights_all = []
        
        for i in range(self.num_heads):
            output, attn_weights = scaled_dot_product_attention(
                Q[i], K[i], V[i], mask
            )
            outputs.append(output)
            attention_weights_all.append(attn_weights)
        
        concat_output = np.concatenate(outputs, axis=-1)
        final_output = np.dot(concat_output, self.W_o)
        attention_weights = np.stack(attention_weights_all, axis=0)
        
        return final_output, attention_weights

print("✅ 輔助組件載入完成")

In [None]:
class TransformerDecoderLayer:
    """Complete Transformer Decoder Layer"""
    
    def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1):
        # ① Masked Multi-Head Self-Attention
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        
        # ② Cross-Attention
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        
        # ③ Feed-Forward Network
        self.ffn = FeedForwardNetwork(d_model, d_ff)
        
        # Layer Normalization (3 個)
        self.layernorm1 = LayerNormalization(d_model)
        self.layernorm2 = LayerNormalization(d_model)
        self.layernorm3 = LayerNormalization(d_model)
        
        self.dropout = dropout
    
    def forward(
        self,
        x: np.ndarray,
        encoder_output: np.ndarray,
        look_ahead_mask: Optional[np.ndarray] = None,
        padding_mask: Optional[np.ndarray] = None
    ) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
        """
        Args:
            x: Decoder input (decoder_seq_len, d_model)
            encoder_output: Encoder output (encoder_seq_len, d_model)
            look_ahead_mask: Mask for self-attention (decoder_seq_len, decoder_seq_len)
            padding_mask: Mask for cross-attention (decoder_seq_len, encoder_seq_len)
        
        Returns:
            output: (decoder_seq_len, d_model)
            self_attn_weights: (num_heads, decoder_seq_len, decoder_seq_len)
            cross_attn_weights: (num_heads, decoder_seq_len, encoder_seq_len)
        """
        # ① Masked Multi-Head Self-Attention
        self_attn_output, self_attn_weights = self.self_attn.forward(
            x, x, x, look_ahead_mask
        )
        x = self.layernorm1.forward(x + self_attn_output)
        
        # ② Cross-Attention
        cross_attn_output, cross_attn_weights = self.cross_attn.forward(
            Q_input=x,  # Query from decoder
            K_input=encoder_output,  # Key from encoder
            V_input=encoder_output,  # Value from encoder
            mask=padding_mask
        )
        x = self.layernorm2.forward(x + cross_attn_output)
        
        # ③ Feed-Forward Network
        ffn_output = self.ffn.forward(x)
        x = self.layernorm3.forward(x + ffn_output)
        
        return x, self_attn_weights, cross_attn_weights


# 測試 Decoder Layer
d_model = 512
num_heads = 8
d_ff = 2048
encoder_seq_len = 10
decoder_seq_len = 8

# 模擬輸入
decoder_input = np.random.randn(decoder_seq_len, d_model)
encoder_output = np.random.randn(encoder_seq_len, d_model)
look_ahead_mask = create_look_ahead_mask(decoder_seq_len)

# 建立並測試 Decoder Layer
decoder_layer = TransformerDecoderLayer(d_model, num_heads, d_ff)
output, self_attn, cross_attn = decoder_layer.forward(
    decoder_input, encoder_output, look_ahead_mask
)

print("✅ Transformer Decoder Layer 測試通過")
print(f"Decoder 輸入: {decoder_input.shape}")
print(f"Encoder 輸出: {encoder_output.shape}")
print(f"Decoder 輸出: {output.shape}")
print(f"Self-Attention 權重: {self_attn.shape}")
print(f"Cross-Attention 權重: {cross_attn.shape}")

## 5. Encoder-Decoder 組合 {#5-encoder-decoder}

### 5.1 完整 Transformer 架構

```
Source Input (Encoder)
         ↓
    Embedding + PE
         ↓
    Encoder Layer 1
         ↓
         ...
         ↓
    Encoder Layer N
         ↓
    Encoder Output ──────────────┐
                                 │
Target Input (Decoder)           │
         ↓                       │
    Embedding + PE               │
         ↓                       │
    Decoder Layer 1 ←────────────┘ (Cross-Attention)
         ↓                       │
         ...                    │
         ↓                       │
    Decoder Layer N ←────────────┘
         ↓
    Linear + Softmax
         ↓
    Output Probabilities
```

In [None]:
class Transformer:
    """Complete Transformer with Encoder and Decoder"""
    
    def __init__(
        self,
        num_encoder_layers: int,
        num_decoder_layers: int,
        d_model: int,
        num_heads: int,
        d_ff: int,
        source_vocab_size: int,
        target_vocab_size: int,
        max_seq_len: int,
        dropout: float = 0.1
    ):
        self.d_model = d_model
        
        # Embeddings
        self.source_embedding = np.random.randn(source_vocab_size, d_model) / np.sqrt(d_model)
        self.target_embedding = np.random.randn(target_vocab_size, d_model) / np.sqrt(d_model)
        self.positional_encoding = self.get_positional_encoding(max_seq_len, d_model)
        
        # Encoder stack
        self.encoder_layers = [
            self._create_encoder_layer(d_model, num_heads, d_ff, dropout)
            for _ in range(num_encoder_layers)
        ]
        
        # Decoder stack
        self.decoder_layers = [
            TransformerDecoderLayer(d_model, num_heads, d_ff, dropout)
            for _ in range(num_decoder_layers)
        ]
        
        # Output projection
        self.output_projection = np.random.randn(d_model, target_vocab_size) / np.sqrt(d_model)
    
    @staticmethod
    def get_positional_encoding(max_seq_len: int, d_model: int) -> np.ndarray:
        pos_encoding = np.zeros((max_seq_len, d_model))
        for pos in range(max_seq_len):
            for i in range(0, d_model, 2):
                pos_encoding[pos, i] = np.sin(pos / (10000 ** (i / d_model)))
                if i + 1 < d_model:
                    pos_encoding[pos, i + 1] = np.cos(pos / (10000 ** (i / d_model)))
        return pos_encoding
    
    def _create_encoder_layer(self, d_model, num_heads, d_ff, dropout):
        """簡化版 Encoder Layer (無 Cross-Attention)"""
        class EncoderLayer:
            def __init__(self):
                self.self_attn = MultiHeadAttention(d_model, num_heads)
                self.ffn = FeedForwardNetwork(d_model, d_ff)
                self.layernorm1 = LayerNormalization(d_model)
                self.layernorm2 = LayerNormalization(d_model)
            
            def forward(self, x, mask=None):
                attn_output, _ = self.self_attn.forward(x, x, x, mask)
                x = self.layernorm1.forward(x + attn_output)
                ffn_output = self.ffn.forward(x)
                x = self.layernorm2.forward(x + ffn_output)
                return x
        
        return EncoderLayer()
    
    def encode(self, source_ids: np.ndarray) -> np.ndarray:
        """Encode source sequence"""
        seq_len = source_ids.shape[0]
        
        # Embedding + Positional Encoding
        x = self.source_embedding[source_ids]
        x = x + self.positional_encoding[:seq_len, :]
        
        # Pass through encoder layers
        for encoder_layer in self.encoder_layers:
            x = encoder_layer.forward(x)
        
        return x
    
    def decode(
        self,
        target_ids: np.ndarray,
        encoder_output: np.ndarray,
        look_ahead_mask: Optional[np.ndarray] = None
    ) -> np.ndarray:
        """Decode target sequence"""
        seq_len = target_ids.shape[0]
        
        # Embedding + Positional Encoding
        x = self.target_embedding[target_ids]
        x = x + self.positional_encoding[:seq_len, :]
        
        # Pass through decoder layers
        for decoder_layer in self.decoder_layers:
            x, _, _ = decoder_layer.forward(x, encoder_output, look_ahead_mask)
        
        return x
    
    def forward(
        self,
        source_ids: np.ndarray,
        target_ids: np.ndarray
    ) -> np.ndarray:
        """
        Complete forward pass
        
        Args:
            source_ids: Source token IDs (source_seq_len,)
            target_ids: Target token IDs (target_seq_len,)
        
        Returns:
            logits: Output logits (target_seq_len, target_vocab_size)
        """
        # Encode
        encoder_output = self.encode(source_ids)
        
        # Create look-ahead mask for decoder
        target_seq_len = target_ids.shape[0]
        look_ahead_mask = create_look_ahead_mask(target_seq_len)
        
        # Decode
        decoder_output = self.decode(target_ids, encoder_output, look_ahead_mask)
        
        # Project to vocabulary
        logits = np.dot(decoder_output, self.output_projection)
        
        return logits


# 測試完整 Transformer
transformer = Transformer(
    num_encoder_layers=6,
    num_decoder_layers=6,
    d_model=512,
    num_heads=8,
    d_ff=2048,
    source_vocab_size=10000,
    target_vocab_size=8000,
    max_seq_len=100
)

# 模擬輸入
source_ids = np.random.randint(0, 10000, size=12)
target_ids = np.random.randint(0, 8000, size=10)

# 前向傳播
logits = transformer.forward(source_ids, target_ids)

print("✅ 完整 Transformer 測試通過")
print(f"Source 序列長度: {len(source_ids)}")
print(f"Target 序列長度: {len(target_ids)}")
print(f"輸出 logits 形狀: {logits.shape}")
print(f"  → 每個 target token 有 {logits.shape[1]} 個詞彙的機率分數")

## 6. 自回歸生成與推論 {#6-generation}

### 6.1 訓練 vs 推論的差異

**訓練時 (Teacher Forcing)**:
- 輸入: 完整的 target 序列（已知所有 token）
- 目標: 一次性預測所有位置的 token
- 效率: 高（並行計算）

**推論時 (Autoregressive Generation)**:
- 輸入: 逐步生成的序列（從 `<BOS>` 開始）
- 目標: 每次預測下一個 token
- 效率: 低（必須循序生成）

### 6.2 自回歸生成流程

```
Step 1: Input = [<BOS>]
        Output = "我"

Step 2: Input = [<BOS>, "我"]
        Output = "愛"

Step 3: Input = [<BOS>, "我", "愛"]
        Output = "NLP"

Step 4: Input = [<BOS>, "我", "愛", "NLP"]
        Output = <EOS>  → 停止生成
```

In [None]:
def greedy_decode(
    transformer: Transformer,
    source_ids: np.ndarray,
    max_len: int,
    bos_token_id: int,
    eos_token_id: int
) -> np.ndarray:
    """
    Greedy decoding for sequence generation
    
    Args:
        transformer: Trained Transformer model
        source_ids: Source sequence (source_seq_len,)
        max_len: Maximum generation length
        bos_token_id: Beginning-of-sequence token ID
        eos_token_id: End-of-sequence token ID
    
    Returns:
        generated_ids: Generated sequence (variable length)
    """
    # Encode source once
    encoder_output = transformer.encode(source_ids)
    
    # Initialize with <BOS>
    generated_ids = [bos_token_id]
    
    for _ in range(max_len):
        # Current target sequence
        target_ids = np.array(generated_ids)
        
        # Create look-ahead mask
        look_ahead_mask = create_look_ahead_mask(len(target_ids))
        
        # Decode
        decoder_output = transformer.decode(target_ids, encoder_output, look_ahead_mask)
        
        # Project to vocabulary
        logits = np.dot(decoder_output, transformer.output_projection)
        
        # Get the last token's logits and apply softmax
        last_logits = logits[-1, :]
        probs = np.exp(last_logits) / np.sum(np.exp(last_logits))
        
        # Greedy: select the token with highest probability
        next_token_id = np.argmax(probs)
        
        # Append to sequence
        generated_ids.append(next_token_id)
        
        # Stop if <EOS> generated
        if next_token_id == eos_token_id:
            break
    
    return np.array(generated_ids)


# 模擬生成測試
source_ids = np.random.randint(0, 10000, size=8)
BOS_TOKEN = 1
EOS_TOKEN = 2
MAX_LEN = 15

print("開始自回歸生成...")
generated = greedy_decode(transformer, source_ids, MAX_LEN, BOS_TOKEN, EOS_TOKEN)

print(f"\n✅ 生成完成")
print(f"Source 長度: {len(source_ids)}")
print(f"生成序列長度: {len(generated)}")
print(f"生成的 token IDs: {generated}")
print(f"\n說明:")
print(f"  • 第一個 token 是 BOS ({BOS_TOKEN})")
print(f"  • 若遇到 EOS ({EOS_TOKEN}) 則停止生成")
print(f"  • 每步都只預測下一個 token (自回歸)")

### 6.3 不同解碼策略

| 策略 | 說明 | 優點 | 缺點 |
|------|------|------|------|
| **Greedy** | 每步選最高機率 token | 快速、確定性 | 可能陷入局部最優 |
| **Beam Search** | 維護 top-k 個候選序列 | 品質較好 | 較慢、需要更多記憶體 |
| **Sampling** | 根據機率分佈隨機採樣 | 多樣性高 | 可能不連貫 |
| **Top-k Sampling** | 只從 top-k token 中採樣 | 平衡品質與多樣性 | 需調整 k 值 |
| **Nucleus (Top-p)** | 從累積機率達 p 的 token 中採樣 | 自適應選擇範圍 | 需調整 p 值 |

**Beam Search 示意圖**:
```
Beam Size = 2

Step 1:        [我 (0.6)]     [你 (0.3)]
               /     \         /     \
Step 2:    [愛]    [喜歡]   [愛]   [討厭]
           (0.4)   (0.3)   (0.2)   (0.15)
           
Keep top-2:  [我, 愛]  [我, 喜歡]
```

## 7. 總結與練習 {#7-summary}

### 7.1 本節重點回顧

✅ **Decoder 的三大組件**:
1. **Masked Self-Attention**: 確保只能看到過去的 token
2. **Cross-Attention**: 與 Encoder 輸出互動，獲取源序列資訊
3. **Feed-Forward Network**: 非線性轉換

✅ **關鍵概念**:
- **Look-Ahead Mask**: 下三角矩陣，防止看到未來
- **Cross-Attention**: Q 來自 Decoder，K/V 來自 Encoder
- **Autoregressive Generation**: 逐步生成，每次預測下一個 token

✅ **Encoder vs Decoder**:

| 特性 | Encoder | Decoder |
|------|---------|----------|
| 注意力方向 | 雙向 | 單向 (masked) |
| Cross-Attention | ❌ | ✅ |
| 應用 | 理解任務 | 生成任務 |
| 代表模型 | BERT | GPT |

### 7.2 完整 Transformer 架構總結

**Encoder-Decoder 架構** (原始 Transformer):
- 應用: 機器翻譯、文本摘要
- 代表: T5, BART, mBART

**Encoder-Only 架構**:
- 應用: 文本分類、NER、問答
- 代表: BERT, RoBERTa, ALBERT

**Decoder-Only 架構**:
- 應用: 文本生成、對話系統
- 代表: GPT, GPT-2, GPT-3, LLaMA

### 7.3 實作練習

#### 練習 1: 視覺化 Cross-Attention

**任務**: 使用真實的翻譯範例，視覺化 Cross-Attention 權重，觀察哪些 target token 關注哪些 source token。

**提示**:
```python
source = "I love natural language processing"
target = "我 愛 自然 語言 處理"
# 繪製 heatmap 顯示對齊關係
```

#### 練習 2: 實作 Beam Search

**任務**: 實作 Beam Search 解碼策略，並與 Greedy Decoding 比較生成品質。

**提示**:
```python
def beam_search_decode(model, source, beam_size=5):
    # 維護 top-k 個候選序列
    # 每步擴展所有候選，保留總機率最高的 k 個
    pass
```

#### 練習 3: 實作 Temperature Sampling

**任務**: 在生成時加入 temperature 參數控制隨機性。

**提示**:
```python
# Temperature scaling
logits = logits / temperature
probs = softmax(logits)
next_token = np.random.choice(vocab_size, p=probs)
```

**Temperature 效果**:
- `T = 0.5`: 更確定性，偏向高機率 token
- `T = 1.0`: 標準 softmax
- `T = 2.0`: 更隨機，分佈更均勻

### 7.4 延伸閱讀

1. **論文**:
   - [Attention Is All You Need](https://arxiv.org/abs/1706.03762) - 仔細閱讀 Decoder 部分
   - [The Curious Case of Neural Text Degeneration](https://arxiv.org/abs/1904.09751) - Nucleus Sampling

2. **實作資源**:
   - [Annotated Transformer - Decoder](http://nlp.seas.harvard.edu/annotated-transformer/)
   - [Hugging Face - Generation Strategies](https://huggingface.co/docs/transformers/generation_strategies)

3. **視覺化工具**:
   - [Seq2Seq Visualization](https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/)

---

## 🎯 下一節預告

**CH07-06: Encoder vs Decoder vs Encoder-Decoder 對比**
- 三種架構的優缺點分析
- 不同任務的最佳架構選擇
- BERT vs GPT vs T5 實際對比
- 模型選擇的實戰建議

---

**課程完成時間**: `____年____月____日`  
**學習心得**: ___________________________________