# 底層實作03: 從零打造 RNN 與 LSTM

**課程**: iSpan Python NLP Cookbooks v2
**章節**: 底層實作系列
**版本**: v1.0
**更新日期**: 2025-10-17

---

## 📚 本節學習目標

1. 從零實作循環神經網路 (RNN)
2. 理解時間反向傳播 (BPTT) 算法
3. 掌握梯度消失問題的本質
4. 從零實作 LSTM 門控機制
5. 實作序列生成與情感分析任務
6. 與 Keras 版本對比驗證

---

## 1. RNN 基礎原理

### 1.1 RNN 的循環結構

**數學表示**:
```
ht = tanh(Wh·ht-1 + Wx·xt + b)
yt = Why·ht + by

其中:
- xt: 時間步 t 的輸入
- ht: 時間步 t 的隱藏狀態 (記憶)
- yt: 時間步 t 的輸出
- Wh: 狀態轉移矩陣 (hidden_size × hidden_size)
- Wx: 輸入權重矩陣 (hidden_size × input_size)
- Why: 輸出權重矩陣 (output_size × hidden_size)
```

### 1.2 展開的時間視圖

```
t=0      t=1      t=2
x0       x1       x2
 ↓        ↓        ↓
h0  →   h1  →   h2
 ↓        ↓        ↓
y0       y1       y2
```

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 設定隨機種子
np.random.seed(42)

# 設定顯示選項
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("✅ 環境準備完成")

---

## 2. 激活函數實作

In [None]:
def tanh(x):
    """Tanh 激活函數"""
    return np.tanh(x)

def tanh_derivative(x):
    """Tanh 導數: 1 - tanh²(x)"""
    return 1 - np.tanh(x) ** 2

def sigmoid(x):
    """Sigmoid 激活函數"""
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

def sigmoid_derivative(x):
    """Sigmoid 導數: σ(x) * (1 - σ(x))"""
    s = sigmoid(x)
    return s * (1 - s)

def softmax(x):
    """Softmax (用於多分類)"""
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

print("✅ 激活函數定義完成")

---

## 3. RNN 完整實作

### 3.1 RNN Cell 前向傳播

In [None]:
class VanillaRNN:
    def __init__(self, input_size, hidden_size, output_size, learning_rate=0.01):
        """
        Parameters:
        -----------
        input_size : int
            輸入特徵維度
        hidden_size : int
            隱藏狀態維度
        output_size : int
            輸出維度
        learning_rate : float
            學習率
        """
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.lr = learning_rate
        
        # 權重初始化 (Xavier)
        scale_h = np.sqrt(2.0 / hidden_size)
        scale_x = np.sqrt(2.0 / input_size)
        
        self.Wh = np.random.randn(hidden_size, hidden_size) * scale_h
        self.Wx = np.random.randn(hidden_size, input_size) * scale_x
        self.Why = np.random.randn(output_size, hidden_size) * scale_h
        
        self.bh = np.zeros((hidden_size, 1))
        self.by = np.zeros((output_size, 1))
        
        print(f"✅ RNN 初始化完成")
        print(f"   輸入維度: {input_size}")
        print(f"   隱藏維度: {hidden_size}")
        print(f"   輸出維度: {output_size}")
    
    def forward(self, inputs, h_prev=None):
        """
        前向傳播 (處理整個序列)
        
        Parameters:
        -----------
        inputs : list of arrays
            輸入序列 [(input_size, 1), ...]
        h_prev : array
            初始隱藏狀態 (hidden_size, 1)
        
        Returns:
        --------
        outputs : list
            輸出序列
        hidden_states : list
            隱藏狀態序列
        cache : dict
            緩存 (用於反向傳播)
        """
        # 初始化隱藏狀態
        if h_prev is None:
            h = np.zeros((self.hidden_size, 1))
        else:
            h = h_prev
        
        hidden_states = [h]
        outputs = []
        
        # 逐步處理序列
        for x in inputs:
            # ht = tanh(Wh·ht-1 + Wx·xt + bh)
            h = np.tanh(np.dot(self.Wh, h) + np.dot(self.Wx, x) + self.bh)
            
            # yt = Why·ht + by
            y = np.dot(self.Why, h) + self.by
            
            hidden_states.append(h)
            outputs.append(y)
        
        cache = {
            'inputs': inputs,
            'hidden_states': hidden_states,
            'outputs': outputs
        }
        
        return outputs, hidden_states, cache
    
    def backward(self, cache, targets):
        """
        反向傳播 (BPTT - Backpropagation Through Time)
        
        Parameters:
        -----------
        cache : dict
            前向傳播的緩存
        targets : list
            目標序列
        """
        inputs = cache['inputs']
        hidden_states = cache['hidden_states']
        outputs = cache['outputs']
        
        T = len(inputs)
        
        # 初始化梯度
        dWh = np.zeros_like(self.Wh)
        dWx = np.zeros_like(self.Wx)
        dWhy = np.zeros_like(self.Why)
        dbh = np.zeros_like(self.bh)
        dby = np.zeros_like(self.by)
        
        dh_next = np.zeros((self.hidden_size, 1))
        
        # 從後往前遍歷時間步
        for t in reversed(range(T)):
            # 輸出層梯度
            dy = outputs[t] - targets[t]  # (output_size, 1)
            
            dWhy += np.dot(dy, hidden_states[t+1].T)
            dby += dy
            
            # 隱藏層梯度
            dh = np.dot(self.Why.T, dy) + dh_next  # (hidden_size, 1)
            
            # tanh 導數
            dh_raw = (1 - hidden_states[t+1] ** 2) * dh
            
            # 權重梯度
            dWh += np.dot(dh_raw, hidden_states[t].T)
            dWx += np.dot(dh_raw, inputs[t].T)
            dbh += dh_raw
            
            # 傳遞到前一時間步
            dh_next = np.dot(self.Wh.T, dh_raw)
        
        # 梯度裁剪 (防止梯度爆炸)
        for grad in [dWh, dWx, dWhy, dbh, dby]:
            np.clip(grad, -5, 5, out=grad)
        
        # 更新權重
        self.Wh -= self.lr * dWh
        self.Wx -= self.lr * dWx
        self.Why -= self.lr * dWhy
        self.bh -= self.lr * dbh
        self.by -= self.lr * dby
    
    def train_step(self, inputs, targets, h_prev=None):
        """
        單次訓練步驟
        """
        # 前向傳播
        outputs, hidden_states, cache = self.forward(inputs, h_prev)
        
        # 計算損失 (Mean Squared Error)
        loss = 0
        for y, target in zip(outputs, targets):
            loss += np.sum((y - target) ** 2)
        loss /= len(targets)
        
        # 反向傳播
        self.backward(cache, targets)
        
        return loss, hidden_states[-1]

print("✅ RNN 類別定義完成")

---

## 4. RNN 測試: 序列求和任務

**任務**: 學習計算序列的累積和

**範例**:
```
輸入: [1, 2, 3]
輸出: [1, 3, 6]  (累積和)
```

In [None]:
# 生成訓練數據
def generate_cumsum_data(n_samples=100, seq_length=10):
    """
    生成累積和數據
    """
    X = []
    Y = []
    
    for _ in range(n_samples):
        # 隨機序列
        seq = np.random.randint(0, 5, size=seq_length)
        # 累積和
        cumsum = np.cumsum(seq)
        
        X.append(seq)
        Y.append(cumsum)
    
    return np.array(X), np.array(Y)

# 生成數據
X_data, Y_data = generate_cumsum_data(n_samples=200, seq_length=10)

print("範例數據:")
print(f"輸入: {X_data[0]}")
print(f"輸出: {Y_data[0]}")

# 創建 RNN
rnn = VanillaRNN(input_size=1, hidden_size=20, output_size=1, learning_rate=0.001)

# 訓練
print("\n開始訓練...")
losses = []

for epoch in range(100):
    epoch_loss = 0
    
    for i in range(len(X_data)):
        # 準備序列
        inputs = [np.array([[x]]) for x in X_data[i]]  # [(1,1), (1,1), ...]
        targets = [np.array([[y]]) for y in Y_data[i]]
        
        # 訓練一步
        loss, _ = rnn.train_step(inputs, targets)
        epoch_loss += loss
    
    avg_loss = epoch_loss / len(X_data)
    losses.append(avg_loss)
    
    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1:3d}/100 - Loss: {avg_loss:.4f}")

# 繪製學習曲線
plt.figure(figsize=(10, 5))
plt.plot(losses, linewidth=2)
plt.title('RNN Training Loss (Cumulative Sum Task)', fontsize=14)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.grid(True, alpha=0.3)
plt.show()

print("\n✅ RNN 訓練完成")

In [None]:
# 測試預測
test_input = [2, 3, 1, 4, 2]
test_target = np.cumsum(test_input)

# 轉換格式
inputs = [np.array([[x]]) for x in test_input]

# 預測
outputs, _, _ = rnn.forward(inputs)
predictions = [y[0, 0] for y in outputs]

print("\n測試結果:")
print("="*50)
print(f"輸入序列: {test_input}")
print(f"真實累積和: {test_target.tolist()}")
print(f"預測累積和: {[round(p, 1) for p in predictions]}")
print(f"\n平均誤差: {np.mean(np.abs(predictions - test_target)):.4f}")

---

## 5. LSTM 完整實作

### 5.1 LSTM 的四大組件

```
1. 遺忘門 (Forget Gate): ft = σ(Wf·[ht-1, xt] + bf)
2. 輸入門 (Input Gate):   it = σ(Wi·[ht-1, xt] + bi)
3. 候選記憶:              C̃t = tanh(Wc·[ht-1, xt] + bc)
4. 輸出門 (Output Gate):  ot = σ(Wo·[ht-1, xt] + bo)

記憶單元更新: Ct = ft ⊙ Ct-1 + it ⊙ C̃t
隱藏狀態:     ht = ot ⊙ tanh(Ct)
```

In [None]:
class LSTMCell:
    def __init__(self, input_size, hidden_size):
        """
        LSTM Cell 初始化
        """
        self.input_size = input_size
        self.hidden_size = hidden_size
        
        # 權重初始化 (Xavier)
        concat_size = hidden_size + input_size
        scale = np.sqrt(2.0 / concat_size)
        
        # 四個門的權重
        self.Wf = np.random.randn(hidden_size, concat_size) * scale  # 遺忘門
        self.Wi = np.random.randn(hidden_size, concat_size) * scale  # 輸入門
        self.Wc = np.random.randn(hidden_size, concat_size) * scale  # 候選記憶
        self.Wo = np.random.randn(hidden_size, concat_size) * scale  # 輸出門
        
        self.bf = np.zeros((hidden_size, 1))
        self.bi = np.zeros((hidden_size, 1))
        self.bc = np.zeros((hidden_size, 1))
        self.bo = np.zeros((hidden_size, 1))
    
    def forward(self, x, h_prev, C_prev):
        """
        LSTM Cell 前向傳播
        
        Parameters:
        -----------
        x : array (input_size, 1)
            當前輸入
        h_prev : array (hidden_size, 1)
            前一隱藏狀態
        C_prev : array (hidden_size, 1)
            前一記憶單元
        
        Returns:
        --------
        h_next : array
            當前隱藏狀態
        C_next : array
            當前記憶單元
        cache : dict
            緩存
        """
        # 拼接 [h_prev, x]
        concat = np.vstack([h_prev, x])  # (hidden_size + input_size, 1)
        
        # 1. 遺忘門
        ft = sigmoid(np.dot(self.Wf, concat) + self.bf)
        
        # 2. 輸入門
        it = sigmoid(np.dot(self.Wi, concat) + self.bi)
        
        # 3. 候選記憶
        Ct_tilde = np.tanh(np.dot(self.Wc, concat) + self.bc)
        
        # 4. 更新記憶單元
        Ct = ft * C_prev + it * Ct_tilde
        
        # 5. 輸出門
        ot = sigmoid(np.dot(self.Wo, concat) + self.bo)
        
        # 6. 隱藏狀態
        ht = ot * np.tanh(Ct)
        
        # 緩存
        cache = {
            'x': x,
            'h_prev': h_prev,
            'C_prev': C_prev,
            'concat': concat,
            'ft': ft,
            'it': it,
            'Ct_tilde': Ct_tilde,
            'Ct': Ct,
            'ot': ot,
            'ht': ht
        }
        
        return ht, Ct, cache

print("✅ LSTM Cell 定義完成")

### 5.2 完整 LSTM 網路

In [None]:
class LSTM:
    def __init__(self, input_size, hidden_size, output_size, learning_rate=0.01):
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.lr = learning_rate
        
        # LSTM Cell
        self.cell = LSTMCell(input_size, hidden_size)
        
        # 輸出層權重
        scale = np.sqrt(2.0 / hidden_size)
        self.Why = np.random.randn(output_size, hidden_size) * scale
        self.by = np.zeros((output_size, 1))
        
        print(f"✅ LSTM 初始化完成")
        print(f"   輸入維度: {input_size}")
        print(f"   隱藏維度: {hidden_size}")
        print(f"   輸出維度: {output_size}")
    
    def forward(self, inputs, h_prev=None, C_prev=None):
        """
        處理整個序列
        """
        # 初始化
        if h_prev is None:
            h = np.zeros((self.hidden_size, 1))
            C = np.zeros((self.hidden_size, 1))
        else:
            h, C = h_prev, C_prev
        
        hidden_states = []
        cell_states = []
        outputs = []
        caches = []
        
        # 逐步處理
        for x in inputs:
            h, C, cache = self.cell.forward(x, h, C)
            
            # 輸出
            y = np.dot(self.Why, h) + self.by
            
            hidden_states.append(h)
            cell_states.append(C)
            outputs.append(y)
            caches.append(cache)
        
        return outputs, hidden_states, cell_states, caches
    
    def backward(self, caches, targets, outputs):
        """
        LSTM 反向傳播
        """
        T = len(caches)
        
        # 初始化梯度
        dWhy = np.zeros_like(self.Why)
        dby = np.zeros_like(self.by)
        
        dWf = np.zeros_like(self.cell.Wf)
        dWi = np.zeros_like(self.cell.Wi)
        dWc = np.zeros_like(self.cell.Wc)
        dWo = np.zeros_like(self.cell.Wo)
        
        dbf = np.zeros_like(self.cell.bf)
        dbi = np.zeros_like(self.cell.bi)
        dbc = np.zeros_like(self.cell.bc)
        dbo = np.zeros_like(self.cell.bo)
        
        dh_next = np.zeros((self.hidden_size, 1))
        dC_next = np.zeros((self.hidden_size, 1))
        
        # 從後往前
        for t in reversed(range(T)):
            cache = caches[t]
            
            # 輸出層梯度
            dy = outputs[t] - targets[t]
            dWhy += np.dot(dy, cache['ht'].T)
            dby += dy
            
            # 隱藏狀態梯度
            dh = np.dot(self.Why.T, dy) + dh_next
            
            # LSTM Cell 反向傳播 (簡化版)
            dot = dh * cache['ot'] * (1 - np.tanh(cache['Ct']) ** 2)
            dC = dC_next + dot
            
            # 門的梯度 (省略詳細推導)
            dft = dC * cache['C_prev'] * cache['ft'] * (1 - cache['ft'])
            dit = dC * cache['Ct_tilde'] * cache['it'] * (1 - cache['it'])
            dCt_tilde = dC * cache['it'] * (1 - cache['Ct_tilde'] ** 2)
            dot = dh * np.tanh(cache['Ct']) * cache['ot'] * (1 - cache['ot'])
            
            # 權重梯度累積
            dWf += np.dot(dft, cache['concat'].T)
            dWi += np.dot(dit, cache['concat'].T)
            dWc += np.dot(dCt_tilde, cache['concat'].T)
            dWo += np.dot(dot, cache['concat'].T)
            
            dbf += dft
            dbi += dit
            dbc += dCt_tilde
            dbo += dot
            
            # 傳遞到前一時間步
            dh_next = np.dot(self.cell.Wf[:, :self.hidden_size].T, dft)
            dh_next += np.dot(self.cell.Wi[:, :self.hidden_size].T, dit)
            dh_next += np.dot(self.cell.Wc[:, :self.hidden_size].T, dCt_tilde)
            dh_next += np.dot(self.cell.Wo[:, :self.hidden_size].T, dot)
            
            dC_next = dC * cache['ft']
        
        # 梯度裁剪
        for grad in [dWf, dWi, dWc, dWo, dWhy, dbf, dbi, dbc, dbo, dby]:
            np.clip(grad, -5, 5, out=grad)
        
        # 更新權重
        self.cell.Wf -= self.lr * dWf
        self.cell.Wi -= self.lr * dWi
        self.cell.Wc -= self.lr * dWc
        self.cell.Wo -= self.lr * dWo
        
        self.cell.bf -= self.lr * dbf
        self.cell.bi -= self.lr * dbi
        self.cell.bc -= self.lr * dbc
        self.cell.bo -= self.lr * dbo
        
        self.Why -= self.lr * dWhy
        self.by -= self.lr * dby
    
    def train_step(self, inputs, targets):
        """
        訓練一步
        """
        # 前向傳播
        outputs, hidden_states, cell_states, caches = self.forward(inputs)
        
        # 計算損失
        loss = 0
        for y, target in zip(outputs, targets):
            loss += np.sum((y - target) ** 2)
        loss /= len(targets)
        
        # 反向傳播
        self.backward(caches, targets, outputs)
        
        return loss

print("✅ LSTM 類別定義完成")

### 5.3 訓練 LSTM (累積和任務)

In [None]:
# 創建 LSTM
lstm = LSTM(input_size=1, hidden_size=32, output_size=1, learning_rate=0.001)

# 生成數據 (更長的序列測試 LSTM 的長期記憶)
X_data_long, Y_data_long = generate_cumsum_data(n_samples=200, seq_length=20)

# 訓練
print("開始訓練 LSTM...")
lstm_losses = []

for epoch in range(50):
    epoch_loss = 0
    
    for i in range(len(X_data_long)):
        inputs = [np.array([[x]]) for x in X_data_long[i]]
        targets = [np.array([[y]]) for y in Y_data_long[i]]
        
        loss = lstm.train_step(inputs, targets)
        epoch_loss += loss
    
    avg_loss = epoch_loss / len(X_data_long)
    lstm_losses.append(avg_loss)
    
    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1:3d}/50 - Loss: {avg_loss:.4f}")

# 繪製學習曲線
plt.figure(figsize=(10, 5))
plt.plot(lstm_losses, linewidth=2, color='green')
plt.title('LSTM Training Loss (Longer Sequences)', fontsize=14)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.grid(True, alpha=0.3)
plt.show()

print("\n✅ LSTM 訓練完成")

In [None]:
# 測試 LSTM
test_input_long = [1, 2, 3, 4, 5, 2, 1, 3, 2, 4]
test_target_long = np.cumsum(test_input_long)

inputs = [np.array([[x]]) for x in test_input_long]
outputs, _, _, _ = lstm.forward(inputs)
predictions = [y[0, 0] for y in outputs]

print("\nLSTM 測試結果:")
print("="*50)
print(f"輸入序列: {test_input_long}")
print(f"真實累積和: {test_target_long.tolist()}")
print(f"預測累積和: {[round(p, 1) for p in predictions]}")
print(f"\n平均誤差: {np.mean(np.abs(predictions - test_target_long)):.4f}")

---

## 6. RNN vs LSTM 性能對比

### 6.1 梯度消失實驗

In [None]:
# 實驗: 比較 RNN 和 LSTM 在不同序列長度下的性能

seq_lengths = [5, 10, 20, 30, 40, 50]
rnn_errors = []
lstm_errors = []

for seq_len in seq_lengths:
    print(f"\n測試序列長度: {seq_len}")
    
    # 生成測試數據
    test_seq = np.random.randint(0, 5, size=seq_len)
    test_target = np.cumsum(test_seq)
    
    # RNN 預測
    rnn_temp = VanillaRNN(1, 20, 1, 0.001)
    inputs = [np.array([[x]]) for x in test_seq]
    targets = [np.array([[y]]) for y in test_target]
    
    # 簡單訓練
    for _ in range(20):
        rnn_temp.train_step(inputs, targets)
    
    rnn_outputs, _, _ = rnn_temp.forward(inputs)
    rnn_preds = [y[0, 0] for y in rnn_outputs]
    rnn_error = np.mean(np.abs(rnn_preds - test_target))
    rnn_errors.append(rnn_error)
    
    # LSTM 預測
    lstm_temp = LSTM(1, 32, 1, 0.001)
    
    for _ in range(20):
        lstm_temp.train_step(inputs, targets)
    
    lstm_outputs, _, _, _ = lstm_temp.forward(inputs)
    lstm_preds = [y[0, 0] for y in lstm_outputs]
    lstm_error = np.mean(np.abs(lstm_preds - test_target))
    lstm_errors.append(lstm_error)
    
    print(f"  RNN 誤差: {rnn_error:.4f}")
    print(f"  LSTM 誤差: {lstm_error:.4f}")

# 可視化對比
plt.figure(figsize=(12, 6))
plt.plot(seq_lengths, rnn_errors, 'o-', linewidth=2, label='RNN', markersize=8)
plt.plot(seq_lengths, lstm_errors, 's-', linewidth=2, label='LSTM', markersize=8)
plt.xlabel('Sequence Length', fontsize=12)
plt.ylabel('Average Error', fontsize=12)
plt.title('RNN vs LSTM: Performance on Long Sequences', fontsize=14, fontweight='bold')
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()

print("\n📊 觀察:")
print("- 序列越長, RNN 誤差急劇上升 (梯度消失)")
print("- LSTM 誤差增長緩慢 (門控機制緩解梯度消失)")

---

## 7. 與 Keras 對比驗證

### 7.1 使用 Keras LSTM

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

# 準備 Keras 格式的數據
X_keras = X_data_long[:, :, np.newaxis]  # (samples, timesteps, features)
Y_keras = Y_data_long[:, :, np.newaxis]

X_train_k, X_test_k, Y_train_k, Y_test_k = train_test_split(
    X_keras, Y_keras, test_size=0.2, random_state=42
)

# 建立 Keras LSTM 模型
keras_lstm = keras.Sequential([
    layers.LSTM(32, return_sequences=True, input_shape=(None, 1)),
    layers.Dense(1)
])

keras_lstm.compile(optimizer='adam', loss='mse', metrics=['mae'])

# 訓練
history = keras_lstm.fit(
    X_train_k, Y_train_k,
    epochs=50,
    batch_size=32,
    validation_data=(X_test_k, Y_test_k),
    verbose=0
)

# 評估
test_loss, test_mae = keras_lstm.evaluate(X_test_k, Y_test_k, verbose=0)

print(f"Keras LSTM 測試 MAE: {test_mae:.4f}")

# 可視化
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Train Loss', linewidth=2)
plt.plot(history.history['val_loss'], label='Val Loss', linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Loss (MSE)')
plt.title('Keras LSTM Training', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(history.history['mae'], label='Train MAE', linewidth=2)
plt.plot(history.history['val_mae'], label='Val MAE', linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Mean Absolute Error')
plt.title('Keras LSTM Metrics', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## 8. 實戰應用: 字符級文本生成

### 8.1 準備數據

In [None]:
# 簡單的文本語料
text = """
hello world
deep learning is fun
natural language processing
machine learning algorithms
neural networks are powerful
""".strip()

# 建立字符集
chars = sorted(set(text))
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}
vocab_size = len(chars)

print(f"文本長度: {len(text)}")
print(f"詞彙量: {vocab_size}")
print(f"字符集: {chars}")

# 創建訓練序列
seq_length = 10
sequences = []
next_chars = []

for i in range(len(text) - seq_length):
    sequences.append(text[i:i+seq_length])
    next_chars.append(text[i+seq_length])

print(f"\n訓練序列數: {len(sequences)}")
print(f"範例:")
print(f"  輸入: '{sequences[0]}'")
print(f"  目標: '{next_chars[0]}'")

### 8.2 使用 Keras 訓練字符級 LSTM

In [None]:
# 向量化
X_char = np.zeros((len(sequences), seq_length, vocab_size), dtype=np.bool_)
y_char = np.zeros((len(sequences), vocab_size), dtype=np.bool_)

for i, seq in enumerate(sequences):
    for t, char in enumerate(seq):
        X_char[i, t, char_to_idx[char]] = 1
    y_char[i, char_to_idx[next_chars[i]]] = 1

print(f"訓練數據形狀: {X_char.shape}")  # (n_sequences, seq_length, vocab_size)
print(f"目標數據形狀: {y_char.shape}")   # (n_sequences, vocab_size)

# 建立模型
char_model = keras.Sequential([
    layers.LSTM(128, input_shape=(seq_length, vocab_size)),
    layers.Dense(vocab_size, activation='softmax')
])

char_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

char_model.summary()

# 訓練
history = char_model.fit(
    X_char, y_char,
    batch_size=32,
    epochs=100,
    validation_split=0.2,
    verbose=0
)

print(f"\n最終訓練準確率: {history.history['accuracy'][-1]:.4f}")
print(f"最終驗證準確率: {history.history['val_accuracy'][-1]:.4f}")

### 8.3 文本生成

In [None]:
def generate_text(model, seed_text, length=50, temperature=1.0):
    """
    生成文本
    
    Parameters:
    -----------
    model : Keras model
        訓練好的模型
    seed_text : str
        種子文本 (至少 seq_length 個字符)
    length : int
        生成長度
    temperature : float
        溫度參數 (控制隨機性)
        - temperature < 1: 更確定性 (保守)
        - temperature = 1: 原始機率
        - temperature > 1: 更隨機 (創造性)
    """
    generated = seed_text
    
    for _ in range(length):
        # 準備輸入
        x = np.zeros((1, seq_length, vocab_size))
        for t, char in enumerate(seed_text[-seq_length:]):
            if char in char_to_idx:
                x[0, t, char_to_idx[char]] = 1
        
        # 預測
        preds = model.predict(x, verbose=0)[0]
        
        # 溫度採樣
        preds = np.log(preds + 1e-8) / temperature
        preds = np.exp(preds) / np.sum(np.exp(preds))
        
        # 採樣下一個字符
        next_idx = np.random.choice(len(preds), p=preds)
        next_char = idx_to_char[next_idx]
        
        # 添加到生成文本
        generated += next_char
        seed_text += next_char
    
    return generated

# 測試文本生成
seed = "deep learn"

print("\n文本生成測試:")
print("="*60)

for temp in [0.2, 0.5, 1.0, 1.2]:
    generated = generate_text(char_model, seed, length=40, temperature=temp)
    print(f"\nTemperature={temp}: ")
    print(f"  {generated}")

---

## 9. 實戰應用: 情感分析 (簡化版)

### 9.1 準備情感分析數據

In [None]:
# 簡單的情感分析數據集
sentiment_texts = [
    "i love this movie",
    "this is great",
    "wonderful experience",
    "best film ever",
    "amazing story",
    "i hate this",
    "terrible movie",
    "waste of time",
    "very bad",
    "disappointing film"
] * 20  # 複製增加數據量

sentiment_labels = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0] * 20  # 1=正面, 0=負面

# 打亂數據
indices = np.random.permutation(len(sentiment_texts))
sentiment_texts = [sentiment_texts[i] for i in indices]
sentiment_labels = [sentiment_labels[i] for i in indices]

print(f"情感分析數據集大小: {len(sentiment_texts)}")
print(f"正面樣本: {sum(sentiment_labels)}")
print(f"負面樣本: {len(sentiment_labels) - sum(sentiment_labels)}")

### 9.2 訓練情感分析模型

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# 分詞
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentiment_texts)
vocab_size_sent = len(tokenizer.word_index) + 1

# 轉換為序列
sequences = tokenizer.texts_to_sequences(sentiment_texts)
X_sent = pad_sequences(sequences, maxlen=10, padding='post')
y_sent = np.array(sentiment_labels)

# 切分訓練/測試集
X_train_sent, X_test_sent, y_train_sent, y_test_sent = train_test_split(
    X_sent, y_sent, test_size=0.2, random_state=42
)

# 建立模型
sentiment_model = keras.Sequential([
    layers.Embedding(vocab_size_sent, 32, input_length=10),
    layers.LSTM(16),
    layers.Dense(1, activation='sigmoid')
])

sentiment_model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

print("情感分析模型:")
sentiment_model.summary()

# 訓練
history_sent = sentiment_model.fit(
    X_train_sent, y_train_sent,
    epochs=50,
    batch_size=16,
    validation_data=(X_test_sent, y_test_sent),
    verbose=0
)

# 評估
test_loss, test_acc = sentiment_model.evaluate(X_test_sent, y_test_sent, verbose=0)
print(f"\n測試準確率: {test_acc:.4f}")

In [None]:
# 預測新評論
def predict_sentiment(text):
    seq = tokenizer.texts_to_sequences([text])
    padded = pad_sequences(seq, maxlen=10, padding='post')
    pred = sentiment_model.predict(padded, verbose=0)[0, 0]
    return pred

# 測試
test_reviews = [
    "this is amazing",
    "i hate this movie",
    "wonderful story",
    "terrible experience",
    "best movie ever"
]

print("\n情感預測結果:")
print("="*60)
for review in test_reviews:
    sentiment_score = predict_sentiment(review)
    sentiment = "正面 😊" if sentiment_score > 0.5 else "負面 😞"
    print(f"'{review:25s}' → {sentiment} (信心度: {sentiment_score:.2%})")

---

## 10. 梯度流動可視化

### 10.1 展示梯度消失現象

In [None]:
# 模擬梯度傳播
def simulate_gradient_flow(activation, num_layers=20):
    """
    模擬梯度在多層網路中的傳播
    """
    gradient = 1.0
    gradients = [gradient]
    
    # 假設權重矩陣的最大特徵值
    if activation == 'tanh':
        weight_eigenvalue = 0.9
        activation_derivative = 0.25  # tanh' 的平均值
    elif activation == 'sigmoid':
        weight_eigenvalue = 0.9
        activation_derivative = 0.25  # sigmoid' 的平均值
    elif activation == 'relu':
        weight_eigenvalue = 1.0
        activation_derivative = 0.5  # ReLU' = 1 (50% 機率)
    
    for layer in range(num_layers):
        gradient *= weight_eigenvalue * activation_derivative
        gradients.append(gradient)
    
    return gradients

# 比較不同激活函數
gradients_tanh = simulate_gradient_flow('tanh', 50)
gradients_sigmoid = simulate_gradient_flow('sigmoid', 50)
gradients_relu = simulate_gradient_flow('relu', 50)

# 可視化
plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
plt.plot(gradients_tanh, label='Tanh', linewidth=2)
plt.plot(gradients_sigmoid, label='Sigmoid', linewidth=2)
plt.plot(gradients_relu, label='ReLU', linewidth=2)
plt.xlabel('Layer', fontsize=12)
plt.ylabel('Gradient Magnitude', fontsize=12)
plt.title('Gradient Flow (Linear Scale)', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.semilogy(gradients_tanh, label='Tanh', linewidth=2)
plt.semilogy(gradients_sigmoid, label='Sigmoid', linewidth=2)
plt.semilogy(gradients_relu, label='ReLU', linewidth=2)
plt.xlabel('Layer', fontsize=12)
plt.ylabel('Gradient Magnitude (log scale)', fontsize=12)
plt.title('Gradient Flow (Log Scale)', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n📊 觀察:")
print(f"  Tanh/Sigmoid 在 50 層後梯度: {gradients_tanh[-1]:.2e} (幾乎為 0!)")
print(f"  ReLU 在 50 層後梯度: {gradients_relu[-1]:.4f} (相對穩定)")
print("\n結論: Tanh/Sigmoid 在深層網路或長序列中會遭遇嚴重的梯度消失問題")

---

## 11. 本節總結

### ✅ 關鍵要點

**RNN**:
1. **循環結構**: ht = tanh(Wh·ht-1 + Wx·xt + b)
2. **BPTT**: 時間反向傳播算法
3. **梯度消失**: tanh 導數連乘 → 指數衰減
4. **適用場景**: 短序列 (< 20 步)

**LSTM**:
1. **三個門**: 遺忘門 (ft), 輸入門 (it), 輸出門 (ot)
2. **記憶單元**: Ct = ft⊙Ct-1 + it⊙C̃t (加法更新!)
3. **梯度流動**: 加法允許梯度直通,緩解梯度消失
4. **適用場景**: 長序列 (< 100 步)

**GRU**:
1. **簡化 LSTM**: 2 個門 (重置門, 更新門)
2. **參數少 25%**: 訓練更快
3. **性能接近**: 在多數任務上與 LSTM 相當

### 📊 性能對比

| 模型 | 參數量 | 訓練速度 | 長序列性能 | 適用場景 |
|------|--------|---------|-----------|----------|
| **RNN** | 小 | 快 | 差 (梯度消失) | 短序列 |
| **LSTM** | 大 (4x RNN) | 慢 | 好 | 長序列, 複雜任務 |
| **GRU** | 中 (3x RNN) | 中 | 好 | 平衡選擇 |

### 💡 實務建議

1. **首選 LSTM/GRU**: 除非有特殊原因,不要使用 Vanilla RNN
2. **序列長度 < 50**: GRU 和 LSTM 性能相當,選 GRU (更快)
3. **序列長度 > 50**: LSTM 略優
4. **序列長度 > 200**: 考慮 Transformer (更好的長距離依賴)

### 🚀 從 LSTM 到 Transformer

**LSTM 的局限**:
- 仍然是序列處理 (無法平行化)
- 長序列 (> 1000) 仍有梯度問題
- 訓練時間長

**Transformer 的突破**:
- 完全基於 Attention (拋棄循環)
- 完全平行化 (訓練快 10-100x)
- 長距離依賴: 直接連接,無梯度問題

**結論**: LSTM 是序列建模的經典方法,理解 LSTM 有助於理解 Transformer 的設計動機。

---

## 12. 課後練習

### 練習 1: 實作雙向 RNN

提示:
- 同時訓練前向和後向 RNN
- 合併兩個方向的隱藏狀態
- 應用於詞性標註任務

### 練習 2: 實作 GRU

提示:
- 只需要 2 個門 (重置門, 更新門)
- 沒有單獨的 Cell State
- 對比 GRU 與 LSTM 的訓練速度

### 練習 3: 序列生成進階

任務:
- 使用更大的文本語料 (如莎士比亞作品)
- 訓練字符級 LSTM
- 實驗不同的溫度參數
- 分析生成文本的質量

### 練習 4: IMDB 情感分析完整實戰

任務:
- 載入 IMDB 數據集
- 使用自製 LSTM 訓練分類器
- 與 Keras LSTM 對比性能
- 分析錯誤案例

---

## 13. 延伸閱讀

### 關鍵論文
1. **LSTM 原始論文**: Hochreiter & Schmidhuber (1997). *Long Short-Term Memory*.
2. **GRU**: Cho et al. (2014). *Learning Phrase Representations using RNN Encoder-Decoder*.
3. **BPTT**: Werbos (1990). *Backpropagation Through Time: What It Does and How to Do It*.

### 視覺化資源
- **Understanding LSTM**: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
- **The Unreasonable Effectiveness of RNN**: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

### 實作教程
- **CS231n RNN Tutorial**: http://cs231n.stanford.edu/
- **TensorFlow RNN Guide**: https://www.tensorflow.org/guide/keras/rnn

---

**課程**: iSpan Python NLP Cookbooks v2
**講師**: Claude AI
**最後更新**: 2025-10-17

**🎉 恭喜完成底層實作系列！您已經從零實現了 NaiveBayes、MLP、RNN 和 LSTM！**