# CH08-08: 模型微調 (Fine-Tuning)

---

## 📚 本節學習目標

1. 理解**遷移學習**與**模型微調**的概念
2. 掌握 **Hugging Face Trainer API** 的使用
3. 學會**超參數調優**技巧
4. 實作完整的**模型微調流程** (AG News 新聞分類)
5. 了解模型**評估與部署**的最佳實踐

---

## 🎯 什麼是模型微調？

### 遷移學習 (Transfer Learning)

```
預訓練模型 (Pre-trained Model)
    ↓
    在大規模通用數據上訓練 (如 Wikipedia, BookCorpus)
    學習到語言的通用知識
    ↓
微調 (Fine-Tuning)
    ↓
    在特定任務數據上微調 (如新聞分類, 情感分析)
    適應特定領域或任務
    ↓
應用模型 (Task-Specific Model)
```

### 為什麼需要微調？

| 方法 | 優點 | 缺點 | 適用場景 |
|------|------|------|----------|
| **從頭訓練** | 完全客製化 | 需要大量數據 (百萬級)<br>訓練時間長 (數週)<br>需要強大 GPU 資源 | 有大規模數據<br>特殊領域 |
| **預訓練模型直接使用** | 零訓練成本<br>立即可用 | 可能不適合特定任務<br>精度受限 | 通用任務<br>快速原型 |
| **微調** ✅ | 數據需求少 (千級即可)<br>訓練快 (分鐘到小時)<br>精度高 | 依賴預訓練模型 | **大多數實際場景** |

---

## 🔧 環境準備

### 安裝必要套件

In [None]:
# Install required packages
# !pip install transformers datasets evaluate accelerate -U
# !pip install scikit-learn numpy pandas matplotlib seaborn

In [None]:
# Import libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback
)
import evaluate
from sklearn.metrics import classification_report, confusion_matrix

# Set random seed for reproducibility
import random
import torch

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f"✅ PyTorch version: {torch.__version__}")
print(f"✅ CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"✅ GPU: {torch.cuda.get_device_name(0)}")

---

## 📊 數據準備: AG News 新聞分類

### 數據集介紹

**AG News** 是經典的文本分類數據集:
- **類別數**: 4 (世界、體育、商業、科技)
- **訓練集**: 120,000 筆新聞
- **測試集**: 7,600 筆新聞
- **來源**: AG's news corpus

### 加載數據

In [None]:
# Load AG News dataset from Hugging Face
print("📥 Loading AG News dataset...")
dataset = load_dataset("ag_news")

print(f"\n✅ Dataset loaded successfully!")
print(f"\nDataset structure:")
print(dataset)

# Check sample
print(f"\n📌 Sample from training set:")
print(dataset['train'][0])

In [None]:
# Create label mapping
label_names = ['World', 'Sports', 'Business', 'Sci/Tech']
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {label: i for i, label in enumerate(label_names)}

print("📋 Label Mapping:")
for idx, name in id2label.items():
    print(f"  {idx}: {name}")

In [None]:
# Explore dataset statistics
train_df = pd.DataFrame(dataset['train'])
test_df = pd.DataFrame(dataset['test'])

print(f"\n📊 Dataset Statistics:")
print(f"  Train samples: {len(train_df):,}")
print(f"  Test samples: {len(test_df):,}")
print(f"  Total samples: {len(train_df) + len(test_df):,}")

# Class distribution
print(f"\n📈 Class Distribution (Training):")
class_counts = train_df['label'].value_counts().sort_index()
for idx, count in class_counts.items():
    print(f"  {id2label[idx]}: {count:,} ({count/len(train_df)*100:.1f}%)")

In [None]:
# Visualize class distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Training set
train_counts = train_df['label'].map(id2label).value_counts()
axes[0].bar(train_counts.index, train_counts.values, color='steelblue')
axes[0].set_title('Training Set - Class Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Category')
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=45)

# Test set
test_counts = test_df['label'].map(id2label).value_counts()
axes[1].bar(test_counts.index, test_counts.values, color='coral')
axes[1].set_title('Test Set - Class Distribution', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Category')
axes[1].set_ylabel('Count')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

### 數據預處理與 Tokenization

In [None]:
# For demonstration, use a smaller subset to speed up training
# Remove this in production for full dataset training
SAMPLE_SIZE = 10000  # Use 10k samples for quick training
USE_FULL_DATASET = False  # Set to True for production

if not USE_FULL_DATASET:
    print(f"⚠️  Using subset of {SAMPLE_SIZE:,} samples for quick training")
    dataset['train'] = dataset['train'].shuffle(seed=SEED).select(range(SAMPLE_SIZE))
    dataset['test'] = dataset['test'].shuffle(seed=SEED).select(range(SAMPLE_SIZE // 10))
    print(f"✅ Train: {len(dataset['train']):,} | Test: {len(dataset['test']):,}")
else:
    print(f"✅ Using full dataset")

In [None]:
# Load tokenizer
MODEL_NAME = "distilbert-base-uncased"  # Fast and efficient for classification
print(f"📦 Loading tokenizer: {MODEL_NAME}")

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
print(f"✅ Tokenizer loaded: {tokenizer.__class__.__name__}")

In [None]:
# Define tokenization function
def tokenize_function(examples):
    """
    Tokenize text data
    Args:
        examples: batch of text samples
    Returns:
        tokenized inputs with padding and truncation
    """
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=128  # Adjust based on your data
    )

# Apply tokenization to dataset
print("🔄 Tokenizing dataset...")
tokenized_datasets = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=['text']  # Remove original text column
)

print("✅ Tokenization completed!")
print(f"\nTokenized dataset structure:")
print(tokenized_datasets)

In [None]:
# Inspect tokenized sample
print("\n📌 Tokenized sample:")
sample = tokenized_datasets['train'][0]
print(f"Input IDs shape: {len(sample['input_ids'])}")
print(f"Attention mask shape: {len(sample['attention_mask'])}")
print(f"Label: {sample['label']} ({id2label[sample['label']]})")

# Decode back to text
decoded_text = tokenizer.decode(sample['input_ids'], skip_special_tokens=True)
print(f"\nDecoded text (first 100 chars):\n{decoded_text[:100]}...")

---

## 🤖 模型準備

### 加載預訓練模型

In [None]:
# Load pre-trained model for sequence classification
print(f"📦 Loading model: {MODEL_NAME}")

model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=len(label_names),
    id2label=id2label,
    label2id=label2id
)

print(f"✅ Model loaded successfully!")
print(f"\nModel architecture: {model.__class__.__name__}")
print(f"Number of parameters: {model.num_parameters():,}")

In [None]:
# Check model structure
print("\n🔍 Model Structure:")
print(model)

---

## 📈 評估指標設定

### 定義評估函數

In [None]:
# Load metrics
accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")

def compute_metrics(eval_pred):
    """
    Compute evaluation metrics during training
    Args:
        eval_pred: predictions and labels
    Returns:
        dict of metrics
    """
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)
    f1 = f1_metric.compute(predictions=predictions, references=labels, average='weighted')
    
    return {
        'accuracy': accuracy['accuracy'],
        'f1': f1['f1']
    }

print("✅ Metrics configured: Accuracy, F1-score (weighted)")

---

## 🎯 訓練配置: TrainingArguments

### 超參數說明

| 參數 | 說明 | 建議值 |
|------|------|--------|
| `output_dir` | 模型與檢查點保存路徑 | `./results` |
| `num_train_epochs` | 訓練輪數 | 3-5 |
| `per_device_train_batch_size` | 每個 GPU 的訓練批次大小 | 8-32 |
| `per_device_eval_batch_size` | 每個 GPU 的評估批次大小 | 16-64 |
| `learning_rate` | 學習率 | 2e-5 ~ 5e-5 |
| `weight_decay` | 權重衰減 (L2 正則化) | 0.01 |
| `warmup_steps` | 學習率預熱步數 | 500 |
| `logging_steps` | 日誌記錄間隔 | 100 |
| `evaluation_strategy` | 評估策略 | `epoch` / `steps` |
| `save_strategy` | 保存策略 | `epoch` / `steps` |
| `load_best_model_at_end` | 訓練結束載入最佳模型 | `True` |

In [None]:
# Define training arguments
training_args = TrainingArguments(
    output_dir="./results/ag_news_finetuned",
    
    # Training hyperparameters
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_steps=500,
    
    # Evaluation
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    
    # Logging
    logging_dir="./logs",
    logging_steps=100,
    
    # Other settings
    seed=SEED,
    push_to_hub=False,  # Set to True if you want to push to Hugging Face Hub
)

print("✅ Training arguments configured")
print(f"\n📋 Key settings:")
print(f"  - Epochs: {training_args.num_train_epochs}")
print(f"  - Batch size (train): {training_args.per_device_train_batch_size}")
print(f"  - Learning rate: {training_args.learning_rate}")
print(f"  - Evaluation strategy: {training_args.evaluation_strategy}")

---

## 🚀 開始訓練

### 初始化 Trainer

In [None]:
# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]  # Stop if no improvement for 2 epochs
)

print("✅ Trainer initialized with early stopping (patience=2)")

In [None]:
# Start training
print("\n🚀 Starting training...\n")
print("=" * 60)

train_result = trainer.train()

print("\n" + "=" * 60)
print("✅ Training completed!\n")

# Print training summary
print("📊 Training Summary:")
print(f"  Total training time: {train_result.metrics['train_runtime']:.2f} seconds")
print(f"  Training samples/second: {train_result.metrics['train_samples_per_second']:.2f}")
print(f"  Final training loss: {train_result.metrics['train_loss']:.4f}")

---

## 📊 模型評估

### 測試集評估

In [None]:
# Evaluate on test set
print("\n🔍 Evaluating model on test set...\n")

eval_results = trainer.evaluate()

print("\n📈 Evaluation Results:")
for key, value in eval_results.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.4f}")
    else:
        print(f"  {key}: {value}")

### 詳細分類報告與混淆矩陣

In [None]:
# Get predictions
predictions = trainer.predict(tokenized_datasets['test'])
pred_labels = np.argmax(predictions.predictions, axis=1)
true_labels = predictions.label_ids

# Classification report
print("\n📋 Classification Report:\n")
print(classification_report(
    true_labels,
    pred_labels,
    target_names=label_names,
    digits=4
))

In [None]:
# Confusion matrix
cm = confusion_matrix(true_labels, pred_labels)

# Plot confusion matrix
plt.figure(figsize=(10, 8))
sns.heatmap(
    cm,
    annot=True,
    fmt='d',
    cmap='Blues',
    xticklabels=label_names,
    yticklabels=label_names
)
plt.title('Confusion Matrix - AG News Classification', fontsize=14, fontweight='bold')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.tight_layout()
plt.show()

---

## 💾 模型保存與加載

### 保存微調後的模型

In [None]:
# Save model and tokenizer
model_save_path = "./models/ag_news_classifier"

print(f"💾 Saving model to {model_save_path}...")
trainer.save_model(model_save_path)
tokenizer.save_pretrained(model_save_path)

print(f"✅ Model and tokenizer saved successfully!")

### 加載已保存的模型

In [None]:
# Load saved model
from transformers import pipeline

print(f"📂 Loading model from {model_save_path}...")
classifier = pipeline(
    "text-classification",
    model=model_save_path,
    tokenizer=model_save_path
)

print("✅ Model loaded successfully!")

---

## 🎮 實際應用範例

### 測試自訂新聞文本

In [None]:
# Test with custom news articles
test_articles = [
    "Apple releases new iPhone with advanced AI features and improved camera.",
    "The Lakers defeated the Warriors 112-108 in last night's NBA game.",
    "Global stock markets surge as inflation data shows signs of cooling.",
    "Scientists discover potential breakthrough in quantum computing technology."
]

print("🎯 Testing on custom news articles:\n")
print("=" * 70)

for i, article in enumerate(test_articles, 1):
    result = classifier(article)[0]
    
    print(f"\n📰 Article {i}:")
    print(f"   Text: {article}")
    print(f"   Predicted: {result['label']} (Confidence: {result['score']:.2%})")

print("\n" + "=" * 70)

---

## 🔧 超參數調優進階技巧

### 1. 學習率調整策略

```python
# Linear warmup + linear decay
from transformers import get_linear_schedule_with_warmup

num_training_steps = len(train_dataloader) * num_epochs
lr_scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=500,
    num_training_steps=num_training_steps
)
```

### 2. 梯度累積 (Gradient Accumulation)

當 GPU 記憶體不足時:

```python
training_args = TrainingArguments(
    ...,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,  # Effective batch size = 8 * 4 = 32
)
```

### 3. 混合精度訓練 (Mixed Precision)

加速訓練並節省記憶體:

```python
training_args = TrainingArguments(
    ...,
    fp16=True,  # Enable mixed precision on compatible GPUs
)
```

### 4. 超參數搜索 (Hyperparameter Search)

自動尋找最佳超參數:

In [None]:
# Hyperparameter search example (commented out - computationally expensive)
"""
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(
        MODEL_NAME,
        num_labels=len(label_names),
        id2label=id2label,
        label2id=label2id
    )

trainer = Trainer(
    model_init=model_init,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Search space
def hp_space(trial):
    return {
        "learning_rate": trial.suggest_float("learning_rate", 1e-5, 5e-5, log=True),
        "num_train_epochs": trial.suggest_int("num_train_epochs", 2, 5),
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16, 32]),
    }

# Run hyperparameter search
best_run = trainer.hyperparameter_search(
    direction="maximize",
    backend="optuna",
    hp_space=hp_space,
    n_trials=10
)

print(f"Best hyperparameters: {best_run.hyperparameters}")
"""

print("💡 Hyperparameter search example provided (commented out)")
print("   Requires: pip install optuna")

---

## 📚 總結與最佳實踐

### ✅ 你學到了什麼

1. **遷移學習與微調概念**: 理解為何微調優於從頭訓練
2. **完整微調流程**:
   - 數據加載與預處理
   - Tokenization
   - 模型配置
   - 訓練與評估
   - 模型保存與部署
3. **Trainer API 使用**: 簡化訓練流程的強大工具
4. **超參數調優**: 學習率、Batch size、Early stopping 等
5. **模型評估**: Accuracy, F1-score, Confusion matrix

### 🎯 微調最佳實踐

| 最佳實踐 | 說明 | 原因 |
|---------|------|------|
| **使用預訓練模型** | 從 BERT、DistilBERT、RoBERTa 等開始 | 節省訓練時間與資源 |
| **較小學習率** | 2e-5 ~ 5e-5 | 避免破壞預訓練權重 |
| **少量 Epochs** | 2-5 epochs | 避免過擬合 |
| **Early Stopping** | 監控驗證集表現 | 自動停止防止過擬合 |
| **數據增強** | 同義詞替換、回譯 | 提升模型泛化能力 |
| **類別平衡** | 處理不平衡數據 | 避免偏向多數類 |
| **保存檢查點** | 定期保存 | 避免訓練中斷損失 |

### 🚀 下一步

1. **嘗試不同的預訓練模型**:
   - `bert-base-uncased`
   - `roberta-base`
   - `xlnet-base-cased`

2. **應用到你自己的數據**:
   - 準備標注數據
   - 調整 tokenization 策略
   - 實驗不同超參數

3. **部署模型**:
   - 使用 FastAPI 建立 API
   - Docker 容器化
   - 部署到雲端 (AWS, GCP, Azure)

4. **進階優化**:
   - 模型量化 (Quantization)
   - 知識蒸餾 (Knowledge Distillation)
   - ONNX 轉換加速推理

---

## 🔗 參考資源

- [Hugging Face Transformers 文檔](https://huggingface.co/docs/transformers/)
- [Trainer API 指南](https://huggingface.co/docs/transformers/main_classes/trainer)
- [AG News Dataset](https://huggingface.co/datasets/ag_news)
- [Fine-tuning Best Practices](https://huggingface.co/docs/transformers/training)

---

**下一節**: `09_專案實戰_客戶意見分析儀.ipynb` - 完整的端到端專案實戰 🎯