# 專案實戰: 文本分類系統 (BERT Fine-Tuning)

**專案類型**: 深度學習 - BERT 模型微調
**難度**: ⭐⭐⭐⭐ 進階
**預計時間**: 4-5 小時
**技術棧**: BERT, Transformers, Trainer API, Hugging Face Datasets

---

## 📚 學習目標

完成本專案後,您將能夠:

1. ✅ 掌握 BERT 模型微調完整流程
2. ✅ 使用 Trainer API 訓練模型
3. ✅ 實作多類別文本分類
4. ✅ 評估與優化模型性能
5. ✅ 部署模型到生產環境

---

## 🎯 專案場景

### 業務需求

**場景**: 新聞媒體需要自動分類文章到正確類別

**需求**:
- 支援 4 類別分類 (World, Sports, Business, Sci/Tech)
- 準確率 > 90%
- 推理時間 < 100ms
- 可處理 1000+ 字文章

**數據集**: AG News (120,000 訓練樣本, 7,600 測試樣本)

---

## Part 1: 環境準備與數據載入

In [None]:
# Install required packages
# !pip install transformers datasets accelerate evaluate -q

import transformers
import datasets
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

print(f"✅ Transformers: {transformers.__version__}")
print(f"✅ Datasets: {datasets.__version__}")
print(f"✅ PyTorch: {torch.__version__}")
print(f"✅ CUDA available: {torch.cuda.is_available()}")

### 載入 AG News 數據集

In [None]:
from datasets import load_dataset

# Load AG News dataset
print("📥 Loading AG News dataset...")

dataset = load_dataset("ag_news")

print(f"✅ Dataset loaded!")
print(f"\n📊 Dataset structure:")
print(dataset)

# Label mapping
label_names = ['World', 'Sports', 'Business', 'Sci/Tech']
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {label: i for i, label in enumerate(label_names)}

print(f"\n🏷️ Label mapping:")
for idx, label in id2label.items():
    print(f"   {idx}: {label}")

### 數據探索性分析 (EDA)

In [None]:
# Convert to pandas for EDA
train_df = pd.DataFrame(dataset['train'])
test_df = pd.DataFrame(dataset['test'])

# Add label names
train_df['label_name'] = train_df['label'].map(id2label)
test_df['label_name'] = test_df['label'].map(id2label)

print("📋 Training Data Overview:")
print(train_df.head())

print(f"\n📊 Label Distribution:")
print(train_df['label_name'].value_counts())

# Text length statistics
train_df['text_length'] = train_df['text'].str.len()
print(f"\n📏 Text Length Statistics:")
print(train_df['text_length'].describe())

In [None]:
# Visualize dataset
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Label distribution
label_counts = train_df['label_name'].value_counts()
axes[0].bar(label_counts.index, label_counts.values, color='steelblue')
axes[0].set_title('Label Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Category')
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=45)

# Text length distribution
axes[1].hist(train_df['text_length'], bins=50, color='coral', edgecolor='black')
axes[1].set_title('Text Length Distribution', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Text Length (characters)')
axes[1].set_ylabel('Frequency')
axes[1].axvline(train_df['text_length'].mean(), color='red', linestyle='--', label='Mean')
axes[1].legend()

plt.tight_layout()
plt.show()

---

## Part 2: 模型與分詞器準備

### 載入 BERT 模型

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Model selection
model_name = "distilbert-base-uncased"  # Lighter and faster than BERT

print(f"📦 Loading model and tokenizer: {model_name}")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=4,  # 4 categories
    id2label=id2label,
    label2id=label2id
)

print(f"✅ Model loaded successfully!")
print(f"   Parameters: {model.num_parameters():,}")
print(f"   Vocabulary size: {tokenizer.vocab_size:,}")

### Tokenization

In [None]:
def tokenize_function(examples):
    """
    Tokenize text for BERT model
    """
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=128  # Limit sequence length
    )

# Apply tokenization to dataset
print("🔄 Tokenizing dataset...")

tokenized_datasets = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=['text']  # Remove original text
)

print("✅ Tokenization completed!")
print(f"\nTokenized dataset:")
print(tokenized_datasets)

In [None]:
# Inspect tokenized example
sample = tokenized_datasets['train'][0]

print("🔍 Tokenized Example:")
print(f"Input IDs (first 20): {sample['input_ids'][:20]}")
print(f"Attention Mask (first 20): {sample['attention_mask'][:20]}")
print(f"Label: {sample['label']} ({id2label[sample['label']]})")

# Decode back to text
decoded = tokenizer.decode(sample['input_ids'], skip_special_tokens=True)
print(f"\nDecoded text: {decoded[:100]}...")

---

## Part 3: 模型訓練

### 設定評估指標

In [None]:
import evaluate

# Load evaluation metrics
accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")
precision_metric = evaluate.load("precision")
recall_metric = evaluate.load("recall")

def compute_metrics(eval_pred):
    """
    Compute evaluation metrics
    """
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)
    f1 = f1_metric.compute(predictions=predictions, references=labels, average='weighted')
    precision = precision_metric.compute(predictions=predictions, references=labels, average='weighted')
    recall = recall_metric.compute(predictions=predictions, references=labels, average='weighted')
    
    return {
        'accuracy': accuracy['accuracy'],
        'f1': f1['f1'],
        'precision': precision['precision'],
        'recall': recall['recall']
    }

print("✅ Metrics computation function ready")

### 訓練參數配置

In [None]:
from transformers import TrainingArguments, Trainer, EarlyStoppingCallback

# Training arguments
training_args = TrainingArguments(
    output_dir="./ag_news_classifier",
    
    # Training hyperparameters
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    
    # Evaluation and saving
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    
    # Logging
    logging_dir="./logs",
    logging_steps=500,
    logging_first_step=True,
    
    # Performance
    fp16=torch.cuda.is_available(),  # Mixed precision (if GPU available)
    dataloader_num_workers=2,
    
    # Reproducibility
    seed=42,
    
    # Report to
    report_to="none"  # Disable W&B/TensorBoard for this demo
)

print("✅ Training arguments configured")
print(f"\n📋 Key settings:")
print(f"   Epochs: {training_args.num_train_epochs}")
print(f"   Batch size: {training_args.per_device_train_batch_size}")
print(f"   Learning rate: {training_args.learning_rate}")
print(f"   Device: {'GPU' if torch.cuda.is_available() else 'CPU'}")

### 初始化 Trainer

In [None]:
# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

print("✅ Trainer initialized")
print(f"   Training samples: {len(tokenized_datasets['train']):,}")
print(f"   Evaluation samples: {len(tokenized_datasets['test']):,}")

### 開始訓練

In [None]:
# Train the model
print("🚀 Starting training...\n")

train_result = trainer.train()

print("\n✅ Training completed!")
print(f"\n📊 Training metrics:")
print(f"   Final loss: {train_result.training_loss:.4f}")
print(f"   Training time: {train_result.metrics['train_runtime']:.2f}s")
print(f"   Samples/second: {train_result.metrics['train_samples_per_second']:.2f}")

---

## Part 4: 模型評估

### 在測試集上評估

In [None]:
# Evaluate on test set
print("📊 Evaluating model on test set...\n")

eval_results = trainer.evaluate()

print("✅ Evaluation completed!")
print(f"\n📈 Test Set Performance:")
print(f"   Accuracy: {eval_results['eval_accuracy']:.4f} ({eval_results['eval_accuracy']*100:.2f}%)")
print(f"   F1-score: {eval_results['eval_f1']:.4f}")
print(f"   Precision: {eval_results['eval_precision']:.4f}")
print(f"   Recall: {eval_results['eval_recall']:.4f}")

### 混淆矩陣分析

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

# Get predictions
predictions = trainer.predict(tokenized_datasets['test'])
y_pred = np.argmax(predictions.predictions, axis=1)
y_true = predictions.label_ids

# Confusion matrix
cm = confusion_matrix(y_true, y_pred)

# Plot confusion matrix
plt.figure(figsize=(10, 8))
sns.heatmap(
    cm,
    annot=True,
    fmt='d',
    cmap='Blues',
    xticklabels=label_names,
    yticklabels=label_names,
    cbar_kws={'label': 'Count'}
)
plt.title('Confusion Matrix', fontsize=14, fontweight='bold')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()

# Classification report
print("\n📋 Detailed Classification Report:\n")
print(classification_report(y_true, y_pred, target_names=label_names))

### 錯誤分析

In [None]:
# Find misclassified examples
test_df_results = test_df.copy()
test_df_results['predicted_label'] = y_pred
test_df_results['predicted_label_name'] = test_df_results['predicted_label'].map(id2label)

# Filter errors
errors = test_df_results[test_df_results['label'] != test_df_results['predicted_label']]

print(f"🔍 Error Analysis")
print(f"   Total errors: {len(errors)} / {len(test_df_results)}")
print(f"   Error rate: {len(errors)/len(test_df_results)*100:.2f}%")

print(f"\n❌ Sample misclassifications:\n")
for idx, row in errors.head(5).iterrows():
    print(f"Text: {row['text'][:100]}...")
    print(f"True label: {row['label_name']}")
    print(f"Predicted: {row['predicted_label_name']}")
    print("-" * 80)

---

## Part 5: 模型保存與載入

In [None]:
# Save the fine-tuned model
model_save_path = "./ag_news_bert_classifier"

print(f"💾 Saving model to {model_save_path}...")

trainer.save_model(model_save_path)
tokenizer.save_pretrained(model_save_path)

print("✅ Model and tokenizer saved!")

# Verify saved files
import os
saved_files = os.listdir(model_save_path)
print(f"\n📁 Saved files:")
for file in saved_files:
    print(f"   - {file}")

In [None]:
# Load saved model
print("📥 Loading saved model...")

loaded_model = AutoModelForSequenceClassification.from_pretrained(model_save_path)
loaded_tokenizer = AutoTokenizer.from_pretrained(model_save_path)

print("✅ Model loaded successfully!")

---

## Part 6: 實際應用

### 建立分類 Pipeline

In [None]:
from transformers import pipeline

# Create classification pipeline
classifier = pipeline(
    "text-classification",
    model=loaded_model,
    tokenizer=loaded_tokenizer,
    device=0 if torch.cuda.is_available() else -1
)

print("✅ Classification pipeline created!")

In [None]:
# Test with custom news articles
test_articles = [
    "Apple announces new iPhone with advanced AI features at tech conference in California.",
    "Stock market reaches all-time high as investors show confidence in economic recovery.",
    "Manchester United defeats Real Madrid 3-1 in Champions League semifinal match.",
    "Scientists discover new exoplanet that could potentially support life.",
    "The Federal Reserve announced interest rate cuts to stimulate economic growth."
]

print("🧪 Testing classifier with custom articles:\n")
print("=" * 80)

for i, article in enumerate(test_articles, 1):
    result = classifier(article)[0]
    
    print(f"\nArticle {i}:")
    print(f"Text: {article}")
    print(f"Category: {result['label']}")
    print(f"Confidence: {result['score']:.2%}")
    print("-" * 80)

### 批量分類應用

In [None]:
# Batch classification
batch_results = classifier(test_articles)

# Create results DataFrame
results_df = pd.DataFrame({
    'article': [a[:80] + '...' for a in test_articles],
    'category': [r['label'] for r in batch_results],
    'confidence': [r['score'] for r in batch_results]
})

print("📊 Batch Classification Results:\n")
print(results_df.to_string(index=False))

---

## Part 7: 模型優化

### 信心度分析

In [None]:
# Analyze prediction confidence
test_predictions = classifier(test_df['text'].head(100).tolist())
confidences = [p['score'] for p in test_predictions]

# Plot confidence distribution
plt.figure(figsize=(12, 5))
plt.hist(confidences, bins=30, color='steelblue', edgecolor='black', alpha=0.7)
plt.axvline(np.mean(confidences), color='red', linestyle='--', linewidth=2, label=f'Mean: {np.mean(confidences):.3f}')
plt.title('Prediction Confidence Distribution', fontsize=14, fontweight='bold')
plt.xlabel('Confidence Score')
plt.ylabel('Frequency')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"📊 Confidence Statistics:")
print(f"   Mean: {np.mean(confidences):.4f}")
print(f"   Median: {np.median(confidences):.4f}")
print(f"   Min: {np.min(confidences):.4f}")
print(f"   Max: {np.max(confidences):.4f}")

### 模型量化 (減少大小)

In [None]:
# Dynamic quantization
import torch.quantization

# Quantize model
print("⚡ Quantizing model...")

quantized_model = torch.quantization.quantize_dynamic(
    loaded_model,
    {torch.nn.Linear},
    dtype=torch.qint8
)

print("✅ Quantization completed!")

# Compare model sizes
def get_model_size(model):
    """Calculate model size in MB"""
    torch.save(model.state_dict(), "temp_model.p")
    size = os.path.getsize("temp_model.p") / 1e6
    os.remove("temp_model.p")
    return size

original_size = get_model_size(loaded_model)
quantized_size = get_model_size(quantized_model)

print(f"\n📦 Model Size Comparison:")
print(f"   Original (FP32): {original_size:.2f} MB")
print(f"   Quantized (INT8): {quantized_size:.2f} MB")
print(f"   Reduction: {(1 - quantized_size/original_size)*100:.1f}%")

---

## Part 8: 生產部署

### 建立 FastAPI 服務

In [None]:
%%writefile news_classifier_api.py
# news_classifier_api.py

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from transformers import pipeline
from typing import List
import uvicorn

# Initialize FastAPI
app = FastAPI(
    title="News Classification API",
    description="Classify news articles into categories",
    version="1.0.0"
)

# Global classifier
classifier = None

@app.on_event("startup")
async def load_model():
    """Load model on startup"""
    global classifier
    print("Loading classifier...")
    classifier = pipeline(
        "text-classification",
        model="./ag_news_bert_classifier",
        device=-1  # CPU
    )
    print("✅ Classifier loaded")

# Request/Response models
class ArticleInput(BaseModel):
    text: str = Field(..., min_length=10, max_length=5000)

class BatchArticleInput(BaseModel):
    articles: List[str]

class ClassificationResponse(BaseModel):
    category: str
    confidence: float

# Endpoints
@app.post("/classify", response_model=ClassificationResponse)
async def classify_article(input_data: ArticleInput):
    """Classify single article"""
    try:
        result = classifier(input_data.text)[0]
        return ClassificationResponse(
            category=result['label'],
            confidence=round(result['score'], 4)
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.post("/classify_batch")
async def classify_batch(input_data: BatchArticleInput):
    """Classify multiple articles"""
    try:
        results = classifier(input_data.articles)
        return {
            "results": [
                {
                    "category": r['label'],
                    "confidence": round(r['score'], 4)
                }
                for r in results
            ]
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
async def health_check():
    return {
        "status": "healthy",
        "model_loaded": classifier is not None
    }

@app.get("/categories")
async def get_categories():
    return {
        "categories": ["World", "Sports", "Business", "Sci/Tech"]
    }

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

### 測試 API

In [None]:
%%writefile test_api.py
# test_api.py

import requests
import json

API_URL = "http://localhost:8000"

# Test single classification
response = requests.post(
    f"{API_URL}/classify",
    json={
        "text": "Apple releases new MacBook Pro with M3 chip and improved performance."
    }
)

print("Single Classification:")
print(json.dumps(response.json(), indent=2))

# Test batch classification
response = requests.post(
    f"{API_URL}/classify_batch",
    json={
        "articles": [
            "The Lakers won the championship game 110-98.",
            "Scientists discover new treatment for cancer.",
            "Stock markets soar as economy shows signs of recovery."
        ]
    }
)

print("\nBatch Classification:")
print(json.dumps(response.json(), indent=2))

---

## Part 9: 總結與擴展

### ✅ 本專案完成內容

1. **完整微調流程**
   - 數據載入與預處理
   - BERT 模型微調
   - 評估與優化
   - 模型保存與載入

2. **性能分析**
   - 多指標評估 (Accuracy, F1, Precision, Recall)
   - 混淆矩陣分析
   - 錯誤案例分析
   - 信心度分布分析

3. **生產部署**
   - 模型量化優化
   - FastAPI 服務
   - 批量分類支持

### 🚀 進階擴展方向

#### 功能擴展
- [ ] 增加更多類別
- [ ] 多語言支持
- [ ] 階層式分類 (粗分類→細分類)
- [ ] 多標籤分類

#### 性能優化
- [ ] 使用更大模型 (BERT-large, RoBERTa)
- [ ] 知識蒸餾 (Knowledge Distillation)
- [ ] ONNX 轉換加速
- [ ] 模型剪枝 (Pruning)

#### 應用場景
- [ ] 郵件自動路由系統
- [ ] 客服工單分類
- [ ] 社交媒體內容審核
- [ ] 文檔管理系統

### 📚 延伸閱讀

- [BERT 論文](https://arxiv.org/abs/1810.04805)
- [DistilBERT 論文](https://arxiv.org/abs/1910.01108)
- [Hugging Face Fine-Tuning Guide](https://huggingface.co/docs/transformers/training)
- [Model Optimization Techniques](https://huggingface.co/docs/transformers/perf_train_gpu_one)

---

**專案版本**: v1.0
**建立日期**: 2025-10-17
**作者**: iSpan NLP Team
**授權**: MIT License