# CH08-10: 進階技巧與優化 (Advanced Techniques & Optimization)

---

## 📚 本節學習目標

1. 掌握**模型量化** (Quantization) 技術
2. 學習**推理加速**策略
3. 了解**模型壓縮**方法 (知識蒸餾)
4. 實作**批量推理**優化
5. 學習**生產環境部署**最佳實踐
6. 使用 **ONNX** 加速推理

---

## 🎯 為什麼需要優化？

### 生產環境的挑戰

| 挑戰 | 說明 | 優化目標 |
|------|------|----------|
| **推理速度慢** | BERT 系列模型參數量大 (110M+) | 減少延遲,提升吞吐量 |
| **記憶體占用高** | 單個模型可能需要 400MB+ RAM | 降低記憶體使用 |
| **成本高昂** | GPU 推理成本 >> CPU | 實現 CPU 高效推理 |
| **擴展性差** | 單 GPU 處理能力有限 | 提升並發處理能力 |

### 優化技術對比

| 技術 | 速度提升 | 模型大小減少 | 精度損失 | 難度 |
|------|---------|-------------|----------|------|
| **量化 (INT8)** | 2-4x | 4x | 微小 (~1%) | ⭐ 低 |
| **知識蒸餾** | 2-10x | 自定義 | 小 (2-5%) | ⭐⭐⭐ 高 |
| **剪枝 (Pruning)** | 1.5-3x | 2-3x | 中等 | ⭐⭐ 中 |
| **ONNX 轉換** | 1.5-2x | - | 無 | ⭐⭐ 中 |
| **批量推理** | 線性擴展 | - | 無 | ⭐ 低 |

---

## 🔧 環境準備

In [None]:
# Install optimization libraries
# !pip install transformers torch optimum[onnxruntime]
# !pip install onnx onnxruntime
# !pip install psutil py-cpuinfo

In [None]:
# Import libraries
import os
import time
import psutil
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    pipeline
)
from pathlib import Path

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("✅ Libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

---

## 📏 基準測試 (Baseline Benchmark)

### 建立性能測試工具

In [None]:
class ModelBenchmark:
    """
    Benchmark tool for model performance evaluation
    """
    def __init__(self):
        self.results = []
    
    @staticmethod
    def get_model_size_mb(model):
        """
        Calculate model size in MB
        """
        param_size = sum(p.nelement() * p.element_size() for p in model.parameters())
        buffer_size = sum(b.nelement() * b.element_size() for b in model.buffers())
        return (param_size + buffer_size) / 1024 / 1024
    
    @staticmethod
    def measure_inference_time(model, tokenizer, texts, warmup=5, iterations=50):
        """
        Measure average inference time
        Args:
            model: model to benchmark
            tokenizer: tokenizer
            texts: list of input texts
            warmup: number of warmup runs
            iterations: number of measurement iterations
        Returns:
            dict with timing statistics
        """
        # Warmup
        for _ in range(warmup):
            inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
            with torch.no_grad():
                _ = model(**inputs)
        
        # Measure
        times = []
        for _ in range(iterations):
            inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
            
            start = time.perf_counter()
            with torch.no_grad():
                _ = model(**inputs)
            end = time.perf_counter()
            
            times.append((end - start) * 1000)  # Convert to ms
        
        return {
            'mean_ms': np.mean(times),
            'std_ms': np.std(times),
            'min_ms': np.min(times),
            'max_ms': np.max(times),
            'p50_ms': np.percentile(times, 50),
            'p95_ms': np.percentile(times, 95),
            'p99_ms': np.percentile(times, 99)
        }
    
    @staticmethod
    def measure_memory_usage():
        """
        Measure current memory usage
        """
        process = psutil.Process()
        mem_info = process.memory_info()
        return mem_info.rss / 1024 / 1024  # MB
    
    def add_result(self, name, model, tokenizer, test_texts):
        """
        Add benchmark result
        """
        print(f"\n🔄 Benchmarking: {name}...")
        
        model_size = self.get_model_size_mb(model)
        timing = self.measure_inference_time(model, tokenizer, test_texts)
        memory = self.measure_memory_usage()
        
        result = {
            'name': name,
            'model_size_mb': model_size,
            'memory_mb': memory,
            **timing
        }
        
        self.results.append(result)
        
        print(f"   Model size: {model_size:.2f} MB")
        print(f"   Avg latency: {timing['mean_ms']:.2f} ms")
        print(f"   P95 latency: {timing['p95_ms']:.2f} ms")
        
        return result
    
    def get_summary_df(self):
        """
        Get summary DataFrame
        """
        return pd.DataFrame(self.results)

# Initialize benchmark
benchmark = ModelBenchmark()
print("✅ Benchmark tool initialized")

### 載入基準模型並測試

In [None]:
# Load baseline model
MODEL_NAME = "distilbert-base-uncased-finetuned-sst-2-english"

print(f"📦 Loading baseline model: {MODEL_NAME}")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
model.eval()  # Set to evaluation mode

print("✅ Baseline model loaded")

# Prepare test texts
test_texts = [
    "This product is amazing!",
    "I'm very disappointed with the quality.",
    "It's okay, nothing special."
]

# Benchmark baseline model
baseline_result = benchmark.add_result("Baseline (FP32)", model, tokenizer, test_texts)

---

## ⚡ 優化技術 1: 動態量化 (Dynamic Quantization)

### 什麼是量化？

將模型權重從 **FP32** (32-bit float) 轉換為 **INT8** (8-bit integer):

```
FP32:  每個參數 4 bytes
INT8:  每個參數 1 byte

壓縮比: 4x
速度提升: 2-4x (CPU)
精度損失: ~1% (微小)
```

### 動態量化實作

In [None]:
# Dynamic quantization
print("🔄 Applying dynamic quantization...\n")

# Reload model for quantization
model_quantized = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
model_quantized.eval()

# Apply dynamic quantization
model_quantized = torch.quantization.quantize_dynamic(
    model_quantized,
    {torch.nn.Linear},  # Quantize Linear layers
    dtype=torch.qint8
)

print("✅ Quantization applied!")

# Benchmark quantized model
quantized_result = benchmark.add_result("Quantized (INT8)", model_quantized, tokenizer, test_texts)

In [None]:
# Compare baseline vs quantized
print("\n" + "="*70)
print("📊 BASELINE vs QUANTIZED COMPARISON")
print("="*70)

print(f"\n📏 Model Size:")
print(f"   Baseline:  {baseline_result['model_size_mb']:.2f} MB")
print(f"   Quantized: {quantized_result['model_size_mb']:.2f} MB")
print(f"   Reduction: {(1 - quantized_result['model_size_mb']/baseline_result['model_size_mb'])*100:.1f}%")

print(f"\n⚡ Inference Speed:")
print(f"   Baseline:  {baseline_result['mean_ms']:.2f} ms")
print(f"   Quantized: {quantized_result['mean_ms']:.2f} ms")
print(f"   Speedup:   {baseline_result['mean_ms']/quantized_result['mean_ms']:.2f}x")

print("\n" + "="*70)

### 驗證量化模型精度

In [None]:
# Test accuracy: baseline vs quantized
print("🧪 Testing prediction accuracy...\n")

test_samples = [
    "Excellent product, highly recommend!",
    "Terrible experience, very disappointed.",
    "It's okay, could be better.",
    "Absolutely fantastic! Best purchase ever!",
    "Worst product I've ever bought."
]

# Baseline predictions
inputs = tokenizer(test_samples, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
    baseline_outputs = model(**inputs)
    quantized_outputs = model_quantized(**inputs)

baseline_preds = torch.argmax(baseline_outputs.logits, dim=1)
quantized_preds = torch.argmax(quantized_outputs.logits, dim=1)

# Compare
print("Prediction Comparison:")
print("-" * 70)
for i, text in enumerate(test_samples):
    match = "✅" if baseline_preds[i] == quantized_preds[i] else "❌"
    print(f"{match} {text[:50]}")
    print(f"   Baseline:  {baseline_preds[i].item()}")
    print(f"   Quantized: {quantized_preds[i].item()}")

accuracy = (baseline_preds == quantized_preds).sum().item() / len(test_samples)
print(f"\n✅ Prediction match rate: {accuracy*100:.1f}%")

---

## 🚀 優化技術 2: ONNX 轉換

### 什麼是 ONNX?

**ONNX (Open Neural Network Exchange)** 是跨平台的深度學習模型格式:

- 🎯 **優化推理**: 針對推理優化的計算圖
- ⚡ **高效執行**: ONNX Runtime 高度優化
- 🌐 **跨平台**: 支援多種硬體 (CPU, GPU, Edge)
- 📦 **部署友善**: 易於整合到生產環境

### ONNX 轉換實作

In [None]:
# Export to ONNX
try:
    from optimum.onnxruntime import ORTModelForSequenceClassification
    from optimum.onnxruntime.configuration import AutoQuantizationConfig
    
    print("📦 Converting model to ONNX format...\n")
    
    # Save directory
    onnx_save_path = "./onnx_model"
    
    # Convert to ONNX
    onnx_model = ORTModelForSequenceClassification.from_pretrained(
        MODEL_NAME,
        export=True
    )
    
    # Save ONNX model
    onnx_model.save_pretrained(onnx_save_path)
    tokenizer.save_pretrained(onnx_save_path)
    
    print(f"✅ ONNX model saved to: {onnx_save_path}")
    
    # Benchmark ONNX model
    # Note: ORTModel doesn't directly support the same interface
    # We'll create a wrapper for benchmarking
    
    print("\n⚡ ONNX model loaded and ready!")
    
except ImportError:
    print("⚠️  optimum[onnxruntime] not installed")
    print("   Install with: pip install optimum[onnxruntime]")
    onnx_model = None

---

## 📦 優化技術 3: 批量推理 (Batch Inference)

### 為什麼批量處理更快？

```
單個處理:  [Text1] → Model → [Result1]  (10ms)
           [Text2] → Model → [Result2]  (10ms)
           [Text3] → Model → [Result3]  (10ms)
           Total: 30ms

批量處理:  [Text1, Text2, Text3] → Model → [Result1, Result2, Result3]
           Total: 15ms (2x faster)
```

### 批量推理實作

In [None]:
# Compare single vs batch inference
test_samples_large = [
    f"This is test review number {i} for benchmarking."
    for i in range(100)
]

# Single inference
print("🔄 Testing single inference...")
start = time.perf_counter()
for text in test_samples_large:
    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        _ = model(**inputs)
end = time.perf_counter()
single_time = (end - start) * 1000
print(f"   Single inference: {single_time:.2f} ms")

# Batch inference (batch_size=10)
print("\n🔄 Testing batch inference...")
batch_size = 10
start = time.perf_counter()
for i in range(0, len(test_samples_large), batch_size):
    batch = test_samples_large[i:i+batch_size]
    inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        _ = model(**inputs)
end = time.perf_counter()
batch_time = (end - start) * 1000
print(f"   Batch inference (batch_size={batch_size}): {batch_time:.2f} ms")

print(f"\n⚡ Speedup: {single_time/batch_time:.2f}x")

In [None]:
# Find optimal batch size
print("🔍 Finding optimal batch size...\n")

batch_sizes = [1, 4, 8, 16, 32, 64]
batch_results = []

for bs in batch_sizes:
    start = time.perf_counter()
    
    for i in range(0, len(test_samples_large), bs):
        batch = test_samples_large[i:i+bs]
        inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True)
        with torch.no_grad():
            _ = model(**inputs)
    
    elapsed = (time.perf_counter() - start) * 1000
    throughput = len(test_samples_large) / elapsed * 1000  # samples/sec
    
    batch_results.append({
        'batch_size': bs,
        'time_ms': elapsed,
        'throughput': throughput
    })
    
    print(f"Batch size {bs:2d}: {elapsed:7.2f} ms | {throughput:6.1f} samples/sec")

# Visualize
batch_df = pd.DataFrame(batch_results)
optimal_idx = batch_df['throughput'].idxmax()
optimal_bs = batch_df.loc[optimal_idx, 'batch_size']

print(f"\n✅ Optimal batch size: {optimal_bs}")

In [None]:
# Plot batch size vs throughput
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Throughput
axes[0].plot(batch_df['batch_size'], batch_df['throughput'], marker='o', linewidth=2, markersize=8)
axes[0].axvline(optimal_bs, color='red', linestyle='--', label=f'Optimal: {optimal_bs}')
axes[0].set_title('Throughput vs Batch Size', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Batch Size')
axes[0].set_ylabel('Throughput (samples/sec)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Total time
axes[1].plot(batch_df['batch_size'], batch_df['time_ms'], marker='s', linewidth=2, markersize=8, color='coral')
axes[1].set_title('Total Inference Time vs Batch Size', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Batch Size')
axes[1].set_ylabel('Time (ms)')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## 💡 優化技術 4: 模型蒸餾 (Knowledge Distillation)

### 概念說明

```
教師模型 (Teacher)           學生模型 (Student)
BERT-base (110M params)  →  DistilBERT (66M params)
RoBERTa-large (355M)     →  DistilRoBERTa (82M)

方法: 讓小模型學習大模型的輸出分布
結果: 保留 95%+ 精度,但模型小 40-60%
```

### 使用預蒸餾模型

In [None]:
# Compare BERT vs DistilBERT
print("📦 Loading BERT-base and DistilBERT for comparison...\n")

# BERT-base
bert_model = AutoModelForSequenceClassification.from_pretrained(
    "textattack/bert-base-uncased-SST-2"
)
bert_tokenizer = AutoTokenizer.from_pretrained("textattack/bert-base-uncased-SST-2")

# DistilBERT (already loaded)
distilbert_model = model
distilbert_tokenizer = tokenizer

# Compare sizes
bert_size = benchmark.get_model_size_mb(bert_model)
distilbert_size = benchmark.get_model_size_mb(distilbert_model)

print(f"📏 Model Size Comparison:")
print(f"   BERT-base:    {bert_size:.2f} MB")
print(f"   DistilBERT:   {distilbert_size:.2f} MB")
print(f"   Reduction:    {(1 - distilbert_size/bert_size)*100:.1f}%")

# Benchmark both
bert_result = benchmark.add_result("BERT-base", bert_model, bert_tokenizer, test_texts)
distilbert_result = benchmark.add_result("DistilBERT", distilbert_model, distilbert_tokenizer, test_texts)

print(f"\n⚡ Speed Comparison:")
print(f"   BERT-base:    {bert_result['mean_ms']:.2f} ms")
print(f"   DistilBERT:   {distilbert_result['mean_ms']:.2f} ms")
print(f"   Speedup:      {bert_result['mean_ms']/distilbert_result['mean_ms']:.2f}x")

---

## 📊 綜合性能對比

### 視覺化所有優化結果

In [None]:
# Get summary DataFrame
summary_df = benchmark.get_summary_df()

print("\n📊 COMPREHENSIVE BENCHMARK SUMMARY\n")
print(summary_df[['name', 'model_size_mb', 'mean_ms', 'p95_ms']].to_string(index=False))

In [None]:
# Visualization: Model size and latency comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Model size comparison
axes[0].barh(summary_df['name'], summary_df['model_size_mb'], color='steelblue')
axes[0].set_title('Model Size Comparison', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Size (MB)')
axes[0].grid(axis='x', alpha=0.3)

# Latency comparison
axes[1].barh(summary_df['name'], summary_df['mean_ms'], color='coral')
axes[1].set_title('Average Latency Comparison', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Latency (ms)')
axes[1].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Speedup analysis (relative to baseline)
baseline_latency = summary_df[summary_df['name'] == 'Baseline (FP32)']['mean_ms'].values[0]
baseline_size = summary_df[summary_df['name'] == 'Baseline (FP32)']['model_size_mb'].values[0]

summary_df['speedup'] = baseline_latency / summary_df['mean_ms']
summary_df['size_reduction'] = (1 - summary_df['model_size_mb'] / baseline_size) * 100

print("\n⚡ Optimization Impact (vs Baseline):\n")
print(summary_df[['name', 'speedup', 'size_reduction']].to_string(index=False))

---

## 🚀 生產環境部署最佳實踐

### 1. 選擇合適的優化策略

#### 決策樹

```
部署環境?
├─ 雲端服務器 (GPU 可用)
│   └─ 使用 FP16 混合精度 + 批量推理
│
├─ 雲端服務器 (CPU only)
│   └─ INT8 量化 + ONNX + 批量推理
│
├─ 邊緣設備 (手機/IoT)
│   └─ DistilBERT + INT8 量化 + TensorFlow Lite
│
└─ API 服務 (高並發)
    └─ ONNX + 批量推理 + 負載均衡
```

### 2. FastAPI 部署範例

In [None]:
# FastAPI deployment example (save as app.py)
fastapi_code = '''
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import pipeline
import torch

# Initialize app
app = FastAPI(title="Sentiment Analysis API")

# Load optimized model (quantized)
model_path = "./models/quantized_model"
classifier = pipeline(
    "sentiment-analysis",
    model=model_path,
    device=-1  # CPU
)

# Request model
class TextInput(BaseModel):
    text: str

class BatchTextInput(BaseModel):
    texts: list[str]

# Single prediction endpoint
@app.post("/predict")
def predict(input_data: TextInput):
    try:
        result = classifier(input_data.text)[0]
        return {
            "sentiment": result["label"],
            "confidence": round(result["score"], 4)
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

# Batch prediction endpoint
@app.post("/predict_batch")
def predict_batch(input_data: BatchTextInput):
    try:
        results = classifier(input_data.texts)
        return {
            "predictions": [
                {
                    "sentiment": r["label"],
                    "confidence": round(r["score"], 4)
                }
                for r in results
            ]
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

# Health check
@app.get("/health")
def health():
    return {"status": "healthy"}

# Run with: uvicorn app:app --host 0.0.0.0 --port 8000
'''

# Save FastAPI code
with open("fastapi_deployment_example.py", "w") as f:
    f.write(fastapi_code)

print("✅ FastAPI deployment code saved to: fastapi_deployment_example.py")
print("\nTo run:")
print("   1. pip install fastapi uvicorn")
print("   2. uvicorn fastapi_deployment_example:app --reload")
print("   3. Visit: http://localhost:8000/docs")

### 3. Docker 容器化

In [None]:
# Dockerfile example
dockerfile = '''
FROM python:3.10-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY app.py .
COPY models/ ./models/

# Expose port
EXPOSE 8000

# Run application
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
'''

# requirements.txt
requirements = '''
fastapi==0.104.1
uvicorn[standard]==0.24.0
transformers==4.35.0
torch==2.1.0
pydantic==2.4.2
'''

# Save files
with open("Dockerfile", "w") as f:
    f.write(dockerfile)

with open("requirements.txt", "w") as f:
    f.write(requirements)

print("✅ Docker configuration files created")
print("\nTo build and run:")
print("   docker build -t sentiment-api .")
print("   docker run -p 8000:8000 sentiment-api")

---

## 📚 總結與最佳實踐

### ✅ 關鍵要點

1. **量化 (Quantization)**:
   - INT8 量化可減少 75% 模型大小
   - 2-4x 推理加速 (CPU)
   - 精度損失 < 1%
   - **最適合**: CPU 部署場景

2. **模型蒸餾 (Distillation)**:
   - 使用 DistilBERT 等預蒸餾模型
   - 保留 95%+ 精度
   - 模型小 40%,速度快 60%
   - **最適合**: 資源受限環境

3. **批量推理 (Batching)**:
   - 線性提升吞吐量
   - 找到最佳 batch size (通常 16-32)
   - **最適合**: 高吞吐量場景

4. **ONNX 轉換**:
   - 跨平台部署
   - ONNX Runtime 高度優化
   - **最適合**: 生產環境部署

### 🎯 優化策略選擇指南

| 場景 | 推薦策略 | 預期效果 |
|------|----------|----------|
| **雲端 API (CPU)** | INT8 量化 + ONNX + Batching | 3-5x 加速 |
| **邊緣設備** | DistilBERT + INT8 | 模型縮小 70% |
| **高並發服務** | Batching + 負載均衡 | 10x+ 吞吐量 |
| **低延遲需求** | GPU + FP16 + Batching | < 10ms 延遲 |

### 💡 部署檢查清單

- [ ] **模型優化**: 選擇合適的量化/蒸餾策略
- [ ] **批量處理**: 實作批量推理邏輯
- [ ] **錯誤處理**: 處理異常輸入與模型失敗
- [ ] **監控**: 記錄延遲、吞吐量、錯誤率
- [ ] **版本管理**: 模型版本控制與回滾機制
- [ ] **負載測試**: 壓力測試與效能基準
- [ ] **文檔**: API 文檔與使用範例
- [ ] **CI/CD**: 自動化測試與部署流程

### 🚀 下一步學習

1. **進階量化**: QAT (Quantization-Aware Training)
2. **模型剪枝**: 移除不重要的權重
3. **TensorRT**: NVIDIA GPU 深度優化
4. **TensorFlow Lite**: 移動端部署
5. **多模型服務**: TorchServe, TensorFlow Serving

---

## 🔗 參考資源

- [PyTorch Quantization](https://pytorch.org/docs/stable/quantization.html)
- [ONNX Runtime](https://onnxruntime.ai/)
- [Hugging Face Optimum](https://huggingface.co/docs/optimum/)
- [FastAPI Documentation](https://fastapi.tiangolo.com/)
- [Model Optimization Guide](https://huggingface.co/docs/transformers/perf_train_gpu_one)

---

**恭喜完成 CH08 Hugging Face 實戰系列! 🎉**

你現在已經掌握:
- ✅ Hugging Face 生態系統
- ✅ Pipeline API 與各種 NLP 任務
- ✅ 模型微調完整流程
- ✅ 端到端專案實戰
- ✅ 生產環境優化與部署

**下一章**: CH09 課程總結與職涯規劃 🎓