# CH08-05: 零樣本分類 (Zero-Shot Classification)

**課程**: iSpan Python NLP Cookbooks v2
**章節**: CH08 Hugging Face 函式庫實戰
**版本**: v1.0
**更新日期**: 2025-10-17

---

## 📚 本節學習目標

1. 理解零樣本學習 (Zero-Shot Learning) 的原理
2. 掌握 Hugging Face 零樣本分類 Pipeline
3. 應用於多標籤分類任務
4. 實作自訂分類標籤
5. 與傳統監督學習方法對比

---

## 1. 零樣本學習原理

### 1.1 什麼是零樣本分類?

**定義**: 模型無需針對特定類別訓練,即可對新類別進行分類

**核心思想**:
```
傳統分類: 
  訓練數據 → 模型 → 預測固定類別

零樣本分類:
  預訓練模型 + 自然語言類別描述 → 預測任意類別
```

**範例**:
```
文本: "Apple released the new iPhone today."
候選標籤: ["technology", "sports", "politics"]

模型判斷:
- technology: 0.85
- sports: 0.10
- politics: 0.05
```

### 1.2 優勢與侷限

**優勢**:
- ✅ 無需標註數據
- ✅ 快速適應新類別
- ✅ 靈活調整分類標籤

**侷限**:
- ❌ 準確率低於微調模型
- ❌ 依賴標籤語義描述
- ❌ 推理速度較慢

In [None]:
# 安裝必要套件
# !pip install transformers torch -q

from transformers import pipeline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

print("✅ 環境準備完成")

---

## 2. 基礎使用

### 2.1 創建零樣本分類器

In [None]:
# 載入零樣本分類 Pipeline
classifier = pipeline(
    "zero-shot-classification",
    model="facebook/bart-large-mnli",
    device=-1  # CPU
)

# 測試文本
text = "Apple CEO Tim Cook announced the new iPhone at the event in California."

# 候選標籤
candidate_labels = ["technology", "sports", "politics", "entertainment", "business"]

# 預測
result = classifier(text, candidate_labels)

print(f"文本: {text}\n")
print("分類結果:")
for label, score in zip(result['labels'], result['scores']):
    print(f"  {label:15s}: {score:.4f}")

### 2.2 可視化結果

In [None]:
# 繪製機率分布
plt.figure(figsize=(10, 6))
plt.barh(result['labels'], result['scores'], color='skyblue')
plt.xlabel('Probability', fontsize=12)
plt.title('Zero-Shot Classification Results', fontsize=14)
plt.xlim(0, 1)
for i, score in enumerate(result['scores']):
    plt.text(score + 0.02, i, f"{score:.3f}", va='center')
plt.tight_layout()
plt.show()

---

## 3. 多標籤分類

### 3.1 Multi-Label vs Multi-Class

In [None]:
text = "This movie is a thrilling action-packed adventure with great special effects."

labels = ["action", "comedy", "thriller", "romance", "sci-fi"]

# Multi-Class (單標籤,互斥)
result_single = classifier(
    text, labels,
    multi_label=False
)

# Multi-Label (多標籤,可共存)
result_multi = classifier(
    text, labels,
    multi_label=True
)

# 對比結果
df_comparison = pd.DataFrame({
    'Label': labels,
    'Multi-Class': [result_single['scores'][result_single['labels'].index(l)] for l in labels],
    'Multi-Label': [result_multi['scores'][result_multi['labels'].index(l)] for l in labels]
})

print("Multi-Class vs Multi-Label 對比:")
print(df_comparison.to_string(index=False))

# 可視化
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].barh(labels, df_comparison['Multi-Class'], color='coral')
axes[0].set_title('Multi-Class (互斥)', fontsize=14)
axes[0].set_xlabel('Probability')

axes[1].barh(labels, df_comparison['Multi-Label'], color='lightgreen')
axes[1].set_title('Multi-Label (可共存)', fontsize=14)
axes[1].set_xlabel('Probability')

plt.tight_layout()
plt.show()

### 3.2 設定閾值過濾

In [None]:
# 設定閾值,只保留高信心度標籤
threshold = 0.5

filtered_labels = [
    (label, score) 
    for label, score in zip(result_multi['labels'], result_multi['scores'])
    if score > threshold
]

print(f"文本: {text}\n")
print(f"閾值: {threshold}")
print("\n過濾後的標籤:")
for label, score in filtered_labels:
    print(f"  - {label}: {score:.4f}")

if not filtered_labels:
    print("  (無標籤超過閾值)")

---

## 4. 實戰應用

### 4.1 新聞分類

In [None]:
# 新聞文本
news_articles = [
    "Tesla stock surges after record quarterly earnings report.",
    "Scientists discover new exoplanet that could support life.",
    "The Lakers won the championship in a thrilling final game.",
    "New health study reveals benefits of Mediterranean diet.",
    "President announces new climate change policy."
]

# 新聞類別
news_categories = [
    "business", "science", "sports", "health", "politics"
]

# 批次分類
print("新聞自動分類結果:\n")
print("="*80)

for i, article in enumerate(news_articles, 1):
    result = classifier(article, news_categories, multi_label=False)
    top_label = result['labels'][0]
    top_score = result['scores'][0]
    
    print(f"{i}. {article}")
    print(f"   分類: {top_label.upper()} (信心度: {top_score:.2%})\n")

### 4.2 情感細粒度分類

In [None]:
# 情感細分
text = "I'm absolutely thrilled and excited about this amazing opportunity!"

emotions = [
    "joy", "sadness", "anger", "fear", 
    "surprise", "disgust", "excitement"
]

result = classifier(text, emotions, multi_label=True)

# 繪製雷達圖
import math

angles = np.linspace(0, 2 * np.pi, len(emotions), endpoint=False).tolist()
scores = [result['scores'][result['labels'].index(e)] for e in emotions]

# 閉合圖形
angles += angles[:1]
scores += scores[:1]

fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(projection='polar'))
ax.plot(angles, scores, 'o-', linewidth=2, color='blue')
ax.fill(angles, scores, alpha=0.25, color='blue')
ax.set_xticks(angles[:-1])
ax.set_xticklabels(emotions)
ax.set_ylim(0, 1)
ax.set_title('Emotion Distribution (Zero-Shot)', fontsize=16, pad=20)
ax.grid(True)

plt.tight_layout()
plt.show()

print(f"文本: {text}\n")
print("情感分數:")
for emotion, score in zip(result['labels'], result['scores']):
    print(f"  {emotion:12s}: {'█' * int(score * 20)} {score:.3f}")

### 4.3 主題標註系統

In [None]:
# 部落格文章自動標註
article = """
In this tutorial, we will learn how to build a machine learning model 
using Python and scikit-learn. We'll cover data preprocessing, 
model training, and evaluation techniques. This guide is perfect 
for beginners who want to get started with artificial intelligence.
"""

# 可能的標籤
possible_tags = [
    "machine learning", "programming", "data science",
    "artificial intelligence", "tutorial", "python",
    "beginner guide", "web development", "database"
]

# 多標籤分類
result = classifier(article, possible_tags, multi_label=True)

# 選擇前 5 個相關標籤
top_tags = list(zip(result['labels'], result['scores']))[:5]

print("文章內容:")
print(article.strip())
print("\n建議標籤:")
for tag, score in top_tags:
    print(f"  #{tag.replace(' ', '_'):25s} ({score:.2%})")

---

## 5. 與傳統方法對比

### 5.1 準備對比數據

In [None]:
# 測試數據集
test_data = [
    ("Apple stock rises after earnings", "business"),
    ("Lakers win NBA championship", "sports"),
    ("New vaccine shows promising results", "health"),
    ("Election results announced today", "politics"),
    ("New movie breaks box office records", "entertainment")
]

texts = [t[0] for t in test_data]
true_labels = [t[1] for t in test_data]
all_labels = ["business", "sports", "health", "politics", "entertainment"]

# 零樣本預測
zero_shot_preds = []
for text in texts:
    result = classifier(text, all_labels)
    zero_shot_preds.append(result['labels'][0])

# 計算準確率
from sklearn.metrics import accuracy_score, classification_report

accuracy = accuracy_score(true_labels, zero_shot_preds)

print("零樣本分類結果:")
print("="*60)
for i, (text, true, pred) in enumerate(zip(texts, true_labels, zero_shot_preds), 1):
    match = "✓" if true == pred else "✗"
    print(f"{i}. {match} True: {true:15s} | Pred: {pred:15s}")
    print(f"   Text: {text}\n")

print(f"準確率: {accuracy:.2%}")

### 5.2 優勢場景分析

**零樣本適用場景**:
1. ✅ 快速原型開發
2. ✅ 動態類別需求 (如用戶自定義標籤)
3. ✅ 冷啟動問題 (無標註數據)
4. ✅ 探索性分析

**傳統監督學習適用場景**:
1. ✅ 固定類別
2. ✅ 有充足標註數據
3. ✅ 需要高準確率
4. ✅ 生產環境部署

---

## 6. 進階技巧

### 6.1 假設模板優化

In [None]:
# 零樣本分類使用的隱式假設模板
# 默認: "This example is about {label}."

# 自訂假設模板 (需要更底層 API)
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "facebook/bart-large-mnli"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

def custom_zero_shot(text, labels, hypothesis_template="This text is about {}."):
    """
    自訂假設模板的零樣本分類
    """
    results = []
    
    for label in labels:
        # 構建假設
        hypothesis = hypothesis_template.format(label)
        
        # NLI 推理
        inputs = tokenizer(text, hypothesis, return_tensors="pt", truncation=True)
        
        with torch.no_grad():
            outputs = model(**inputs)
            logits = outputs.logits
            # MNLI: [contradiction, neutral, entailment]
            probs = torch.softmax(logits, dim=-1)
            entailment_score = probs[0][2].item()  # 取 entailment 分數
        
        results.append((label, entailment_score))
    
    # 排序
    results.sort(key=lambda x: x[1], reverse=True)
    return results

# 測試不同模板
text = "I love programming in Python!"
labels = ["technology", "cooking", "sports"]

print("不同假設模板對比:\n")

templates = [
    "This example is {}.",
    "This text is about {}.",
    "The topic of this text is {}.",
]

for template in templates:
    results = custom_zero_shot(text, labels, template)
    print(f"模板: '{template}'")
    for label, score in results:
        print(f"  {label:12s}: {score:.4f}")
    print()

---

## 7. 課後練習

### 練習 1: 構建情緒分類器

使用零樣本分類構建 6 種情緒的分類器。

In [None]:
# TODO: 實作 6 種情緒分類
# 情緒: joy, sadness, anger, fear, surprise, disgust
# 測試多個文本範例

### 練習 2: 動態標籤系統

實作一個允許用戶自定義標籤的分類系統。

In [None]:
# TODO: 實作互動式標籤系統
# 提示:
# 1. 接受用戶輸入的文本
# 2. 接受用戶自定義的標籤列表
# 3. 返回分類結果

---

## 8. 本節總結

### ✅ 關鍵要點

1. **零樣本原理**: 基於 NLI (自然語言推理) 模型
2. **Multi-Label**: 設定 `multi_label=True` 支援多標籤
3. **假設模板**: 影響分類效果的關鍵因素
4. **應用場景**: 快速原型、動態類別、冷啟動

### 📊 效能對比

| 方法 | 準確率 | 訓練時間 | 標註成本 | 靈活性 |
|------|--------|---------|---------|--------|
| 零樣本 | 70-85% | 0 | $0 | 極高 |
| 微調 | 90-95% | ~1h | $$$ | 低 |
| 從零訓練 | 85-92% | ~10h | $$$$ | 低 |

### 📚 延伸閱讀

- [Zero-Shot Learning 論文](https://arxiv.org/abs/1909.00161)
- [BART-MNLI 模型](https://huggingface.co/facebook/bart-large-mnli)
- [NLI 任務介紹](https://nlp.stanford.edu/projects/snli/)

### 🚀 下一節預告

**CH08-06: 文本摘要 (Summarization)**
- 抽取式 vs 生成式摘要
- 使用 BART/T5 生成摘要
- 摘要質量評估

---

**課程**: iSpan Python NLP Cookbooks v2
**講師**: Claude AI
**最後更新**: 2025-10-17