# CH08-06: 文本摘要 (Summarization)

**課程**: iSpan Python NLP Cookbooks v2
**章節**: CH08 Hugging Face 函式庫實戰
**版本**: v1.0
**更新日期**: 2025-10-17

---

## 📚 本節學習目標

1. 理解抽取式 vs 生成式摘要的差異
2. 使用 BART/T5/Pegasus 生成摘要
3. 掌握摘要參數調整技巧
4. 實作多文檔摘要
5. 評估摘要質量 (ROUGE 指標)

---

## 1. 文本摘要基礎

### 1.1 摘要類型

**抽取式摘要 (Extractive)**:
- 從原文中挑選重要句子
- 保留原文表達
- 不產生新內容

**生成式摘要 (Abstractive)**:
- 理解原文後重新生成
- 可能產生新詞彙
- 更接近人類摘要

```
原文:
"The Transformer architecture has revolutionized NLP. 
It introduced self-attention mechanisms that allow models 
to process sequences in parallel."

抽取式: "The Transformer architecture has revolutionized NLP."

生成式: "Transformers changed NLP with self-attention."
```

In [None]:
# 安裝套件
# !pip install transformers torch rouge-score -q

from transformers import pipeline
import numpy as np
import matplotlib.pyplot as plt

print("✅ 環境準備完成")

---

## 2. 使用預訓練模型

### 2.1 BART 模型

In [None]:
# 載入 BART 摘要模型
summarizer_bart = pipeline(
    "summarization",
    model="facebook/bart-large-cnn",
    device=-1
)

# 測試文本
article = """
The Transformer architecture, introduced in the paper "Attention Is All You Need" 
by Vaswani et al. in 2017, has revolutionized natural language processing. 
Unlike previous architectures that relied on recurrent or convolutional layers, 
Transformers use self-attention mechanisms to process input sequences in parallel. 
This parallel processing capability makes Transformers significantly faster to train 
than RNNs. The architecture consists of an encoder and a decoder, each composed of 
multiple layers of self-attention and feed-forward networks. The self-attention 
mechanism allows the model to weigh the importance of different words in a sentence 
when encoding each word. This has proven to be extremely effective for a wide range 
of NLP tasks, from translation to text generation.
"""

# 生成摘要
summary = summarizer_bart(
    article,
    max_length=60,
    min_length=30,
    do_sample=False
)

print(f"原文 ({len(article.split())} 詞):")
print(article.strip())
print(f"\n摘要 ({len(summary[0]['summary_text'].split())} 詞):")
print(summary[0]['summary_text'])
print(f"\n壓縮比: {len(summary[0]['summary_text'].split())/len(article.split()):.1%}")

### 2.2 T5 模型

In [None]:
# T5 (Text-to-Text Transfer Transformer)
summarizer_t5 = pipeline(
    "summarization",
    model="t5-small",
    device=-1
)

# 使用相同文本
summary_t5 = summarizer_t5(
    article,
    max_length=60,
    min_length=30
)

print("T5 摘要:")
print(summary_t5[0]['summary_text'])

### 2.3 模型對比

In [None]:
models = [
    ("BART-large-CNN", "facebook/bart-large-cnn"),
    ("T5-small", "t5-small"),
    ("Pegasus-CNN", "google/pegasus-cnn_dailymail")
]

print("不同模型摘要對比:\n")
print("="*80)

for name, model_name in models:
    try:
        summarizer = pipeline("summarization", model=model_name, device=-1)
        result = summarizer(article, max_length=50, min_length=25)
        
        print(f"\n{name}:")
        print(result[0]['summary_text'])
    except Exception as e:
        print(f"\n{name}: 載入失敗 ({str(e)[:50]}...)")

---

## 3. 參數調整

### 3.1 長度控制

In [None]:
# 測試不同長度設定
length_configs = [
    {"max_length": 30, "min_length": 20, "name": "短摘要"},
    {"max_length": 60, "min_length": 40, "name": "中摘要"},
    {"max_length": 100, "min_length": 70, "name": "長摘要"}
]

print("不同長度摘要對比:\n")

for config in length_configs:
    summary = summarizer_bart(
        article,
        max_length=config['max_length'],
        min_length=config['min_length'],
        do_sample=False
    )
    
    text = summary[0]['summary_text']
    print(f"{config['name']} ({len(text.split())} 詞):")
    print(text)
    print()

### 3.2 採樣策略

In [None]:
# Beam Search vs Sampling
configs = [
    {"do_sample": False, "num_beams": 4, "name": "Beam Search (4)"},
    {"do_sample": True, "top_k": 50, "top_p": 0.95, "name": "Top-K & Top-P Sampling"},
    {"do_sample": True, "temperature": 0.8, "name": "Temperature Sampling"}
]

print("不同採樣策略對比:\n")

for config in configs:
    name = config.pop('name')
    
    summary = summarizer_bart(
        article,
        max_length=50,
        min_length=30,
        **config
    )
    
    print(f"{name}:")
    print(summary[0]['summary_text'])
    print()

---

## 4. 實戰應用

### 4.1 新聞摘要

In [None]:
news_article = """
Apple Inc. announced record-breaking quarterly earnings today, with revenue 
reaching $120 billion, surpassing analyst expectations. The company's CEO 
Tim Cook attributed the success to strong iPhone sales and growth in the 
services sector. Apple's stock price rose 5% in after-hours trading following 
the announcement. The company also revealed plans to invest $10 billion in 
new product development over the next year, focusing on augmented reality 
and artificial intelligence technologies. Analysts predict continued growth 
for Apple in the coming quarters, citing strong consumer demand and a robust 
product pipeline.
"""

summary = summarizer_bart(
    news_article,
    max_length=50,
    min_length=25,
    do_sample=False
)

print("📰 新聞摘要生成\n")
print("原文:")
print(news_article.strip())
print(f"\n摘要:")
print(summary[0]['summary_text'])

### 4.2 批次處理

In [None]:
# 多篇文章批次摘要
articles = [
    "Tesla announced a new electric vehicle model today...",
    "Scientists discovered a new exoplanet in the habitable zone...",
    "The stock market reached new highs amid positive economic data..."
]

# 批次處理
summaries = summarizer_bart(
    articles,
    max_length=30,
    min_length=15,
    batch_size=3
)

print("批次摘要結果:\n")
for i, (article, summary) in enumerate(zip(articles, summaries), 1):
    print(f"{i}. 原文: {article}")
    print(f"   摘要: {summary['summary_text']}\n")

### 4.3 多文檔摘要

In [None]:
# 多個相關文檔的統一摘要
docs = [
    "Apple released the new iPhone 15 with advanced AI features.",
    "The iPhone 15 includes a new A17 chip and improved camera system.",
    "Apple's new phone has been well-received by tech reviewers."
]

# 合併文檔
combined_text = " ".join(docs)

# 生成摘要
summary = summarizer_bart(
    combined_text,
    max_length=40,
    min_length=20
)

print("多文檔摘要:\n")
print("文檔 1:", docs[0])
print("文檔 2:", docs[1])
print("文檔 3:", docs[2])
print(f"\n統一摘要: {summary[0]['summary_text']}")

---

## 5. 摘要質量評估

### 5.1 ROUGE 指標

In [None]:
from rouge_score import rouge_scorer

# 參考摘要 (人工標註)
reference = "Transformers revolutionized NLP with self-attention mechanisms for parallel processing."

# 生成摘要
generated = summarizer_bart(article, max_length=30, min_length=15)[0]['summary_text']

# 計算 ROUGE
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference, generated)

print("ROUGE 評估結果:\n")
print(f"參考摘要: {reference}")
print(f"生成摘要: {generated}\n")

for metric, score in scores.items():
    print(f"{metric}:")
    print(f"  Precision: {score.precision:.4f}")
    print(f"  Recall:    {score.recall:.4f}")
    print(f"  F1:        {score.fmeasure:.4f}\n")

**ROUGE 指標說明**:
- **ROUGE-1**: Unigram (單詞) 重疊
- **ROUGE-2**: Bigram (雙詞) 重疊
- **ROUGE-L**: 最長公共子序列 (LCS)

### 5.2 長度分析

In [None]:
# 分析不同長度的摘要質量
import pandas as pd

max_lengths = [20, 30, 40, 50, 60]
results = []

for max_len in max_lengths:
    summary = summarizer_bart(
        article, 
        max_length=max_len, 
        min_length=max_len-10
    )[0]['summary_text']
    
    scores = scorer.score(reference, summary)
    
    results.append({
        'Max Length': max_len,
        'Actual Length': len(summary.split()),
        'ROUGE-1': scores['rouge1'].fmeasure,
        'ROUGE-2': scores['rouge2'].fmeasure,
        'ROUGE-L': scores['rougeL'].fmeasure
    })

df = pd.DataFrame(results)
print("長度 vs ROUGE 分數:\n")
print(df.to_string(index=False))

# 繪圖
plt.figure(figsize=(10, 6))
plt.plot(df['Max Length'], df['ROUGE-1'], marker='o', label='ROUGE-1')
plt.plot(df['Max Length'], df['ROUGE-2'], marker='s', label='ROUGE-2')
plt.plot(df['Max Length'], df['ROUGE-L'], marker='^', label='ROUGE-L')
plt.xlabel('Max Summary Length', fontsize=12)
plt.ylabel('ROUGE F1 Score', fontsize=12)
plt.title('Summary Length vs ROUGE Scores', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---

## 6. 進階技巧

### 6.1 抽取式摘要 (TextRank)

In [None]:
# 簡易 TextRank 實作
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx
import nltk

# nltk.download('punkt')

def extractive_summarize(text, num_sentences=3):
    # 分句
    sentences = nltk.sent_tokenize(text)
    
    if len(sentences) <= num_sentences:
        return text
    
    # TF-IDF 向量化
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(sentences)
    
    # 計算相似度
    similarity_matrix = cosine_similarity(tfidf_matrix)
    
    # PageRank
    nx_graph = nx.from_numpy_array(similarity_matrix)
    scores = nx.pagerank(nx_graph)
    
    # 選擇 top-k 句子
    ranked = sorted(((scores[i], s) for i, s in enumerate(sentences)), reverse=True)
    top_sentences = [s for _, s in ranked[:num_sentences]]
    
    # 按原順序排列
    summary = [s for s in sentences if s in top_sentences]
    
    return ' '.join(summary)

# 測試
extractive = extractive_summarize(article, num_sentences=2)
abstractive = summarizer_bart(article, max_length=50, min_length=30)[0]['summary_text']

print("抽取式 vs 生成式對比:\n")
print("抽取式摘要:")
print(extractive)
print(f"\n生成式摘要:")
print(abstractive)

### 6.2 自訂摘要風格

In [None]:
# 使用 Prompt 控制摘要風格 (需要支援的模型)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# T5 需要加前綴
input_text = "summarize: " + article

inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(
    inputs.input_ids,
    max_length=50,
    min_length=25,
    num_beams=4,
    early_stopping=True
)

summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("T5 生成摘要:")
print(summary)

---

## 7. 課後練習

### 練習 1: 會議紀錄摘要

將會議記錄轉換為簡潔的行動項目摘要。

In [None]:
# TODO: 實作會議紀錄摘要
meeting_transcript = """
[長篇會議記錄]
"""

# 提取:
# 1. 主要決策
# 2. 行動項目
# 3. 責任人

### 練習 2: 多文檔主題摘要

從多篇相關新聞中提取共同主題摘要。

In [None]:
# TODO: 實作多文檔主題摘要
# 提示:
# 1. 識別共同主題
# 2. 合併相關信息
# 3. 生成統一摘要

---

## 8. 本節總結

### ✅ 關鍵要點

1. **摘要類型**: 抽取式 (選句) vs 生成式 (重寫)
2. **模型選擇**: BART (新聞), T5 (通用), Pegasus (摘要專用)
3. **參數調整**: max_length, min_length, num_beams, do_sample
4. **評估指標**: ROUGE-1/2/L (與參考摘要比較)

### 📊 模型效能對比

| 模型 | 參數量 | ROUGE-L | 速度 | 適用場景 |
|------|--------|---------|------|----------|
| BART-large | 406M | ~44 | 中 | 新聞、長文 |
| T5-base | 220M | ~42 | 快 | 通用文本 |
| Pegasus | 568M | ~47 | 慢 | 專業摘要 |

### 📚 延伸閱讀

- [BART 論文](https://arxiv.org/abs/1910.13461)
- [T5 論文](https://arxiv.org/abs/1910.10683)
- [ROUGE 評估指標](https://aclanthology.org/W04-1013/)

### 🚀 下一節預告

**CH08-07: 文本生成 (Text Generation)**
- GPT-2/GPT-3 文本生成
- 生成策略 (Top-K, Top-P, Beam Search)
- 控制生成內容

---

**課程**: iSpan Python NLP Cookbooks v2
**講師**: Claude AI
**最後更新**: 2025-10-17