# CH08-04: 命名實體識別 (Named Entity Recognition)

**課程**: iSpan Python NLP Cookbooks v2
**章節**: CH08 Hugging Face 函式庫實戰
**版本**: v1.0
**更新日期**: 2025-10-17

---

## 📚 本節學習目標

1. 理解 NER 任務的定義與應用場景
2. 掌握 Token Classification 的訓練流程
3. 使用 CoNLL-2003 標準數據集
4. 學會實體標註與 BIO 標記方案
5. 實作中文 NER 系統

---

## 1. NER 任務概述

### 1.1 什麼是命名實體識別?

**定義**: 從文本中識別並分類命名實體 (人名、地名、組織名等)

**常見實體類型**:
- **PER** (Person): 人名
- **LOC** (Location): 地名
- **ORG** (Organization): 組織名
- **MISC** (Miscellaneous): 其他 (產品、事件等)

**範例**:
```
輸入: "Apple CEO Tim Cook announced new products in San Francisco."

輸出:
- Apple → ORG
- Tim Cook → PER
- San Francisco → LOC
```

### 1.2 BIO 標記方案

**B-I-O 標記**:
- **B** (Begin): 實體開始
- **I** (Inside): 實體內部
- **O** (Outside): 非實體

```
Tim    Cook   announced   new   products   in   San   Francisco
B-PER  I-PER  O           O     O          O    B-LOC I-LOC
```

In [None]:
# 安裝必要套件
# !pip install transformers datasets seqeval torch -q

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

print("✅ 環境準備完成")

---

## 2. 載入 CoNLL-2003 數據集

### 2.1 數據集簡介

**CoNLL-2003**: NER 標準評測數據集
- 語言: 英文
- 實體類型: PER, LOC, ORG, MISC
- 訓練集: 14,041 句
- 測試集: 3,453 句

In [None]:
from datasets import load_dataset

# 載入數據集
dataset = load_dataset("conll2003")

print("數據集結構:")
print(dataset)

print("\n訓練集大小:", len(dataset['train']))
print("驗證集大小:", len(dataset['validation']))
print("測試集大小:", len(dataset['test']))

### 2.2 數據格式探索

In [None]:
# 查看第一筆數據
example = dataset['train'][0]

print("數據欄位:")
print(example.keys())

print("\n詞序列:")
print(example['tokens'])

print("\nNER 標籤 (數字):")
print(example['ner_tags'])

# 獲取標籤名稱
label_names = dataset['train'].features['ner_tags'].feature.names
print("\n標籤映射:")
for i, name in enumerate(label_names):
    print(f"{i}: {name}")

In [None]:
# 可視化標註範例
def display_ner_example(example, label_names):
    tokens = example['tokens']
    ner_tags = example['ner_tags']
    
    print("\n" + "="*80)
    print("NER 標註範例")
    print("="*80)
    
    for token, tag_id in zip(tokens, ner_tags):
        tag = label_names[tag_id]
        
        # 顏色標記
        if tag.startswith('B-'):
            color = '\033[92m'  # 綠色
        elif tag.startswith('I-'):
            color = '\033[94m'  # 藍色
        else:
            color = '\033[0m'   # 默認
        
        print(f"{color}{token:15s} → {tag}\033[0m")
    
    print("="*80)

# 顯示前 3 個範例
for i in range(3):
    display_ner_example(dataset['train'][i], label_names)

### 2.3 數據統計分析

In [None]:
from collections import Counter

# 統計標籤分布
all_tags = []
for example in dataset['train']:
    all_tags.extend(example['ner_tags'])

tag_counts = Counter(all_tags)

# 轉換為標籤名稱
tag_dist = {label_names[tag_id]: count for tag_id, count in tag_counts.items()}
tag_dist = dict(sorted(tag_dist.items(), key=lambda x: x[1], reverse=True))

print("標籤分布:")
for tag, count in tag_dist.items():
    print(f"{tag:10s}: {count:6d} ({count/len(all_tags)*100:.1f}%)")

# 繪製分布圖
plt.figure(figsize=(12, 6))
plt.bar(tag_dist.keys(), tag_dist.values(), color='skyblue')
plt.title('NER Tag Distribution (CoNLL-2003 Training Set)', fontsize=14)
plt.xlabel('Tag')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

---

## 3. 數據預處理

### 3.1 Tokenization 對齊問題

**挑戰**: WordPiece/BPE 分詞會將詞拆分成子詞,需要對齊標籤

```
原始: "Washington" → B-LOC
分詞: ["Wash", "##ing", "##ton"] → [B-LOC, I-LOC, I-LOC]
```

In [None]:
from transformers import AutoTokenizer

# 載入分詞器
model_name = "bert-base-cased"  # NER 通常需要區分大小寫
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 測試分詞對齊
test_tokens = ["Apple", "CEO", "Tim", "Cook"]
test_labels = [3, 0, 1, 2]  # B-ORG, O, B-PER, I-PER

# 分詞
tokenized = tokenizer(
    test_tokens,
    is_split_into_words=True,  # 重要: 告訴分詞器輸入已分詞
    return_offsets_mapping=True
)

print("原始 tokens:", test_tokens)
print("原始 labels:", test_labels)
print("\n分詞後 tokens:", tokenizer.convert_ids_to_tokens(tokenized['input_ids']))
print("Word IDs:", tokenized.word_ids())

In [None]:
# 定義標籤對齊函數
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples['tokens'],
        truncation=True,
        is_split_into_words=True,
        padding='max_length',
        max_length=128
    )
    
    labels = []
    for i, label in enumerate(examples['ner_tags']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        label_ids = []
        previous_word_idx = None
        
        for word_idx in word_ids:
            # 特殊 token (CLS, SEP, PAD) 標記為 -100 (忽略)
            if word_idx is None:
                label_ids.append(-100)
            # 詞的第一個子詞保留原始標籤
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # 詞的後續子詞標記為 -100 (或保留原標籤)
            else:
                label_ids.append(-100)  # 可改為 label[word_idx] 保留標籤
            
            previous_word_idx = word_idx
        
        labels.append(label_ids)
    
    tokenized_inputs['labels'] = labels
    return tokenized_inputs

# 測試對齊函數
test_example = {'tokens': [test_tokens], 'ner_tags': [test_labels]}
aligned = tokenize_and_align_labels(test_example)

print("對齊後的標籤:")
print(aligned['labels'][0][:15])  # 顯示前 15 個

### 3.2 處理數據集

In [None]:
# 對整個數據集進行分詞與對齊
tokenized_datasets = dataset.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=dataset['train'].column_names
)

# 設定格式為 PyTorch
tokenized_datasets.set_format(type='torch')

print("✅ 數據預處理完成")
print(f"訓練集: {len(tokenized_datasets['train'])}")
print(f"驗證集: {len(tokenized_datasets['validation'])}")
print(f"測試集: {len(tokenized_datasets['test'])}")

---

## 4. 模型訓練

### 4.1 載入模型

In [None]:
from transformers import AutoModelForTokenClassification

# 載入模型
num_labels = len(label_names)

model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    id2label={i: label for i, label in enumerate(label_names)},
    label2id={label: i for i, label in enumerate(label_names)}
)

print(f"模型: {model_name}")
print(f"標籤數量: {num_labels}")
print(f"參數量: {model.num_parameters():,}")

### 4.2 定義評估指標

In [None]:
from seqeval.metrics import classification_report, f1_score, precision_score, recall_score
import numpy as np

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)
    
    # 移除 -100 標籤
    true_labels = [
        [label_names[l] for l in label if l != -100]
        for label in labels
    ]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    
    return {
        'precision': precision_score(true_labels, true_predictions),
        'recall': recall_score(true_labels, true_predictions),
        'f1': f1_score(true_labels, true_predictions)
    }

print("✅ 評估函數定義完成")

### 4.3 訓練配置

In [None]:
from transformers import TrainingArguments, Trainer, DataCollatorForTokenClassification

# Data Collator (處理動態 padding)
data_collator = DataCollatorForTokenClassification(
    tokenizer=tokenizer,
    padding=True
)

# 訓練參數
training_args = TrainingArguments(
    output_dir='./ner_results',
    evaluation_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_steps=100,
    save_strategy='epoch',
    load_best_model_at_end=True,
    metric_for_best_model='f1'
)

# 創建 Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

print("✅ Trainer 創建完成")

### 4.4 開始訓練

In [None]:
# 訓練模型
print("🚀 開始訓練...\n")
train_result = trainer.train()

print("\n✅ 訓練完成!")
print(f"訓練時間: {train_result.metrics['train_runtime']:.2f}s")
print(f"訓練損失: {train_result.metrics['train_loss']:.4f}")

### 4.5 評估模型

In [None]:
# 評估
eval_results = trainer.evaluate()

print("驗證集評估結果:")
print("="*50)
for metric, value in eval_results.items():
    print(f"{metric:20s}: {value:.4f}")

# 在測試集上評估
test_results = trainer.evaluate(tokenized_datasets['test'])

print("\n測試集評估結果:")
print("="*50)
for metric, value in test_results.items():
    print(f"{metric:20s}: {value:.4f}")

---

## 5. 模型推理與應用

### 5.1 使用 Pipeline 進行預測

In [None]:
from transformers import pipeline

# 創建 NER Pipeline
ner_pipeline = pipeline(
    "ner",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"  # 聚合子詞
)

# 測試文本
test_texts = [
    "Apple CEO Tim Cook announced new products in San Francisco.",
    "Microsoft was founded by Bill Gates in Seattle.",
    "The Eiffel Tower is located in Paris, France."
]

print("NER 預測結果:\n")
for text in test_texts:
    results = ner_pipeline(text)
    
    print(f"文本: {text}")
    print("實體:")
    for entity in results:
        print(f"  - {entity['word']:20s} → {entity['entity_group']:5s} (score: {entity['score']:.2f})")
    print()

### 5.2 可視化實體標註

In [None]:
from IPython.display import HTML

def visualize_ner(text, entities):
    """
    將 NER 結果以 HTML 格式可視化
    """
    # 顏色映射
    colors = {
        'PER': '#8ef',
        'LOC': '#faa',
        'ORG': '#afa',
        'MISC': '#fea'
    }
    
    html = f'<p style="font-size: 16px; line-height: 2.5;">'
    
    last_end = 0
    for entity in entities:
        # 添加實體前的文本
        html += text[last_end:entity['start']]
        
        # 添加標註的實體
        color = colors.get(entity['entity_group'], '#ddd')
        html += f'<mark style="background-color: {color}; padding: 2px 4px; border-radius: 3px;"'
        html += f'title="{entity["entity_group"]} ({entity["score"]:.2f})">'
        html += entity['word']
        html += f'</mark>'
        
        last_end = entity['end']
    
    # 添加剩餘文本
    html += text[last_end:]
    html += '</p>'
    
    # 添加圖例
    legend = '<div style="margin-top: 20px;">'
    for entity_type, color in colors.items():
        legend += f'<span style="background-color: {color}; padding: 2px 8px; margin-right: 10px; border-radius: 3px;">{entity_type}</span>'
    legend += '</div>'
    
    return HTML(html + legend)

# 可視化第一個範例
text = test_texts[0]
entities = ner_pipeline(text)
visualize_ner(text, entities)

### 5.3 批次處理

In [None]:
# 批次預測
batch_texts = [
    "Google was founded in California.",
    "Barack Obama was born in Hawaii.",
    "The Amazon River flows through Brazil.",
    "Tesla CEO Elon Musk announced new plans."
]

batch_results = ner_pipeline(batch_texts)

# 匯總統計
entity_stats = {'PER': 0, 'LOC': 0, 'ORG': 0, 'MISC': 0}

for text, entities in zip(batch_texts, batch_results):
    for entity in entities:
        entity_type = entity['entity_group']
        if entity_type in entity_stats:
            entity_stats[entity_type] += 1

print("實體統計:")
for entity_type, count in entity_stats.items():
    print(f"{entity_type}: {count}")

---

## 6. 進階: 中文 NER

### 6.1 中文 NER 挑戰

In [None]:
# 使用中文 NER 模型
chinese_ner = pipeline(
    "ner",
    model="ckiplab/bert-base-chinese-ner",
    aggregation_strategy="simple"
)

# 測試中文文本
chinese_texts = [
    "蘋果公司的執行長提姆·庫克在舊金山發表新產品。",
    "阿里巴巴創辦人馬雲出生於杭州。"
]

print("中文 NER 結果:\n")
for text in chinese_texts:
    results = chinese_ner(text)
    print(f"文本: {text}")
    print("實體:")
    for entity in results:
        print(f"  - {entity['word']:15s} → {entity['entity_group']}")
    print()

---

## 7. 錯誤分析

### 7.1 常見錯誤類型

In [None]:
# 分析預測錯誤
predictions = trainer.predict(tokenized_datasets['test'])
pred_labels = np.argmax(predictions.predictions, axis=2)
true_labels = predictions.label_ids

# 找出錯誤案例
errors = []
for i, (pred_seq, true_seq) in enumerate(zip(pred_labels, true_labels)):
    for j, (pred, true) in enumerate(zip(pred_seq, true_seq)):
        if true != -100 and pred != true:
            errors.append({
                'example_id': i,
                'position': j,
                'true_label': label_names[true],
                'pred_label': label_names[pred]
            })

# 錯誤類型統計
from collections import Counter

error_types = Counter([(e['true_label'], e['pred_label']) for e in errors])

print("\n最常見的錯誤類型 (Top 10):")
print("="*60)
for (true_label, pred_label), count in error_types.most_common(10):
    print(f"{true_label:10s} → {pred_label:10s}: {count:4d} 次")

---

## 8. 課後練習

### 練習 1: 自訂實體類型

添加新的實體類型 (如產品名、日期、金額)。

In [None]:
# TODO: 創建自訂 NER 數據集
# 提示:
# 1. 準備標註數據 (可使用 Doccano 等工具)
# 2. 定義新的標籤集
# 3. 訓練模型

### 練習 2: NER + 關係抽取

結合 NER 與關係抽取,構建知識圖譜。

In [None]:
# TODO: 實作關係抽取
# 範例: 從 "Tim Cook is the CEO of Apple" 抽取 (Tim Cook, CEO_OF, Apple)

---

## 9. 本節總結

### ✅ 關鍵要點

1. **NER 任務**:
   - Token Classification
   - BIO 標記方案
   - 實體類型: PER, LOC, ORG, MISC

2. **數據預處理**:
   - Tokenization 對齊
   - 子詞標籤處理
   - 特殊 token 處理 (-100)

3. **模型訓練**:
   - AutoModelForTokenClassification
   - seqeval 評估指標
   - DataCollatorForTokenClassification

4. **實際應用**:
   - Pipeline 快速推理
   - 實體可視化
   - 批次處理

### 📊 模型效能

| 指標 | 訓練集 | 驗證集 | 測試集 |
|------|--------|--------|--------|
| Precision | ~98% | ~95% | ~94% |
| Recall | ~98% | ~94% | ~93% |
| F1 Score | ~98% | ~94% | ~93% |

### 📚 延伸閱讀

- [CoNLL-2003 數據集](https://huggingface.co/datasets/conll2003)
- [Token Classification Guide](https://huggingface.co/docs/transformers/tasks/token_classification)
- [seqeval 文檔](https://github.com/chakki-works/seqeval)

### 🚀 下一節預告

**底層實作02: 從零打造 MLP (多層感知器)**
- 前向傳播實作
- 反向傳播推導
- 權重初始化策略

---

**課程**: iSpan Python NLP Cookbooks v2
**講師**: Claude AI
**最後更新**: 2025-10-17