# 專案實戰: 新聞自動標籤系統 (News Auto-Tagging)

**專案類型**: 零樣本分類 + NER - 智能標籤生成
**難度**: ⭐⭐⭐⭐ 進階
**預計時間**: 3-4 小時
**技術棧**: Zero-Shot Classification, NER, BART, spaCy

---

## 📚 學習目標

完成本專案後,您將能夠:

1. ✅ 掌握零樣本分類技術 (無需訓練數據)
2. ✅ 使用 NER 自動提取實體標籤
3. ✅ 結合多種 NLP 技術自動生成標籤
4. ✅ 構建智能內容管理系統
5. ✅ 實作生產級標籤推薦引擎

---

## 🎯 專案場景

### 業務需求

**場景**: 新聞媒體每天發布數百篇文章,需要自動生成標籤以便:
- 內容分類與歸檔
- SEO 優化
- 推薦系統
- 搜尋引擎索引

**挑戰**:
- ❌ 手動標籤耗時 (每篇 5-10 分鐘)
- ❌ 標籤不一致
- ❌ 新主題難以及時標註

**解決方案**:
- ✅ 零樣本分類: 無需訓練,靈活添加新標籤
- ✅ NER: 自動提取人名、地名、組織
- ✅ 關鍵字提取: TF-IDF + RAKE
- ✅ 多策略融合: 綜合推薦最佳標籤

---

## Part 1: 環境準備

In [None]:
# Install required packages
# !pip install transformers torch spacy rake-nltk -q
# !python -m spacy download en_core_web_sm

import transformers
import spacy
import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

print(f"✅ Transformers: {transformers.__version__}")
print(f"✅ spaCy: {spacy.__version__}")
print(f"✅ PyTorch: {torch.__version__}")

## Part 2: 準備示範新聞數據

In [None]:
# Sample news articles
news_articles = [
    {
        'id': 'N001',
        'title': 'Apple Unveils New iPhone 15 with Advanced AI Features',
        'content': '''Apple Inc. announced its latest iPhone 15 at a press conference in 
        Cupertino, California on September 12, 2024. CEO Tim Cook presented the new device 
        featuring advanced AI capabilities powered by the A17 Bionic chip. The phone includes 
        improved camera technology and enhanced battery life. Analysts expect strong sales 
        in the holiday season.'''
    },
    {
        'id': 'N002',
        'title': 'Lakers Win NBA Championship After Dramatic Game 7',
        'content': '''The Los Angeles Lakers defeated the Boston Celtics 110-108 in a 
        thrilling Game 7 to win the NBA Championship. LeBron James scored 35 points, while 
        Anthony Davis added 28 points and 12 rebounds. This marks the Lakers 18th 
        championship title. The game was held at Crypto.com Arena in Los Angeles.'''
    },
    {
        'id': 'N003',
        'title': 'Federal Reserve Raises Interest Rates to Combat Inflation',
        'content': '''The Federal Reserve announced a 0.25% interest rate increase on 
        Wednesday, bringing the federal funds rate to 5.5%. Fed Chair Jerome Powell stated 
        that the decision aims to curb persistent inflation. Economists predict this could 
        slow economic growth but is necessary to maintain price stability. Stock markets 
        reacted negatively to the news.'''
    },
    {
        'id': 'N004',
        'title': 'Scientists Discover Potential Cancer Treatment Using CRISPR',
        'content': '''Researchers at Stanford University have made a breakthrough in cancer 
        treatment using CRISPR gene-editing technology. The study, published in Nature Medicine, 
        shows promising results in targeting and eliminating cancer cells without harming 
        healthy tissue. Clinical trials are expected to begin next year. Dr. Jennifer Doudna, 
        Nobel laureate and CRISPR pioneer, called it a significant advance.'''
    },
    {
        'id': 'N005',
        'title': 'Climate Summit in Paris Reaches Historic Agreement',
        'content': '''World leaders gathered in Paris for the Global Climate Summit reached 
        a landmark agreement to reduce carbon emissions by 50% by 2030. UN Secretary-General 
        António Guterres praised the accord as a critical step. Over 190 countries committed 
        to the new targets. Environmental activists welcomed the decision but called for 
        faster action.'''
    }
]

# Create DataFrame
df = pd.DataFrame(news_articles)
print(f"✅ Loaded {len(df)} news articles\n")
print(df[['id', 'title']])

## Part 3: 策略 1 - 零樣本分類

### 3.1 預定義標籤庫

In [None]:
# Define tag categories
TAG_CATEGORIES = {
    'topic': [
        'Technology', 'Sports', 'Business', 'Science', 'Politics',
        'Entertainment', 'Health', 'Environment', 'Education'
    ],
    'industry': [
        'Tech Industry', 'Finance', 'Healthcare', 'Energy',
        'Retail', 'Manufacturing', 'Transportation'
    ],
    'sentiment': [
        'Positive', 'Negative', 'Neutral'
    ],
    'urgency': [
        'Breaking News', 'Regular News', 'Feature Story'
    ]
}

print("📋 Available Tag Categories:")
for category, tags in TAG_CATEGORIES.items():
    print(f"\n{category.upper()}:")
    print(f"   {', '.join(tags)}")

### 3.2 零樣本分類實作

In [None]:
from transformers import pipeline

# Load zero-shot classification model
print("📦 Loading zero-shot classifier...")

zero_shot_classifier = pipeline(
    "zero-shot-classification",
    model="facebook/bart-large-mnli",
    device=0 if torch.cuda.is_available() else -1
)

print("✅ Classifier loaded!")

In [None]:
def classify_article(text, candidate_labels, top_k=3, threshold=0.5):
    """
    Classify article using zero-shot classification
    
    Args:
        text: Article text
        candidate_labels: List of possible labels
        top_k: Number of top labels to return
        threshold: Minimum confidence threshold
    
    Returns:
        List of (label, score) tuples
    """
    result = zero_shot_classifier(
        text,
        candidate_labels,
        multi_label=True  # Allow multiple labels
    )
    
    # Filter by threshold and get top-k
    tags = []
    for label, score in zip(result['labels'], result['scores']):
        if score >= threshold:
            tags.append((label, score))
            if len(tags) >= top_k:
                break
    
    return tags


# Test on first article
article = df.iloc[0]
text = article['title'] + ' ' + article['content']

print(f"📰 Article: {article['title']}\n")

# Classify by topic
topic_tags = classify_article(text, TAG_CATEGORIES['topic'], top_k=3, threshold=0.5)

print("🏷️ Topic Tags:")
for tag, score in topic_tags:
    print(f"   {tag}: {score:.2%}")

### 3.3 批量標籤生成

In [None]:
# Generate tags for all articles
print("🔄 Generating tags for all articles...\n")

all_tags = []

for idx, row in df.iterrows():
    text = row['title'] + ' ' + row['content']
    
    # Generate topic tags
    topic_tags = classify_article(
        text,
        TAG_CATEGORIES['topic'],
        top_k=2,
        threshold=0.5
    )
    
    # Generate industry tags
    industry_tags = classify_article(
        text,
        TAG_CATEGORIES['industry'],
        top_k=1,
        threshold=0.6
    )
    
    all_tags.append({
        'article_id': row['id'],
        'title': row['title'],
        'topic_tags': [tag for tag, score in topic_tags],
        'industry_tags': [tag for tag, score in industry_tags],
        'confidence': np.mean([score for tag, score in topic_tags]) if topic_tags else 0
    })
    
    print(f"✅ {row['id']}: {row['title'][:50]}...")
    print(f"   Topics: {', '.join([tag for tag, _ in topic_tags])}")
    print(f"   Industry: {', '.join([tag for tag, _ in industry_tags])}\n")

tags_df = pd.DataFrame(all_tags)
print("\n✅ Tagging completed!")

## Part 4: 策略 2 - NER 實體提取

### 4.1 提取命名實體作為標籤

In [None]:
# Load NER pipeline
print("📦 Loading NER model...")

ner_pipeline = pipeline(
    "ner",
    model="dslim/bert-base-NER",
    aggregation_strategy="simple",
    device=0 if torch.cuda.is_available() else -1
)

print("✅ NER model loaded!")

In [None]:
def extract_entity_tags(text, entity_types=['PER', 'ORG', 'LOC'], min_confidence=0.9):
    """
    Extract named entities as tags
    
    Args:
        text: Article text
        entity_types: Entity types to extract
        min_confidence: Minimum confidence threshold
    
    Returns:
        Dict with entities by type
    """
    entities = ner_pipeline(text)
    
    entity_tags = {et: [] for et in entity_types}
    
    for entity in entities:
        if entity['score'] >= min_confidence and entity['entity_group'] in entity_types:
            entity_tags[entity['entity_group']].append({
                'text': entity['word'],
                'score': entity['score']
            })
    
    # Remove duplicates
    for entity_type in entity_tags:
        seen = set()
        unique_entities = []
        for ent in entity_tags[entity_type]:
            if ent['text'] not in seen:
                seen.add(ent['text'])
                unique_entities.append(ent)
        entity_tags[entity_type] = unique_entities
    
    return entity_tags


# Test NER on first article
article = df.iloc[0]
text = article['title'] + ' ' + article['content']

entity_tags = extract_entity_tags(text)

print(f"📰 Article: {article['title']}\n")
print("🏷️ Extracted Entity Tags:\n")

for entity_type, entities in entity_tags.items():
    if entities:
        print(f"{entity_type} (People/Organizations/Locations):")
        for ent in entities:
            print(f"   - {ent['text']} (confidence: {ent['score']:.2%})")
        print()

### 4.2 為所有文章提取實體標籤

In [None]:
# Extract entities for all articles
print("🔄 Extracting entities from all articles...\n")

for idx, row in df.iterrows():
    text = row['title'] + ' ' + row['content']
    entities = extract_entity_tags(text)
    
    # Flatten entities
    all_entity_names = []
    for entity_type, ents in entities.items():
        all_entity_names.extend([e['text'] for e in ents])
    
    # Add to tags DataFrame
    tags_df.loc[tags_df['article_id'] == row['id'], 'entity_tags'] = ', '.join(all_entity_names)
    
    print(f"✅ {row['id']}: {', '.join(all_entity_names[:5])}")

print("\n✅ Entity extraction completed!")

## Part 5: 策略 3 - 關鍵字提取 (RAKE)

### 5.1 使用 RAKE 演算法

In [None]:
from rake_nltk import Rake
import nltk

# Download NLTK data
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)

# Initialize RAKE
rake = Rake()

def extract_keywords_rake(text, top_n=5):
    """
    Extract keywords using RAKE algorithm
    
    Args:
        text: Article text
        top_n: Number of top keywords
    
    Returns:
        List of (keyword, score) tuples
    """
    rake.extract_keywords_from_text(text)
    keywords = rake.get_ranked_phrases_with_scores()
    
    # Return top N
    return keywords[:top_n]


# Test RAKE
article = df.iloc[0]
text = article['title'] + ' ' + article['content']

keywords = extract_keywords_rake(text, top_n=5)

print(f"📰 Article: {article['title']}\n")
print("🔑 RAKE Keywords:\n")
for keyword, score in keywords:
    print(f"   {keyword} (score: {score:.2f})")

## Part 6: 完整標籤系統整合

### 6.1 多策略標籤生成器

In [None]:
class NewsAutoTagger:
    """
    智能新聞自動標籤系統
    結合零樣本分類、NER、關鍵字提取
    """
    def __init__(self):
        print("🚀 Initializing News Auto-Tagger...")
        
        # Load models
        self.zero_shot = pipeline(
            "zero-shot-classification",
            model="facebook/bart-large-mnli",
            device=-1
        )
        
        self.ner = pipeline(
            "ner",
            model="dslim/bert-base-NER",
            aggregation_strategy="simple",
            device=-1
        )
        
        self.rake = Rake()
        
        print("✅ Auto-Tagger ready!")
    
    def generate_tags(self, article, tag_categories=None):
        """
        Generate comprehensive tags for an article
        
        Args:
            article: Dict with 'title' and 'content'
            tag_categories: Dict of tag categories
        
        Returns:
            Dict with different types of tags
        """
        if tag_categories is None:
            tag_categories = TAG_CATEGORIES
        
        text = article['title'] + ' ' + article['content']
        
        tags = {
            'article_id': article.get('id', 'unknown'),
            'title': article['title']
        }
        
        # 1. Topic classification
        topic_result = self.zero_shot(
            text,
            tag_categories['topic'],
            multi_label=True
        )
        tags['topics'] = [
            {'tag': label, 'score': score}
            for label, score in zip(topic_result['labels'][:3], topic_result['scores'][:3])
            if score > 0.5
        ]
        
        # 2. Entity extraction
        entities = self.ner(text[:512])  # Limit text length for NER
        entity_tags = []
        seen = set()
        for ent in entities:
            if ent['score'] > 0.9 and ent['word'] not in seen:
                entity_tags.append({
                    'tag': ent['word'],
                    'type': ent['entity_group'],
                    'score': ent['score']
                })
                seen.add(ent['word'])
        tags['entities'] = entity_tags[:5]
        
        # 3. Keyword extraction
        self.rake.extract_keywords_from_text(text)
        keywords = self.rake.get_ranked_phrases()[:5]
        tags['keywords'] = keywords
        
        # 4. Generate final tag list
        final_tags = []
        
        # Add top topics
        final_tags.extend([t['tag'] for t in tags['topics'][:2]])
        
        # Add top entities
        final_tags.extend([e['tag'] for e in tags['entities'][:3]])
        
        # Add top keyword
        if keywords:
            final_tags.append(keywords[0])
        
        tags['final_tags'] = list(set(final_tags))  # Remove duplicates
        
        return tags
    
    def batch_generate_tags(self, articles):
        """
        Generate tags for multiple articles
        """
        results = []
        for article in articles:
            tags = self.generate_tags(article)
            results.append(tags)
        return results


# Initialize tagger
auto_tagger = NewsAutoTagger()

### 6.2 測試自動標籤系統

In [None]:
# Test on all articles
print("🏷️ Generating comprehensive tags...\n")
print("=" * 80)

tagging_results = []

for idx, row in df.iterrows():
    article = row.to_dict()
    tags = auto_tagger.generate_tags(article)
    tagging_results.append(tags)
    
    print(f"\n📰 {tags['article_id']}: {tags['title']}")
    print(f"\n🏷️ Recommended Tags:")
    for tag in tags['final_tags']:
        print(f"   • {tag}")
    
    print(f"\n📊 Details:")
    print(f"   Topics: {[t['tag'] for t in tags['topics']]}")
    print(f"   Entities: {[e['tag'] for e in tags['entities']]}")
    print(f"   Keywords: {tags['keywords'][:3]}")
    print("-" * 80)

print("\n✅ All articles tagged!")

## Part 7: 標籤分析與可視化

### 7.1 標籤統計

In [None]:
# Analyze tag distribution
all_final_tags = []
for result in tagging_results:
    all_final_tags.extend(result['final_tags'])

tag_freq = Counter(all_final_tags)

print("📊 Tag Frequency Distribution:\n")
for tag, count in tag_freq.most_common(15):
    print(f"   {tag:30} : {count:2d} articles")

In [None]:
# Visualize top tags
top_tags = tag_freq.most_common(10)
tags, counts = zip(*top_tags)

plt.figure(figsize=(12, 6))
plt.barh(tags, counts, color='steelblue')
plt.xlabel('Frequency')
plt.title('Top 10 Most Frequent Tags', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

### 7.2 標籤共現分析

In [None]:
# Tag co-occurrence matrix
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import pairwise_distances

# Prepare tag lists
tag_lists = [result['final_tags'] for result in tagging_results]

# Binary encoding
mlb = MultiLabelBinarizer()
tag_matrix = mlb.fit_transform(tag_lists)

# Calculate co-occurrence
co_occurrence = tag_matrix.T @ tag_matrix

# Visualize (top 10 tags)
top_tag_names = [tag for tag, _ in tag_freq.most_common(10)]
top_tag_indices = [list(mlb.classes_).index(tag) for tag in top_tag_names if tag in mlb.classes_]

co_occurrence_subset = co_occurrence[np.ix_(top_tag_indices, top_tag_indices)]

plt.figure(figsize=(12, 10))
sns.heatmap(
    co_occurrence_subset,
    annot=True,
    fmt='d',
    cmap='YlOrRd',
    xticklabels=top_tag_names,
    yticklabels=top_tag_names,
    cbar_kws={'label': 'Co-occurrence Count'}
)
plt.title('Tag Co-occurrence Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## Part 8: 生產部署

### 8.1 FastAPI 標籤服務

In [None]:
%%writefile news_tagging_api.py
# news_tagging_api.py

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from typing import List, Dict
import uvicorn

app = FastAPI(
    title="News Auto-Tagging API",
    description="Automatic tag generation for news articles",
    version="1.0.0"
)

# Global auto-tagger
auto_tagger = None

@app.on_event("startup")
async def load_models():
    global auto_tagger
    print("Loading models...")
    auto_tagger = NewsAutoTagger()
    print("✅ Models loaded")

class ArticleInput(BaseModel):
    title: str = Field(..., min_length=5, max_length=500)
    content: str = Field(..., min_length=50, max_length=10000)
    id: str = Field(default="AUTO")

class TagResponse(BaseModel):
    article_id: str
    recommended_tags: List[str]
    topics: List[Dict]
    entities: List[Dict]
    keywords: List[str]

@app.post("/tag", response_model=TagResponse)
async def tag_article(article: ArticleInput):
    """Generate tags for an article"""
    try:
        tags = auto_tagger.generate_tags(article.dict())
        
        return TagResponse(
            article_id=tags['article_id'],
            recommended_tags=tags['final_tags'],
            topics=tags['topics'],
            entities=tags['entities'],
            keywords=tags['keywords']
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/available_tags")
async def get_available_tags():
    """Get all available tag categories"""
    return TAG_CATEGORIES

@app.get("/health")
async def health_check():
    return {"status": "healthy", "tagger_loaded": auto_tagger is not None}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

## Part 9: 標籤品質評估

### 9.1 標籤一致性分析

In [None]:
# Calculate tagging statistics
tag_counts_per_article = [len(result['final_tags']) for result in tagging_results]

print("📊 Tagging Statistics:\n")
print(f"   Average tags per article: {np.mean(tag_counts_per_article):.2f}")
print(f"   Min tags: {np.min(tag_counts_per_article)}")
print(f"   Max tags: {np.max(tag_counts_per_article)}")
print(f"   Median tags: {np.median(tag_counts_per_article)}")

# Unique tags
all_unique_tags = set()
for result in tagging_results:
    all_unique_tags.update(result['final_tags'])

print(f"\n   Total unique tags: {len(all_unique_tags)}")
print(f"   Tag vocabulary size: {len(tag_freq)}")

## Part 10: 總結與擴展

### ✅ 本專案完成內容

1. **多策略標籤生成**
   - 零樣本分類 (主題、產業分類)
   - NER 實體提取 (人名、地名、組織)
   - 關鍵字提取 (RAKE 演算法)

2. **智能融合**
   - 多模型結果整合
   - 信心度過濾
   - 去重與排序

3. **可視化分析**
   - 標籤頻率分布
   - 標籤共現矩陣
   - 品質統計分析

4. **生產部署**
   - FastAPI 服務
   - RESTful API 設計
   - 批量標籤生成

### 🚀 進階擴展

#### 技術優化
- [ ] 使用 BERT 進行關鍵字提取 (KeyBERT)
- [ ] 整合 LLM 生成描述性標籤
- [ ] 實作標籤階層結構 (父標籤-子標籤)
- [ ] 個性化標籤推薦 (基於用戶閱讀歷史)

#### 功能擴展
- [ ] 中文新聞標籤支持
- [ ] 自動生成 SEO meta tags
- [ ] 標籤趨勢分析 (熱門標籤追蹤)
- [ ] 標籤相似度計算 (tag embeddings)

#### 應用場景
- [ ] CMS (內容管理系統) 整合
- [ ] 社交媒體自動 hashtag
- [ ] 電商產品標籤生成
- [ ] 學術論文關鍵詞提取

---

**專案版本**: v1.0
**建立日期**: 2025-10-17
**作者**: iSpan NLP Team
**授權**: MIT License