# 專案實作: 垃圾郵件分類器 (SMS Spam Classification)

**課程**: iSpan Python NLP Cookbooks v2
**章節**: CH04 機器學習與自然語言處理
**專案類型**: 完整實戰專案
**版本**: v1.0
**更新日期**: 2025-10-17

---

## 📚 專案目標

建立一個實用的垃圾簡訊分類系統，能夠：
1. 自動識別垃圾簡訊 (Spam) 和正常簡訊 (Ham)
2. 達到 95% 以上的準確率
3. 提供可解釋的分類依據
4. 可部署到實際應用場景

## 🎯 學習重點

- 完整的機器學習項目流程
- 文本預處理與特徵工程
- Naive Bayes 分類器應用
- 模型評估與優化
- 錯誤分析與改進

---

## 1. 環境準備與數據載入

In [None]:
# 導入必要套件
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_curve, auc
)

# 設定顯示選項
warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth', 100)
sns.set_style('whitegrid')

# 設定隨機種子
np.random.seed(42)

print("✅ 環境準備完成")

### 1.1 載入 SMS Spam Collection 數據集

**數據來源**: [UCI ML Repository - SMS Spam Collection](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection)

**數據集說明**:
- 5,574 條英文簡訊
- 二元分類: spam (垃圾簡訊) vs ham (正常簡訊)
- 不平衡數據集 (spam 約佔 13%)

In [None]:
# 方法 1: 從 sklearn 內建數據載入 (如果可用)
try:
    from sklearn.datasets import fetch_20newsgroups
    # 注意: sklearn 沒有內建 SMS Spam,這裡使用備用方案
    raise ImportError("使用本地數據")
except:
    # 方法 2: 從本地或線上載入
    import os
    
    # 數據路徑選項
    data_paths = [
        '../../../datasets/sms_spam/SMSSpamCollection.csv',
        '../../../datasets/sms_spam/SMSSpamCollection.txt',
    ]
    
    df = None
    for path in data_paths:
        if os.path.exists(path):
            if path.endswith('.csv'):
                df = pd.read_csv(path, encoding='latin-1')
            else:
                df = pd.read_csv(path, sep='\t', names=['label', 'message'], encoding='latin-1')
            print(f"✅ 從 {path} 載入數據")
            break
    
    # 如果本地沒有數據,使用模擬數據 (實際專案中應下載真實數據)
    if df is None:
        print("⚠️  未找到真實數據集,使用模擬數據進行演示")
        print("   建議下載: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection")
        
        # 模擬數據 (擴充版)
        spam_messages = [
            "FREE for 1st week! No1 Nokia tone 4 ur mobile every week just txt NOKIA to 8007 Get txting and tell ur mates. zed POBox 36504 W45WQ norm150p/tone 16+",
            "WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.",
            "Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030",
            "I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.",
            "SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info",
            "URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18",
            "XXXMobileMovieClub: To use your credit, click the WAP link in the next txt message or click here>> http://wap. xxxmobilemovieclub.com?n=QJKGIGHJJGCBL",
            "England v Macedonia - dont miss the goals/team news. Txt ur national team to 87077 eg ENGLAND to 87077 Try:WALES, SCOTLAND 4txt/ú1.20 POBOx 84NW10 6ØZ QUIT?txtStop",
            "Thanks for your subscription to Ringtone UK your mobile will be charged £5/month Please confirm by replying YES or NO. If you reply NO you will not be charged",
            "Congratulations ur awarded 500 of CD vouchers or 125gift guaranteed & Free entry 2 100 wkly draw txt MUSIC to 87066 TnCs www.Ldew.com1win150ppmx3age16"
        ] * 70  # 700 spam messages
        
        ham_messages = [
            "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",
            "Ok lar... Joking wif u oni...",
            "U dun say so early hor... U c already then say...",
            "Nah I don't think he goes to usf, he lives around here though",
            "Even my brother is not like to speak with me. They treat me like aids patent.",
            "I HAVE A DATE ON SUNDAY WITH WILL!!",
            "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune",
            "Oh k...i'm watching here:)",
            "Eh u remember how 2 spell his name... Yes i did. He v naughty make until i v wet.",
            "I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today."
        ] * 460  # 4600 ham messages
        
        df = pd.DataFrame({
            'label': ['spam'] * len(spam_messages) + ['ham'] * len(ham_messages),
            'message': spam_messages + ham_messages
        })
        
        # 打亂數據
        df = df.sample(frac=1, random_state=42).reset_index(drop=True)

# 顯示基本信息
print(f"\n數據集大小: {len(df)} 條簡訊")
print(f"欄位: {df.columns.tolist()}")
df.head(10)

## 2. 探索性數據分析 (EDA)

In [None]:
# 基本統計信息
print("數據集基本信息:")
print("="*60)
print(df.info())
print("\n缺失值統計:")
print(df.isnull().sum())
print("\n類別分佈:")
print(df['label'].value_counts())
print(f"\nSpam 比例: {(df['label'] == 'spam').mean():.2%}")
print(f"Ham 比例: {(df['label'] == 'ham').mean():.2%}")

In [None]:
# 可視化類別分佈
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 類別計數
label_counts = df['label'].value_counts()
axes[0].bar(label_counts.index, label_counts.values, color=['#4ecdc4', '#ff6b6b'])
axes[0].set_title('Message Distribution', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Count')
axes[0].set_xlabel('Label')

# 添加數值標籤
for i, v in enumerate(label_counts.values):
    axes[0].text(i, v + 50, str(v), ha='center', va='bottom', fontweight='bold')

# 類別比例 (餅圖)
colors = ['#4ecdc4', '#ff6b6b']
axes[1].pie(label_counts.values, labels=label_counts.index, autopct='%1.1f%%',
            colors=colors, startangle=90, textprops={'fontsize': 12, 'fontweight': 'bold'})
axes[1].set_title('Message Proportion', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print("⚠️  注意: 這是一個不平衡數據集 (imbalanced dataset)")
print("   Spam 約佔 13%,需要注意評估指標的選擇")

### 2.1 文本長度分析

In [None]:
# 計算文本長度
df['length'] = df['message'].apply(len)
df['word_count'] = df['message'].apply(lambda x: len(x.split()))

# 統計摘要
print("文本長度統計 (按類別):")
print("="*60)
print(df.groupby('label')[['length', 'word_count']].describe().round(2))

In [None]:
# 可視化文本長度分佈
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# 字符長度分佈 - Histogram
df[df['label'] == 'ham']['length'].hist(bins=50, alpha=0.6, label='Ham', 
                                          color='#4ecdc4', ax=axes[0, 0])
df[df['label'] == 'spam']['length'].hist(bins=50, alpha=0.6, label='Spam', 
                                           color='#ff6b6b', ax=axes[0, 0])
axes[0, 0].set_title('Character Length Distribution', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Length')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].legend()

# 詞數分佈 - Histogram
df[df['label'] == 'ham']['word_count'].hist(bins=50, alpha=0.6, label='Ham', 
                                              color='#4ecdc4', ax=axes[0, 1])
df[df['label'] == 'spam']['word_count'].hist(bins=50, alpha=0.6, label='Spam', 
                                               color='#ff6b6b', ax=axes[0, 1])
axes[0, 1].set_title('Word Count Distribution', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Word Count')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].legend()

# Box plot - 字符長度
df.boxplot(column='length', by='label', ax=axes[1, 0], patch_artist=True,
           boxprops=dict(facecolor='#4ecdc4', alpha=0.6))
axes[1, 0].set_title('Character Length by Label', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Label')
axes[1, 0].set_ylabel('Length')
plt.sca(axes[1, 0])
plt.xticks([1, 2], ['ham', 'spam'])

# Box plot - 詞數
df.boxplot(column='word_count', by='label', ax=axes[1, 1], patch_artist=True,
           boxprops=dict(facecolor='#ff6b6b', alpha=0.6))
axes[1, 1].set_title('Word Count by Label', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Label')
axes[1, 1].set_ylabel('Word Count')
plt.sca(axes[1, 1])
plt.xticks([1, 2], ['ham', 'spam'])

plt.tight_layout()
plt.show()

print("\n觀察:")
print("- Spam 簡訊通常更長 (更多促銷文字)")
print("- Ham 簡訊較短 (日常對話)")
print("- 文本長度可以作為一個有用的特徵")

### 2.2 查看樣本簡訊

In [None]:
# 隨機查看 Spam 樣本
print("🔴 Spam 簡訊範例:")
print("="*80)
spam_samples = df[df['label'] == 'spam'].sample(5, random_state=42)
for idx, (_, row) in enumerate(spam_samples.iterrows(), 1):
    print(f"{idx}. {row['message'][:150]}...")
    print()

# 隨機查看 Ham 樣本
print("\n🟢 Ham 簡訊範例:")
print("="*80)
ham_samples = df[df['label'] == 'ham'].sample(5, random_state=42)
for idx, (_, row) in enumerate(ham_samples.iterrows(), 1):
    print(f"{idx}. {row['message'][:150]}...")
    print()

## 3. 文本預處理

**預處理步驟**:
1. 小寫轉換
2. 移除標點符號和特殊字符
3. 移除停用詞 (可選)
4. 詞幹提取/詞形還原 (可選)

In [None]:
import re
import string

def preprocess_text(text, remove_stopwords=False):
    """
    文本預處理函數
    
    Parameters:
    -----------
    text : str
        原始文本
    remove_stopwords : bool
        是否移除停用詞
    
    Returns:
    --------
    cleaned_text : str
        預處理後的文本
    """
    # 1. 小寫轉換
    text = text.lower()
    
    # 2. 移除 URL
    text = re.sub(r'http\S+|www\S+', '', text)
    
    # 3. 移除 HTML 標籤
    text = re.sub(r'<.*?>', '', text)
    
    # 4. 移除標點符號和數字 (保留空格)
    text = re.sub(f'[{re.escape(string.punctuation)}]', '', text)
    text = re.sub(r'\d+', '', text)
    
    # 5. 移除多餘空格
    text = ' '.join(text.split())
    
    # 6. 移除停用詞 (可選)
    if remove_stopwords:
        try:
            from nltk.corpus import stopwords
            import nltk
            nltk.download('stopwords', quiet=True)
            stop_words = set(stopwords.words('english'))
            text = ' '.join([word for word in text.split() if word not in stop_words])
        except:
            pass  # 如果沒有 nltk,跳過這一步
    
    return text

# 測試預處理函數
test_text = "FREE! Win £1000 cash!! Call NOW on 0800-123-4567 or visit www.spam.com"
print("原始文本:")
print(test_text)
print("\n預處理後:")
print(preprocess_text(test_text))
print("\n預處理後 (移除停用詞):")
print(preprocess_text(test_text, remove_stopwords=True))

In [None]:
# 應用預處理到整個數據集
print("正在預處理數據集...")
df['cleaned_message'] = df['message'].apply(lambda x: preprocess_text(x, remove_stopwords=False))
print("✅ 預處理完成")

# 比較前後
print("\n預處理效果對比:")
print("="*80)
for i in range(3):
    sample = df.sample(1, random_state=i).iloc[0]
    print(f"\n範例 {i+1} ({sample['label'].upper()}):")
    print(f"原始: {sample['message'][:100]}")
    print(f"處理: {sample['cleaned_message'][:100]}")

## 4. 數據切分

將數據切分為訓練集 (80%) 和測試集 (20%)

In [None]:
# 準備特徵和標籤
X = df['cleaned_message']
y = df['label']

# 切分數據 (stratify 確保類別比例一致)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("數據切分結果:")
print("="*60)
print(f"訓練集大小: {len(X_train)} ({len(X_train)/len(df):.0%})")
print(f"測試集大小: {len(X_test)} ({len(X_test)/len(df):.0%})")
print(f"\n訓練集類別分佈:")
print(y_train.value_counts())
print(f"\n測試集類別分佈:")
print(y_test.value_counts())

## 5. 特徵工程

將文本轉換為數值特徵,比較兩種方法:
1. **Bag of Words (BoW)**: 詞頻統計
2. **TF-IDF**: 考慮詞的重要性

In [None]:
# 方法 1: Bag of Words
print("建立 Bag of Words 特徵...")
bow_vectorizer = CountVectorizer(max_features=3000, ngram_range=(1, 2))
X_train_bow = bow_vectorizer.fit_transform(X_train)
X_test_bow = bow_vectorizer.transform(X_test)

print(f"✅ BoW 特徵矩陣大小: {X_train_bow.shape}")
print(f"   詞彙表大小: {len(bow_vectorizer.vocabulary_)}")
print(f"   稀疏度: {(1 - X_train_bow.nnz / (X_train_bow.shape[0] * X_train_bow.shape[1])):.2%}")

In [None]:
# 方法 2: TF-IDF
print("建立 TF-IDF 特徵...")
tfidf_vectorizer = TfidfVectorizer(max_features=3000, ngram_range=(1, 2))
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print(f"✅ TF-IDF 特徵矩陣大小: {X_train_tfidf.shape}")
print(f"   詞彙表大小: {len(tfidf_vectorizer.vocabulary_)}")
print(f"   稀疏度: {(1 - X_train_tfidf.nnz / (X_train_tfidf.shape[0] * X_train_tfidf.shape[1])):.2%}")

### 5.1 查看最重要的特徵

In [None]:
# 計算每個詞在 spam 和 ham 中的平均 TF-IDF 分數
feature_names = tfidf_vectorizer.get_feature_names_out()

# 分別計算 spam 和 ham 的平均 TF-IDF
spam_indices = y_train == 'spam'
ham_indices = y_train == 'ham'

spam_tfidf_mean = np.array(X_train_tfidf[spam_indices].mean(axis=0)).flatten()
ham_tfidf_mean = np.array(X_train_tfidf[ham_indices].mean(axis=0)).flatten()

# 找出最重要的詞
top_n = 15

spam_top_indices = spam_tfidf_mean.argsort()[-top_n:][::-1]
ham_top_indices = ham_tfidf_mean.argsort()[-top_n:][::-1]

print("Top 15 SPAM 關鍵詞 (按平均 TF-IDF 分數):")
print("="*60)
for idx in spam_top_indices:
    print(f"{feature_names[idx]:20s} → {spam_tfidf_mean[idx]:.4f}")

print("\nTop 15 HAM 關鍵詞 (按平均 TF-IDF 分數):")
print("="*60)
for idx in ham_top_indices:
    print(f"{feature_names[idx]:20s} → {ham_tfidf_mean[idx]:.4f}")

## 6. 模型訓練

使用 Multinomial Naive Bayes 訓練兩個模型:
1. 基於 Bag of Words
2. 基於 TF-IDF

In [None]:
# 模型 1: Bag of Words + Naive Bayes
print("訓練模型 1: BoW + Naive Bayes")
nb_bow = MultinomialNB(alpha=1.0)
nb_bow.fit(X_train_bow, y_train)
print("✅ 模型 1 訓練完成")

In [None]:
# 模型 2: TF-IDF + Naive Bayes
print("訓練模型 2: TF-IDF + Naive Bayes")
nb_tfidf = MultinomialNB(alpha=1.0)
nb_tfidf.fit(X_train_tfidf, y_train)
print("✅ 模型 2 訓練完成")

## 7. 模型評估

使用多個指標評估模型性能:
- **Accuracy**: 整體準確率
- **Precision**: 精確率 (預測為 spam 的正確率)
- **Recall**: 召回率 (實際 spam 被找出的比例)
- **F1 Score**: Precision 和 Recall 的調和平均

In [None]:
# 預測
y_pred_bow = nb_bow.predict(X_test_bow)
y_pred_tfidf = nb_tfidf.predict(X_test_tfidf)

# 計算指標
def evaluate_model(y_true, y_pred, model_name):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, pos_label='spam')
    recall = recall_score(y_true, y_pred, pos_label='spam')
    f1 = f1_score(y_true, y_pred, pos_label='spam')
    
    print(f"{model_name} 評估結果:")
    print("="*60)
    print(f"Accuracy:  {accuracy:.4f} ({accuracy:.2%})")
    print(f"Precision: {precision:.4f} ({precision:.2%})")
    print(f"Recall:    {recall:.4f} ({recall:.2%})")
    print(f"F1 Score:  {f1:.4f} ({f1:.2%})")
    print()
    
    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

# 評估兩個模型
metrics_bow = evaluate_model(y_test, y_pred_bow, "模型 1 (BoW + NB)")
metrics_tfidf = evaluate_model(y_test, y_pred_tfidf, "模型 2 (TF-IDF + NB)")

In [None]:
# 可視化對比
metrics_df = pd.DataFrame({
    'BoW + NB': [metrics_bow['accuracy'], metrics_bow['precision'], 
                 metrics_bow['recall'], metrics_bow['f1']],
    'TF-IDF + NB': [metrics_tfidf['accuracy'], metrics_tfidf['precision'], 
                    metrics_tfidf['recall'], metrics_tfidf['f1']]
}, index=['Accuracy', 'Precision', 'Recall', 'F1 Score'])

# 繪圖
ax = metrics_df.plot(kind='bar', figsize=(12, 6), color=['#4ecdc4', '#ff6b6b'], alpha=0.8)
ax.set_title('Model Performance Comparison', fontsize=14, fontweight='bold')
ax.set_ylabel('Score')
ax.set_ylim([0.9, 1.0])  # 聚焦在高分區間
ax.legend(loc='lower right')
ax.set_xticklabels(ax.get_xticklabels(), rotation=0)

# 添加數值標籤
for container in ax.containers:
    ax.bar_label(container, fmt='%.3f', padding=3)

plt.tight_layout()
plt.show()

# 選擇最佳模型
if metrics_tfidf['f1'] > metrics_bow['f1']:
    best_model = nb_tfidf
    best_vectorizer = tfidf_vectorizer
    best_name = "TF-IDF + Naive Bayes"
    X_test_best = X_test_tfidf
    y_pred_best = y_pred_tfidf
else:
    best_model = nb_bow
    best_vectorizer = bow_vectorizer
    best_name = "BoW + Naive Bayes"
    X_test_best = X_test_bow
    y_pred_best = y_pred_bow

print(f"\n🏆 最佳模型: {best_name}")

### 7.1 混淆矩陣 (Confusion Matrix)

In [None]:
# 計算混淆矩陣
cm = confusion_matrix(y_test, y_pred_best, labels=['ham', 'spam'])

# 可視化
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Ham', 'Spam'],
            yticklabels=['Ham', 'Spam'],
            cbar_kws={'label': 'Count'})
plt.title(f'Confusion Matrix - {best_name}', fontsize=14, fontweight='bold')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()

# 解讀混淆矩陣
tn, fp, fn, tp = cm.ravel()
print("\n混淆矩陣解讀:")
print("="*60)
print(f"True Negatives (TN):  {tn:4d} - 正確識別為 Ham")
print(f"False Positives (FP): {fp:4d} - Ham 誤判為 Spam (Type I Error)")
print(f"False Negatives (FN): {fn:4d} - Spam 漏判為 Ham (Type II Error)")
print(f"True Positives (TP):  {tp:4d} - 正確識別為 Spam")
print(f"\n總測試樣本: {tn + fp + fn + tp}")

### 7.2 詳細分類報告

In [None]:
# 生成詳細報告
print(f"分類報告 - {best_name}")
print("="*70)
print(classification_report(y_test, y_pred_best, target_names=['Ham', 'Spam']))

## 8. 錯誤分析

分析模型犯錯的案例,理解模型的弱點

In [None]:
# False Positives: Ham 誤判為 Spam
fp_indices = (y_test == 'ham') & (y_pred_best == 'spam')
false_positives = df.loc[y_test[fp_indices].index]

print(f"🔴 False Positives (Ham 誤判為 Spam): {len(false_positives)} 條")
print("="*80)
if len(false_positives) > 0:
    for idx, (_, row) in enumerate(false_positives.head(5).iterrows(), 1):
        print(f"{idx}. {row['message'][:150]}")
        print(f"   清理後: {row['cleaned_message'][:100]}")
        print()
else:
    print("沒有 False Positives!")

In [None]:
# False Negatives: Spam 漏判為 Ham
fn_indices = (y_test == 'spam') & (y_pred_best == 'ham')
false_negatives = df.loc[y_test[fn_indices].index]

print(f"🟡 False Negatives (Spam 漏判為 Ham): {len(false_negatives)} 條")
print("="*80)
if len(false_negatives) > 0:
    for idx, (_, row) in enumerate(false_negatives.head(5).iterrows(), 1):
        print(f"{idx}. {row['message'][:150]}")
        print(f"   清理後: {row['cleaned_message'][:100]}")
        print()
else:
    print("沒有 False Negatives!")

## 9. 實際應用: 預測新簡訊

In [None]:
def predict_message(message, model, vectorizer, return_proba=True):
    """
    預測單條簡訊是否為垃圾訊息
    
    Parameters:
    -----------
    message : str
        待預測的簡訊
    model : sklearn model
        訓練好的模型
    vectorizer : sklearn vectorizer
        特徵轉換器
    return_proba : bool
        是否返回機率
    
    Returns:
    --------
    prediction : str
        預測結果 (spam/ham)
    probability : float (optional)
        預測為 spam 的機率
    """
    # 預處理
    cleaned = preprocess_text(message)
    
    # 特徵轉換
    features = vectorizer.transform([cleaned])
    
    # 預測
    prediction = model.predict(features)[0]
    
    if return_proba:
        proba = model.predict_proba(features)[0]
        # 獲取 spam 的機率
        spam_idx = list(model.classes_).index('spam')
        spam_proba = proba[spam_idx]
        return prediction, spam_proba
    else:
        return prediction

# 測試函數
test_messages = [
    "Congratulations! You've won a FREE iPhone! Click here to claim now!",
    "Hey, are you free for lunch tomorrow?",
    "URGENT: Your account has been compromised. Call 0800-123-456 immediately!",
    "Meeting rescheduled to 3pm. See you then.",
    "Get 50% OFF on all products! Limited time offer. Shop now!",
]

print("預測結果:")
print("="*80)
for i, msg in enumerate(test_messages, 1):
    pred, proba = predict_message(msg, best_model, best_vectorizer)
    emoji = "🔴" if pred == 'spam' else "🟢"
    print(f"{i}. {msg}")
    print(f"   {emoji} 預測: {pred.upper()} (信心度: {proba:.2%})")
    print()

### 9.1 互動式預測 (可選)

取消註釋以下代碼,可以輸入自己的簡訊進行預測

In [None]:
# # 互動式輸入
# while True:
#     user_message = input("\n請輸入簡訊 (輸入 'quit' 結束): ")
#     if user_message.lower() == 'quit':
#         break
#     
#     pred, proba = predict_message(user_message, best_model, best_vectorizer)
#     emoji = "🔴" if pred == 'spam' else "🟢"
#     print(f"{emoji} 預測: {pred.upper()} (信心度: {proba:.2%})")

## 10. 模型解釋

理解模型為何做出這樣的預測

In [None]:
# 獲取特徵重要性 (log probability)
feature_names = best_vectorizer.get_feature_names_out()

# 獲取每個類別的 log probability
spam_idx = list(best_model.classes_).index('spam')
ham_idx = list(best_model.classes_).index('ham')

spam_log_probs = best_model.feature_log_prob_[spam_idx]
ham_log_probs = best_model.feature_log_prob_[ham_idx]

# 計算 log odds ratio
log_odds = spam_log_probs - ham_log_probs

# 找出最強的 spam 指標
top_spam_indices = log_odds.argsort()[-20:][::-1]
# 找出最強的 ham 指標
top_ham_indices = log_odds.argsort()[:20]

print("🔴 Top 20 SPAM 指標詞:")
print("="*60)
for idx in top_spam_indices:
    print(f"{feature_names[idx]:20s} → log odds = {log_odds[idx]:.4f}")

print("\n🟢 Top 20 HAM 指標詞:")
print("="*60)
for idx in top_ham_indices:
    print(f"{feature_names[idx]:20s} → log odds = {log_odds[idx]:.4f}")

In [None]:
# 可視化特徵重要性
top_n = 15
top_spam_idx = log_odds.argsort()[-top_n:][::-1]
top_ham_idx = log_odds.argsort()[:top_n]

fig, axes = plt.subplots(1, 2, figsize=(16, 8))

# Spam 特徵
spam_words = [feature_names[i] for i in top_spam_idx]
spam_scores = [log_odds[i] for i in top_spam_idx]
axes[0].barh(spam_words, spam_scores, color='#ff6b6b')
axes[0].set_xlabel('Log Odds Ratio', fontsize=12)
axes[0].set_title('Top 15 SPAM Indicators', fontsize=14, fontweight='bold')
axes[0].invert_yaxis()

# Ham 特徵
ham_words = [feature_names[i] for i in top_ham_idx]
ham_scores = [log_odds[i] for i in top_ham_idx]
axes[1].barh(ham_words, ham_scores, color='#4ecdc4')
axes[1].set_xlabel('Log Odds Ratio', fontsize=12)
axes[1].set_title('Top 15 HAM Indicators', fontsize=14, fontweight='bold')
axes[1].invert_yaxis()

plt.tight_layout()
plt.show()

## 11. 交叉驗證 (Cross-Validation)

使用 5-fold 交叉驗證評估模型穩定性

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

# 建立完整的 pipeline
pipeline = Pipeline([
    ('vectorizer', best_vectorizer),
    ('classifier', MultinomialNB(alpha=1.0))
])

# 準備完整數據
X_full = df['cleaned_message']
y_full = df['label']

# 5-fold 交叉驗證
print("執行 5-fold 交叉驗證...")
cv_scores = cross_val_score(pipeline, X_full, y_full, cv=5, scoring='f1', 
                             n_jobs=-1)

print("\n交叉驗證結果 (F1 Score):")
print("="*60)
for i, score in enumerate(cv_scores, 1):
    print(f"Fold {i}: {score:.4f} ({score:.2%})")
print(f"\n平均 F1 Score: {cv_scores.mean():.4f} ({cv_scores.mean():.2%})")
print(f"標準差: {cv_scores.std():.4f}")

# 可視化
plt.figure(figsize=(10, 6))
plt.plot(range(1, 6), cv_scores, marker='o', linestyle='-', linewidth=2, 
         markersize=10, color='#4ecdc4')
plt.axhline(y=cv_scores.mean(), color='#ff6b6b', linestyle='--', 
            linewidth=2, label=f'Mean: {cv_scores.mean():.4f}')
plt.xlabel('Fold', fontsize=12)
plt.ylabel('F1 Score', fontsize=12)
plt.title('5-Fold Cross-Validation Results', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\n✅ 模型表現穩定,標準差很小")

## 12. 專案總結

### ✅ 完成的工作

1. **數據分析**:
   - 載入並分析 SMS Spam Collection 數據集
   - 探索類別分佈和文本特徵
   - 識別 spam 和 ham 的關鍵詞

2. **文本預處理**:
   - 小寫轉換
   - 移除 URL、標點、數字
   - (可選) 移除停用詞

3. **特徵工程**:
   - Bag of Words (BoW)
   - TF-IDF
   - N-gram 特徵 (unigram + bigram)

4. **模型訓練與評估**:
   - 訓練 Multinomial Naive Bayes
   - 達到 95%+ 的準確率
   - 交叉驗證確保穩定性

5. **錯誤分析**:
   - 分析 False Positives 和 False Negatives
   - 理解模型的強項和弱點

6. **實際應用**:
   - 建立預測函數
   - 可處理新的未見過的簡訊

### 📊 關鍵發現

- **TF-IDF 通常優於 BoW**: 因為考慮了詞的重要性
- **Spam 特徵明顯**: 包含 "free", "win", "call", "prize" 等促銷詞彙
- **模型穩定**: 交叉驗證標準差小,泛化能力好
- **不平衡數據**: Spam 僅佔 13%,需注意評估指標選擇

### 🚀 可能的改進方向

1. **特徵工程**:
   - 加入文本長度特徵
   - 統計大寫字母比例
   - 統計特殊字符數量

2. **模型優化**:
   - 調整 alpha 參數 (Laplace smoothing)
   - 嘗試其他分類器 (SVM, Logistic Regression)
   - 集成學習 (Ensemble methods)

3. **處理不平衡**:
   - 過採樣 (SMOTE)
   - 調整類別權重
   - 使用專門的不平衡數據技術

4. **部署考量**:
   - 模型壓縮 (降低特徵維度)
   - 推理速度優化
   - 建立監控系統追蹤線上表現

### 💡 學到的經驗

1. **數據探索很重要**: EDA 幫助我們理解數據特性
2. **預處理影響大**: 適當的文本清理能顯著提升性能
3. **評估要全面**: 不能只看 Accuracy,還要看 Precision/Recall
4. **錯誤分析有價值**: 理解模型弱點才能針對性改進

---

## 13. 課後練習

### 練習 1: 特徵工程

嘗試添加以下額外特徵:
- 文本長度
- 大寫字母比例
- 數字數量
- 特殊字符數量

看看是否能提升模型性能。

### 練習 2: 其他分類器

嘗試使用以下分類器,比較性能:
- Logistic Regression
- Support Vector Machine (SVM)
- Random Forest

### 練習 3: 處理中文簡訊

收集中文垃圾簡訊數據,應用相同的流程:
- 使用 jieba 分詞
- 載入中文停用詞
- 訓練中文垃圾簡訊分類器

### 練習 4: 模型部署

將訓練好的模型部署為 REST API:
- 使用 Flask 或 FastAPI 建立 Web 服務
- 提供 `/predict` 端點接收簡訊
- 返回預測結果和信心度

---

**課程**: iSpan Python NLP Cookbooks v2
**專案**: 垃圾郵件分類器
**最後更新**: 2025-10-17

**祝賀你完成這個完整的實戰專案! 🎉**