# CH08-09: 專案實戰 - 客戶意見分析儀 (Customer Feedback Analyzer)

---

## 📚 專案目標

建立一套**完整的客戶意見分析系統**,能夠:

1. 🎯 **情感分析**: 自動判斷客戶評論是正面/負面/中性
2. 🏷️ **主題分類**: 識別評論涉及的產品類別或問題類型
3. 🔍 **關鍵字提取**: 找出客戶最關注的議題
4. 📊 **可視化儀表板**: 呈現分析結果與趨勢
5. 🚀 **實際部署**: 建立 API 供業務系統調用

### 商業價值

- **自動化處理**: 每天數千筆評論自動分析
- **即時反饋**: 快速發現客戶不滿與產品問題
- **數據驅動**: 客觀量化客戶滿意度
- **成本節省**: 減少人工審閱時間 90%

---

## 🎯 專案架構

```
客戶評論 (Raw Reviews)
    ↓
預處理 (Preprocessing)
    ├── 去除噪音
    ├── 文本清理
    └── Tokenization
    ↓
NLP 分析 (NLP Pipeline)
    ├── 情感分析 (Sentiment)
    ├── 主題分類 (Topic)
    └── 關鍵字提取 (Keywords)
    ↓
結果整合 (Aggregation)
    ├── 統計分析
    ├── 趨勢分析
    └── 異常檢測
    ↓
可視化輸出 (Visualization)
    ├── 情感分布圖
    ├── 主題熱點圖
    └── 詞雲 (Word Cloud)
```

---

## 🔧 環境準備

In [None]:
# Install required packages
# !pip install transformers datasets torch
# !pip install pandas numpy matplotlib seaborn plotly
# !pip install wordcloud scikit-learn nltk

In [None]:
# Import libraries
import os
import re
import json
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from wordcloud import WordCloud

from transformers import pipeline
from collections import Counter
from datetime import datetime, timedelta

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

print("✅ Libraries imported successfully!")

---

## 📊 數據準備

### 1. 生成模擬客戶評論數據

在實際應用中,你會從以下來源獲取數據:
- 電商平台 (Amazon, PChome)
- Google Reviews
- 客服系統
- 社交媒體 (Facebook, Twitter)

這裡我們生成模擬數據進行演示:

In [None]:
# Generate synthetic customer reviews
np.random.seed(42)

# Sample reviews with different sentiments and topics
positive_reviews = [
    "Excellent product! Fast shipping and great quality. Highly recommend!",
    "Love this item! Exactly as described. Will buy again.",
    "Amazing customer service. They resolved my issue quickly.",
    "Best purchase I've made this year. Super satisfied!",
    "Great value for money. The product exceeded my expectations.",
    "Fast delivery and product is in perfect condition. Very happy!",
    "Outstanding quality! This is exactly what I was looking for.",
    "Fantastic experience from order to delivery. Five stars!",
]

negative_reviews = [
    "Terrible quality. Product broke after two days. Very disappointed.",
    "Shipping took forever. Item arrived damaged. Not happy.",
    "Poor customer service. They didn't respond to my complaints.",
    "Complete waste of money. Product doesn't work as advertised.",
    "Do not buy! Cheap materials and terrible build quality.",
    "Worst purchase ever. Requesting a full refund immediately.",
    "Product is defective. Customer support was unhelpful.",
    "Very disappointed with the quality. Not worth the price.",
]

neutral_reviews = [
    "Product is okay. Nothing special but does the job.",
    "Average quality. Shipping was standard. No complaints.",
    "It's fine. Meets basic expectations but nothing more.",
    "Decent product for the price. Not amazing, not terrible.",
    "Standard item. Delivery was on time. No issues.",
    "Acceptable quality. Could be better but works as intended.",
]

# Topics
topics = ['Product Quality', 'Shipping', 'Customer Service', 'Pricing', 'Features']

# Generate dataset
num_reviews = 300
reviews_data = []

for i in range(num_reviews):
    # Random sentiment distribution: 50% positive, 30% negative, 20% neutral
    sentiment_choice = np.random.choice(['positive', 'negative', 'neutral'], p=[0.5, 0.3, 0.2])
    
    if sentiment_choice == 'positive':
        review_text = np.random.choice(positive_reviews)
    elif sentiment_choice == 'negative':
        review_text = np.random.choice(negative_reviews)
    else:
        review_text = np.random.choice(neutral_reviews)
    
    # Generate random date (last 30 days)
    review_date = datetime.now() - timedelta(days=np.random.randint(0, 30))
    
    # Random rating (1-5 stars)
    if sentiment_choice == 'positive':
        rating = np.random.choice([4, 5], p=[0.3, 0.7])
    elif sentiment_choice == 'negative':
        rating = np.random.choice([1, 2], p=[0.6, 0.4])
    else:
        rating = 3
    
    reviews_data.append({
        'id': f'REV{i+1:04d}',
        'date': review_date.strftime('%Y-%m-%d'),
        'rating': rating,
        'review': review_text,
        'topic': np.random.choice(topics)
    })

# Create DataFrame
df = pd.DataFrame(reviews_data)

print(f"✅ Generated {len(df):,} synthetic customer reviews")
print(f"\n📊 Dataset shape: {df.shape}")
print(f"\n🔍 Sample data:")
df.head()

In [None]:
# Dataset overview
print("📋 Dataset Information:\n")
print(df.info())

print("\n📈 Rating Distribution:")
print(df['rating'].value_counts().sort_index())

### 2. 數據探索性分析 (EDA)

In [None]:
# Visualize rating distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Rating counts
rating_counts = df['rating'].value_counts().sort_index()
axes[0].bar(rating_counts.index, rating_counts.values, color='steelblue')
axes[0].set_title('Rating Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Rating (Stars)')
axes[0].set_ylabel('Count')
axes[0].set_xticks([1, 2, 3, 4, 5])

# Topic distribution
topic_counts = df['topic'].value_counts()
axes[1].barh(topic_counts.index, topic_counts.values, color='coral')
axes[1].set_title('Topic Distribution', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Count')
axes[1].set_ylabel('Topic')

plt.tight_layout()
plt.show()

In [None]:
# Time series: reviews over time
df['date'] = pd.to_datetime(df['date'])
daily_reviews = df.groupby('date').size().reset_index(name='count')

plt.figure(figsize=(14, 5))
plt.plot(daily_reviews['date'], daily_reviews['count'], marker='o', linewidth=2, color='steelblue')
plt.title('Daily Review Volume', fontsize=14, fontweight='bold')
plt.xlabel('Date')
plt.ylabel('Number of Reviews')
plt.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

---

## 🤖 NLP 模型載入

### 載入情感分析模型

In [None]:
# Load sentiment analysis pipeline
print("📦 Loading sentiment analysis model...")

sentiment_analyzer = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english",
    device=0 if __import__('torch').cuda.is_available() else -1  # Use GPU if available
)

print("✅ Sentiment analyzer loaded!")

# Test the model
test_text = "This product is amazing! I love it!"
result = sentiment_analyzer(test_text)[0]
print(f"\n🧪 Test prediction:")
print(f"   Text: {test_text}")
print(f"   Sentiment: {result['label']} (Score: {result['score']:.4f})")

---

## 🔍 情感分析

### 批量處理所有評論

In [None]:
# Analyze sentiment for all reviews
print("🔄 Analyzing sentiment for all reviews...\n")

# Process in batches for efficiency
batch_size = 32
sentiments = []
sentiment_scores = []

for i in range(0, len(df), batch_size):
    batch = df['review'].iloc[i:i+batch_size].tolist()
    results = sentiment_analyzer(batch)
    
    for result in results:
        sentiments.append(result['label'])
        sentiment_scores.append(result['score'])
    
    # Progress indicator
    progress = min((i + batch_size) / len(df) * 100, 100)
    print(f"\rProgress: {progress:.1f}% ({min(i+batch_size, len(df))}/{len(df)})", end='')

print("\n\n✅ Sentiment analysis completed!")

# Add results to DataFrame
df['predicted_sentiment'] = sentiments
df['sentiment_confidence'] = sentiment_scores

print(f"\n📊 Sentiment Distribution:")
print(df['predicted_sentiment'].value_counts())

In [None]:
# Show sample predictions
print("\n🔍 Sample Predictions:\n")
sample_df = df[['review', 'predicted_sentiment', 'sentiment_confidence']].head(10)
for idx, row in sample_df.iterrows():
    print(f"Review: {row['review'][:60]}...")
    print(f"Sentiment: {row['predicted_sentiment']} (Confidence: {row['sentiment_confidence']:.2%})")
    print("-" * 80)

### 情感分析結果可視化

In [None]:
# Visualize sentiment distribution
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Sentiment pie chart
sentiment_counts = df['predicted_sentiment'].value_counts()
colors = ['#2ecc71', '#e74c3c']  # Green for positive, red for negative
axes[0, 0].pie(
    sentiment_counts.values,
    labels=sentiment_counts.index,
    autopct='%1.1f%%',
    startangle=90,
    colors=colors
)
axes[0, 0].set_title('Overall Sentiment Distribution', fontsize=14, fontweight='bold')

# 2. Confidence distribution
axes[0, 1].hist(df['sentiment_confidence'], bins=30, color='steelblue', edgecolor='black')
axes[0, 1].set_title('Sentiment Confidence Distribution', fontsize=14, fontweight='bold')
axes[0, 1].set_xlabel('Confidence Score')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].axvline(df['sentiment_confidence'].mean(), color='red', linestyle='--', label='Mean')
axes[0, 1].legend()

# 3. Sentiment by topic
sentiment_by_topic = pd.crosstab(df['topic'], df['predicted_sentiment'])
sentiment_by_topic.plot(kind='bar', ax=axes[1, 0], color=colors, width=0.7)
axes[1, 0].set_title('Sentiment by Topic', fontsize=14, fontweight='bold')
axes[1, 0].set_xlabel('Topic')
axes[1, 0].set_ylabel('Count')
axes[1, 0].tick_params(axis='x', rotation=45)
axes[1, 0].legend(title='Sentiment')

# 4. Rating vs Sentiment
rating_sentiment = pd.crosstab(df['rating'], df['predicted_sentiment'], normalize='index') * 100
rating_sentiment.plot(kind='bar', ax=axes[1, 1], color=colors, width=0.7)
axes[1, 1].set_title('Sentiment Distribution by Rating', fontsize=14, fontweight='bold')
axes[1, 1].set_xlabel('Rating (Stars)')
axes[1, 1].set_ylabel('Percentage (%)')
axes[1, 1].tick_params(axis='x', rotation=0)
axes[1, 1].legend(title='Sentiment')

plt.tight_layout()
plt.show()

---

## 🏷️ 關鍵字提取

### 提取高頻詞彙

In [None]:
# Text preprocessing for keyword extraction
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download NLTK data (run once)
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

stop_words = set(stopwords.words('english'))

def extract_keywords(text, top_n=10):
    """
    Extract top keywords from text
    Args:
        text: input text
        top_n: number of top keywords to return
    Returns:
        list of (keyword, frequency) tuples
    """
    # Lowercase and remove special characters
    text = re.sub(r'[^a-zA-Z\s]', '', text.lower())
    
    # Tokenize
    words = word_tokenize(text)
    
    # Remove stopwords and short words
    keywords = [w for w in words if w not in stop_words and len(w) > 3]
    
    # Count frequencies
    word_freq = Counter(keywords)
    
    return word_freq.most_common(top_n)

print("✅ Keyword extraction function ready")

In [None]:
# Extract keywords from positive and negative reviews separately
positive_text = ' '.join(df[df['predicted_sentiment'] == 'POSITIVE']['review'])
negative_text = ' '.join(df[df['predicted_sentiment'] == 'NEGATIVE']['review'])

positive_keywords = extract_keywords(positive_text, top_n=15)
negative_keywords = extract_keywords(negative_text, top_n=15)

print("✅ Top Keywords in Positive Reviews:")
for word, freq in positive_keywords:
    print(f"   {word}: {freq}")

print("\n❌ Top Keywords in Negative Reviews:")
for word, freq in negative_keywords:
    print(f"   {word}: {freq}")

### 詞雲可視化 (Word Cloud)

In [None]:
# Generate word clouds
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Positive word cloud
positive_wordcloud = WordCloud(
    width=800,
    height=400,
    background_color='white',
    colormap='Greens',
    stopwords=stop_words
).generate(positive_text)

axes[0].imshow(positive_wordcloud, interpolation='bilinear')
axes[0].set_title('Positive Reviews - Word Cloud', fontsize=14, fontweight='bold')
axes[0].axis('off')

# Negative word cloud
negative_wordcloud = WordCloud(
    width=800,
    height=400,
    background_color='white',
    colormap='Reds',
    stopwords=stop_words
).generate(negative_text)

axes[1].imshow(negative_wordcloud, interpolation='bilinear')
axes[1].set_title('Negative Reviews - Word Cloud', fontsize=14, fontweight='bold')
axes[1].axis('off')

plt.tight_layout()
plt.show()

---

## 📊 綜合分析報告

### 生成業務洞察報告

In [None]:
# Generate comprehensive analysis report
def generate_analysis_report(df):
    """
    Generate business insights report
    """
    report = {}
    
    # Overall metrics
    total_reviews = len(df)
    avg_rating = df['rating'].mean()
    positive_pct = (df['predicted_sentiment'] == 'POSITIVE').sum() / total_reviews * 100
    negative_pct = (df['predicted_sentiment'] == 'NEGATIVE').sum() / total_reviews * 100
    avg_confidence = df['sentiment_confidence'].mean()
    
    report['overview'] = {
        'total_reviews': total_reviews,
        'avg_rating': round(avg_rating, 2),
        'positive_percentage': round(positive_pct, 1),
        'negative_percentage': round(negative_pct, 1),
        'avg_confidence': round(avg_confidence, 3)
    }
    
    # Topic-level insights
    topic_analysis = df.groupby('topic').agg({
        'rating': 'mean',
        'predicted_sentiment': lambda x: (x == 'POSITIVE').sum() / len(x) * 100
    }).round(2)
    topic_analysis.columns = ['avg_rating', 'positive_pct']
    report['topic_insights'] = topic_analysis.to_dict('index')
    
    # Identify problem areas (topics with low ratings)
    problem_topics = topic_analysis[topic_analysis['avg_rating'] < 3.5].index.tolist()
    report['problem_areas'] = problem_topics
    
    # Identify strengths (topics with high ratings)
    strength_topics = topic_analysis[topic_analysis['avg_rating'] >= 4.5].index.tolist()
    report['strengths'] = strength_topics
    
    return report

# Generate report
report = generate_analysis_report(df)

# Print formatted report
print("=" * 70)
print("📊 CUSTOMER FEEDBACK ANALYSIS REPORT")
print("=" * 70)

print("\n📌 OVERVIEW")
print(f"   Total Reviews Analyzed: {report['overview']['total_reviews']:,}")
print(f"   Average Rating: {report['overview']['avg_rating']} / 5.0")
print(f"   Positive Sentiment: {report['overview']['positive_percentage']}%")
print(f"   Negative Sentiment: {report['overview']['negative_percentage']}%")
print(f"   Average Confidence: {report['overview']['avg_confidence']:.1%}")

print("\n📈 TOPIC-LEVEL INSIGHTS")
for topic, metrics in report['topic_insights'].items():
    print(f"   {topic}:")
    print(f"      - Avg Rating: {metrics['avg_rating']}")
    print(f"      - Positive %: {metrics['positive_pct']}%")

if report['problem_areas']:
    print("\n⚠️  PROBLEM AREAS (Require Attention)")
    for topic in report['problem_areas']:
        print(f"   - {topic}")
else:
    print("\n✅ No major problem areas identified!")

if report['strengths']:
    print("\n💪 STRENGTHS (Performing Well)")
    for topic in report['strengths']:
        print(f"   - {topic}")

print("\n" + "=" * 70)

---

## 📈 時間趨勢分析

### 情感隨時間的變化

In [None]:
# Sentiment trend over time
df_sorted = df.sort_values('date')
daily_sentiment = df_sorted.groupby(['date', 'predicted_sentiment']).size().unstack(fill_value=0)

# Plot sentiment trend
plt.figure(figsize=(14, 6))
plt.plot(daily_sentiment.index, daily_sentiment['POSITIVE'], marker='o', label='Positive', linewidth=2, color='green')
plt.plot(daily_sentiment.index, daily_sentiment['NEGATIVE'], marker='o', label='Negative', linewidth=2, color='red')
plt.fill_between(daily_sentiment.index, daily_sentiment['POSITIVE'], alpha=0.3, color='green')
plt.fill_between(daily_sentiment.index, daily_sentiment['NEGATIVE'], alpha=0.3, color='red')

plt.title('Sentiment Trend Over Time', fontsize=14, fontweight='bold')
plt.xlabel('Date')
plt.ylabel('Number of Reviews')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

---

## 🚀 部署準備

### 1. 建立分析函數 API

In [None]:
# Create reusable analysis function
class CustomerFeedbackAnalyzer:
    """
    Customer Feedback Analyzer - Production-ready class
    """
    def __init__(self):
        print("🔄 Initializing Customer Feedback Analyzer...")
        self.sentiment_analyzer = pipeline(
            "sentiment-analysis",
            model="distilbert-base-uncased-finetuned-sst-2-english",
            device=-1  # CPU for production
        )
        print("✅ Analyzer ready!")
    
    def analyze_sentiment(self, text):
        """
        Analyze sentiment of a single review
        Args:
            text: review text
        Returns:
            dict with sentiment and confidence
        """
        result = self.sentiment_analyzer(text)[0]
        return {
            'sentiment': result['label'],
            'confidence': round(result['score'], 4)
        }
    
    def batch_analyze(self, reviews):
        """
        Batch analyze multiple reviews
        Args:
            reviews: list of review texts
        Returns:
            list of analysis results
        """
        results = self.sentiment_analyzer(reviews)
        return [
            {
                'sentiment': r['label'],
                'confidence': round(r['score'], 4)
            }
            for r in results
        ]
    
    def get_summary_stats(self, results):
        """
        Calculate summary statistics
        Args:
            results: list of analysis results
        Returns:
            dict with summary statistics
        """
        total = len(results)
        positive = sum(1 for r in results if r['sentiment'] == 'POSITIVE')
        negative = total - positive
        
        return {
            'total_reviews': total,
            'positive_count': positive,
            'negative_count': negative,
            'positive_percentage': round(positive / total * 100, 2),
            'negative_percentage': round(negative / total * 100, 2)
        }

# Initialize analyzer
analyzer = CustomerFeedbackAnalyzer()

In [None]:
# Test the analyzer
test_reviews = [
    "This product is absolutely fantastic!",
    "Terrible experience, will never buy again.",
    "Good quality but shipping was slow."
]

print("🧪 Testing analyzer with sample reviews:\n")
results = analyzer.batch_analyze(test_reviews)

for review, result in zip(test_reviews, results):
    print(f"Review: {review}")
    print(f"Sentiment: {result['sentiment']} (Confidence: {result['confidence']:.2%})")
    print("-" * 70)

# Get summary
summary = analyzer.get_summary_stats(results)
print("\n📊 Summary Statistics:")
print(json.dumps(summary, indent=2))

### 2. 導出分析結果

In [None]:
# Export results to CSV
output_path = "./customer_feedback_analysis_results.csv"
df.to_csv(output_path, index=False)
print(f"✅ Results exported to: {output_path}")

# Export summary report to JSON
report_path = "./analysis_report.json"
with open(report_path, 'w') as f:
    json.dump(report, f, indent=2)
print(f"✅ Report exported to: {report_path}")

---

## 📚 總結與延伸應用

### ✅ 你學到了什麼

1. **端到端專案流程**:
   - 數據收集與準備
   - NLP 模型應用
   - 結果分析與可視化
   - 業務洞察生成
   - 模型部署準備

2. **實戰技能**:
   - Hugging Face Pipeline 在生產環境的應用
   - 批量處理大量文本數據
   - 關鍵字提取與詞雲生成
   - 多維度數據可視化
   - 可重用的分析類別設計

3. **商業價值**:
   - 自動化客戶反饋分析
   - 快速識別問題領域
   - 數據驅動決策支持

### 🚀 延伸應用方向

1. **整合更多 NLP 功能**:
   - 命名實體識別 (提取產品名稱、品牌)
   - 主題建模 (LDA, BERTopic)
   - 零樣本分類 (自動分類問題類型)

2. **建立 Web API**:
   ```python
   # FastAPI example
   from fastapi import FastAPI
   
   app = FastAPI()
   analyzer = CustomerFeedbackAnalyzer()
   
   @app.post("/analyze")
   def analyze_review(text: str):
       return analyzer.analyze_sentiment(text)
   ```

3. **即時監控儀表板**:
   - 使用 Streamlit 或 Dash 建立互動式儀表板
   - 整合 Plotly 製作動態圖表
   - 設定警報系統 (負評超過閾值時通知)

4. **多語言支持**:
   - 使用多語言模型 (XLM-RoBERTa)
   - 整合翻譯 API
   - 支援中文、日文等亞洲語言

5. **進階分析**:
   - 情感強度分析 (1-5 星細緻度)
   - 情緒識別 (喜悅、憤怒、失望等)
   - 異常檢測 (突發負評警報)

### 💼 商業應用場景

| 產業 | 應用場景 | 價值 |
|------|----------|------|
| **電商** | 產品評論分析 | 快速識別問題產品,改善客戶體驗 |
| **餐飲** | Google/Yelp 評論監控 | 即時回應負評,維護品牌聲譽 |
| **旅遊** | 酒店/景點評價分析 | 優化服務品質,提升競爭力 |
| **金融** | 客服對話分析 | 評估服務品質,訓練客服人員 |
| **SaaS** | 用戶反饋分析 | 產品迭代優先級排序 |

---

## 🔗 參考資源

- [Hugging Face Transformers](https://huggingface.co/docs/transformers/)
- [DistilBERT Model Card](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
- [WordCloud Documentation](https://github.com/amueller/word_cloud)
- [Plotly Python](https://plotly.com/python/)

---

**下一節**: `10_進階技巧與優化.ipynb` - 模型量化、推理加速、部署優化 ⚡