# 9. Sentiment Analysis with Different Models

**Estimated Time**: ~2 hours

**Prerequisites**: Notebooks 1-8 (especially Zero-Shot Classification and Text Similarity)

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Understand** why different models produce different sentiment results for the same text
2. **Compare** binary sentiment (positive/negative) vs. multi-class sentiment (5-star ratings)
3. **Analyze** how training data influences model behavior and biases
4. **Read** model cards to understand model capabilities and limitations
5. **Build** a multi-model sentiment dashboard that aggregates predictions

## Setup

Run this cell first. If you completed previous notebooks, you already have the core packages ready.

In [None]:
# Core imports
from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer
import torch
import numpy as np

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print("Setup complete!")
print(f"PyTorch version: {torch.__version__}")

---

# Part 1: Conceptual Foundation

## What is Sentiment Analysis?

**In plain English**: Sentiment analysis determines the emotional tone or opinion expressed in text - is it positive, negative, or somewhere in between?

**Technical definition**: Sentiment analysis is a text classification task that assigns emotional polarity or intensity scores to text, using models trained on labeled datasets of opinions and reviews.

### Why Do Different Models Give Different Results?

```
THE OBSERVATION:
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Input: "The movie was okay, I guess."                        ‚îÇ
‚îÇ                                                                ‚îÇ
‚îÇ  Model A (Movie Reviews):   Negative (70%)                    ‚îÇ
‚îÇ  Model B (Product Reviews): Neutral  (65%)                    ‚îÇ
‚îÇ  Model C (Social Media):    Positive (55%)                    ‚îÇ
‚îÇ                                                                ‚îÇ
‚îÇ  Same text, completely different results! Why?                ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

THE REASON: TRAINING DATA SHAPES MODEL BEHAVIOR
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Each model learned from different examples:                  ‚îÇ
‚îÇ                                                                ‚îÇ
‚îÇ  Model A: Movie reviews (1-10 stars ‚Üí binary)                 ‚îÇ
‚îÇ     "okay" in movie context often means disappointing        ‚îÇ
‚îÇ                                                                ‚îÇ
‚îÇ  Model B: Product reviews (1-5 stars ‚Üí 5 classes)             ‚îÇ
‚îÇ     "okay" maps to middle rating (3 stars)                    ‚îÇ
‚îÇ                                                                ‚îÇ
‚îÇ  Model C: Tweets (casual language)                            ‚îÇ
‚îÇ     "okay" can be understated positivity                      ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### Types of Sentiment Classification

```
BINARY SENTIMENT:
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Two classes: POSITIVE or NEGATIVE                         ‚îÇ
‚îÇ                                                             ‚îÇ
‚îÇ  "I loved this!"     ‚Üí POSITIVE (98%)                      ‚îÇ
‚îÇ  "Terrible waste."   ‚Üí NEGATIVE (95%)                      ‚îÇ
‚îÇ  "It was okay."      ‚Üí ??? (Model must choose one)         ‚îÇ
‚îÇ                                                             ‚îÇ
‚îÇ  Pros: Simple, clear-cut                                    ‚îÇ
‚îÇ  Cons: Loses nuance, forced choice on neutral text          ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

MULTI-CLASS SENTIMENT (3 classes):
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Three classes: POSITIVE, NEUTRAL, NEGATIVE                ‚îÇ
‚îÇ                                                             ‚îÇ
‚îÇ  "I loved this!"     ‚Üí POSITIVE (95%)                      ‚îÇ
‚îÇ  "Terrible waste."   ‚Üí NEGATIVE (92%)                      ‚îÇ
‚îÇ  "It was okay."      ‚Üí NEUTRAL  (78%)                      ‚îÇ
‚îÇ                                                             ‚îÇ
‚îÇ  Pros: Captures "in-between" sentiment                      ‚îÇ
‚îÇ  Cons: Still coarse-grained                                 ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

FINE-GRAINED SENTIMENT (5 classes):
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Five classes: ‚òÖ‚òÜ‚òÜ‚òÜ‚òÜ  ‚òÖ‚òÖ‚òÜ‚òÜ‚òÜ  ‚òÖ‚òÖ‚òÖ‚òÜ‚òÜ  ‚òÖ‚òÖ‚òÖ‚òÖ‚òÜ  ‚òÖ‚òÖ‚òÖ‚òÖ‚òÖ           ‚îÇ
‚îÇ                                                             ‚îÇ
‚îÇ  "Best thing ever!"  ‚Üí ‚òÖ‚òÖ‚òÖ‚òÖ‚òÖ (5 stars, 89%)                ‚îÇ
‚îÇ  "Pretty good!"      ‚Üí ‚òÖ‚òÖ‚òÖ‚òÖ‚òÜ (4 stars, 72%)                ‚îÇ
‚îÇ  "It was okay."      ‚Üí ‚òÖ‚òÖ‚òÖ‚òÜ‚òÜ (3 stars, 65%)                ‚îÇ
‚îÇ  "Not great."        ‚Üí ‚òÖ‚òÖ‚òÜ‚òÜ‚òÜ (2 stars, 58%)                ‚îÇ
‚îÇ  "Awful!"            ‚Üí ‚òÖ‚òÜ‚òÜ‚òÜ‚òÜ (1 star,  91%)                ‚îÇ
‚îÇ                                                             ‚îÇ
‚îÇ  Pros: Most nuanced, matches review systems                 ‚îÇ
‚îÇ  Cons: Harder to train, boundaries between classes blur     ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### Connection to Previous Notebooks

| Notebook | What You Learned | Relevance to Sentiment |
|----------|------------------|------------------------|
| 6 (Zero-Shot) | NLI-based classification | Sentiment as hypothesis testing |
| 8 (Embeddings) | Comparing model outputs | Comparing sentiment models |
| **9 (This notebook)** | **Model differences** | **Why models disagree** |

In Notebook 6, you saw how zero-shot classification works via NLI. Sentiment models work similarly but are fine-tuned specifically for opinion/emotion detection.

### Understanding Model Cards

**Model cards** are documentation that accompanies ML models, describing:

```
MODEL CARD CONTENTS:
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  1. MODEL DETAILS                                              ‚îÇ
‚îÇ     - Architecture (BERT, RoBERTa, DistilBERT...)             ‚îÇ
‚îÇ     - Base model it was fine-tuned from                        ‚îÇ
‚îÇ                                                                ‚îÇ
‚îÇ  2. TRAINING DATA                                              ‚îÇ
‚îÇ     - What dataset was used (SST-2, IMDB, Amazon reviews...)  ‚îÇ
‚îÇ     - How many examples                                        ‚îÇ
‚îÇ     - What domain (movies, products, social media...)          ‚îÇ
‚îÇ                                                                ‚îÇ
‚îÇ  3. INTENDED USE                                               ‚îÇ
‚îÇ     - What the model is good for                               ‚îÇ
‚îÇ     - What it should NOT be used for                           ‚îÇ
‚îÇ                                                                ‚îÇ
‚îÇ  4. PERFORMANCE METRICS                                        ‚îÇ
‚îÇ     - Accuracy on test sets                                    ‚îÇ
‚îÇ     - Known limitations                                        ‚îÇ
‚îÇ                                                                ‚îÇ
‚îÇ  5. BIASES & LIMITATIONS                                       ‚îÇ
‚îÇ     - Known biases in training data                            ‚îÇ
‚îÇ     - Edge cases where model struggles                         ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

**Always check the model card before using a model in production!**

### Key Terminology

| Term | Definition |
|------|------------|
| **Sentiment** | The emotional tone or opinion in text |
| **Polarity** | The direction of sentiment (positive/negative) |
| **Binary classification** | Two-class prediction (pos/neg) |
| **Multi-class** | Three or more classes (pos/neu/neg, 5-star) |
| **Fine-grained** | Sentiment with many levels (5+ classes) |
| **Model card** | Documentation describing model training and use |
| **Domain** | The subject area of training data (movies, products, etc.) |
| **Ensemble** | Combining predictions from multiple models |

### Check Your Understanding

Before moving on, try to answer these questions (answers at the end):

1. Why might the same text get different sentiment scores from different models?
   - A) Models use different random seeds
   - B) Models are trained on different data from different domains
   - C) One model is broken

2. What is the main disadvantage of binary sentiment classification?
   - A) It's too slow
   - B) It forces neutral text into positive or negative
   - C) It requires too much data

3. What should you check in a model card before using a sentiment model?
   - A) The model's favorite color
   - B) The training data domain and known limitations
   - C) How many downloads it has

4. What is an "ensemble" approach to sentiment analysis?
   - A) Using one very large model
   - B) Combining predictions from multiple models
   - C) Training a model on ensemble music reviews

---

# Part 2: Basic Implementation

## Loading Different Sentiment Models

Let's load several sentiment models with different characteristics:

In [None]:
# Model 1: Default sentiment model (DistilBERT on SST-2 - movie reviews, binary)
print("Loading Model 1: Default sentiment (binary, movie reviews)...")
sentiment_default = pipeline("sentiment-analysis")
print("  Loaded: distilbert-base-uncased-finetuned-sst-2-english")

# Model 2: RoBERTa trained on tweets (3-class: positive/neutral/negative)
print("\nLoading Model 2: Twitter RoBERTa (3-class, social media)...")
sentiment_twitter = pipeline(
    "sentiment-analysis", 
    model="cardiffnlp/twitter-roberta-base-sentiment-latest"
)
print("  Loaded: cardiffnlp/twitter-roberta-base-sentiment-latest")

# Model 3: Fine-grained sentiment (5-star ratings)
print("\nLoading Model 3: 5-star sentiment (product reviews)...")
sentiment_5star = pipeline(
    "sentiment-analysis",
    model="nlptown/bert-base-multilingual-uncased-sentiment"
)
print("  Loaded: nlptown/bert-base-multilingual-uncased-sentiment")

print("\nAll models loaded!")

### Model Comparison Table

| Model | Training Data | Classes | Best For |
|-------|---------------|---------|----------|
| Default (DistilBERT-SST2) | Stanford Sentiment (movies) | 2 (pos/neg) | Movie reviews |
| Twitter-RoBERTa | 124M tweets | 3 (pos/neu/neg) | Social media |
| BERT-Multilingual | Product reviews | 5 (1-5 stars) | Product/service reviews |

In [None]:
# Test with a simple sentence
test_text = "I absolutely loved this product, it exceeded my expectations!"

print(f"Test text: \"{test_text}\"")
print("="*70)

# Get predictions from each model
result_default = sentiment_default(test_text)[0]
result_twitter = sentiment_twitter(test_text)[0]
result_5star = sentiment_5star(test_text)[0]

print(f"\nModel 1 (Binary/Movies):   {result_default['label']:15s} ({result_default['score']:.1%})")
print(f"Model 2 (Twitter 3-class): {result_twitter['label']:15s} ({result_twitter['score']:.1%})")
print(f"Model 3 (5-star):          {result_5star['label']:15s} ({result_5star['score']:.1%})")

### Understanding Different Label Formats

Each model uses different label conventions:

In [None]:
def explain_label(model_name, label):
    """Explain what different model labels mean."""
    explanations = {
        'default': {
            'POSITIVE': 'Positive sentiment detected',
            'NEGATIVE': 'Negative sentiment detected',
        },
        'twitter': {
            'positive': 'Positive sentiment',
            'neutral': 'Neutral/no strong sentiment',
            'negative': 'Negative sentiment',
        },
        '5star': {
            '1 star': 'Very negative (1/5)',
            '2 stars': 'Negative (2/5)',
            '3 stars': 'Neutral/mixed (3/5)',
            '4 stars': 'Positive (4/5)',
            '5 stars': 'Very positive (5/5)',
        }
    }
    return explanations.get(model_name, {}).get(label, 'Unknown label')


print("Label Format Reference:")
print("="*60)

print("\nModel 1 (Binary):")
for label in ['POSITIVE', 'NEGATIVE']:
    print(f"  {label:15s} ‚Üí {explain_label('default', label)}")

print("\nModel 2 (Twitter 3-class):")
for label in ['positive', 'neutral', 'negative']:
    print(f"  {label:15s} ‚Üí {explain_label('twitter', label)}")

print("\nModel 3 (5-star):")
for label in ['1 star', '2 stars', '3 stars', '4 stars', '5 stars']:
    print(f"  {label:15s} ‚Üí {explain_label('5star', label)}")

### Comparing All Three Models on Sample Texts

In [None]:
def compare_models(text, models_dict):
    """
    Compare sentiment predictions from multiple models.
    
    Args:
        text: Input text to analyze
        models_dict: Dict of {name: pipeline}
        
    Returns:
        Dict of results
    """
    results = {'text': text}
    
    for name, model in models_dict.items():
        prediction = model(text)[0]
        results[name] = {
            'label': prediction['label'],
            'score': prediction['score']
        }
    
    return results


def format_comparison(result):
    """Format comparison result for display."""
    lines = []
    lines.append(f"\nText: \"{result['text']}\"")
    lines.append("-" * 60)
    
    for name, pred in result.items():
        if name == 'text':
            continue
        bar = '*' * int(pred['score'] * 20)
        lines.append(f"  {name:20s}: {pred['label']:15s} {pred['score']:.1%} {bar}")
    
    return '\n'.join(lines)


# Set up models dict
models = {
    'Binary (Movies)': sentiment_default,
    'Twitter (3-class)': sentiment_twitter,
    '5-Star (Products)': sentiment_5star,
}

# Test texts covering various sentiments
test_texts = [
    "This is absolutely amazing! Best purchase ever!",
    "Terrible experience, I want my money back.",
    "It was okay, nothing special.",
    "Pretty good, but could be better.",
    "Not bad, I guess.",
]

print("Model Comparison Results:")
print("="*70)

for text in test_texts:
    result = compare_models(text, models)
    print(format_comparison(result))

---

## Exercise 1: Model Behavior Exploration (Guided)

**Difficulty**: Basic | **Time**: 10-15 minutes

**Your task**: Explore how the models handle different types of text.

### Step 1: Test with ambiguous sentences

In [None]:
# Ambiguous texts that models might interpret differently
ambiguous_texts = [
    "I didn't hate it.",  # Negative phrasing, neutral/positive meaning
    "It could have been worse.",  # Backhanded compliment
    "Interesting.",  # Single word, context-dependent
    "Well, that happened.",  # Sarcasm-like
    "I've seen better, I've seen worse.",  # Mixed
]

print("Testing Ambiguous Texts:")
print("="*70)

for text in ambiguous_texts:
    result = compare_models(text, models)
    print(format_comparison(result))

### Step 2: Notice the patterns

**Questions to consider:**
- Which model is most likely to say "neutral"?
- Which model seems most confident (highest scores)?
- Do any models struggle with negation ("didn't hate")?

In [None]:
# Let's analyze the pattern more systematically
def analyze_model_tendencies(texts, models_dict):
    """Analyze how models tend to classify texts."""
    stats = {name: {'labels': [], 'scores': []} for name in models_dict.keys()}
    
    for text in texts:
        for name, model in models_dict.items():
            pred = model(text)[0]
            stats[name]['labels'].append(pred['label'])
            stats[name]['scores'].append(pred['score'])
    
    print("Model Tendency Analysis:")
    print("="*60)
    
    for name, data in stats.items():
        labels = data['labels']
        scores = data['scores']
        
        # Count label distribution
        label_counts = {}
        for label in labels:
            label_counts[label] = label_counts.get(label, 0) + 1
        
        print(f"\n{name}:")
        print(f"  Average confidence: {np.mean(scores):.1%}")
        print(f"  Label distribution:")
        for label, count in sorted(label_counts.items(), key=lambda x: -x[1]):
            pct = count / len(labels) * 100
            print(f"    {label:15s}: {count:2d} ({pct:.0f}%)")

# Combine all texts for analysis
all_test_texts = test_texts + ambiguous_texts
analyze_model_tendencies(all_test_texts, models)

### Step 3: Try your own test cases

In [None]:
# YOUR CODE HERE
# Add your own test texts and compare models

my_test_texts = [
    "Your text here",
    # Add more...
]

# Uncomment to run:
# for text in my_test_texts:
#     result = compare_models(text, models)
#     print(format_comparison(result))

---

# Part 3: Intermediate Exploration

## Getting All Class Probabilities

So far we've only seen the top prediction. Let's see all class probabilities:

In [None]:
# Create pipelines that return all scores
sentiment_default_full = pipeline(
    "sentiment-analysis",
    top_k=None  # Return all classes
)

sentiment_twitter_full = pipeline(
    "sentiment-analysis",
    model="cardiffnlp/twitter-roberta-base-sentiment-latest",
    top_k=None
)

sentiment_5star_full = pipeline(
    "sentiment-analysis",
    model="nlptown/bert-base-multilingual-uncased-sentiment",
    top_k=None
)

print("Full probability pipelines loaded!")

In [None]:
def show_all_probabilities(text):
    """Show full probability distribution from each model."""
    print(f"\nText: \"{text}\"")
    print("="*70)
    
    # Binary model
    print("\nBinary Model (distilbert-sst2):")
    probs = sentiment_default_full(text)[0]
    for p in sorted(probs, key=lambda x: -x['score']):
        bar = '*' * int(p['score'] * 40)
        print(f"  {p['label']:15s}: {p['score']:.1%} {bar}")
    
    # Twitter 3-class
    print("\nTwitter 3-class Model:")
    probs = sentiment_twitter_full(text)[0]
    for p in sorted(probs, key=lambda x: -x['score']):
        bar = '*' * int(p['score'] * 40)
        print(f"  {p['label']:15s}: {p['score']:.1%} {bar}")
    
    # 5-star
    print("\n5-Star Model:")
    probs = sentiment_5star_full(text)[0]
    for p in sorted(probs, key=lambda x: -x['score']):
        bar = '*' * int(p['score'] * 40)
        print(f"  {p['label']:15s}: {p['score']:.1%} {bar}")


# Test with a neutral-ish text
show_all_probabilities("It was okay, nothing special.")

In [None]:
# Compare probability distributions for different sentiment levels
spectrum_texts = [
    "This is the worst thing I've ever experienced. Absolutely terrible.",
    "Not great, pretty disappointing actually.",
    "It's fine, I guess. Nothing special.",
    "Pretty good! I enjoyed it.",
    "Absolutely incredible! Best ever! Highly recommend!",
]

print("Sentiment Spectrum Analysis:")
for text in spectrum_texts:
    show_all_probabilities(text)

### Normalizing to a Common Scale

To compare models fairly, let's convert all outputs to a common -1 to +1 scale:

In [None]:
def normalize_sentiment(model_type, predictions):
    """
    Convert model predictions to a -1 to +1 scale.
    
    -1 = Most negative
     0 = Neutral
    +1 = Most positive
    """
    probs = {p['label']: p['score'] for p in predictions}
    
    if model_type == 'binary':
        # Binary: POSITIVE and NEGATIVE
        pos = probs.get('POSITIVE', 0)
        neg = probs.get('NEGATIVE', 0)
        return pos - neg  # Range: -1 to +1
    
    elif model_type == 'twitter':
        # Twitter: positive, neutral, negative
        pos = probs.get('positive', 0)
        neu = probs.get('neutral', 0)
        neg = probs.get('negative', 0)
        return pos - neg  # Ignore neutral for simplicity
    
    elif model_type == '5star':
        # 5-star: convert to weighted average
        weights = {
            '1 star': -1.0,
            '2 stars': -0.5,
            '3 stars': 0.0,
            '4 stars': 0.5,
            '5 stars': 1.0,
        }
        score = sum(probs.get(label, 0) * weight for label, weight in weights.items())
        return score
    
    return 0


def get_normalized_scores(text):
    """Get normalized sentiment scores from all models."""
    results = {}
    
    results['Binary'] = normalize_sentiment(
        'binary', 
        sentiment_default_full(text)[0]
    )
    results['Twitter'] = normalize_sentiment(
        'twitter', 
        sentiment_twitter_full(text)[0]
    )
    results['5-Star'] = normalize_sentiment(
        '5star', 
        sentiment_5star_full(text)[0]
    )
    
    return results


# Compare normalized scores
print("Normalized Sentiment Scores (-1 to +1):")
print("="*70)

for text in spectrum_texts:
    scores = get_normalized_scores(text)
    avg = np.mean(list(scores.values()))
    
    print(f"\n\"{text[:50]}...\"" if len(text) > 50 else f"\n\"{text}\"")
    print(f"  Binary:  {scores['Binary']:+.3f}")
    print(f"  Twitter: {scores['Twitter']:+.3f}")
    print(f"  5-Star:  {scores['5-Star']:+.3f}")
    print(f"  Average: {avg:+.3f}")

### Visualizing Model Agreement

In [None]:
def visualize_sentiment_comparison(text):
    """Create a visual comparison of model sentiments."""
    scores = get_normalized_scores(text)
    
    print(f"\n\"{text}\"")
    print()
    
    # Scale: -1 to +1, mapped to 0 to 40 characters
    def score_to_position(score):
        return int((score + 1) * 20)  # 0 to 40
    
    # Draw scale
    print("         Negative          Neutral          Positive")
    print("         ‚óÑ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫")
    print("        -1                  0                 +1")
    print()
    
    for name, score in scores.items():
        pos = score_to_position(score)
        line = [' '] * 41
        line[20] = '‚îÇ'  # Center marker
        line[pos] = '‚óè'
        
        print(f"  {name:10s} {''.join(line)} ({score:+.2f})")
    
    # Show agreement metric
    values = list(scores.values())
    spread = max(values) - min(values)
    agreement = 1 - (spread / 2)  # 0 to 1 scale
    
    print()
    print(f"  Model Agreement: {agreement:.0%}" + 
          (" (High)" if agreement > 0.7 else " (Low)" if agreement < 0.4 else " (Medium)"))


# Test visualization
test_vis_texts = [
    "I love this so much!",
    "It's fine.",
    "Worst experience ever.",
    "I didn't hate it.",
]

print("Visual Sentiment Comparison:")
print("="*70)

for text in test_vis_texts:
    visualize_sentiment_comparison(text)

---

## Exercise 2: Find Edge Cases (Semi-guided)

**Difficulty**: Intermediate | **Time**: 15-20 minutes

**Your task**: Find texts where models strongly disagree.

**Hints**:
- Try negation ("I don't think it's bad")
- Try sarcasm ("Oh great, another delay")
- Try domain-specific language
- Try mixed sentiment ("Good product, terrible service")

In [None]:
def find_disagreements(texts, threshold=0.5):
    """
    Find texts where models disagree significantly.
    
    Args:
        texts: List of texts to test
        threshold: Minimum spread to count as disagreement
        
    Returns:
        List of (text, scores, spread) for disagreements
    """
    disagreements = []
    
    for text in texts:
        scores = get_normalized_scores(text)
        values = list(scores.values())
        spread = max(values) - min(values)
        
        if spread >= threshold:
            disagreements.append((text, scores, spread))
    
    # Sort by spread (highest first)
    disagreements.sort(key=lambda x: x[2], reverse=True)
    
    return disagreements


# Test with challenging texts
challenging_texts = [
    # Negation
    "I don't think it's bad.",
    "Not the worst I've seen.",
    "I can't say I disliked it.",
    
    # Sarcasm/Irony
    "Oh great, another meeting.",
    "Wow, what a surprise. Not.",
    "Yeah, that's exactly what I wanted.",
    
    # Mixed sentiment
    "Great product, terrible customer service.",
    "Loved the food, hated the wait.",
    "Beautiful design but falls apart quickly.",
    
    # Domain-specific
    "The plot twist was sick!",  # Slang positive
    "This code is wicked complex.",  # Slang
    "That's a bad movie in the best way.",  # "So bad it's good"
]

print("Searching for Model Disagreements:")
print("="*70)

disagreements = find_disagreements(challenging_texts, threshold=0.4)

if disagreements:
    print(f"\nFound {len(disagreements)} texts with significant disagreement:\n")
    for text, scores, spread in disagreements:
        visualize_sentiment_comparison(text)
        print(f"  Spread: {spread:.2f}")
        print()
else:
    print("No significant disagreements found. Try more challenging texts!")

In [None]:
# YOUR CODE HERE
# Try to find texts with maximum disagreement

my_challenging_texts = [
    "Your challenging text here",
    # Add more...
]

# Uncomment to test:
# for text in my_challenging_texts:
#     visualize_sentiment_comparison(text)

---

# Part 4: Advanced Topics

## Building a Sentiment Ensemble

Combine multiple models for more robust predictions:

In [None]:
class SentimentEnsemble:
    """
    Combines multiple sentiment models for more robust predictions.
    """
    
    def __init__(self):
        """Initialize ensemble with multiple models."""
        self.models = {}
        self.model_types = {}
        
    def add_model(self, name, model, model_type):
        """
        Add a model to the ensemble.
        
        Args:
            name: Display name for the model
            model: Pipeline with top_k=None
            model_type: 'binary', 'twitter', or '5star' for normalization
        """
        self.models[name] = model
        self.model_types[name] = model_type
    
    def predict(self, text, strategy='average'):
        """
        Get ensemble prediction.
        
        Args:
            text: Input text
            strategy: 'average', 'majority', or 'weighted'
            
        Returns:
            dict with ensemble and individual predictions
        """
        individual_scores = {}
        
        for name, model in self.models.items():
            preds = model(text)[0]
            normalized = normalize_sentiment(self.model_types[name], preds)
            individual_scores[name] = normalized
        
        # Calculate ensemble score
        scores_list = list(individual_scores.values())
        
        if strategy == 'average':
            ensemble_score = np.mean(scores_list)
        elif strategy == 'majority':
            # Count positive vs negative
            positives = sum(1 for s in scores_list if s > 0)
            negatives = sum(1 for s in scores_list if s < 0)
            if positives > negatives:
                ensemble_score = abs(np.mean([s for s in scores_list if s > 0]))
            elif negatives > positives:
                ensemble_score = -abs(np.mean([s for s in scores_list if s < 0]))
            else:
                ensemble_score = 0
        else:  # weighted by confidence
            ensemble_score = np.mean(scores_list)  # Default to average
        
        # Determine label
        if ensemble_score > 0.3:
            label = 'POSITIVE'
        elif ensemble_score < -0.3:
            label = 'NEGATIVE'
        else:
            label = 'NEUTRAL'
        
        # Calculate agreement
        spread = max(scores_list) - min(scores_list)
        agreement = 1 - (spread / 2)
        
        return {
            'text': text,
            'ensemble_score': ensemble_score,
            'ensemble_label': label,
            'agreement': agreement,
            'individual_scores': individual_scores,
            'strategy': strategy,
        }
    
    def format_prediction(self, result):
        """Format prediction for display."""
        lines = []
        lines.append("‚îå" + "‚îÄ"*66 + "‚îê")
        
        text_display = result['text'][:60]
        lines.append(f"‚îÇ Text: {text_display:60s} ‚îÇ")
        lines.append("‚îú" + "‚îÄ"*66 + "‚î§")
        
        # Individual scores
        lines.append("‚îÇ Individual Models:" + " "*47 + "‚îÇ")
        for name, score in result['individual_scores'].items():
            bar_pos = int((score + 1) * 15)  # 0 to 30
            bar = [' '] * 31
            bar[15] = '‚îÇ'
            bar[bar_pos] = '‚óè'
            lines.append(f"‚îÇ   {name:18s} {''.join(bar)} {score:+.2f}" + " "*3 + "‚îÇ")
        
        lines.append("‚îú" + "‚îÄ"*66 + "‚î§")
        
        # Ensemble result
        score = result['ensemble_score']
        label = result['ensemble_label']
        agreement = result['agreement']
        
        lines.append(f"‚îÇ ENSEMBLE RESULT: {label:10s}  Score: {score:+.2f}" + " "*23 + "‚îÇ")
        lines.append(f"‚îÇ Model Agreement: {agreement:.0%}" + " "*46 + "‚îÇ")
        
        lines.append("‚îî" + "‚îÄ"*66 + "‚îò")
        
        return '\n'.join(lines)


# Create ensemble
ensemble = SentimentEnsemble()
ensemble.add_model('Binary (SST2)', sentiment_default_full, 'binary')
ensemble.add_model('Twitter', sentiment_twitter_full, 'twitter')
ensemble.add_model('5-Star', sentiment_5star_full, '5star')

print("Sentiment Ensemble created with 3 models!")

In [None]:
# Test the ensemble
test_ensemble_texts = [
    "Absolutely loved it! Highly recommend!",
    "Terrible waste of money.",
    "It was okay, nothing special.",
    "I didn't hate it.",
    "Great product but awful customer service.",
]

print("Ensemble Predictions:")
print("="*70)

for text in test_ensemble_texts:
    result = ensemble.predict(text)
    print(ensemble.format_prediction(result))
    print()

### When to Use Ensembles

| Scenario | Ensemble? | Why |
|----------|-----------|-----|
| High-stakes decisions | Yes | Reduces single-model errors |
| Mixed domains | Yes | Different models for different text types |
| Speed-critical | No | Multiple models = slower |
| Simple classification | No | One good model is usually enough |
| Detecting edge cases | Yes | Disagreement signals uncertainty |

## Domain-Specific Considerations

Different domains have different sentiment patterns:

In [None]:
# Domain-specific test cases
domain_texts = {
    'Movie Reviews': [
        "The cinematography was stunning but the plot was thin.",
        "Oscar-worthy performance!",
        "A total flop at the box office.",
    ],
    'Product Reviews': [
        "Works as advertised. Good value.",
        "Broke after 2 weeks. Cheap quality.",
        "Exactly what I needed!",
    ],
    'Social Media': [
        "lol this is so random üòÇ",
        "ugh monday again üôÑ",
        "this slaps! üî•",
    ],
    'Financial News': [
        "Stock plunged 15% after earnings miss.",
        "Company announces record quarterly profits.",
        "Market remained flat amid uncertainty.",
    ],
}

print("Domain-Specific Analysis:")
print("="*70)

for domain, texts in domain_texts.items():
    print(f"\n{'='*70}")
    print(f"DOMAIN: {domain}")
    print("="*70)
    
    for text in texts:
        result = ensemble.predict(text)
        print(f"\n\"{text}\"")
        print(f"  Ensemble: {result['ensemble_label']:10s} ({result['ensemble_score']:+.2f})")
        print(f"  Agreement: {result['agreement']:.0%}")
        
        # Show which model is most confident
        scores = result['individual_scores']
        most_confident = max(scores.keys(), key=lambda k: abs(scores[k]))
        print(f"  Most confident: {most_confident}")

---

## Exercise 3: Model Selection Advisor (Independent)

**Difficulty**: Advanced | **Time**: 15-20 minutes

**Your task**: Build a function that recommends which sentiment model to use based on the input text characteristics.

**Requirements**:
1. Detect domain indicators (e.g., movie terms, product terms, emojis)
2. Recommend the most appropriate model
3. Explain the recommendation

In [None]:
# YOUR CODE HERE

def recommend_model(text):
    """
    Recommend the best sentiment model for a given text.
    
    Args:
        text: Input text to analyze
        
    Returns:
        dict with recommendation and explanation
    """
    text_lower = text.lower()
    
    # Domain indicators
    movie_indicators = ['movie', 'film', 'actor', 'actress', 'director', 'oscar', 
                        'cinema', 'scene', 'plot', 'character', 'sequel']
    product_indicators = ['product', 'quality', 'price', 'shipping', 'delivery',
                         'bought', 'purchase', 'worth', 'money', 'item']
    social_indicators = ['lol', 'lmao', 'omg', 'bruh', 'ngl', 'tbh', 'idk']
    
    # Check for emojis (simplified check)
    has_emoji = any(ord(char) > 127 for char in text)
    
    # Count indicator matches
    movie_score = sum(1 for ind in movie_indicators if ind in text_lower)
    product_score = sum(1 for ind in product_indicators if ind in text_lower)
    social_score = sum(1 for ind in social_indicators if ind in text_lower)
    
    if has_emoji:
        social_score += 2  # Emojis suggest social media
    
    # Short informal text suggests social media
    if len(text) < 50 and not any([movie_score, product_score]):
        social_score += 1
    
    # Make recommendation
    scores = {
        'Binary (Movies)': movie_score + 1,  # +1 as it's the default
        'Twitter (3-class)': social_score,
        '5-Star (Products)': product_score,
    }
    
    recommended = max(scores.keys(), key=lambda k: scores[k])
    
    # Generate explanation
    explanations = {
        'Binary (Movies)': 'Best for general/movie-related text with clear sentiment.',
        'Twitter (3-class)': 'Best for informal text, social media, or text with emojis.',
        '5-Star (Products)': 'Best for product/service reviews with nuanced ratings.',
    }
    
    return {
        'text': text,
        'recommended_model': recommended,
        'explanation': explanations[recommended],
        'confidence_scores': scores,
    }


# Test the advisor
advisor_texts = [
    "The movie's cinematography was breathtaking!",
    "Great product, fast shipping, would buy again!",
    "lol this is so funny üòÇüòÇ",
    "Not bad for the price.",
    "The sequel was even better than the original film!",
]

print("Model Recommendation Advisor:")
print("="*70)

for text in advisor_texts:
    rec = recommend_model(text)
    print(f"\nText: \"{text}\"")
    print(f"  Recommended: {rec['recommended_model']}")
    print(f"  Reason: {rec['explanation']}")

---

# Part 5: Mini-Project

## Project: Multi-Model Sentiment Dashboard

**Scenario**: You're building a content moderation system that analyzes user reviews and social media posts. You need to understand sentiment from multiple perspectives and flag cases where models strongly disagree.

**Your goal**: Build a `SentimentDashboard` class that:
1. Analyzes text with multiple models
2. Provides a summary with agreement metrics
3. Flags uncertain or controversial predictions
4. Generates actionable insights

In [None]:
# MINI-PROJECT: Multi-Model Sentiment Dashboard
# ============================================

class SentimentDashboard:
    """
    A comprehensive sentiment analysis dashboard using multiple models.
    """
    
    def __init__(self, agreement_threshold=0.6, uncertainty_threshold=0.3):
        """
        Initialize the dashboard.
        
        Args:
            agreement_threshold: Below this, flag as low agreement
            uncertainty_threshold: Score magnitude below this is uncertain
        """
        self.agreement_threshold = agreement_threshold
        self.uncertainty_threshold = uncertainty_threshold
        
        # Initialize models
        self.models = {
            'binary': sentiment_default_full,
            'twitter': sentiment_twitter_full,
            '5star': sentiment_5star_full,
        }
        
        # Track analysis history
        self.history = []
    
    def analyze(self, text):
        """
        Perform comprehensive sentiment analysis.
        
        Returns:
            dict with detailed analysis results
        """
        results = {'text': text, 'models': {}}
        normalized_scores = []
        
        # Get predictions from each model
        for model_type, pipeline in self.models.items():
            preds = pipeline(text)[0]
            top_pred = max(preds, key=lambda x: x['score'])
            normalized = normalize_sentiment(model_type, preds)
            
            results['models'][model_type] = {
                'label': top_pred['label'],
                'confidence': top_pred['score'],
                'normalized_score': normalized,
                'all_probs': {p['label']: p['score'] for p in preds}
            }
            normalized_scores.append(normalized)
        
        # Calculate aggregate metrics
        avg_score = np.mean(normalized_scores)
        spread = max(normalized_scores) - min(normalized_scores)
        agreement = 1 - (spread / 2)
        
        # Determine consensus label
        if avg_score > self.uncertainty_threshold:
            consensus = 'POSITIVE'
        elif avg_score < -self.uncertainty_threshold:
            consensus = 'NEGATIVE'
        else:
            consensus = 'NEUTRAL'
        
        # Generate flags
        flags = []
        if agreement < self.agreement_threshold:
            flags.append('LOW_AGREEMENT')
        if abs(avg_score) < self.uncertainty_threshold:
            flags.append('UNCERTAIN')
        if any(m['confidence'] < 0.5 for m in results['models'].values()):
            flags.append('LOW_CONFIDENCE')
        
        results['summary'] = {
            'consensus_label': consensus,
            'average_score': avg_score,
            'agreement': agreement,
            'spread': spread,
            'flags': flags,
            'needs_review': len(flags) > 0,
        }
        
        # Store in history
        self.history.append(results)
        
        return results
    
    def format_analysis(self, result):
        """Format analysis for display."""
        lines = []
        
        # Header
        lines.append("‚ïî" + "‚ïê"*68 + "‚ïó")
        lines.append("‚ïë" + " SENTIMENT ANALYSIS DASHBOARD ".center(68) + "‚ïë")
        lines.append("‚ï†" + "‚ïê"*68 + "‚ï£")
        
        # Text
        text_display = result['text'][:64]
        lines.append(f"‚ïë Text: {text_display:62s} ‚ïë")
        if len(result['text']) > 64:
            lines.append(f"‚ïë       {result['text'][64:126]:62s} ‚ïë")
        lines.append("‚ï†" + "‚ïê"*68 + "‚ï£")
        
        # Individual model results
        lines.append("‚ïë" + " Individual Model Predictions: ".ljust(68) + "‚ïë")
        lines.append("‚ïë" + "-"*68 + "‚ïë")
        
        model_names = {'binary': 'Binary (SST2)', 'twitter': 'Twitter', '5star': '5-Star'}
        for model_type, data in result['models'].items():
            name = model_names[model_type]
            label = data['label']
            conf = data['confidence']
            norm = data['normalized_score']
            lines.append(f"‚ïë  {name:18s} ‚îÇ {label:15s} ‚îÇ Conf: {conf:.0%} ‚îÇ Score: {norm:+.2f} ‚ïë")
        
        lines.append("‚ï†" + "‚ïê"*68 + "‚ï£")
        
        # Summary
        summary = result['summary']
        lines.append("‚ïë" + " SUMMARY ".center(68, '‚îÄ') + "‚ïë")
        lines.append(f"‚ïë  Consensus: {summary['consensus_label']:15s}  ‚îÇ  Score: {summary['average_score']:+.2f}" + " "*21 + "‚ïë")
        lines.append(f"‚ïë  Agreement: {summary['agreement']:.0%}" + " "*(56 - len(f"{summary['agreement']:.0%}")) + "‚ïë")
        
        # Flags
        if summary['flags']:
            flags_str = ", ".join(summary['flags'])
            lines.append("‚ï†" + "‚ïê"*68 + "‚ï£")
            lines.append(f"‚ïë  ‚ö†Ô∏è  FLAGS: {flags_str:55s} ‚ïë")
            
            # Recommendations based on flags
            if 'LOW_AGREEMENT' in summary['flags']:
                lines.append("‚ïë       ‚Üí Models disagree significantly - manual review suggested" + " "*3 + "‚ïë")
            if 'UNCERTAIN' in summary['flags']:
                lines.append("‚ïë       ‚Üí Sentiment is ambiguous or neutral" + " "*24 + "‚ïë")
            if 'LOW_CONFIDENCE' in summary['flags']:
                lines.append("‚ïë       ‚Üí At least one model has low confidence" + " "*20 + "‚ïë")
        else:
            lines.append("‚ï†" + "‚ïê"*68 + "‚ï£")
            lines.append("‚ïë  ‚úì  No flags - High confidence prediction" + " "*26 + "‚ïë")
        
        lines.append("‚ïö" + "‚ïê"*68 + "‚ïù")
        
        return '\n'.join(lines)
    
    def batch_analyze(self, texts):
        """
        Analyze multiple texts and return summary statistics.
        """
        results = [self.analyze(text) for text in texts]
        
        # Calculate batch statistics
        n = len(results)
        needs_review = sum(1 for r in results if r['summary']['needs_review'])
        
        sentiment_counts = {'POSITIVE': 0, 'NEGATIVE': 0, 'NEUTRAL': 0}
        for r in results:
            sentiment_counts[r['summary']['consensus_label']] += 1
        
        avg_agreement = np.mean([r['summary']['agreement'] for r in results])
        
        return {
            'total_analyzed': n,
            'needs_review': needs_review,
            'sentiment_distribution': sentiment_counts,
            'average_agreement': avg_agreement,
            'results': results,
        }
    
    def format_batch_summary(self, batch_result):
        """Format batch analysis summary."""
        lines = []
        
        lines.append("‚ïî" + "‚ïê"*50 + "‚ïó")
        lines.append("‚ïë" + " BATCH ANALYSIS SUMMARY ".center(50) + "‚ïë")
        lines.append("‚ï†" + "‚ïê"*50 + "‚ï£")
        
        lines.append(f"‚ïë  Total texts analyzed: {batch_result['total_analyzed']:25d} ‚ïë")
        lines.append(f"‚ïë  Flagged for review:   {batch_result['needs_review']:25d} ‚ïë")
        lines.append(f"‚ïë  Average agreement:    {batch_result['average_agreement']:24.0%} ‚ïë")
        
        lines.append("‚ï†" + "‚ïê"*50 + "‚ï£")
        lines.append("‚ïë" + " Sentiment Distribution: ".ljust(50) + "‚ïë")
        
        dist = batch_result['sentiment_distribution']
        total = sum(dist.values())
        for sentiment, count in dist.items():
            pct = count / total * 100 if total > 0 else 0
            bar = '‚ñà' * int(pct / 5)
            lines.append(f"‚ïë  {sentiment:10s}: {bar:20s} {count:3d} ({pct:.0f}%) ‚ïë")
        
        lines.append("‚ïö" + "‚ïê"*50 + "‚ïù")
        
        return '\n'.join(lines)


# Create dashboard
dashboard = SentimentDashboard()
print("Sentiment Dashboard initialized!")

In [None]:
# Test individual analysis
test_dashboard_texts = [
    "Absolutely love this product! Best purchase I've made all year!",
    "Complete waste of money. Don't buy this.",
    "It's okay I guess. Nothing special but does the job.",
    "The service was great but the product broke after a week.",
]

print("Individual Analysis Results:")
print("="*70)

for text in test_dashboard_texts:
    result = dashboard.analyze(text)
    print(dashboard.format_analysis(result))
    print()

In [None]:
# Test batch analysis with simulated reviews
sample_reviews = [
    "Five stars! Amazing quality and fast shipping!",
    "Good product, would recommend to others.",
    "Exactly as described. Happy with purchase.",
    "Arrived late and was damaged. Very disappointed.",
    "Meh. It's fine.",
    "DO NOT BUY! Total scam!",
    "Pretty good for the price.",
    "Not what I expected but still useful.",
    "Terrible customer service!",
    "I love it! My whole family uses it now.",
    "Could be better but it works.",
    "Waste of money, broke immediately.",
]

# Run batch analysis
batch_result = dashboard.batch_analyze(sample_reviews)

print("\nBatch Analysis:")
print("="*70)
print(dashboard.format_batch_summary(batch_result))

# Show items flagged for review
print("\n\nItems Flagged for Review:")
print("="*70)

flagged = [r for r in batch_result['results'] if r['summary']['needs_review']]
for result in flagged:
    print(f"\n\"{result['text']}\"")
    print(f"  Flags: {', '.join(result['summary']['flags'])}")

In [None]:
# Interactive analysis - try your own texts
# Uncomment and modify:

# my_text = "Your text here"
# result = dashboard.analyze(my_text)
# print(dashboard.format_analysis(result))

### Extension Ideas

If you want to extend this project further:

1. **Export to CSV/JSON**: Save analysis results for external reporting
2. **Trend analysis**: Track sentiment changes over time
3. **Category breakdown**: Analyze sentiment by product category
4. **Language detection**: Route to appropriate language models
5. **Aspect-based sentiment**: Detect sentiment for different aspects (quality, price, service)

---

# Part 6: Wrap-Up

## Key Takeaways

1. **Different models give different results** because they're trained on different data from different domains

2. **Training data shapes model behavior**:
   - Movie review models see "okay" as disappointment
   - Product review models have nuanced star ratings
   - Social media models understand informal language

3. **Always check model cards** to understand:
   - What data the model was trained on
   - What domains it works well for
   - Known limitations and biases

4. **Ensemble methods** combine multiple models for more robust predictions

5. **Model agreement** is a useful signal:
   - High agreement = confident prediction
   - Low agreement = text may be ambiguous or edge case

## Common Mistakes to Avoid

| Mistake | Why It's a Problem |
|---------|-------------------|
| Using one model for all domains | Results may be unreliable for some text types |
| Ignoring neutral sentiment | Binary models force neutral into pos/neg |
| Not checking model cards | May use model on inappropriate domain |
| Treating confidence as accuracy | High confidence ‚â† correct prediction |
| Ignoring sarcasm and negation | Models often struggle with these |

## What's Next?

In **Notebook 10: Pipeline Internals (Capstone)**, you'll learn:
- What happens inside a pipeline (tokenization ‚Üí inference ‚Üí post-processing)
- How to implement pipeline components manually
- How to customize and extend pipelines

This capstone notebook will tie together everything you've learned across all notebooks!

---

## Solutions

### Check Your Understanding (Quiz Answers)

1. **B) Models are trained on different data from different domains** - Training data determines how a model interprets language
2. **B) It forces neutral text into positive or negative** - Binary classification has no neutral option
3. **B) The training data domain and known limitations** - These help you know if the model is appropriate
4. **B) Combining predictions from multiple models** - Ensembles aggregate multiple opinions for robustness

### Exercise 2: Key Insights

In [None]:
# Key patterns that cause model disagreement:

disagreement_patterns = {
    'Negation': {
        'examples': ["I don't hate it", "Not bad", "Can't complain"],
        'why': 'Models may focus on negative words ("hate", "bad") and miss the negation',
    },
    'Sarcasm': {
        'examples': ["Oh great, just what I needed", "Wow, what a surprise"],
        'why': 'Literal interpretation conflicts with intended meaning',
    },
    'Mixed Sentiment': {
        'examples': ["Great product, terrible service", "Love the look, hate the price"],
        'why': 'Different aspects have different sentiments',
    },
    'Domain-Specific Language': {
        'examples': ["This code is sick!", "That movie was a bomb"],
        'why': 'Slang meanings differ from literal meanings',
    },
    'Hedged Statements': {
        'examples': ["I suppose it's okay", "Could have been worse"],
        'why': 'Weak positive/negative signals, models disagree on interpretation',
    },
}

print("Patterns That Cause Model Disagreement:")
print("="*60)

for pattern, info in disagreement_patterns.items():
    print(f"\n{pattern}:")
    print(f"  Examples: {', '.join(info['examples'])}")
    print(f"  Why: {info['why']}")

---

## Additional Resources

- [Hugging Face Model Cards](https://huggingface.co/docs/hub/model-cards) - Understanding model documentation
- [Stanford Sentiment Treebank](https://nlp.stanford.edu/sentiment/) - Original SST dataset
- [SemEval Sentiment Tasks](https://semeval.github.io/) - Academic sentiment benchmarks
- [TweetEval Benchmark](https://github.com/cardiffnlp/tweeteval) - Social media NLP benchmarks
- [Ensemble Methods in NLP](https://arxiv.org/abs/2004.00790) - Academic overview of ensemble approaches