# Session 6: Model Evaluation and Metrics
## How Do You Know If a Model Is Good?

**Session Length:** 2 hours

**Today's Mission:** Learn to evaluate AI models critically -- understand confidence scores, compare models head-to-head, and discover why accuracy alone can be misleading.

### Session Outline
| Time | Activity |
|------|----------|
| 0:00-0:05 | Review: What did parameter experiments reveal? |
| 0:05-0:30 | Part 1: Confidence Scores -- Can You Trust Them? |
| 0:30-1:05 | Part 2: Model Comparison -- Same Input, Different Models |
| 1:05-1:40 | Part 3: When Numbers Lie |
| 1:40-2:00 | On Your Own: Extended model comparison |

### Key Vocabulary
| Term | Definition |
|------|-----------|
| Confidence Score | A number (0-1) showing how sure the model is |
| Model Comparison | Testing the same input on different models to see who is better |
| Accuracy | How often the model gets the right answer |
| False Positive | When the model says YES but the answer is NO |
| Evaluation | Systematically testing how well a model performs |

---

## Review: What Did Parameter Experiments Reveal? (0:00-0:05)

Last session we learned that models have **hyperparameters** -- dials you control that change how the model behaves. Temperature controls creativity, top-p limits word choices, max_length controls summary size, and even the labels you give a zero-shot classifier change its results.

Today we ask the next logical question: **how do you know if a model is actually good?**

A model can generate text, classify sentiment, summarize articles. But is it doing a good job? How would you even measure that? And what happens when the numbers say "great" but the model is actually failing?

Let's find out.

---

## Setup

Run this cell to install the libraries we need.

In [None]:
!pip install transformers==4.47.1 gradio -q


### Important: Restart Your Runtime

After installing packages, you need to restart the runtime so Python can find them.

**Go to: Runtime > Restart runtime**

After restarting, come back here and continue running the cells below. You do NOT need to re-run the install cell -- the packages are already installed. Just start from the next code cell.

---

In [None]:
from transformers import pipeline
print("Ready!")

---

## Part 1: Confidence Scores -- Can You Trust Them? (0:05-0:30)

Every time a classification model makes a prediction, it also gives you a **confidence score** -- a number between 0 and 1 that represents how sure the model is. A score of 0.95 means "I'm 95% confident in this answer."

But here is the critical question: **does confident mean correct?**

### Sentiment Analysis: The Basics

In [None]:
sentiment = pipeline("sentiment-analysis")
print("Sentiment model loaded!")

#### Easy Cases: High Confidence, Correct Answer

In [None]:
easy_texts = [
    "I absolutely love this movie, it was incredible!",
    "This is the worst product I have ever purchased.",
    "The concert was fantastic and everyone had a great time."
]

print("EASY CASES")
print("=" * 50)

for text in easy_texts:
    result = sentiment(text)[0]
    print(f"\nText: {text}")
    print(f"  Prediction: {result['label']} (confidence: {result['score']:.1%})")

No surprises here. Clear positive text gets classified as positive with high confidence. Clear negative text gets classified as negative. The model is sure, and it is right.

But what happens when the text is not so clear?

#### Ambiguous Cases: Lower Confidence

In [None]:
ambiguous_texts = [
    "It was fine.",
    "I guess it could have been worse.",
    "The food was okay but the service was slow.",
    "Not bad, not great, just average."
]

print("AMBIGUOUS CASES")
print("=" * 50)

for text in ambiguous_texts:
    result = sentiment(text)[0]
    print(f"\nText: {text}")
    print(f"  Prediction: {result['label']} (confidence: {result['score']:.1%})")

Notice how the confidence drops for ambiguous text. The model is less sure -- and that is actually appropriate. When the text genuinely could go either way, a lower confidence score is honest.

#### The Dangerous Case: High Confidence, Wrong Answer

> **INSTRUCTOR NOTE:** Before running the next cell, ask students: "What do you think will happen with sarcasm? The model has never been taught about sarcasm -- it just looks at word patterns."

In [None]:
tricky_texts = [
    "Oh wonderful, another meeting. Exactly how I wanted to spend my Saturday.",
    "What a great idea to release the product without testing it first.",
    "I just love sitting in traffic for two hours. Best part of my day.",
    "Sure, because that worked so well last time."
]

print("TRICKY CASES: SARCASM")
print("=" * 50)

for text in tricky_texts:
    result = sentiment(text)[0]
    print(f"\nText: {text}")
    print(f"  Prediction: {result['label']} (confidence: {result['score']:.1%})")

This is the dangerous scenario. The model sees words like "wonderful," "great," "love," and "best" and confidently predicts POSITIVE. But a human reader immediately recognizes the sarcasm -- the actual sentiment is negative.

**A confidence score tells you how sure the model is. It does NOT tell you whether the model is correct.** A model can be 99% confident and 100% wrong. This is one of the most important lessons in AI evaluation.

### Zero-Shot on Genuinely Ambiguous Text

In [None]:
classifier = pipeline("zero-shot-classification")

text = "I spent three hours working on this project"
categories = ["productive", "frustrated", "neutral", "dedicated"]

result = classifier(text, categories)

print("AMBIGUOUS TEXT: ZERO-SHOT CLASSIFICATION")
print("=" * 50)
print(f"Text: {text}\n")

for label, score in zip(result['labels'], result['scores']):
    print(f"  {label}: {score:.1%}")

> **INSTRUCTOR NOTE:** Ask students: "Is 'I spent three hours working on this project' positive or negative? Could it be both? What confidence score would you trust -- 90%? 70%? 51%?"

### Student Test: Find a Confidence Fail

> **INSTRUCTOR NOTE:** Ask students to suggest ambiguous or sarcastic text. Type it below and see if the model gets fooled.

In [None]:
student_ambiguous = "REPLACE WITH SOMETHING AMBIGUOUS"

result = sentiment(student_ambiguous)[0]
print(f"Text: {student_ambiguous}")
print(f"Prediction: {result['label']} (confidence: {result['score']:.1%})")

# Also try zero-shot
zs_result = classifier(student_ambiguous, ["positive", "negative", "sarcastic", "neutral"])
print(f"\nZero-shot breakdown:")
for label, score in zip(zs_result['labels'], zs_result['scores']):
    print(f"  {label}: {score:.1%}")

> **ASK AI ABOUT THIS**
>
> Copy the sarcasm examples and their predictions into Claude or ChatGPT and ask:
>
> *"These three models gave different answers for the same text. Why might that happen? What is different about these models?"*
>
> This is how real programmers learn -- by asking questions about code they encounter.

---

## Part 2: Model Comparison -- Same Input, Different Models (0:30-1:05)

Here is a fact that surprises most people: **different models give different answers for the same input.** Not slightly different -- sometimes completely opposite answers.

Why? Because each model was trained on different data. A model trained on movie reviews has a different idea of "positive" than a model trained on tweets. A model trained on product reviews might not understand political text at all.

Let's see this in action by running the same sentences through three different sentiment models.

### Loading Three Models

In [None]:
# Model A: Default distilbert (trained on movie reviews - SST-2)
model_a = pipeline("sentiment-analysis")

# Model B: Twitter-specific model (trained on tweets)
model_b = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment-latest")

# Model C: Star-rating model (trained on product reviews, 1-5 stars)
model_c = pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment")

print("All three models loaded!")
print()
print("Model A: distilbert-base (movie reviews)")
print("Model B: twitter-roberta (tweets)")
print("Model C: bert-multilingual (product reviews, 1-5 stars)")

### The Comparison Test

We will run the same sentences through all three models and see where they agree and disagree.

> **INSTRUCTOR NOTE:** Let students suggest 2-3 test sentences. Replace the "REPLACE WITH STUDENT SUGGESTION" entries with their ideas before running the cell.

In [None]:
test_sentences = [
    "I absolutely love this product!",
    "This is the worst experience I have ever had.",
    "It was okay, nothing special.",
    "Oh great, another update that breaks everything.",
    "The service was slow but the food was amazing.",
    "I didn't hate it.",
    "REPLACE WITH STUDENT SUGGESTION",
    "REPLACE WITH STUDENT SUGGESTION",
]

print(f"{'Text':<45} {'Default':<15} {'Twitter':<15} {'Stars':<15}")
print("=" * 90)

for text in test_sentences:
    if "REPLACE" in text:
        continue
    a = model_a(text)[0]
    b = model_b(text)[0]
    c = model_c(text)[0]
    a_str = f"{a['label'][:3]} {a['score']:.0%}"
    b_str = f"{b['label'][:3]} {b['score']:.0%}"
    c_str = f"{c['label'][:8]} {c['score']:.0%}"
    print(f"{text[:43]:<45} {a_str:<15} {b_str:<15} {c_str:<15}")

### Discussion Questions

Look at the comparison table and think about these:

1. **Where did all three models agree?** These are probably "easy" cases where the sentiment is obvious.

2. **Where did they disagree?** These are the interesting cases. Why might the Twitter model see something differently than the movie review model?

3. **The Twitter model was trained on tweets. The default model was trained on movie reviews.** How might that explain their different responses to sarcasm or casual language?

4. **The star-rating model gives 1-5 stars instead of positive/negative.** When is a 3-star rating useful information that binary positive/negative misses?

### Detailed Breakdown

Let's look at one sentence in detail across all three models.

In [None]:
detail_text = "The service was slow but the food was amazing."

print(f"Text: {detail_text}")
print()

print("Model A (Default - Movie Reviews):")
a = model_a(detail_text)[0]
print(f"  {a['label']} ({a['score']:.1%})")

print("\nModel B (Twitter):")
b = model_b(detail_text)[0]
print(f"  {b['label']} ({b['score']:.1%})")

print("\nModel C (Product Reviews - Stars):")
c = model_c(detail_text)[0]
print(f"  {c['label']} ({c['score']:.1%})")

print()
print("This sentence has MIXED sentiment -- one bad thing and one good thing.")
print("Notice how each model handles the contradiction differently.")

> **INSTRUCTOR NOTE:** Go to huggingface.co/models and filter by "text-classification". Show students how many sentiment models exist (hundreds). Click on cardiffnlp/twitter-roberta-base-sentiment-latest -- show the model card, training data section, and any benchmark numbers. Point out that the model card tells you what data the model was trained on, which helps explain its behavior.

### Student Challenge: Find the Biggest Disagreement

> **INSTRUCTOR NOTE:** Ask students to come up with sentences where they think the models will disagree the most. Test their predictions.

In [None]:
student_challenge = "REPLACE WITH STUDENT SUGGESTION"

print(f"Text: {student_challenge}\n")

a = model_a(student_challenge)[0]
b = model_b(student_challenge)[0]
c = model_c(student_challenge)[0]

print(f"Default:  {a['label']} ({a['score']:.1%})")
print(f"Twitter:  {b['label']} ({b['score']:.1%})")
print(f"Stars:    {c['label']} ({c['score']:.1%})")

> **ASK AI ABOUT THIS**
>
> Copy the comparison table output into Claude or ChatGPT and ask:
>
> *"These three sentiment models gave different answers for the same texts. One was trained on movie reviews, one on tweets, and one on product reviews. Explain why training data affects predictions, using these results as examples."*
>
> This is how real programmers learn -- by asking questions about code they encounter.

---

## Part 3: When Numbers Lie (1:05-1:25)

We have seen that confidence scores can be misleading and that different models disagree. Now let's talk about the most common trap in AI evaluation: **accuracy that hides failure.**

### The Spam Detector Thought Experiment

Imagine you build a spam detector for email. You test it on 1,000 emails:
- 950 are real emails (not spam)
- 50 are spam

Your model predicts "NOT SPAM" for every single email. Every one. It never flags anything as spam.

**Quiz: What is this model's accuracy?**

In [None]:
total_emails = 1000
real_emails = 950
spam_emails = 50

# The model predicts NOT SPAM for everything
correct_predictions = real_emails  # It gets all 950 real emails right
wrong_predictions = spam_emails    # It misses all 50 spam emails

accuracy = correct_predictions / total_emails

print("THE LAZY SPAM DETECTOR")
print("=" * 50)
print(f"Total emails tested: {total_emails}")
print(f"Real emails:         {real_emails}")
print(f"Spam emails:         {spam_emails}")
print()
print(f"Model predicts: NOT SPAM for everything")
print(f"Correct predictions: {correct_predictions}")
print(f"Wrong predictions:   {wrong_predictions}")
print(f"Accuracy:            {accuracy:.1%}")
print()
print("95% accuracy! Ship it?")

**95% accuracy.** That sounds great -- until you realize the model is completely useless. It never catches a single spam email. Its entire job is to find spam, and it finds zero.

This is why **accuracy alone hides problems.** A model can be "accurate" while failing completely at its actual job. The 95% accuracy comes from a trick: since most emails are real, the model can just predict "not spam" every time and be right 95% of the time.

### What Went Wrong: The Imbalanced Data Problem

When one category is much more common than another (950 real vs. 50 spam), the model can cheat by always guessing the common category. This is called the **class imbalance problem**, and it is one of the most common traps in machine learning.

### A Better Way to Evaluate

Instead of just asking "how often is the model right?", we should ask:
- **Of the emails the model flagged as spam, how many actually were spam?** (Precision)
- **Of all the actual spam emails, how many did the model catch?** (Recall)

Let's see this with a simple simulation.

In [None]:
print("TWO SPAM DETECTORS COMPARED")
print("=" * 50)

# Detector A: predicts NOT SPAM for everything
print("\nDetector A: Predicts NOT SPAM for everything")
print(f"  Accuracy: 95.0%")
print(f"  Spam caught: 0 out of 50 (0%)")
print(f"  Verdict: USELESS -- never catches spam")

# Detector B: catches some spam, makes some mistakes
print("\nDetector B: Actually tries to detect spam")
print(f"  Accuracy: 92.0%")
print(f"  Spam caught: 40 out of 50 (80%)")
print(f"  False alarms: 30 real emails flagged as spam")
print(f"  Verdict: USEFUL -- catches most spam, some false alarms")

print("\n" + "=" * 50)
print("\nDetector A has HIGHER accuracy (95% vs 92%)")
print("But Detector B is clearly the better spam detector.")
print("\nThis is why you cannot evaluate a model on accuracy alone.")

### False Positives and False Negatives

When a model makes a mistake, there are two kinds of errors:

| Error Type | What Happened | Spam Example |
|-----------|---------------|--------------|
| **False Positive** | Model said YES, answer was NO | Flagged a real email as spam |
| **False Negative** | Model said NO, answer was YES | Let a spam email through to inbox |

Which error is worse depends on the task:
- **Medical diagnosis:** A false negative (missing a disease) is much worse than a false positive (unnecessary test).
- **Spam filter:** A false positive (blocking a real email) might be worse than a false negative (letting some spam through).
- **Self-driving car:** A false negative (not seeing a pedestrian) is catastrophic.

There is no universal answer to "which error is worse." It depends entirely on what the model is being used for. This is a **design decision**, not a math problem.

### Seeing It With Real Models

Let's create a small evaluation experiment with our sentiment model. We will test it on sentences where we know the right answer and see where it fails.

In [None]:
# Test cases where WE know the correct sentiment
test_cases = [
    ("I love this!", "POSITIVE"),
    ("Terrible product.", "NEGATIVE"),
    ("Best day ever!", "POSITIVE"),
    ("Total waste of money.", "NEGATIVE"),
    ("It was okay.", "POSITIVE"),    # Arguable -- could go either way
    ("Not my favorite.", "NEGATIVE"),
    ("Could be better.", "NEGATIVE"),
    ("I didn't hate it.", "POSITIVE"),  # Tricky -- weak positive
    ("What a disaster.", "NEGATIVE"),
    ("Pretty good actually.", "POSITIVE"),
]

print("SENTIMENT MODEL EVALUATION")
print("=" * 60)
print(f"{'Text':<30} {'Expected':<12} {'Predicted':<12} {'Match?':<8}")
print("-" * 60)

correct = 0
total = len(test_cases)

for text, expected in test_cases:
    result = sentiment(text)[0]
    predicted = result['label']
    match = predicted == expected
    correct += int(match)
    marker = "YES" if match else "NO <--"
    print(f"{text:<30} {expected:<12} {predicted:<12} {marker}")

print("-" * 60)
print(f"Accuracy: {correct}/{total} = {correct/total:.0%}")
print()
if correct < total:
    print(f"The model got {total - correct} wrong. Look at the misses --")
    print("are they cases where the 'right' answer is genuinely debatable?")

> **ASK AI ABOUT THIS**
>
> Ask Claude or ChatGPT:
>
> *"A spam detection model that always predicts NOT SPAM has 95% accuracy because 95% of emails are real. Explain why this is misleading. What metrics would be better, and why?"*
>
> This is how real programmers learn -- by asking questions about code they encounter.

### Preview: What Comes Next

This is why real data scientists never report just accuracy. They look at:
- Where does the model fail?
- What kinds of mistakes does it make?
- Are the mistakes acceptable for this use case?

Next session, we will go deeper into what happens when the data the model sees in the real world looks different from what it trained on. A model trained on movie reviews might fail on tweets. A model trained on English might fail on slang. Understanding these failure modes is what separates someone who uses AI from someone who understands AI.

---

## On Your Own (1:40-2:00)

### Experiment 1: Build Your Own Test Suite

Add your own test sentences to the comparison. Try to find inputs where all three models give different answers.

In [None]:
my_test_sentences = [
    "REPLACE WITH YOUR SENTENCE 1",
    "REPLACE WITH YOUR SENTENCE 2",
    "REPLACE WITH YOUR SENTENCE 3",
    "REPLACE WITH YOUR SENTENCE 4",
    "REPLACE WITH YOUR SENTENCE 5",
]

print(f"{'Text':<45} {'Default':<15} {'Twitter':<15} {'Stars':<15}")
print("=" * 90)

for text in my_test_sentences:
    if "REPLACE" in text:
        continue
    a = model_a(text)[0]
    b = model_b(text)[0]
    c = model_c(text)[0]
    a_str = f"{a['label'][:3]} {a['score']:.0%}"
    b_str = f"{b['label'][:3]} {b['score']:.0%}"
    c_str = f"{c['label'][:8]} {c['score']:.0%}"
    print(f"{text[:43]:<45} {a_str:<15} {b_str:<15} {c_str:<15}")

### Experiment 2: Which Model Would You Pick?

Imagine you are building a sentiment analysis tool for one of these use cases. Which of the three models would you choose, and why?

| Use Case | Model Choice | Why? |
|----------|-------------|------|
| Analyzing customer reviews for a restaurant | | |
| Monitoring Twitter for brand mentions | | |
| Scanning movie reviews for a recommendation app | | |
| Checking student feedback on a class | | |

There is no single right answer. The best model depends on the job.

### Experiment 3: Create Your Own Evaluation

Pick a topic you care about and write 10 test sentences where YOU know the correct sentiment. Run them through the model and calculate accuracy.

In [None]:
# Write your own test cases: (text, expected_label)
# Expected label should be "POSITIVE" or "NEGATIVE"
my_evaluation = [
    ("REPLACE WITH YOUR TEXT", "POSITIVE"),
    ("REPLACE WITH YOUR TEXT", "NEGATIVE"),
    # Add more...
]

print("MY CUSTOM EVALUATION")
print("=" * 50)

correct = 0
total = 0

for text, expected in my_evaluation:
    if "REPLACE" in text:
        continue
    result = sentiment(text)[0]
    match = result['label'] == expected
    correct += int(match)
    total += 1
    marker = "YES" if match else "NO <--"
    print(f"{text[:40]:<42} Expected: {expected:<10} Got: {result['label']:<10} {marker}")

if total > 0:
    print(f"\nAccuracy: {correct}/{total} = {correct/total:.0%}")

### Bonus: Model Comparison App

In Session 3 you saw Gradio turn one model into a web app. Now let's do something more powerful -- compare all three sentiment models side by side in one interface.

This time the app has **one input and three outputs**.


In [None]:
import gradio as gr

def compare_three_models(text):
    a = model_a(text)[0]
    b = model_b(text)[0]
    c = model_c(text)[0]
    result_a = f"{a['label']} ({a['score']:.1%})"
    result_b = f"{b['label']} ({b['score']:.1%})"
    result_c = f"{c['label']} ({c['score']:.1%})"
    return result_a, result_b, result_c

demo = gr.Interface(
    fn=compare_three_models,
    inputs=gr.Textbox(label="Type any text", lines=2,
                     placeholder="Try sarcasm, mixed sentiment, or slang..."),
    outputs=[
        gr.Textbox(label="Model A (Movie Reviews)"),
        gr.Textbox(label="Model B (Twitter)"),
        gr.Textbox(label="Model C (Product Reviews)"),
    ],
    title="Sentiment Model Showdown",
    description="Same text, three different models. See who disagrees!",
    allow_flagging="never",
)

demo.launch(share=True)


> **INSTRUCTOR NOTE:** This builds on Session 3's single-model Gradio demo. Students now see that Gradio can have multiple outputs. The shareable link lets them challenge friends to find inputs where the models disagree. Stop the demo by clicking the stop button or restarting the runtime when done.

---


---

### Checklist: Before You Leave

- [ ] Understood that confidence scores show certainty, not correctness
- [ ] Tested sarcasm and ambiguous text to find model blind spots
- [ ] Compared three different sentiment models on the same inputs
- [ ] Discussed why training data explains model disagreements
- [ ] Understood the spam detector example (high accuracy, useless model)
- [ ] Learned the difference between false positives and false negatives
- [ ] Built your own evaluation test suite
- [ ] Decided which model you would pick for a specific task
- [ ] Saved your work (File > Save a copy in Drive)

---

## Looking Ahead

Next session, we will explore what happens when the real world does not match the training data. Models trained on one kind of text can fail spectacularly on another. Understanding these failure modes -- and knowing how to spot them -- is what separates someone who uses AI from someone who truly understands it.

See you next session.

---

*Youth Horizons AI Researcher Program - Level 2*