# Session 7: Overfitting and Generalization
## When Models Leave Home

**Session Length:** 2 hours

**Today's Mission:** Discover that AI models are shaped by their training data -- and when the real world looks different from that data, models can fail in surprising ways. Learn to spot domain shift, understand overfitting, and match the right model to the right data.

### Session Outline
| Time | Activity |
|------|----------|
| 0:00-0:05 | Review: What did model comparison reveal? |
| 0:05-0:35 | Part 1: Domain Shift -- When Models Leave Home |
| 0:35-0:55 | Part 2: Memorization vs. Learning |
| 0:55-1:40 | Part 3: Matching Models to Data |
| 1:40-2:00 | On Your Own: Test models on your own text domains |

### Key Vocabulary
| Term | Definition |
|------|-----------|
| Domain Shift | When input data differs from training data |
| Overfitting | Too good at training data, too bad at new data |
| Generalization | Model works on data it hasn't seen before |
| Training Domain | The type of data a model learned from |
| Memorization | Learning quirks instead of patterns |

---

## Review: What Did Model Comparison Reveal? (0:00-0:05)

Last session we learned that **different models give different answers for the same input** -- and that a model can be 95% accurate while being completely useless. We learned about confidence scores, false positives, and false negatives.

Today we ask: **why do models fail?** Not just "they get it wrong sometimes" -- but what specifically causes a model to break down?

The answer is surprisingly simple: **models fail when the real world doesn't look like their training data.**

---

## Setup

Run this cell to install the libraries we need.

In [None]:
!pip install transformers==4.47.1 -q

### Important: Restart Your Runtime

After installing packages, you need to restart the runtime so Python can find them.

**Go to: Runtime > Restart runtime**

After restarting, come back here and continue running the cells below. You do NOT need to re-run the install cell -- the packages are already installed. Just start from the next code cell.

---

In [None]:
from transformers import pipeline
print("Ready!")

---

## Part 1: Domain Shift -- When Models Leave Home (0:05-0:35)

Every model was trained on a specific kind of data. Movie reviews. Tweets. News articles. Wikipedia. When you give it data that looks **different** from what it trained on, it can fail -- even if the task is exactly the same. This is called **domain shift**.

Think of it like this: if you learned to drive in a flat desert, you might be great at desert roads. But put you on a mountain switchback in the rain, and suddenly your "driving skill" doesn't transfer. The skill is the same -- driving -- but the **domain** changed.

### Load the Default Sentiment Model

The default sentiment model was trained on **movie reviews** (the SST-2 dataset). It knows what "positive" and "negative" look like in movie review language.

In [None]:
sentiment = pipeline("sentiment-analysis")
print("Default sentiment model loaded!")
print("Training data: SST-2 (movie reviews)")

### Easy Case: Movie Review Language

In [None]:
movie_reviews = [
    "The acting was phenomenal and the plot kept me hooked.",
    "Terrible screenplay with wooden performances throughout.",
    "A masterpiece of modern cinema that will stand the test of time."
]

print("MOVIE REVIEWS (the model's home territory)")
print("=" * 55)

for text in movie_reviews:
    result = sentiment(text)[0]
    print(f"\nText: {text}")
    print(f"  Prediction: {result['label']} (confidence: {result['score']:.1%})")

No surprises. The model is confident and correct on movie reviews because that is exactly what it was trained on. This is the model's **home territory**.

Now let's take it somewhere unfamiliar.

### Domain Shift: Tweets and Slang

> **INSTRUCTOR NOTE:** Before running the next cell, ask students: "This model was trained on formal movie reviews. What do you think will happen when we give it tweets and slang? Will it still work?"

In [None]:
tweets_and_slang = [
    "ngl that movie slapped fr fr",
    "mid tbh, wouldn't recommend",
    "bestie this show is giving everything",
    "that test was bussin no cap",
    "lowkey obsessed with this album rn"
]

print("TWEETS AND SLANG (leaving home territory)")
print("=" * 55)

for text in tweets_and_slang:
    result = sentiment(text)[0]
    print(f"\nText: {text}")
    print(f"  Prediction: {result['label']} (confidence: {result['score']:.1%})")

Notice what happened. The model might have gotten some right, but look at the **confidence scores** and check whether the predictions actually match what a human would say. Words like "slapped," "bussin," and "giving" have positive meanings in slang, but the model may not know that -- because slang was not in its training data.

### Domain Shift: Formal and Technical Text

In [None]:
formal_texts = [
    "The patient presents with acute symptoms requiring immediate intervention.",
    "Quarterly earnings exceeded analyst expectations by a significant margin.",
    "The defendant's counsel filed a motion to dismiss on procedural grounds.",
    "The compound exhibited promising results in preliminary trials."
]

print("FORMAL / TECHNICAL TEXT")
print("=" * 55)

for text in formal_texts:
    result = sentiment(text)[0]
    print(f"\nText: {text}")
    print(f"  Prediction: {result['label']} (confidence: {result['score']:.1%})")

Medical, legal, financial, and scientific text all have their own vocabulary and conventions. A movie review model has no training on this kind of language. It is **guessing** based on whatever patterns it can find -- and those guesses may be wrong even when the confidence is high.

### The Sarcasm Problem

In [None]:
sarcastic_texts = [
    "What a FANTASTIC use of my Saturday.",
    "Oh wonderful, another group project. Just what I needed.",
    "Sure, because that worked so well last time.",
    "I just love sitting in traffic for two hours. Best part of my day."
]

print("SARCASM (the model's worst enemy)")
print("=" * 55)

for text in sarcastic_texts:
    result = sentiment(text)[0]
    print(f"\nText: {text}")
    print(f"  Prediction: {result['label']} (confidence: {result['score']:.1%})")

Sarcasm is almost impossible for this model. Why? Because sarcasm uses **positive words to express negative feelings**. The model sees "fantastic," "wonderful," "love," and "best" and confidently predicts POSITIVE. But a human immediately recognizes the tone.

This is not a bug in the model. It is a **limitation of how it was trained**. The training data (movie reviews) does not contain much sarcasm, so the model never learned to handle it.

### Student Test: Your Own Domain Shift

> **INSTRUCTOR NOTE:** Ask students to suggest slang or informal text from their own lives. Type it in the cell below.

In [None]:
student_text = "REPLACE WITH STUDENT SUGGESTION"

result = sentiment(student_text)[0]
print(f"Text: {student_text}")
print(f"Prediction: {result['label']} (confidence: {result['score']:.1%})")
print()
print("Does this match what a human would say? If not, why might the model be wrong?")

> **ASK AI ABOUT THIS**
>
> Paste a slang sentence and its sentiment result into Claude or ChatGPT and ask:
>
> *"This model said NEGATIVE for a sentence that's actually positive. The model was trained on movie reviews. Can you explain why it might have gotten this wrong?"*
>
> This is how real programmers learn -- by asking questions about code they encounter.

---

## Part 2: Memorization vs. Learning (0:35-0:55)

If you memorized every answer on a practice test, you would ace the practice test -- but bomb the real test. Models do the same thing. They can memorize quirks of their training data instead of learning general patterns. This is called **overfitting**.

A model that **generalizes** well has learned the underlying patterns. A model that has **overfit** has memorized specific examples.

### Two Models, Two Training Domains

Let's load a second sentiment model -- one trained specifically on **tweets**. Then we will compare both models on different kinds of text.

In [None]:
# Model A: Default (trained on movie reviews - SST-2)
model_reviews = pipeline("sentiment-analysis")

# Model B: Twitter-specific (trained on tweets)
model_twitter = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment-latest")

print("Two models loaded!")
print("Model A: distilbert-base (movie reviews)")
print("Model B: twitter-roberta (tweets)")

> **INSTRUCTOR NOTE:** Before running the comparison, ask: "Which model do you think will handle tweets better? Which will handle formal text better? Why?"

### Head-to-Head Comparison

In [None]:
# Tweets â€” Twitter model's home territory
tweets = [
    "ngl that movie slapped fr fr",
    "mid tbh, wouldn't recommend",
    "this show is giving everything omg",
    "ratio + L + didn't ask",
    "W take honestly"
]

# Formal text â€” Review model's home territory
formal = [
    "The acting was phenomenal and the plot kept me hooked.",
    "A disappointing sequel that fails to capture the original's charm.",
    "The quarterly report indicates strong revenue growth across all sectors.",
]

print("TWEETS (Twitter model's home territory)")
print("=" * 65)
print(f"{'Text':<40} {'Reviews Model':<18} {'Twitter Model':<18}")
print("-" * 65)

for text in tweets:
    a = model_reviews(text)[0]
    b = model_twitter(text)[0]
    a_str = f"{a['label'][:3]} {a['score']:.0%}"
    b_str = f"{b['label'][:3]} {b['score']:.0%}"
    print(f"{text[:38]:<40} {a_str:<18} {b_str:<18}")

print()
print("FORMAL TEXT (Reviews model's home territory)")
print("=" * 65)
print(f"{'Text':<40} {'Reviews Model':<18} {'Twitter Model':<18}")
print("-" * 65)

for text in formal:
    a = model_reviews(text)[0]
    b = model_twitter(text)[0]
    a_str = f"{a['label'][:3]} {a['score']:.0%}"
    b_str = f"{b['label'][:3]} {b['score']:.0%}"
    print(f"{text[:38]:<40} {a_str:<18} {b_str:<18}")

Look at the pattern. The Twitter model handles tweets better because it was **trained on tweets**. The reviews model handles formal text better because it was **trained on reviews**. Neither model is "better" -- they are each better at their own domain.

This is the difference between memorization and learning:
- A model that **memorized** movie review patterns will ace movie reviews but fail on tweets.
- A model that truly **learned** sentiment would work on any text. No current model does this perfectly.

### Why This Matters

> **INSTRUCTOR NOTE:** Open the model cards for both models on Hugging Face. Show the "Training Data" sections. Ask: "Given what each was trained on, why does the Twitter model handle slang better?"
>
> - Default model: [distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
> - Twitter model: [cardiffnlp/twitter-roberta-base-sentiment-latest](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest)

---

## Part 3: Matching Models to Data (0:55-1:25)

The skill is not finding the "best model." The skill is **matching the right model to the right data.**

Here are five real scenarios. For each one, think about which model would be a better fit -- and then we will test your prediction.

### The Scenarios

In [None]:
scenarios = {
    "Customer support tickets (formal)": [
        "The product arrived damaged and I need a replacement immediately.",
        "Your software crashes every time I try to export to PDF.",
        "Thank you for the quick resolution to my billing issue."
    ],
    "TikTok comments (informal, slang)": [
        "this is so fire i can't even",
        "not the tutorial i needed but the tutorial i deserved lol",
        "bestie you ate this up"
    ],
    "News article excerpts (journalistic)": [
        "The legislation passed with bipartisan support after months of negotiation.",
        "Critics argue the policy will disproportionately affect low-income families.",
        "Markets rallied on the announcement, with stocks closing at record highs."
    ],
    "Friend group chat (very informal)": [
        "bruh that was so extra lmaooo",
        "im literally dead rn ðŸ’€",
        "ugh monday again somebody save me"
    ],
    "Job application cover letters (formal)": [
        "I am writing to express my enthusiastic interest in the position.",
        "My experience in project management has prepared me well for this role.",
        "I am confident I can contribute meaningfully to your team."
    ]
}

print("MODEL MATCHING EXERCISE")
print("=" * 70)
print()
print(f"{'Scenario':<45} {'Reviews Model':<15} {'Twitter Model':<15}")
print("-" * 70)

for scenario, texts in scenarios.items():
    print(f"\n{scenario}")
    for text in texts:
        a = model_reviews(text)[0]
        b = model_twitter(text)[0]
        a_str = f"{a['label'][:3]} {a['score']:.0%}"
        b_str = f"{b['label'][:3]} {b['score']:.0%}"
        print(f"  {text[:41]:<43} {a_str:<15} {b_str:<15}")

### Discussion

Look at the results and think about these questions:

1. **Which model did better on formal text?** Why?
2. **Which model did better on slang and informal text?** Why?
3. **Were there any cases where BOTH models failed?** What kind of text was it?
4. **If you were building an app to analyze TikTok comments, which model would you pick?**
5. **If you were building an app to analyze customer feedback emails, which would you pick?**

The answer to "which model is better?" is always: **"Better at what?"**

### Student Challenge: Pick Your Domain

> **INSTRUCTOR NOTE:** Have each student pick a domain they care about -- their text messages, Discord messages, school emails, social media comments. Type in 3-4 examples and see which model handles their domain better.

In [None]:
student_domain = "My Domain: REPLACE WITH DESCRIPTION"

student_examples = [
    "REPLACE WITH EXAMPLE 1",
    "REPLACE WITH EXAMPLE 2",
    "REPLACE WITH EXAMPLE 3",
]

print(f"{student_domain}")
print("=" * 65)
print(f"{'Text':<40} {'Reviews Model':<15} {'Twitter Model':<15}")
print("-" * 65)

for text in student_examples:
    if "REPLACE" in text:
        continue
    a = model_reviews(text)[0]
    b = model_twitter(text)[0]
    a_str = f"{a['label'][:3]} {a['score']:.0%}"
    b_str = f"{b['label'][:3]} {b['score']:.0%}"
    print(f"{text[:38]:<40} {a_str:<15} {b_str:<15}")

> **ASK AI ABOUT THIS**
>
> After the model matching exercise, paste your results into Claude or ChatGPT and ask:
>
> *"I tested two sentiment models on different types of text. One was trained on movie reviews, the other on tweets. Here are the results: [paste results]. Which model would you recommend for analyzing [your use case], and why?"*
>
> This is how real programmers learn -- by asking questions about code they encounter.

---

## On Your Own (1:40-2:00)

### Experiment 1: Find Your Own Domain Shift

Find text from your own life -- group chats, school emails, social media posts, game chat -- and run it through both models. Document which model handles your data best.

In [None]:
my_texts = [
    "REPLACE WITH TEXT FROM YOUR LIFE 1",
    "REPLACE WITH TEXT FROM YOUR LIFE 2",
    "REPLACE WITH TEXT FROM YOUR LIFE 3",
    "REPLACE WITH TEXT FROM YOUR LIFE 4",
    "REPLACE WITH TEXT FROM YOUR LIFE 5",
]

print("MY DOMAIN SHIFT EXPERIMENT")
print("=" * 65)
print(f"{'Text':<40} {'Reviews Model':<15} {'Twitter Model':<15}")
print("-" * 65)

for text in my_texts:
    if "REPLACE" in text:
        continue
    a = model_reviews(text)[0]
    b = model_twitter(text)[0]
    a_str = f"{a['label'][:3]} {a['score']:.0%}"
    b_str = f"{b['label'][:3]} {b['score']:.0%}"
    print(f"{text[:38]:<40} {a_str:<15} {b_str:<15}")

### Experiment 2: The Sarcasm Challenge

Write 5 sarcastic sentences and run them through both models. Can either model detect sarcasm reliably?

In [None]:
sarcasm_test = [
    "REPLACE WITH SARCASTIC SENTENCE 1",
    "REPLACE WITH SARCASTIC SENTENCE 2",
    "REPLACE WITH SARCASTIC SENTENCE 3",
    "REPLACE WITH SARCASTIC SENTENCE 4",
    "REPLACE WITH SARCASTIC SENTENCE 5",
]

print("SARCASM CHALLENGE")
print("=" * 65)
print(f"{'Text':<40} {'Reviews Model':<15} {'Twitter Model':<15}")
print("-" * 65)

for text in sarcasm_test:
    if "REPLACE" in text:
        continue
    a = model_reviews(text)[0]
    b = model_twitter(text)[0]
    a_str = f"{a['label'][:3]} {a['score']:.0%}"
    b_str = f"{b['label'][:3]} {b['score']:.0%}"
    print(f"{text[:38]:<40} {a_str:<15} {b_str:<15}")

### Experiment 3: Model Recommendation Table

Based on everything you tested today, fill in this table:

| Data Type | Better Model | Why? |
|-----------|-------------|------|
| Movie reviews | | |
| Tweets and social media | | |
| Customer emails | | |
| Sarcastic text | | |
| News articles | | |
| Your own domain: _______ | | |

---

### Checklist: Before You Leave

- [ ] Tested the default model on text outside its training domain
- [ ] Saw how slang, sarcasm, and formal text cause different failures
- [ ] Compared two models trained on different data
- [ ] Understood that neither model is "better" -- they fit different domains
- [ ] Tested models on 5 real-world scenarios
- [ ] Found text from your own life and identified the best model for it
- [ ] Saved your work (File > Save a copy in Drive)

---

## Looking Ahead

Next session, we tackle one of the most important topics in AI: **bias**. Models learn from data, data comes from the real world, and the real world has biases. We will find those biases in real models, explore what happens when models are confident but wrong, and discuss who gets hurt when AI systems are unfair.

See you next session.

---

*Youth Horizons AI Researcher Program - Level 2*