# Session 3: Break It on Purpose

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/buildLittleWorlds/level-2-course-material/blob/main/session-03/notebook.ipynb)

In [None]:
!pip install -q transformers torch
from transformers import pipeline
import re
print("Setup complete!")

## What We Built Tonight

We found what **breaks** the AI, then wrote a `clean_text()` function to fix it.

The pattern: **MESSY INPUT** -> **clean_text()** -> **CLEANER INPUT** -> **MODEL** -> **OUTPUT**

The model didn't get smarter -- we just gave it better input to work with.

## Load the Model

In [None]:
classifier = pipeline("zero-shot-classification", model="valhalla/distilbart-mnli-12-3")
print("Model loaded!")

## Before Cleaning: Try Messy Input

Run the model on messy text and see what happens.

In [None]:
# Messy input -- emoji, ALL CAPS, repeated characters
messy_text = "OMG THIS IS SOOOOO AMAZING I CANT EVEN BELIEVE IT Dr. Smith said the results were incredible"

result = classifier(
    messy_text,
    candidate_labels=["silly and ridiculous", "serious and important", "ordinary and boring"]
)

print("MESSY input results:")
for label, score in zip(result["labels"], result["scores"]):
    print(f"  {label}: {score:.0%}")

## The clean_text() Function

This function cleans up messy input **before** the model sees it. Run this cell to define it.

In [None]:
def clean_text(text):
    # 1. Strip leading/trailing whitespace
    text = text.strip()

    # 2. Collapse multiple spaces into one
    text = re.sub(r' {2,}', ' ', text)

    # 3. Limit repeated characters ("sooooo" -> "soo")
    text = re.sub(r'(.)\1{2,}', r'\1\1', text)

    # 4. Expand common abbreviations
    abbreviations = {
        "Rep.": "Representative",
        "Dr.": "Doctor",
        "Mr.": "Mister",
        "Gov.": "Governor",
        "St.": "Saint",
    }
    for abbr, full in abbreviations.items():
        text = text.replace(abbr, full)

    # 5. Remove emoji
    text = re.sub(
        r'[\U0001F600-\U0001F64F'
        r'\U0001F300-\U0001F5FF'
        r'\U0001F680-\U0001F6FF'
        r'\U00002600-\U000026FF]+',
        ' ', text
    )

    # 6. Normalize ALL CAPS to Title Case
    words = text.split()
    caps_count = sum(1 for w in words if w.isupper() and len(w) > 1)
    if caps_count > 3:
        text = text.title()

    # Clean up any extra spaces from cleaning
    text = re.sub(r' {2,}', ' ', text).strip()

    return text

print("clean_text() is ready!")

## After Cleaning: Compare

Run the same messy text through `clean_text()` first, then through the model.

In [None]:
messy_text = "OMG THIS IS SOOOOO AMAZING I CANT EVEN BELIEVE IT Dr. Smith said the results were incredible"

cleaned = clean_text(messy_text)
print(f"Before: {messy_text}")
print(f"After:  {cleaned}")
print()

result = classifier(
    cleaned,
    candidate_labels=["silly and ridiculous", "serious and important", "ordinary and boring"]
)

print("CLEANED input results:")
for label, score in zip(result["labels"], result["scores"]):
    print(f"  {label}: {score:.0%}")

## Experiments

### Experiment 1: What breaks the model?

**Try this:** Feed the model the worst text you can think of. Try ALL CAPS, emoji floods, repeated characters, very short input.

In [None]:
# Try this: type something messy and see the before/after
my_messy_text = "soooooooo boooooored nothing ever happens in this townnnnnn"

cleaned = clean_text(my_messy_text)
print(f"Before: {my_messy_text}")
print(f"After:  {cleaned}")

### Experiment 2: What can't cleaning fix?

**Try this:** Try sarcasm, double negatives, or mixed languages. Does cleaning help at all?

In [None]:
# Try this: sarcasm -- can cleaning help?
sarcastic = "Oh great, another Monday. I just LOVE waking up early."

cleaned = clean_text(sarcastic)
print(f"Before: {sarcastic}")
print(f"After:  {cleaned}")
print()

result = classifier(
    cleaned,
    candidate_labels=["happy", "sarcastic", "angry"]
)
for label, score in zip(result["labels"], result["scores"]):
    print(f"  {label}: {score:.0%}")

### Experiment 3: Add your own cleaning step

**Try this:** Edit the function below to add a new cleaning step. Ideas: remove hashtags, expand contractions, remove URLs.

In [None]:
def my_clean_text(text):
    # Start with the same steps as before
    text = text.strip()
    text = re.sub(r' {2,}', ' ', text)
    text = re.sub(r'(.)\1{2,}', r'\1\1', text)

    # YOUR NEW STEP HERE: try adding something!
    # Example: remove hashtags
    # text = re.sub(r'#(\w+)', r'\1', text)

    return text

# Test it
test = "#blessed soooo happy right now #goodvibes"
print(f"Before: {test}")
print(f"After:  {my_clean_text(test)}")

## Challenge

Find an input that `clean_text()` makes better **and** an input where cleaning makes no difference. What does that tell you about the limits of data cleaning?

---

**GitHub skill:** Upload this notebook to your `my-ai-portfolio` repo:
1. Go to your repo on github.com
2. Click **Add file** -> **Upload files**
3. Drag your `.ipynb` file and click **Commit changes**

## Vocabulary

| Term | Meaning |
|------|--------|
| **Data cleaning** | Preprocessing text to remove noise before the model sees it |
| **Preprocessing** | Any transformation applied to data before the model processes it |
| **Feature engineering** | Transforming raw input into something a model can work with better |
| **Noise** | Stuff in the input that confuses the model (emoji, extra spaces, weird formatting) |
| **Edge case** | An unusual input that causes unexpected behavior |