# Session 5: Model Training and Parameters
## The Dials You Control

**Session Length:** 2 hours

**Today's Mission:** Discover that AI models have settings YOU can change -- called hyperparameters -- and see how those settings dramatically affect what the model produces.

### Session Outline
| Time | Activity |
|------|----------|
| 0:00-0:05 | Review: What supervised learning examples did you find? |
| 0:05-0:35 | Part 1: Hyperparameters -- The Dials You Control |
| 0:35-1:00 | Part 2: Parameters in Other Models |
| 1:00-1:40 | Part 3: Systematic Parameter Sweep |
| 1:40-2:00 | On Your Own: Extended parameter experiments |

### Key Vocabulary
| Term | Definition |
|------|-----------|
| Parameters | The billions of patterns a model learned during training (you can't change these) |
| Hyperparameters | Settings YOU control that affect how the model behaves |
| Temperature | Controls how creative vs. predictable text generation is |
| Top-p | Limits the model to only the most likely next words |
| Beam Search | A method where the model considers multiple options before committing |

---

## Review: What Supervised Learning Examples Did You Find? (0:00-0:05)

Last session we saw how machines learn from labeled examples -- training data goes in, patterns come out, and the model makes predictions on data it has never seen.

Today we flip the question around. Instead of asking "how does the model learn?", we ask: **"what can I control about how the model behaves?"**

The answer: more than you might think.

---

## Setup

Run this cell to install the libraries we need.

In [None]:
!pip install transformers==4.47.1 -q

### Important: Restart Your Runtime

After installing packages, you need to restart the runtime so Python can find them.

**Go to: Runtime > Restart runtime**

After restarting, come back here and continue running the cells below. You do NOT need to re-run the install cell -- the packages are already installed. Just start from the next code cell.

---

In [None]:
from transformers import pipeline
print("Ready!")

---

## Part 1: Hyperparameters -- The Dials You Control (0:05-0:35)

Here is the key idea for today:

- The model has **BILLIONS of learned parameters** -- patterns it picked up during training. You cannot see or change those. They are baked into the model.
- But you CAN control settings that affect how the model **uses** those parameters. These settings are called **hyperparameters**.
- **Temperature** is a hyperparameter. It controls how creative vs. predictable the model is.

Think of it like a radio. The radio station (the model's learned knowledge) is fixed. But you control the volume, the bass, the treble. Those are the dials. Today we learn what the dials do.

> **FIND A MODEL: Text Generation**
>
> The model we'll use today (`distilgpt2`) is a small text generator. But there are many text-generation models on the Hub, each with different strengths.
>
> 1. Go to [huggingface.co/models?pipeline_tag=text-generation&sort=downloads](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads)
> 2. Browse the top results. Notice the range -- tiny models like `distilgpt2`, medium models, and huge ones that need GPUs.
> 3. Pick a model that says **"cpu"** or does NOT list a minimum GPU requirement. **Read its model card**: What was it trained on? What language? Any known limitations?
> 4. Copy the model ID -- you can try it in the swap slot below.
>
> **Important:** Many text-generation models require a GPU. Stick with small models (under 500M parameters) to run on free Colab CPU.

### Load a Text Generator

We will use `distilgpt2`, a small but real text generation model. It predicts what comes next in a sentence, one word at a time.

In [None]:
# ── SWAP SLOT: Text Generation Model ──
# Default: distilgpt2 (small, CPU-friendly, trained on web text)
# To use a model you found on the Hub, paste its ID below:

my_model = "PASTE YOUR MODEL ID HERE"
# Example: "distilgpt2"  (82M params, CPU ok)
# Example: "sshleifer/tiny-gpt2"  (even smaller, fast)

# Uncomment these two lines to use your own model:
# generator = pipeline("text-generation", model=my_model)
# print(f"Loaded custom model: {my_model}")

# Default: use distilgpt2
generator = pipeline("text-generation", model="distilgpt2")
print("Generator loaded: distilgpt2")

### Temperature Experiment

Temperature controls how "adventurous" the model is when picking the next word.

- **Low temperature (0.3):** The model almost always picks the most probable next word. Safe, predictable, sometimes boring.
- **Medium temperature (0.7-1.0):** A balance between predictable and surprising.
- **High temperature (1.3):** The model picks less likely words more often. Creative, surprising, sometimes nonsensical.

Let's see this in action.

> **INSTRUCTOR NOTE:** Before running the cell, ask students: "If the prompt is 'The robot opened the door and saw,' what do you think it will say at temperature 0.3 vs 1.3? More random? Weirder? Let them guess, then run it."

In [None]:
prompt = "The robot opened the door and saw"

print("TEMPERATURE EXPERIMENT")
print("=" * 50)
print(f"Prompt: {prompt}\n")

for temp in [0.3, 0.7, 1.0, 1.3]:
    result = generator(
        prompt,
        max_length=50,
        do_sample=True,
        temperature=temp,
        num_return_sequences=1
    )
    creativity = "Conservative" if temp < 0.5 else "Balanced" if temp < 0.9 else "Creative" if temp < 1.1 else "Wild"
    print(f"\nTemp {temp} ({creativity}):")
    print(f"  {result[0]['generated_text']}")

Notice the pattern. Low temperature text reads like a textbook -- correct but dull. High temperature text takes risks -- sometimes brilliant, sometimes gibberish. This is the fundamental tradeoff in generative AI: **safety vs. creativity**.

### Top-P Experiment

Top-p (also called "nucleus sampling") is a different kind of dial. Instead of adjusting how adventurous the model is, it limits **which words the model is even allowed to consider**.

- **Top-p = 0.5:** Only consider words that make up the top 50% of probability. Very restricted vocabulary.
- **Top-p = 0.95:** Consider words that make up the top 95% of probability. Almost everything is on the table.

In [None]:
prompt = "Once upon a time in a magical forest,"

print("TOP-P EXPERIMENT")
print("=" * 50)
print(f"Prompt: {prompt}\n")

for top_p in [0.5, 0.8, 0.95]:
    result = generator(
        prompt,
        max_length=60,
        do_sample=True,
        top_p=top_p,
        num_return_sequences=1
    )
    print(f"\nTop-P {top_p}:")
    print(f"  {result[0]['generated_text']}")

### Try Your Own Prompt

Let's test temperature and top-p with a prompt from the class.

> **INSTRUCTOR NOTE:** Ask students to suggest a prompt. Type it into the cell below, then run the cell to see how temperature changes the output.

In [None]:
student_prompt = "REPLACE WITH STUDENT SUGGESTION"

print(f"Prompt: {student_prompt}\n")
for temp in [0.3, 0.7, 1.0, 1.3]:
    result = generator(
        student_prompt,
        max_length=50,
        do_sample=True,
        temperature=temp,
        num_return_sequences=1
    )
    creativity = "Conservative" if temp < 0.5 else "Balanced" if temp < 0.9 else "Creative" if temp < 1.1 else "Wild"
    print(f"Temp {temp} ({creativity}):")
    print(f"  {result[0]['generated_text']}\n")

> **READ THE MODEL CARD**
>
> We've been using `distilgpt2` -- but what was it actually trained on? Check the source:
>
> Go to [huggingface.co/distilgpt2](https://huggingface.co/distilgpt2)
>
> - What dataset was it trained on? (Hint: look for "OpenWebTextCorpus")
> - How many parameters does it have?
> - What does the "Limitations and bias" section say?
> - How does it compare to the full GPT-2 model?
>
> Model cards document the hyperparameter choices the model creators made during training -- the same kind of choices you're experimenting with today.

---

## Part 2: Parameters in Other Models (0:35-1:00)

Temperature and top-p are hyperparameters for text generation. But **every model has its own dials**. Let's explore a few different model types and see what you can control in each one.

### Summarization: Controlling Length and Quality

> **INSTRUCTOR NOTE:** For each experiment in Part 2, ask students to predict the result before running the cell. "What do you think will happen if we set max_length to 20 vs 100?"

In [None]:
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")
print("Summarizer loaded!")

#### max_length: How Long Should the Summary Be?

The `max_length` hyperparameter tells the model the maximum number of tokens (roughly words) it can use in the summary. Watch how the same article gets compressed differently.

In [None]:
text = """Artificial intelligence has made remarkable progress in recent years. \
Large language models can now write essays, translate between languages, and even \
generate code. Computer vision systems can identify objects in photos with superhuman \
accuracy. Self-driving cars use AI to navigate roads, and recommendation systems \
powered by AI decide what shows you see on Netflix and what posts appear in your \
social media feeds. Despite these advances, AI still struggles with common sense \
reasoning, understanding context the way humans do, and explaining its own decisions. \
Researchers are working on making AI systems more transparent and trustworthy."""

print("SUMMARIZATION: max_length EXPERIMENT")
print("=" * 50)
print(f"Original text: {len(text.split())} words\n")

for length in [20, 40, 60, 100]:
    summary = summarizer(text, max_length=length, min_length=10)
    print(f"max_length={length}:")
    print(f"  {summary[0]['summary_text']}\n")

Notice how the model has to make harder decisions when the length is shorter. At max_length=20, it can only keep the most essential idea. At max_length=100, it can include nuance and detail. This is a real design decision -- do you want a tweet-length summary or a paragraph?

#### num_beams: Greedy vs. Beam Search

When the model generates a summary, it picks words one at a time. With **greedy search** (num_beams=1), it always picks the single most likely next word. With **beam search** (num_beams=4), it considers 4 different paths at each step and picks the best overall sequence.

Think of it like navigating a maze. Greedy search always turns toward the exit at every intersection. Beam search explores a few paths at once and picks the one that actually gets you out fastest.

In [None]:
print("SUMMARIZATION: BEAM SEARCH EXPERIMENT")
print("=" * 50)

for beams in [1, 4]:
    summary = summarizer(text, max_length=50, min_length=15, num_beams=beams)
    label = "Greedy (first guess)" if beams == 1 else "Beam search (considers options)"
    print(f"\n{label} (num_beams={beams}):")
    print(f"  {summary[0]['summary_text']}")

> **ASK AI ABOUT THIS**
>
> Copy the summarization code into Claude or ChatGPT and ask:
>
> *"What is beam search and why does it sometimes produce better summaries? Explain it like I'm in high school."*
>
> This is how real programmers learn -- by asking questions about code they encounter.

### Zero-Shot Classification: Labels as Parameters

Here is a model where the "hyperparameter" is not a number -- it is the **set of categories** you give it. The same text can be classified completely differently depending on what labels you offer.

### Tokenization Visualization (Book Enhancement)

Before a model can reason over text, it breaks text into **tokens**.

- Word-level idea: split by words
- Subword-level idea: split rare words into pieces
- Character-level idea: split into letters/symbols

Most modern transformer models use subword tokenization.

> **INSTRUCTOR NOTE:** Use the `unhappiness` example below to show how one word can become multiple tokens.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

token_examples = [
    "I love cats",
    "What is unhappiness?",
    "AI models read tokens, not raw sentences."
]

print("TOKENIZATION EXAMPLES")
print("=" * 60)
for text in token_examples:
    tokens = tokenizer.tokenize(text)
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    print(f"Text: {text}")
    print(f"Tokens: {tokens}")
    print(f"Token IDs (first 10): {token_ids[:10]}")
    print()

In [None]:
classifier = pipeline("zero-shot-classification")
print("Classifier loaded!")

In [None]:
text = "The new iPhone has an amazing camera but the battery life is disappointing"

print("ZERO-SHOT: CHANGING LABELS")
print("=" * 50)
print(f"Text: {text}\n")

label_sets = [
    ["positive", "negative"],
    ["tech review", "personal opinion", "news"],
    ["praise", "criticism", "mixed"]
]

for labels in label_sets:
    result = classifier(text, labels)
    print(f"Labels: {labels}")
    for lbl, score in zip(result['labels'], result['scores']):
        print(f"  {lbl}: {score:.1%}")
    print()

Same text, completely different analysis -- just by changing the labels. The labels you choose are a design decision, and they dramatically affect what insights you get.

### Question Answering: Context as a Parameter

In question-answering, the model extracts answers from a passage you provide. The passage itself is a kind of parameter -- change the passage, change the answer.

In [None]:
qa = pipeline("question-answering")

question = "What is the main challenge?"

contexts = [
    "Climate change is the main challenge facing humanity. Rising temperatures threaten ecosystems worldwide.",
    "The main challenge in education is keeping students engaged. Traditional lectures often fail to hold attention.",
    "For software developers, the main challenge is managing complexity. As systems grow, bugs multiply."
]

print("QA: SAME QUESTION, DIFFERENT CONTEXTS")
print("=" * 50)
print(f"Question: {question}\n")

for i, context in enumerate(contexts, 1):
    result = qa(question=question, context=context)
    print(f"Context {i}: {context[:60]}...")
    print(f"  Answer: {result['answer']} (confidence: {result['score']:.1%})\n")

> **INSTRUCTOR NOTE:** Go to huggingface.co and navigate to a model card (e.g., distilgpt2). Show the "Usage" section. Point out that model creators often suggest parameter ranges -- what temperature works well, what max_length to use, etc. This is where professionals look for guidance.

---

## Part 3: Systematic Parameter Sweep (1:00-1:25)

So far we have been testing a few values by hand. But real researchers do this **systematically** -- they sweep through many values and record the results in a table. This lets you see patterns instead of guessing.

### Temperature Sweep

Let's run the same prompt through 7 different temperature values and compare the results side by side.

In [None]:
prompt = "The future of artificial intelligence is"

print(f"Prompt: {prompt}")
print()
print(f"{'Temp':<6} {'Output':<70}")
print("=" * 76)

for temp in [0.1, 0.3, 0.5, 0.7, 0.9, 1.1, 1.3]:
    result = generator(prompt, max_length=40, do_sample=True, temperature=temp)
    output = result[0]['generated_text'][len(prompt):].strip()[:65]
    print(f"{temp:<6} {output}")

### Observation Table

For each temperature in the sweep, rate the output on these dimensions:

| Temp | Coherent? (1-5) | Surprising? (1-5) | Grammatical? (1-5) | Useful? (1-5) |
|------|-----|------|------|------|
| 0.1 | | | | |
| 0.3 | | | | |
| 0.5 | | | | |
| 0.7 | | | | |
| 0.9 | | | | |
| 1.1 | | | | |
| 1.3 | | | | |

### The Big Takeaway

There is no "right" temperature. It depends on what you want:
- Writing a factual summary? **Low temperature** (0.2-0.4).
- Brainstorming creative ideas? **High temperature** (0.9-1.2).
- Chatbot conversation? **Medium temperature** (0.6-0.8).

This is a **design choice**, not a technical one. You are the designer.

### Student Sweep

Now let's do the same sweep with a prompt from the class.

> **INSTRUCTOR NOTE:** Ask students for a prompt. Type it below. Then run the sweep to see how temperature affects their specific prompt.

In [None]:
student_sweep_prompt = "REPLACE WITH STUDENT SUGGESTION"

print(f"Prompt: {student_sweep_prompt}")
print()
print(f"{'Temp':<6} {'Output':<70}")
print("=" * 76)

for temp in [0.1, 0.3, 0.5, 0.7, 0.9, 1.1, 1.3]:
    result = generator(student_sweep_prompt, max_length=40, do_sample=True, temperature=temp)
    output = result[0]['generated_text'][len(student_sweep_prompt):].strip()[:65]
    print(f"{temp:<6} {output}")

> **ASK AI ABOUT THIS**
>
> Copy your completed observation table into Claude or ChatGPT and ask:
>
> *"I ran a temperature sweep on a text generation model. Here are my ratings for each temperature. What temperature would you recommend for [describe your use case], and why?"*
>
> This is how real programmers learn -- by asking questions about code they encounter.

---

## On Your Own (1:40-2:00)

### Experiment 1: Your Own Temperature Sweep

Pick a prompt that interests you -- a story opening, a question, a description. Run it through the full temperature range.

In [None]:
my_prompt = "REPLACE WITH YOUR OWN PROMPT"

print(f"Prompt: {my_prompt}")
print()
print(f"{'Temp':<6} {'Output':<70}")
print("=" * 76)

for temp in [0.1, 0.3, 0.5, 0.7, 0.9, 1.1, 1.3]:
    result = generator(my_prompt, max_length=40, do_sample=True, temperature=temp)
    output = result[0]['generated_text'][len(my_prompt):].strip()[:65]
    print(f"{temp:<6} {output}")

### Experiment 2: Temperature + Top-P Together

What happens when you combine the two dials? Try different combinations.

In [None]:
my_prompt_2 = "REPLACE WITH YOUR OWN PROMPT"

print(f"Prompt: {my_prompt_2}")
print()
print(f"{'Temp':<6} {'Top-P':<6} {'Output':<60}")
print("=" * 72)

for temp in [0.3, 0.7, 1.2]:
    for top_p in [0.5, 0.9]:
        result = generator(
            my_prompt_2,
            max_length=40,
            do_sample=True,
            temperature=temp,
            top_p=top_p
        )
        output = result[0]['generated_text'][len(my_prompt_2):].strip()[:55]
        print(f"{temp:<6} {top_p:<6} {output}")

### Experiment 3: Summarization Length vs. Quality

Try summarizing a paragraph you care about (a news article, a textbook passage, something you wrote) at different max_length values. When does the summary lose important information?

In [None]:
my_text = "REPLACE WITH A PARAGRAPH YOU WANT TO SUMMARIZE"

print("YOUR SUMMARIZATION EXPERIMENT")
print("=" * 50)
print(f"Original: {len(my_text.split())} words\n")

for length in [15, 30, 50, 80]:
    summary = summarizer(my_text, max_length=length, min_length=5)
    print(f"max_length={length}:")
    print(f"  {summary[0]['summary_text']}\n")

---

### Checklist: Before You Leave

- [ ] Understood the difference between parameters (learned) and hyperparameters (set by you)
- [ ] Ran temperature experiments and saw the creativity-predictability tradeoff
- [ ] Ran top-p experiments
- [ ] Explored hyperparameters in summarization (max_length, num_beams)
- [ ] Saw how zero-shot labels change classification results
- [ ] Completed a systematic parameter sweep
- [ ] Filled in the observation table
- [ ] Browsed text-generation models on the Hub and read a model card
- [ ] Tried your own prompt experiments
- [ ] Saved your work (File > Save a copy in Drive)

---

## Looking Ahead

Next session, we will learn how to **evaluate** models -- not just "does it work?" but "how well does it work, and how do we measure that?" We will compare multiple models head-to-head, learn about confidence scores, and discover why a model can be 95% accurate and still completely useless.

See you next session.

---

*Youth Horizons AI Researcher Program - Level 2*