# Week 6: Going Deeper - Loading Models with Code
## Understanding What's Under the Hood

**Today's Goals:**
1. Understand the difference between `pipeline` (easy) and manual loading (more control)
2. Learn about tokenizers - how models "read" text
3. Load and configure models manually
4. Understand model outputs in detail

---

## Part 1: Pipeline vs Manual Loading

Last week we used `pipeline()` - it's like using a microwave:
- Put food in, press button, get hot food
- Easy, but limited control

Today we'll learn to "cook from scratch":
- More steps, but you control everything
- Understand what's really happening

**When to use which:**
- **Pipeline**: Quick experiments, standard tasks
- **Manual**: Custom behavior, learning how it works, fine-tuning

## Setup

In [None]:
!pip install transformers -q
!pip install torch -q
print("Ready!")

---
## Part 2: What is a Tokenizer?

AI models don't read words - they read **numbers**.

A **tokenizer** converts:
- Text â†’ Numbers (for the model to read)
- Numbers â†’ Text (for us to read the output)

Let's see it in action:

In [None]:
from transformers import AutoTokenizer

# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Tokenize some text
text = "Hello, I am learning about AI!"
tokens = tokenizer(text)

print(f"Original text: {text}")
print(f"\nTokenized (numbers): {tokens['input_ids']}")

In [None]:
# See what each number represents
print("What each token means:")
for token_id in tokens['input_ids']:
    word = tokenizer.decode([token_id])
    print(f"  {token_id} â†’ '{word}'")

### Special Tokens:

Notice some special tokens:
- `[CLS]` (101) - Start of sequence
- `[SEP]` (102) - End of sequence
- `[PAD]` (0) - Padding (for equal lengths)

These help the model understand where sentences begin and end.

In [None]:
# Try different words - see how they're tokenized
words = ["hello", "Hello", "HELLO", "artificial", "AI", "ðŸ¤–"]

print("How different words get tokenized:")
for word in words:
    tokens = tokenizer(word)['input_ids']
    # Remove special tokens for clarity
    tokens = tokens[1:-1]  
    print(f"  '{word}' â†’ {tokens}")

### Key Insight:

Some words get split into multiple tokens! This is called **subword tokenization**.
- "artificial" might become ["art", "##ificial"]
- This helps the model handle words it hasn't seen before

---
## Part 3: Loading a Model Manually

Now let's load the actual model:

In [None]:
from transformers import AutoModelForSequenceClassification
import torch

# Load a sentiment analysis model
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

print(f"Model loaded: {model_name}")
print(f"Model type: {type(model).__name__}")

### Using the Model Step by Step:

In [None]:
# Step 1: Prepare the text
text = "I love learning about AI!"

# Step 2: Tokenize (convert to numbers)
inputs = tokenizer(text, return_tensors="pt")  # pt = PyTorch tensors
print("Step 2 - Tokenized input:")
print(f"  input_ids: {inputs['input_ids']}")
print(f"  attention_mask: {inputs['attention_mask']}")

In [None]:
# Step 3: Run through the model
with torch.no_grad():  # We're not training, just predicting
    outputs = model(**inputs)

print("Step 3 - Raw model output:")
print(f"  logits: {outputs.logits}")

In [None]:
# Step 4: Convert to probabilities
probabilities = torch.softmax(outputs.logits, dim=1)
print("Step 4 - Probabilities:")
print(f"  NEGATIVE: {probabilities[0][0]:.2%}")
print(f"  POSITIVE: {probabilities[0][1]:.2%}")

# Step 5: Get the prediction
prediction = torch.argmax(probabilities).item()
labels = ["NEGATIVE", "POSITIVE"]
print(f"\nFinal prediction: {labels[prediction]}")

### The Full Process:

```
Text â†’ Tokenizer â†’ Numbers â†’ Model â†’ Logits â†’ Softmax â†’ Probabilities â†’ Prediction
```

**Pipeline does all this automatically!** But now you understand what's happening.

---
## Part 4: Let's Make a Function

Let's wrap this in a reusable function:

In [None]:
def predict_sentiment(text, show_details=False):
    """Predict sentiment of text with optional details."""
    
    # Tokenize
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    
    # Predict
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Get probabilities
    probs = torch.softmax(outputs.logits, dim=1)[0]
    
    # Get prediction
    pred_idx = torch.argmax(probs).item()
    labels = ["NEGATIVE", "POSITIVE"]
    
    if show_details:
        print(f"Text: {text}")
        print(f"Tokens: {len(inputs['input_ids'][0])} tokens")
        print(f"NEGATIVE prob: {probs[0]:.2%}")
        print(f"POSITIVE prob: {probs[1]:.2%}")
        print(f"Prediction: {labels[pred_idx]}")
        print()
    
    return labels[pred_idx], probs[pred_idx].item()

# Test it
predict_sentiment("This is amazing!", show_details=True)
predict_sentiment("I hate waiting in line.", show_details=True)

---
## Part 5: The Auto Classes

Hugging Face has "Auto" classes that pick the right model type automatically:

| Auto Class | Use For |
|------------|--------|
| `AutoTokenizer` | Any tokenizer |
| `AutoModel` | Base model (just embeddings) |
| `AutoModelForSequenceClassification` | Text classification |
| `AutoModelForQuestionAnswering` | Q&A tasks |
| `AutoModelForCausalLM` | Text generation |
| `AutoModelForSeq2SeqLM` | Translation, summarization |

In [None]:
# Example: Load a Q&A model manually
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

qa_model_name = "distilbert-base-cased-distilled-squad"
qa_tokenizer = AutoTokenizer.from_pretrained(qa_model_name)
qa_model = AutoModelForQuestionAnswering.from_pretrained(qa_model_name)

print(f"Loaded Q&A model: {qa_model_name}")

In [None]:
# Use the Q&A model
context = "The Eiffel Tower is located in Paris, France. It was built in 1889."
question = "Where is the Eiffel Tower?"

# Tokenize both question and context
inputs = qa_tokenizer(question, context, return_tensors="pt")

# Get answer
with torch.no_grad():
    outputs = qa_model(**inputs)

# Find the answer span
answer_start = torch.argmax(outputs.start_logits)
answer_end = torch.argmax(outputs.end_logits) + 1

# Decode the answer
answer_tokens = inputs['input_ids'][0][answer_start:answer_end]
answer = qa_tokenizer.decode(answer_tokens)

print(f"Question: {question}")
print(f"Answer: {answer}")

---
## Part 6: Comparing Pipeline vs Manual

Let's compare the two approaches side by side:

In [None]:
from transformers import pipeline

# Pipeline approach (easy)
print("=== PIPELINE APPROACH ===")
classifier = pipeline("sentiment-analysis")
result = classifier("I love this!")
print(f"Result: {result}")
print()

In [None]:
# Manual approach (more control)
print("=== MANUAL APPROACH ===")
label, confidence = predict_sentiment("I love this!", show_details=True)

### When to Use Each:

**Use Pipeline when:**
- Quick prototyping
- Standard tasks
- You just need the answer

**Use Manual when:**
- You need custom processing
- You want to understand the internals
- You're preparing for fine-tuning
- You need specific outputs (embeddings, attention, etc.)

---
## Challenge: Build Your Own Classifier Function

Create a function that:
1. Takes a list of texts
2. Returns predictions for all of them
3. Shows a summary at the end

**Hint:** Ask AI to help you build on the `predict_sentiment` function!

In [None]:
# Your challenge code here!
# Try asking AI: "Modify this function to handle a list of texts and show statistics"

def analyze_many_texts(texts):
    """Analyze multiple texts and show summary."""
    # Your code here!
    pass

# Test it
test_texts = [
    "I love this product!",
    "This is terrible.",
    "It's okay, nothing special.",
    "Best purchase ever!",
    "Waste of money."
]

# analyze_many_texts(test_texts)

---
## Key Takeaways

### The Loading Pattern:
```python
from transformers import AutoTokenizer, AutoModelFor[Task]

tokenizer = AutoTokenizer.from_pretrained("model-name")
model = AutoModelFor[Task].from_pretrained("model-name")
```

### The Prediction Pattern:
```python
# 1. Tokenize
inputs = tokenizer(text, return_tensors="pt")

# 2. Predict
with torch.no_grad():
    outputs = model(**inputs)

# 3. Process outputs
probabilities = torch.softmax(outputs.logits, dim=1)
```

---
## Checklist: What You Learned Today

- [ ] What a tokenizer does (text â†” numbers)
- [ ] How to load models manually with Auto classes
- [ ] The full prediction flow: tokenize â†’ model â†’ softmax â†’ prediction
- [ ] When to use pipeline vs manual loading
- [ ] How to wrap model code in reusable functions

---

## Looking Ahead: Next Week

Next week we'll explore **image models**:
- Image classification (what's in this picture?)
- How vision models "see" images
- Loading and using image AI

**Homework (optional):**
- Complete the challenge above
- Try loading a different model type manually
- Save your work to GitHub!

---

*Youth Horizons AI Researcher Program - Level 2*