# Instruction Tuning (SFT)

In this notebook, you'll perform supervised finetuning (SFT) on a base language model and observe the behavioral transformation from text completer to instruction follower.

**What you'll do:**
- Load a base model and observe that it does NOT follow instructions â€” it completes text
- Prepare an instruction-following dataset in chat format (system/user/assistant turns)
- Implement the SFT training loop with loss masking on prompt tokens
- Compare before/after model responses on held-out prompts
- Experiment with dataset quality â€” noisy vs clean data

**For each exercise, PREDICT the output before running the cell.** Wrong predictions are more valuable than correct ones â€” they reveal gaps in your mental model.

In [None]:
# Setup â€” self-contained for Google Colab
!pip install -q transformers datasets accelerate peft

import torch
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
import matplotlib.pyplot as plt
import json
import copy

# Reproducibility
torch.manual_seed(42)

# Nice plots
plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = [10, 4]

# Device setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')
if device.type == 'cuda':
    print(f'GPU: {torch.cuda.get_device_name(0)}')
    print(f'Memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')

print('\nSetup complete.')

---

## Exercise 1: Base Model Behavior â€” Text Completion, Not Instruction Following (Guided)

The lesson showed the core insight: a base model is a text completer. It predicts the next token. When you give it an instruction, it does not *follow* it â€” it *continues* it as if it were part of a document.

We'll load GPT-2 (a base model) and send it several instruction-style prompts.

**Before running, predict:** When you send GPT-2 the prompt `"Write a haiku about machine learning"`, what will it generate? Will it produce a haiku, or something else? What about `"What is the capital of France?"`?

In [None]:
# Load GPT-2 â€” a base model (not instruction-tuned)
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

# GPT-2 has no pad token by default â€” set it to eos
tokenizer.pad_token = tokenizer.eos_token

print(f"Model: {model_name}")
print(f"Parameters: {sum(p.numel() for p in base_model.parameters()) / 1e6:.1f}M")
print(f"Vocabulary size: {tokenizer.vocab_size}")

In [None]:
def generate_response(model, tokenizer, prompt, max_new_tokens=80):
    """Generate text from a prompt. Returns only the NEW tokens (not the prompt)."""
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    prompt_length = inputs["input_ids"].shape[1]

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,  # Greedy for reproducibility
            temperature=1.0,
            pad_token_id=tokenizer.eos_token_id,
        )

    # Decode only the generated part (not the prompt)
    generated_ids = outputs[0][prompt_length:]
    return tokenizer.decode(generated_ids, skip_special_tokens=True)


# Test prompts â€” things you'd ask an instruction-following model
test_prompts = [
    "Write a haiku about machine learning.",
    "What is the capital of France?",
    "Explain why the sky is blue in one sentence.",
    "List three benefits of exercise.",
]

print("=" * 60)
print("BASE MODEL (GPT-2) RESPONSES")
print("=" * 60)

for prompt in test_prompts:
    response = generate_response(base_model, tokenizer, prompt)
    print(f"\nPrompt: {prompt}")
    print(f"Response: {response[:200]}")
    print("-" * 40)

**What you just observed:** GPT-2 does not follow instructions. It treats each prompt as a document fragment and continues it. "Write a haiku" becomes part of an article *about* haiku writing. "What is the capital of France?" becomes a quiz question in a textbook.

The model **has knowledge** (it knows Paris is the capital of France) but it does not have the **behavior** of answering questions directly. It is a text completer, not an instruction follower.

SFT will change this behavior â€” using the exact same model architecture and loss function.

---

## Exercise 2: Prepare an Instruction Dataset in Chat Format (Guided)

SFT data is instruction-response pairs. But the model needs to know where the instruction ends and the response begins. That's what **chat templates** and **special tokens** do â€” they are structural delimiters the model learns during SFT.

We'll load a real instruction dataset (Alpaca format) and convert each example into chat-template format with special tokens.

**Before running, predict:** An Alpaca-format example has fields `instruction`, `input`, and `output`. When we convert it to chat format with `<|im_start|>` and `<|im_end|>` tokens, what will the resulting string look like? How many special tokens will there be per example?

In [None]:
# Load a small instruction dataset
# tatsu-lab/alpaca is the classic 52K instruction dataset from Stanford
dataset = load_dataset("tatsu-lab/alpaca", split="train")
print(f"Dataset size: {len(dataset)} examples")
print(f"Fields: {list(dataset[0].keys())}")

# Look at a few raw examples
print("\n" + "=" * 60)
print("RAW ALPACA EXAMPLES")
print("=" * 60)
for i in range(3):
    ex = dataset[i]
    print(f"\n--- Example {i} ---")
    print(f"instruction: {ex['instruction'][:100]}")
    print(f"input:       {ex['input'][:100] if ex['input'] else '(none)'}")
    print(f"output:      {ex['output'][:100]}")

In [None]:
# Convert Alpaca format to ChatML format with special tokens
#
# ChatML uses:
#   <|im_start|>role\ncontent<|im_end|>
#
# The model learns that after <|im_start|>assistant\n it should generate a response.

CHAT_TEMPLATE = """<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant
{response}<|im_end|>"""


def format_example(example):
    """Convert an Alpaca example to ChatML format."""
    instruction = example["instruction"]
    if example["input"]:
        instruction = f"{instruction}\n\n{example['input']}"
    return CHAT_TEMPLATE.format(
        instruction=instruction,
        response=example["output"],
    )


# Format a few examples and inspect
print("FORMATTED EXAMPLE (ChatML):")
print("=" * 60)
formatted = format_example(dataset[0])
print(formatted)
print("=" * 60)

# Count special tokens in the formatted string
n_im_start = formatted.count("<|im_start|>")
n_im_end = formatted.count("<|im_end|>")
print(f"\nSpecial tokens per example: {n_im_start} <|im_start|>, {n_im_end} <|im_end|>")
print(f"Roles: system, user, assistant â€” three turns")
print(f"\nThe model will learn: after '<|im_start|>assistant\\n', generate a response.")

**What you just built:** A function that converts raw instruction/response pairs into the ChatML template format. The special tokens `<|im_start|>` and `<|im_end|>` are structural delimiters â€” the model will learn during SFT that they mark role boundaries.

These special tokens have **no pretrained meaning**. They did not exist in GPT-2's pretraining data. They will acquire meaning entirely from the SFT training data, where they consistently appear as boundaries between roles.

---

## Exercise 3: Implement the SFT Training Loop (Supported)

Now the core exercise: implement SFT. Remember from the lesson â€” the training loop is the **same heartbeat** as pretraining and classification finetuning:

1. Forward pass
2. Compute loss (cross-entropy on next-token prediction)
3. Zero gradients
4. Backward
5. Step

The **one new mechanical concept** is **loss masking**: compute loss only on response tokens (not prompt tokens). Prompt tokens get label `-100`, which `CrossEntropyLoss` ignores.

You'll fill in the TODOs for:
- Adding special tokens to the tokenizer and resizing the model embeddings
- Implementing the loss masking logic
- Writing the training loop

<details>
<summary>ðŸ’¡ Solution</summary>

The key insights:

1. **Adding special tokens** requires both extending the tokenizer vocabulary AND resizing the model's embedding matrix. The new token embeddings are initialized randomly and will learn their meaning during SFT.

2. **Loss masking** sets labels to -100 for all prompt tokens. The boundary is where `<|im_start|>assistant\n` ends â€” everything before that is prompt, everything after is response. PyTorch's `CrossEntropyLoss` ignores -100 indices by default.

3. **The training loop** is identical to what you've written before. Forward, loss, zero_grad, backward, step. The only difference is the data going in.

```python
# Adding special tokens:
special_tokens = {"additional_special_tokens": ["<|im_start|>", "<|im_end|>"]}
tokenizer.add_special_tokens(special_tokens)
model.resize_token_embeddings(len(tokenizer))

# Loss masking â€” find where the assistant response starts:
assistant_marker = "<|im_start|>assistant\n"
marker_ids = tokenizer.encode(assistant_marker, add_special_tokens=False)
# Find marker position in token_ids, set labels[:marker_end] = -100

# Training loop:
for step, batch in enumerate(dataloader):
    input_ids = batch["input_ids"].to(device)
    labels = batch["labels"].to(device)
    outputs = model(input_ids=input_ids, labels=labels)
    loss = outputs.loss
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
```

Common mistake: forgetting to resize embeddings after adding special tokens. The model will crash because token IDs exceed the embedding matrix dimensions.

</details>

In [None]:
# --- Step 1: Prepare the model and tokenizer for SFT ---

# Fresh copy of GPT-2 for SFT (keep the original base_model for comparison)
sft_model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
sft_tokenizer = AutoTokenizer.from_pretrained(model_name)
sft_tokenizer.pad_token = sft_tokenizer.eos_token

# TODO: Add the ChatML special tokens to the tokenizer.
# The tokens are: "<|im_start|>" and "<|im_end|>"
# Use sft_tokenizer.add_special_tokens() with the key "additional_special_tokens"
special_tokens = {"additional_special_tokens": ["<|im_start|>", "<|im_end|>"]}
num_added = sft_tokenizer.add_special_tokens(special_tokens)
print(f"Added {num_added} special tokens to vocabulary")

# TODO: Resize the model's embedding matrix to accommodate the new tokens.
# Use sft_model.resize_token_embeddings(len(sft_tokenizer))
sft_model.resize_token_embeddings(len(sft_tokenizer))
print(f"New vocabulary size: {len(sft_tokenizer)}")

# Verify the new tokens have IDs
im_start_id = sft_tokenizer.convert_tokens_to_ids("<|im_start|>")
im_end_id = sft_tokenizer.convert_tokens_to_ids("<|im_end|>")
print(f"<|im_start|> token ID: {im_start_id}")
print(f"<|im_end|> token ID: {im_end_id}")

In [None]:
# --- Step 2: Tokenize with loss masking ---

# Loss masking: we compute loss ONLY on the response tokens.
# The prompt (system + user turns) gets label = -100.
# The response (assistant turn) gets the actual next-token targets.

MAX_LENGTH = 256  # Keep sequences short for training speed


def tokenize_with_labels(formatted_text, tokenizer, max_length=MAX_LENGTH):
    """Tokenize a ChatML-formatted example and create labels with loss masking.

    Returns:
        input_ids: token IDs for the full sequence
        labels: same as input_ids but with -100 for prompt tokens
    """
    # Tokenize the full formatted text
    encoding = tokenizer(
        formatted_text,
        truncation=True,
        max_length=max_length,
        return_tensors="pt",
    )
    input_ids = encoding["input_ids"][0]

    # For next-token prediction, labels are the input shifted by 1.
    # HuggingFace models handle this shift internally, so labels = input_ids.
    labels = input_ids.clone()

    # TODO: Find where the assistant response starts and mask everything before it.
    #
    # Strategy: find "<|im_start|>assistant\n" in the formatted text,
    # tokenize just the prompt portion to get its length in tokens,
    # then set labels[:prompt_length] = -100.
    #
    # Hint: Use formatted_text.find() to locate the assistant marker,
    # then tokenize formatted_text[:marker_end] to count prompt tokens.

    assistant_marker = "<|im_start|>assistant\n"
    marker_pos = formatted_text.find(assistant_marker)
    prompt_end = marker_pos + len(assistant_marker)

    # Tokenize just the prompt to find where to mask
    prompt_tokens = tokenizer(
        formatted_text[:prompt_end],
        return_tensors="pt",
    )["input_ids"][0]
    prompt_length = len(prompt_tokens)

    # Mask the prompt tokens â€” these do NOT contribute to the loss
    labels[:prompt_length] = -100

    return input_ids, labels


# Test it on one example
test_formatted = format_example(dataset[0])
test_ids, test_labels = tokenize_with_labels(test_formatted, sft_tokenizer)

# Show the masking
n_total = len(test_ids)
n_masked = (test_labels == -100).sum().item()
n_active = n_total - n_masked

print(f"Total tokens: {n_total}")
print(f"Masked (prompt, label=-100): {n_masked}")
print(f"Active (response, loss computed): {n_active}")
print(f"\nFirst few labels: {test_labels[:15].tolist()}")
print(f"Last few labels:  {test_labels[-15:].tolist()}")
print(f"\n-100 = masked (prompt tokens). Other values = target token IDs (response tokens).")

In [None]:
# --- Step 3: Create the training dataset ---

# Use a small subset for fast training (SFT is data-efficient!)
NUM_TRAIN = 500  # 500 examples â€” enough to see behavioral change


class InstructionDataset(Dataset):
    def __init__(self, raw_dataset, tokenizer, num_examples, max_length=MAX_LENGTH):
        self.examples = []
        skipped = 0

        for i in range(min(num_examples, len(raw_dataset))):
            formatted = format_example(raw_dataset[i])
            input_ids, labels = tokenize_with_labels(formatted, tokenizer, max_length)

            # Skip examples where the response is entirely truncated
            if (labels != -100).sum() < 5:
                skipped += 1
                continue

            self.examples.append({"input_ids": input_ids, "labels": labels})

        print(f"Created dataset: {len(self.examples)} examples (skipped {skipped} too-long)")

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, idx):
        return self.examples[idx]


def collate_fn(batch):
    """Pad examples to the same length within a batch."""
    max_len = max(len(ex["input_ids"]) for ex in batch)

    input_ids_padded = []
    labels_padded = []
    attention_masks = []

    for ex in batch:
        pad_len = max_len - len(ex["input_ids"])
        pad_id = sft_tokenizer.pad_token_id

        input_ids_padded.append(
            torch.cat([ex["input_ids"], torch.full((pad_len,), pad_id)])
        )
        labels_padded.append(
            torch.cat([ex["labels"], torch.full((pad_len,), -100)])
        )
        attention_masks.append(
            torch.cat([torch.ones(len(ex["input_ids"])), torch.zeros(pad_len)])
        )

    return {
        "input_ids": torch.stack(input_ids_padded).long(),
        "labels": torch.stack(labels_padded).long(),
        "attention_mask": torch.stack(attention_masks).long(),
    }


train_dataset = InstructionDataset(dataset, sft_tokenizer, NUM_TRAIN)
train_dataloader = DataLoader(
    train_dataset, batch_size=4, shuffle=True, collate_fn=collate_fn
)

In [None]:
# --- Step 4: The SFT Training Loop ---

# Same heartbeat as pretraining and classification finetuning:
#   forward -> loss -> zero_grad -> backward -> step
#
# The ONLY difference: the data is instruction-response pairs with loss masking.

NUM_EPOCHS = 2
LEARNING_RATE = 5e-5

optimizer = torch.optim.AdamW(sft_model.parameters(), lr=LEARNING_RATE)
sft_model.train()

losses = []
step_count = 0

print(f"Training for {NUM_EPOCHS} epochs on {len(train_dataset)} examples...")
print(f"Batch size: 4, Steps per epoch: ~{len(train_dataloader)}")
print()

for epoch in range(NUM_EPOCHS):
    epoch_loss = 0.0
    n_batches = 0

    for batch in train_dataloader:
        # Move batch to device
        input_ids = batch["input_ids"].to(device)
        labels = batch["labels"].to(device)
        attention_mask = batch["attention_mask"].to(device)

        # TODO: Complete the training loop.
        # 1. Forward pass: outputs = sft_model(input_ids=..., labels=..., attention_mask=...)
        # 2. Get loss: loss = outputs.loss
        # 3. Zero gradients
        # 4. Backward
        # 5. Step

        outputs = sft_model(
            input_ids=input_ids,
            labels=labels,
            attention_mask=attention_mask,
        )
        loss = outputs.loss

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()
        n_batches += 1
        step_count += 1
        losses.append(loss.item())

        if step_count % 25 == 0:
            print(f"  Step {step_count:4d} | Loss: {loss.item():.4f}")

    avg_loss = epoch_loss / n_batches
    print(f"\nEpoch {epoch + 1}/{NUM_EPOCHS} â€” Avg loss: {avg_loss:.4f}")
    print()

print(f"Training complete! {step_count} total steps.")

In [None]:
# Plot the training loss
plt.figure(figsize=(10, 4))
plt.plot(losses, linewidth=1.5, color='#34d399', alpha=0.6, label='Per-step loss')

# Smoothed version
window = 20
if len(losses) > window:
    smoothed = [sum(losses[max(0,i-window):i+1]) / len(losses[max(0,i-window):i+1]) for i in range(len(losses))]
    plt.plot(smoothed, linewidth=2, color='#34d399', label=f'Smoothed ({window}-step)')

plt.xlabel('Training Step')
plt.ylabel('Cross-Entropy Loss')
plt.title('SFT Training Loss')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

print(f"Loss decreased from {losses[0]:.4f} to {losses[-1]:.4f}")
print("The model is learning to predict response tokens given instruction prompts.")

**What you just implemented:** The complete SFT pipeline â€” the same training loop heartbeat (forward, loss, zero_grad, backward, step), but with instruction-response pairs and loss masking. No new architecture. No new loss function. The only change is the **data**.

---

## Exercise 4: Compare Before and After (Supported)

Now the payoff: compare the original base model's responses to the SFT model's responses on the same prompts.

The model needs prompts formatted with the chat template it was trained on. If we use the wrong template (or no template), the model will not recognize the structural boundary â€” as the lesson's "Wrong Template" checkpoint explained.

<details>
<summary>ðŸ’¡ Solution</summary>

The key insight: the SFT model was trained on ChatML-formatted data, so at inference time we must format the prompt the same way. The model learned that `<|im_start|>assistant\n` means "start generating a response here."

```python
def format_inference_prompt(user_message):
    return (
        "<|im_start|>system\n"
        "You are a helpful assistant.<|im_end|>\n"
        "<|im_start|>user\n"
        f"{user_message}<|im_end|>\n"
        "<|im_start|>assistant\n"
    )
```

We stop at `<|im_start|>assistant\n` â€” the model generates the rest. Without this template, the SFT model would not know where to start its response.

</details>

In [None]:
# TODO: Write a function that formats a user message into a ChatML prompt
# for inference. The prompt should include the system message and user turn,
# ending with "<|im_start|>assistant\n" so the model knows to generate a response.

def format_inference_prompt(user_message):
    """Format a user message into ChatML for inference.
    Includes system + user turns, ending at the assistant's turn start."""
    return (
        "<|im_start|>system\n"
        "You are a helpful assistant.<|im_end|>\n"
        "<|im_start|>user\n"
        f"{user_message}<|im_end|>\n"
        "<|im_start|>assistant\n"
    )


# Evaluation prompts â€” a mix of tasks the model may or may not have seen
eval_prompts = [
    "Write a haiku about machine learning.",
    "What is the capital of France?",
    "Explain why the sky is blue in one sentence.",
    "List three benefits of exercise.",
    "What is the difference between a list and a tuple in Python?",
]

sft_model.eval()

print("=" * 70)
print("BEFORE vs AFTER SFT")
print("=" * 70)

for prompt_text in eval_prompts:
    # Base model: raw prompt (no template â€” base model was never trained on templates)
    base_response = generate_response(base_model, tokenizer, prompt_text, max_new_tokens=100)

    # SFT model: formatted with chat template
    chat_prompt = format_inference_prompt(prompt_text)
    sft_response = generate_response(sft_model, sft_tokenizer, chat_prompt, max_new_tokens=100)

    print(f"\nPrompt: {prompt_text}")
    print(f"\n  BASE MODEL (text completer):")
    print(f"  {base_response[:200]}")
    print(f"\n  SFT MODEL (instruction follower):")
    print(f"  {sft_response[:200]}")
    print("-" * 70)

**What you should observe:** The SFT model's responses should show a shift toward instruction-following behavior. With only 500 examples and 2 epochs on tiny GPT-2, don't expect ChatGPT quality â€” but you should see a clear difference in **format**. The base model continues text. The SFT model attempts to answer.

This is the lesson's central insight in action: **SFT teaches format, not knowledge.** The knowledge was already in the base model. SFT changed how the model expresses it.

---

## Exercise 5: Data Quality Experiment â€” Noisy vs Clean (Independent)

The lesson mentioned that data quality matters more than quantity for SFT â€” LIMA showed that 1,000 carefully curated examples can match datasets 50x larger.

**Your task:** Train two SFT models from the same base:
1. One on **clean** instruction-response pairs (well-formed, correct responses)
2. One on **noisy** data (same instructions but with corrupted/garbled responses)

Compare their outputs on the same evaluation prompts. Does the noise in the training data show up in the model's behavior?

**Specification:**
- Create a `corrupt_response()` function that degrades response quality (shuffle words, add random characters, truncate, etc.)
- Build two datasets of 200 examples each: one clean, one noisy
- Train two models (same hyperparameters) for 1 epoch each
- Compare responses on 3-5 evaluation prompts

<details>
<summary>ðŸ’¡ Solution</summary>

The reasoning: if SFT teaches format and not knowledge, then noisy SFT teaches noisy format. The model will still attempt to respond to instructions (it learned the instruction-response pattern), but the quality of its responses will reflect the quality of the training data.

This is why the LIMA paper's result makes sense: format is a relatively simple pattern, so a small number of high-quality examples is enough. But low-quality examples teach low-quality format.

```python
import random

def corrupt_response(text):
    """Degrade a response: shuffle words, add noise, truncate."""
    words = text.split()
    # Shuffle word order
    random.shuffle(words)
    # Truncate to random length
    keep = max(3, len(words) // 2)
    words = words[:keep]
    # Add random characters
    noisy_words = []
    for w in words:
        if random.random() < 0.3:
            w = w + "xxx"
        noisy_words.append(w)
    return " ".join(noisy_words)

# Build noisy dataset
noisy_examples = []
for i in range(200):
    ex = dataset[i]
    noisy_ex = {
        "instruction": ex["instruction"],
        "input": ex["input"],
        "output": corrupt_response(ex["output"]),
    }
    noisy_examples.append(noisy_ex)

# Train both models with the same loop, then compare on eval prompts.
```

Common alternative: Instead of shuffling, you could replace responses with random text entirely. This tests whether the model even learns the format at all vs learning bad format.

</details>

In [None]:
# --- Your data quality experiment ---
# Implement the experiment described above.
# Create corrupt_response(), build clean and noisy datasets,
# train two models, and compare outputs.

import random
random.seed(42)

# Your code here...


---

## Key Takeaways

1. **SFT teaches format, not knowledge.** The base model already has vast knowledge from pretraining. SFT on instruction-response pairs teaches it to express that knowledge in an instruction-following format â€” a much simpler pattern to learn.

2. **No new architecture, no new loss function.** The training loop is the same heartbeat: forward, cross-entropy loss, zero_grad, backward, step. The only change is the data â€” formatted instruction-response pairs instead of web text.

3. **Loss masking focuses training on responses.** Prompt tokens get label `-100` so the model learns to *generate* responses, not to predict instruction tokens it already has. This is the one genuinely new mechanical concept.

4. **Chat templates are functional structure, not cosmetic formatting.** Special tokens like `<|im_start|>` and `<|im_end|>` are structural delimiters the model learns to recognize. Using the wrong template at inference time breaks the model's ability to find the response boundary.

5. **Data quality matters more than quantity.** A small number of clean, well-formed instruction-response pairs teaches better format than a large number of noisy ones. Format is a simple pattern â€” it needs clarity, not volume.