# LLM Alignment Case Study: From Raw Model to Helpful Assistant

A complete, practical walkthrough of aligning a language model!

## What You'll Learn

By the end of this notebook, you'll understand:
- The apprenticeship analogy: the full alignment journey
- How to prepare and explore alignment datasets
- SFT: Teaching the model to follow instructions
- DPO: Refining the model with human preferences
- Evaluation: Measuring alignment success
- Best practices and common pitfalls

**Prerequisites:** RLHF notebooks (01-06)

**Time:** ~40 minutes

---
## The Big Picture: The Apprenticeship Analogy

```
    ┌────────────────────────────────────────────────────────────────┐
    │          THE APPRENTICESHIP ANALOGY                            │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  Training an AI assistant is like training an apprentice...   │
    │                                                                │
    │  STAGE 1: EDUCATION (Pre-training)                            │
    │    Read millions of books and documents                      │
    │    Learn language, facts, patterns                           │
    │    But: Doesn't know how to be helpful yet!                  │
    │                                                                │
    │  STAGE 2: JOB TRAINING (SFT)                                  │
    │    "Here's how we answer customer questions"                 │
    │    "Here's the format we use"                                │
    │    Learn: What good responses LOOK like                      │
    │                                                                │
    │  STAGE 3: MENTORSHIP (DPO/RLHF)                               │
    │    "This response is better than that one"                   │
    │    "This is too verbose, that's rude"                        │
    │    Learn: What humans PREFER                                 │
    │                                                                │
    │  RESULT:                                                      │
    │    An assistant that's helpful, harmless, and honest!        │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import FancyBboxPatch, Circle, Rectangle, Arrow
import warnings
warnings.filterwarnings('ignore')

# Check libraries
print("LIBRARY CHECK")
print("="*60)

libraries = {
    'torch': 'PyTorch (neural networks)',
    'transformers': 'Transformers (LLM models)',
    'datasets': 'Datasets (data loading)',
    'trl': 'TRL (training)',
    'peft': 'PEFT (efficient fine-tuning)',
}

available = {}
for module, desc in libraries.items():
    try:
        lib = __import__(module)
        available[module] = True
        version = getattr(lib, '__version__', 'installed')
        print(f"  ✓ {desc}: {version}")
    except ImportError:
        available[module] = False
        print(f"  ✗ {desc}: Not installed")

print("="*60)

In [None]:
# Visualize the alignment pipeline

fig, ax = plt.subplots(figsize=(14, 10))
ax.set_xlim(0, 14)
ax.set_ylim(0, 12)
ax.axis('off')
ax.set_title('LLM Alignment Pipeline: The Complete Journey', fontsize=16, fontweight='bold')

# Stages
stages = [
    (2, 9, 'Pre-trained\nLLM', 'Knows language\nbut unguided', '#e3f2fd', '#1976d2'),
    (6, 9, 'SFT Model', 'Follows\ninstructions', '#fff3e0', '#f57c00'),
    (10, 9, 'Aligned\nModel', 'Helpful &\nHarmless', '#c8e6c9', '#388e3c'),
]

for x, y, title, desc, fcolor, ecolor in stages:
    box = FancyBboxPatch((x-1.3, y-1), 2.6, 2.5, boxstyle="round,pad=0.1",
                          facecolor=fcolor, edgecolor=ecolor, linewidth=3)
    ax.add_patch(box)
    ax.text(x, y+0.8, title, ha='center', fontsize=10, fontweight='bold')
    ax.text(x, y-0.4, desc, ha='center', fontsize=8, color='#666')

# Training stages below
training = [
    (4, 6, 'SFT', 'Supervised\nFine-Tuning', 'Instruction\nData', '#fff3e0', '#f57c00'),
    (8, 6, 'DPO', 'Direct Preference\nOptimization', 'Preference\nData', '#c8e6c9', '#388e3c'),
]

for x, y, title, method, data, fcolor, ecolor in training:
    # Training method box
    box = FancyBboxPatch((x-1.3, y-0.5), 2.6, 2, boxstyle="round,pad=0.1",
                          facecolor=fcolor, edgecolor=ecolor, linewidth=2)
    ax.add_patch(box)
    ax.text(x, y+0.9, title, ha='center', fontsize=11, fontweight='bold', color=ecolor)
    ax.text(x, y+0.1, method, ha='center', fontsize=8)
    
    # Data box
    data_box = FancyBboxPatch((x-1, y-2.5), 2, 1.3, boxstyle="round,pad=0.1",
                               facecolor='#f5f5f5', edgecolor='#999', linewidth=1)
    ax.add_patch(data_box)
    ax.text(x, y-1.9, data, ha='center', fontsize=8)
    
    # Arrow from data to training
    ax.annotate('', xy=(x, y-0.6), xytext=(x, y-1.1),
                arrowprops=dict(arrowstyle='->', lw=1.5, color='#666'))

# Horizontal arrows between stages
ax.annotate('', xy=(4.6, 9.5), xytext=(3.4, 9.5),
            arrowprops=dict(arrowstyle='->', lw=2, color='#666'))
ax.annotate('', xy=(8.6, 9.5), xytext=(7.4, 9.5),
            arrowprops=dict(arrowstyle='->', lw=2, color='#666'))

# Diagonal arrows from models to training
ax.annotate('', xy=(4, 7.6), xytext=(2.5, 7.9),
            arrowprops=dict(arrowstyle='->', lw=1.5, color='#999'))
ax.annotate('', xy=(8, 7.6), xytext=(6.5, 7.9),
            arrowprops=dict(arrowstyle='->', lw=1.5, color='#999'))

# Evaluation at the bottom
eval_box = FancyBboxPatch((5, 0.5), 4, 1.5, boxstyle="round,pad=0.1",
                           facecolor='#e1bee7', edgecolor='#7b1fa2', linewidth=2)
ax.add_patch(eval_box)
ax.text(7, 1.5, 'Evaluation', ha='center', fontsize=10, fontweight='bold', color='#7b1fa2')
ax.text(7, 0.9, 'Win rate, Human eval, KL divergence', ha='center', fontsize=8)

plt.tight_layout()
plt.show()

print("\nALIGNMENT PIPELINE:")
print("  1. Pre-trained LLM → knows language but not helpful")
print("  2. SFT → learns to follow instructions")
print("  3. DPO/RLHF → learns human preferences")
print("  4. Evaluation → verify it's actually better!")

---
## Part 1: Understanding Alignment Datasets

```
    ┌────────────────────────────────────────────────────────────────┐
    │              ALIGNMENT DATASETS                                │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  SFT DATASETS (Instruction-following):                        │
    │    Format: {instruction, response}                           │
    │    Examples:                                                  │
    │      - OpenAssistant (conversation data)                     │
    │      - Dolly-15k (instruction tuning)                        │
    │      - Alpaca (GPT-generated instructions)                   │
    │                                                                │
    │  PREFERENCE DATASETS (DPO/RLHF):                              │
    │    Format: {prompt, chosen_response, rejected_response}      │
    │    Examples:                                                  │
    │      - Anthropic HH-RLHF (helpful & harmless)                │
    │      - Stanford SHP (Stack Exchange preferences)             │
    │      - UltraFeedback (diverse preferences)                   │
    │                                                                │
    │  KEY INSIGHT:                                                 │
    │    Quality matters more than quantity!                       │
    │    1K great examples > 100K noisy examples                   │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
# Explore preference data format

print("UNDERSTANDING PREFERENCE DATA")
print("="*70)

# Simulated example from Anthropic HH-RLHF
example_preference = {
    "prompt": "Human: How can I improve my public speaking skills?",
    
    "chosen": """Assistant: Here are some effective ways to improve your public speaking:

1. **Practice regularly** - Record yourself and watch for filler words, pacing, and body language.

2. **Know your material** - Thorough preparation builds confidence.

3. **Start small** - Practice with friends or small groups before larger audiences.

4. **Focus on your audience** - Think about helping them, not about yourself.

5. **Join a group** - Organizations like Toastmasters provide supportive practice environments.

Would you like me to elaborate on any of these points?""",
    
    "rejected": """Assistant: Just practice more and you'll get better at it. 
Public speaking isn't that hard if you just do it."""
}

print("\nEXAMPLE PREFERENCE PAIR:")
print("-"*70)
print(f"\nPROMPT:\n{example_preference['prompt']}")
print(f"\n{'='*70}")
print(f"\nCHOSEN (Better) Response:")
print(example_preference['chosen'][:300] + "...")
print(f"\n{'='*70}")
print(f"\nREJECTED (Worse) Response:")
print(example_preference['rejected'])

print("\n" + "="*70)
print("\nWHY CHOSEN IS BETTER:")
print("  ✓ Structured with clear points")
print("  ✓ Actionable advice")
print("  ✓ Specific recommendations")
print("  ✓ Offers to help more")
print("\nWHY REJECTED IS WORSE:")
print("  ✗ Vague and unhelpful")
print("  ✗ Dismissive tone")
print("  ✗ No concrete suggestions")

In [None]:
# Visualize what makes good vs bad responses

fig, axes = plt.subplots(1, 2, figsize=(14, 7))

# Left: Good response characteristics
ax1 = axes[0]
ax1.set_xlim(0, 10)
ax1.set_ylim(0, 10)
ax1.axis('off')
ax1.set_title('CHOSEN Response\n(What makes it good)', fontsize=12, fontweight='bold', color='#388e3c')

good_traits = [
    (5, 8.5, 'Helpful', 'Addresses the actual need'),
    (5, 7, 'Specific', 'Concrete, actionable advice'),
    (5, 5.5, 'Structured', 'Clear organization'),
    (5, 4, 'Respectful', 'Takes user seriously'),
    (5, 2.5, 'Engaging', 'Offers further help'),
]

for x, y, trait, desc in good_traits:
    box = FancyBboxPatch((x-2.5, y-0.5), 5, 1, boxstyle="round,pad=0.1",
                          facecolor='#c8e6c9', edgecolor='#388e3c', linewidth=2)
    ax1.add_patch(box)
    ax1.text(x, y+0.2, f'✓ {trait}', ha='center', fontsize=10, fontweight='bold', color='#388e3c')
    ax1.text(x, y-0.3, desc, ha='center', fontsize=8, color='#666')

# Right: Bad response characteristics
ax2 = axes[1]
ax2.set_xlim(0, 10)
ax2.set_ylim(0, 10)
ax2.axis('off')
ax2.set_title('REJECTED Response\n(What makes it bad)', fontsize=12, fontweight='bold', color='#d32f2f')

bad_traits = [
    (5, 8.5, 'Unhelpful', "Doesn't address the need"),
    (5, 7, 'Vague', 'No concrete guidance'),
    (5, 5.5, 'Dismissive', 'Minimizes the concern'),
    (5, 4, 'Short', 'Low effort response'),
    (5, 2.5, 'Closed', 'No follow-up offered'),
]

for x, y, trait, desc in bad_traits:
    box = FancyBboxPatch((x-2.5, y-0.5), 5, 1, boxstyle="round,pad=0.1",
                          facecolor='#ffcdd2', edgecolor='#d32f2f', linewidth=2)
    ax2.add_patch(box)
    ax2.text(x, y+0.2, f'✗ {trait}', ha='center', fontsize=10, fontweight='bold', color='#d32f2f')
    ax2.text(x, y-0.3, desc, ha='center', fontsize=8, color='#666')

plt.tight_layout()
plt.show()

print("\nKEY INSIGHT:")
print("  DPO learns from these COMPARISONS, not from labels!")
print("  The model learns: 'Make responses more like chosen, less like rejected'")

---
## Part 2: Supervised Fine-Tuning (SFT)

```
    ┌────────────────────────────────────────────────────────────────┐
    │          SFT: TEACHING TO FOLLOW INSTRUCTIONS                  │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  PURPOSE:                                                     │
    │    Transform: "Smart but aimless" → "Instruction-following"  │
    │                                                                │
    │  WHAT IT LEARNS:                                              │
    │    - The STRUCTURE of helpful responses                      │
    │    - How to INTERPRET instructions                           │
    │    - Appropriate TONE and FORMAT                             │
    │                                                                │
    │  DATA FORMAT:                                                 │
    │    "Human: How do I bake a cake?"                            │
    │    "Assistant: Here's a simple recipe..."                    │
    │                                                                │
    │  TRAINING:                                                    │
    │    Standard next-token prediction on instruction data        │
    │    Loss = -log P(correct_next_token)                         │
    │                                                                │
    │  KEY TIPS:                                                    │
    │    • Quality data matters more than quantity                 │
    │    • Use LoRA for efficient training                         │
    │    • 1-3 epochs is usually enough                            │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
# SFT Training Code (Template)

print("SFT TRAINING TEMPLATE")
print("="*70)

sft_code = '''
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig

# 1. Load base model (use a small one for demonstration)
model_name = "gpt2"  # Or "meta-llama/Llama-2-7b-hf" for real use
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# 2. Load instruction dataset
dataset = load_dataset("timdettmers/openassistant-guanaco", split="train")
print(f"Dataset size: {len(dataset)} examples")

# 3. Configure LoRA (efficient fine-tuning)
peft_config = LoraConfig(
    r=16,                           # Rank of LoRA matrices
    lora_alpha=32,                  # Scaling factor
    lora_dropout=0.05,              # Regularization
    target_modules=["c_attn"],      # GPT-2 attention (varies by model)
    bias="none",
    task_type="CAUSAL_LM",
)

# 4. Configure training
training_args = SFTConfig(
    output_dir="./sft_model",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    max_seq_length=512,
    learning_rate=2e-4,
    logging_steps=50,
    save_steps=500,
)

# 5. Create trainer and train
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
    peft_config=peft_config,
)

trainer.train()
trainer.save_model("./sft_model_final")
'''

print(sft_code)
print("="*70)

In [None]:
# Simulate SFT training curves

np.random.seed(42)

# Simulated training metrics
steps = np.arange(0, 1001, 50)
loss = 3.5 * np.exp(-steps/300) + 1.5 + np.random.randn(len(steps)) * 0.1

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Loss curve
ax1 = axes[0]
ax1.plot(steps, loss, 'b-', linewidth=2)
ax1.fill_between(steps, loss - 0.1, loss + 0.1, alpha=0.2)
ax1.set_xlabel('Training Steps', fontsize=11)
ax1.set_ylabel('Loss', fontsize=11)
ax1.set_title('SFT Training Loss', fontsize=12, fontweight='bold')
ax1.grid(True, alpha=0.3)
ax1.axhline(y=1.6, color='red', linestyle='--', alpha=0.5, label='Target')
ax1.legend()

# Right: Before/After comparison
ax2 = axes[1]
ax2.set_xlim(0, 10)
ax2.set_ylim(0, 10)
ax2.axis('off')
ax2.set_title('SFT Effect', fontsize=12, fontweight='bold')

# Before SFT
before_box = FancyBboxPatch((0.5, 5.5), 4, 4, boxstyle="round,pad=0.1",
                             facecolor='#ffcdd2', edgecolor='#d32f2f', linewidth=2)
ax2.add_patch(before_box)
ax2.text(2.5, 9, 'Before SFT', ha='center', fontsize=11, fontweight='bold', color='#d32f2f')
ax2.text(2.5, 8, 'User: What is 2+2?', ha='center', fontsize=9)
ax2.text(2.5, 7.2, 'Model: The sum of two', ha='center', fontsize=9)
ax2.text(2.5, 6.6, 'numbers when added...', ha='center', fontsize=9)
ax2.text(2.5, 5.8, '(Rambles, no answer)', ha='center', fontsize=8, color='#666', style='italic')

# After SFT
after_box = FancyBboxPatch((5.5, 5.5), 4, 4, boxstyle="round,pad=0.1",
                            facecolor='#c8e6c9', edgecolor='#388e3c', linewidth=2)
ax2.add_patch(after_box)
ax2.text(7.5, 9, 'After SFT', ha='center', fontsize=11, fontweight='bold', color='#388e3c')
ax2.text(7.5, 8, 'User: What is 2+2?', ha='center', fontsize=9)
ax2.text(7.5, 7.2, 'Model: 2+2 equals 4.', ha='center', fontsize=9)
ax2.text(7.5, 6.4, 'Is there anything else', ha='center', fontsize=9)
ax2.text(7.5, 5.8, "I can help you with?", ha='center', fontsize=9)

# Arrow
ax2.annotate('', xy=(5.4, 7.5), xytext=(4.6, 7.5),
             arrowprops=dict(arrowstyle='->', lw=2, color='#666'))

# Impact summary
ax2.text(5, 4, 'SFT teaches the model HOW to respond', ha='center', fontsize=10, style='italic')
ax2.text(5, 3, 'Format, structure, helpfulness', ha='center', fontsize=9, color='#666')

plt.tight_layout()
plt.show()

---
## Part 3: Direct Preference Optimization (DPO)

```
    ┌────────────────────────────────────────────────────────────────┐
    │          DPO: LEARNING FROM PREFERENCES                        │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  PURPOSE:                                                     │
    │    Transform: "Follows instructions" → "Aligned with humans" │
    │                                                                │
    │  WHAT IT LEARNS:                                              │
    │    - Which responses humans PREFER                           │
    │    - Subtle quality differences                              │
    │    - Safety and helpfulness tradeoffs                        │
    │                                                                │
    │  THE DPO LOSS:                                                │
    │    L = -log σ(β × [log π(y_w)/π_ref(y_w)                     │
    │                  - log π(y_l)/π_ref(y_l)])                   │
    │                                                                │
    │    "Increase probability of chosen response"                 │
    │    "Decrease probability of rejected response"               │
    │    "Don't drift too far from reference (KL constraint)"      │
    │                                                                │
    │  KEY ADVANTAGE:                                               │
    │    No reward model needed! Direct supervised training!       │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
# DPO Training Code (Template)

print("DPO TRAINING TEMPLATE")
print("="*70)

dpo_code = '''
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from peft import LoraConfig
from trl import DPOTrainer, DPOConfig

# 1. Load SFT model (from previous step)
model = AutoModelForCausalLM.from_pretrained("./sft_model_final")
tokenizer = AutoTokenizer.from_pretrained("./sft_model_final")

# 2. Load preference dataset
dataset = load_dataset("Anthropic/hh-rlhf", split="train[:5000]")

# 3. Format data for DPO
def format_for_dpo(example):
    """Convert to DPO format: prompt, chosen, rejected."""
    # Parse Anthropic format (Human/Assistant turns)
    chosen = example["chosen"]
    rejected = example["rejected"]
    
    # Extract prompt and responses
    if "\\nAssistant:" in chosen:
        prompt = chosen.split("\\nAssistant:")[0].replace("Human: ", "")
        chosen_response = chosen.split("\\nAssistant:")[-1]
        rejected_response = rejected.split("\\nAssistant:")[-1]
    else:
        prompt = chosen[:100]
        chosen_response = chosen
        rejected_response = rejected
    
    return {
        "prompt": prompt.strip(),
        "chosen": chosen_response.strip(),
        "rejected": rejected_response.strip(),
    }

dpo_dataset = dataset.map(format_for_dpo)

# 4. Configure DPO training
training_args = DPOConfig(
    output_dir="./dpo_model",
    beta=0.1,                          # KL penalty coefficient
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    max_length=512,
    max_prompt_length=256,
    learning_rate=5e-5,
    logging_steps=50,
)

# 5. Create trainer (reference model created automatically)
trainer = DPOTrainer(
    model=model,
    args=training_args,
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
)

trainer.train()
trainer.save_model("./dpo_model_final")
'''

print(dpo_code)
print("="*70)

In [None]:
# Visualize DPO training dynamics

np.random.seed(42)

steps = np.arange(0, 501, 25)

# Simulated metrics
chosen_rewards = -0.5 + 0.8 * (1 - np.exp(-steps/150)) + np.random.randn(len(steps)) * 0.1
rejected_rewards = -0.3 - 0.5 * (1 - np.exp(-steps/150)) + np.random.randn(len(steps)) * 0.1
margin = chosen_rewards - rejected_rewards
kl_div = 0.1 + steps/100 + np.random.randn(len(steps)) * 0.2

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Left: Reward margins
ax1 = axes[0]
ax1.plot(steps, chosen_rewards, 'g-', linewidth=2, label='Chosen reward')
ax1.plot(steps, rejected_rewards, 'r-', linewidth=2, label='Rejected reward')
ax1.fill_between(steps, rejected_rewards, chosen_rewards, alpha=0.2, color='green')
ax1.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
ax1.set_xlabel('Training Steps', fontsize=11)
ax1.set_ylabel('Implicit Reward', fontsize=11)
ax1.set_title('DPO Reward Dynamics', fontsize=12, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Middle: Margin
ax2 = axes[1]
ax2.plot(steps, margin, 'b-', linewidth=2)
ax2.fill_between(steps, 0, margin, alpha=0.2)
ax2.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
ax2.set_xlabel('Training Steps', fontsize=11)
ax2.set_ylabel('Reward Margin', fontsize=11)
ax2.set_title('Chosen - Rejected Margin', fontsize=12, fontweight='bold')
ax2.grid(True, alpha=0.3)

# Right: KL divergence
ax3 = axes[2]
ax3.plot(steps, kl_div, 'orange', linewidth=2)
ax3.axhline(y=5, color='red', linestyle='--', alpha=0.5, label='Warning threshold')
ax3.axhline(y=15, color='red', linestyle='-', alpha=0.5, label='Danger zone')
ax3.set_xlabel('Training Steps', fontsize=11)
ax3.set_ylabel('KL Divergence', fontsize=11)
ax3.set_title('KL from Reference', fontsize=12, fontweight='bold')
ax3.legend()
ax3.grid(True, alpha=0.3)
ax3.set_ylim(0, 10)

plt.tight_layout()
plt.show()

print("\nDPO TRAINING DYNAMICS:")
print("  • Chosen reward INCREASES (model prefers good responses)")
print("  • Rejected reward DECREASES (model avoids bad responses)")
print("  • Margin GROWS (clearer preference signal)")
print("  • KL divergence should stay moderate (not too high!)")

---
## Part 4: Evaluation - Did It Work?

```
    ┌────────────────────────────────────────────────────────────────┐
    │          EVALUATION METHODS                                    │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  1. AUTOMATED METRICS                                         │
    │     Win Rate: Compare to baseline on same prompts            │
    │     Reward Score: Average reward model score                 │
    │     Perplexity: Language quality (shouldn't degrade)         │
    │                                                                │
    │  2. LLM-AS-JUDGE                                              │
    │     Use GPT-4 or Claude to evaluate response quality         │
    │     Cheaper than human eval, correlates well                 │
    │                                                                │
    │  3. HUMAN EVALUATION (Gold Standard)                          │
    │     Real humans rate helpfulness, harmlessness               │
    │     Most reliable but expensive                              │
    │                                                                │
    │  4. SAFETY TESTING                                            │
    │     Red teaming: Try to make model produce harmful output   │
    │     Check for refusals on dangerous requests                 │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
# Simulated evaluation results

print("EVALUATION RESULTS (Simulated)")
print("="*70)

np.random.seed(42)

# Simulate evaluation on 100 prompts
n_eval = 100

# Win rates (aligned model vs baseline)
base_scores = np.random.normal(5, 1.5, n_eval)
sft_scores = np.random.normal(6.5, 1.2, n_eval)
aligned_scores = np.random.normal(7.8, 1.0, n_eval)

# Compute win rates
def compute_win_rate(model_scores, baseline_scores):
    wins = sum(m > b for m, b in zip(model_scores, baseline_scores))
    ties = sum(m == b for m, b in zip(model_scores, baseline_scores))
    return wins / len(model_scores), ties / len(model_scores)

win_sft, tie_sft = compute_win_rate(sft_scores, base_scores)
win_aligned, tie_aligned = compute_win_rate(aligned_scores, sft_scores)

print("\nWIN RATE ANALYSIS:")
print(f"  SFT vs Base:     {win_sft:.1%} wins")
print(f"  Aligned vs SFT:  {win_aligned:.1%} wins")

print("\nAVERAGE QUALITY SCORES (1-10):")
print(f"  Base Model:    {base_scores.mean():.2f} ± {base_scores.std():.2f}")
print(f"  SFT Model:     {sft_scores.mean():.2f} ± {sft_scores.std():.2f}")
print(f"  Aligned Model: {aligned_scores.mean():.2f} ± {aligned_scores.std():.2f}")

print("\nSAFETY EVALUATION:")
safety_results = {
    'Harmful request refusal': (85, 95),  # (SFT, Aligned)
    'Honest uncertainty': (70, 88),
    'No hallucination': (75, 82),
}

for metric, (sft_rate, aligned_rate) in safety_results.items():
    improvement = aligned_rate - sft_rate
    print(f"  {metric}: SFT={sft_rate}%, Aligned={aligned_rate}% (+{improvement}%)")

print("\n" + "="*70)

In [None]:
# Visualize evaluation results

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Left: Score distributions
ax1 = axes[0]
ax1.hist(base_scores, bins=15, alpha=0.5, label='Base', color='red')
ax1.hist(sft_scores, bins=15, alpha=0.5, label='SFT', color='orange')
ax1.hist(aligned_scores, bins=15, alpha=0.5, label='Aligned', color='green')
ax1.set_xlabel('Quality Score', fontsize=11)
ax1.set_ylabel('Count', fontsize=11)
ax1.set_title('Response Quality Distribution', fontsize=12, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Middle: Win rate visualization
ax2 = axes[1]
comparisons = ['SFT vs Base', 'Aligned vs SFT']
win_rates = [win_sft * 100, win_aligned * 100]
lose_rates = [100 - win_sft * 100, 100 - win_aligned * 100]

x = np.arange(len(comparisons))
width = 0.35

ax2.bar(x - width/2, win_rates, width, label='Wins', color='#4caf50')
ax2.bar(x + width/2, lose_rates, width, label='Losses', color='#f44336')
ax2.axhline(y=50, color='gray', linestyle='--', alpha=0.5)

ax2.set_ylabel('Percentage', fontsize=11)
ax2.set_title('Win Rate Comparison', fontsize=12, fontweight='bold')
ax2.set_xticks(x)
ax2.set_xticklabels(comparisons)
ax2.legend()
ax2.grid(True, alpha=0.3, axis='y')

# Right: Safety metrics
ax3 = axes[2]
metrics = list(safety_results.keys())
sft_vals = [v[0] for v in safety_results.values()]
aligned_vals = [v[1] for v in safety_results.values()]

x = np.arange(len(metrics))
ax3.bar(x - width/2, sft_vals, width, label='SFT', color='#ff9800')
ax3.bar(x + width/2, aligned_vals, width, label='Aligned', color='#4caf50')

ax3.set_ylabel('Success Rate (%)', fontsize=11)
ax3.set_title('Safety Metrics', fontsize=12, fontweight='bold')
ax3.set_xticks(x)
ax3.set_xticklabels(['Refusal', 'Uncertainty', 'No Halluc.'], rotation=15)
ax3.legend()
ax3.grid(True, alpha=0.3, axis='y')
ax3.set_ylim(0, 100)

plt.tight_layout()
plt.show()

print("\nKEY FINDINGS:")
print("  • Aligned model wins ~75% of comparisons")
print("  • Quality scores improved from 5 → 6.5 → 7.8")
print("  • Safety metrics improved across the board")

---
## Part 5: Common Pitfalls and Best Practices

```
    ┌────────────────────────────────────────────────────────────────┐
    │              PITFALLS TO AVOID                                 │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  1. REWARD HACKING                                            │
    │     Model exploits reward signal without being helpful       │
    │     Fix: Stronger KL penalty, diverse evaluation             │
    │                                                                │
    │  2. MODE COLLAPSE                                             │
    │     Model gives same response to everything                  │
    │     Fix: Lower learning rate, increase KL penalty            │
    │                                                                │
    │  3. CATASTROPHIC FORGETTING                                   │
    │     Model forgets base knowledge                             │
    │     Fix: Use LoRA, smaller learning rate                     │
    │                                                                │
    │  4. LENGTH GAMING                                             │
    │     Model learns "longer = better"                           │
    │     Fix: Length normalization in reward                      │
    │                                                                │
    │  5. SYCOPHANCY                                                │
    │     Model agrees with user even when wrong                   │
    │     Fix: Include honesty examples in training data           │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
# Visualize pitfalls and solutions

fig, ax = plt.subplots(figsize=(14, 10))
ax.set_xlim(0, 14)
ax.set_ylim(0, 12)
ax.axis('off')
ax.set_title('Common Alignment Pitfalls and Solutions', fontsize=16, fontweight='bold')

pitfalls = [
    (2.5, 9.5, 'Reward\nHacking', 'Model exploits reward\nwithout being helpful', 'Increase β,\ndiverse eval'),
    (7, 9.5, 'Mode\nCollapse', 'Same response\nto everything', 'Lower LR,\nmore KL'),
    (11.5, 9.5, 'Catastrophic\nForgetting', 'Forgets base\nknowledge', 'Use LoRA,\nlower LR'),
    (4.75, 5.5, 'Length\nGaming', 'Learns longer\nis better', 'Length\nnormalization'),
    (9.25, 5.5, 'Sycophancy', 'Agrees even\nwhen wrong', 'Honesty\ntraining data'),
]

for x, y, title, problem, solution in pitfalls:
    # Problem box (red)
    prob_box = FancyBboxPatch((x-1.5, y), 3, 2, boxstyle="round,pad=0.1",
                               facecolor='#ffcdd2', edgecolor='#d32f2f', linewidth=2)
    ax.add_patch(prob_box)
    ax.text(x, y+1.6, title, ha='center', fontsize=9, fontweight='bold', color='#d32f2f')
    ax.text(x, y+0.5, problem, ha='center', fontsize=7)
    
    # Solution box (green) below
    sol_box = FancyBboxPatch((x-1.3, y-2.3), 2.6, 1.5, boxstyle="round,pad=0.1",
                              facecolor='#c8e6c9', edgecolor='#388e3c', linewidth=2)
    ax.add_patch(sol_box)
    ax.text(x, y-1.3, 'FIX:', ha='center', fontsize=8, fontweight='bold', color='#388e3c')
    ax.text(x, y-1.9, solution, ha='center', fontsize=8)
    
    # Arrow
    ax.annotate('', xy=(x, y-0.7), xytext=(x, y-0.1),
                arrowprops=dict(arrowstyle='->', lw=1.5, color='#666'))

# Best practices section
ax.text(7, 2, 'BEST PRACTICES', ha='center', fontsize=12, fontweight='bold', color='#1976d2')
practices = [
    'Start small (test pipeline on GPT-2 first)',
    'Monitor KL divergence throughout training',
    'Use diverse evaluation (automated + human)',
    'Save checkpoints frequently',
]
for i, practice in enumerate(practices):
    ax.text(7, 1.3 - i*0.4, f'• {practice}', ha='center', fontsize=9)

plt.tight_layout()
plt.show()

---
## Summary: The Alignment Recipe

### Pipeline Overview

| Stage | Purpose | Data Needed | Output |
|-------|---------|-------------|--------|
| Pre-train | Learn language | Internet text | Base LLM |
| SFT | Learn to follow | Instruction pairs | Instruction model |
| DPO | Learn preferences | Preference pairs | Aligned model |

### Key Hyperparameters

| Parameter | SFT | DPO |
|-----------|-----|-----|
| Learning rate | 2e-4 | 5e-5 |
| Batch size | 4 | 2 |
| Epochs | 1-3 | 1 |
| β (KL penalty) | N/A | 0.1 |

### Quick Reference

```python
# Minimal alignment pipeline
from trl import SFTTrainer, DPOTrainer

# Step 1: SFT
sft_trainer = SFTTrainer(model, args, sft_dataset)
sft_trainer.train()

# Step 2: DPO
dpo_trainer = DPOTrainer(sft_model, args, preference_dataset)
dpo_trainer.train()

# Step 3: Evaluate
# Compare to baseline on held-out prompts
```

---
## Test Your Understanding

**1. What's the difference between SFT and DPO?**
<details>
<summary>Click to reveal answer</summary>
SFT (Supervised Fine-Tuning) teaches the model WHAT good responses look like by training on instruction-response pairs. It's like showing an apprentice examples of good work.

DPO (Direct Preference Optimization) teaches the model WHICH responses humans prefer by training on preference pairs (chosen vs rejected). It's like mentorship - "this is better than that."

SFT comes first (foundation), DPO refines (polish).
</details>

**2. Why use LoRA instead of full fine-tuning?**
<details>
<summary>Click to reveal answer</summary>
LoRA (Low-Rank Adaptation) is more efficient:
- Uses ~10-100x less memory
- Trains ~0.1% of parameters
- Reduces catastrophic forgetting
- Enables training large models on consumer GPUs

Full fine-tuning updates ALL parameters, which requires massive compute and risks losing base knowledge.
</details>

**3. What is reward hacking and how do you prevent it?**
<details>
<summary>Click to reveal answer</summary>
Reward hacking is when the model finds ways to maximize the reward signal without actually being helpful. Examples:
- Writing longer responses (if length correlates with reward)
- Using specific phrases that score high
- Being sycophantic (agreeing with everything)

Prevention:
- Increase β (KL penalty) to stay close to reference
- Use diverse evaluation metrics
- Include human evaluation
- Length normalization
</details>

**4. Why is KL divergence monitoring important?**
<details>
<summary>Click to reveal answer</summary>
KL divergence measures how different the trained model is from the reference model.

- Too low (< 1): Model isn't learning much from preferences
- Good range (5-15): Meaningful improvement while staying stable
- Too high (> 15): Risk of reward hacking or catastrophic forgetting

High KL means the model has drifted far from its original behavior, which can cause unexpected failures.
</details>

**5. What makes a good preference dataset?**
<details>
<summary>Click to reveal answer</summary>
Good preference data has:
- Clear differences between chosen and rejected (not ambiguous)
- Diversity of topics and response types
- Consistent labeling (no contradictions)
- Quality matters more than quantity (1K good > 100K noisy)

The chosen response should be clearly better - helpful, accurate, and well-structured - while rejected should have clear flaws.
</details>

---
## What's Next?

You've now seen a complete LLM alignment case study! In the final notebook, we'll explore **Multi-Agent Systems** - what happens when multiple RL agents interact.

**Continue to:** [Notebook 5: Multi-Agent Systems](05_multi_agent_systems.ipynb)

---

*"Alignment is not a one-time fix - it's an ongoing journey of understanding and improving."*