# Direct Preference Optimization

In this notebook, you'll implement DPO from scratch and train a small model with it.

**What you'll do:**
- Compute the DPO loss by hand for preference pairs, verifying the formula with real numbers
- Implement the DPO loss function in PyTorch (~10 lines) and inspect its gradients
- Train GPT-2 small on preference pairs using your DPO loss function
- Extract the implicit reward model from the trained policy and verify it generalizes

**For each exercise, PREDICT the output before running the cell.** Wrong predictions are more valuable than correct ones — they reveal gaps in your mental model.

**Important:** Exercises are cumulative. Exercise 2 verifies against Exercise 1's numbers. Exercise 3 uses Exercise 2's loss function. Exercise 4 uses Exercise 3's trained model.

In [None]:
# Setup — self-contained for Google Colab
# transformers and torch are pre-installed in Colab

import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np
from copy import deepcopy
from transformers import AutoTokenizer, AutoModelForCausalLM

# Reproducible results
torch.manual_seed(42)
np.random.seed(42)

# Nice plots
plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = [10, 4]

# Device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
print("Setup complete.")

In [None]:
# Shared helper: compute log-probability of a response given a prompt
#
# This is the fundamental quantity in DPO. The loss operates on
# log P(response | prompt) under both the policy and reference models.
# A full response's log-probability is the sum of per-token log-probs.

def get_response_log_prob(model, tokenizer, prompt: str, response: str) -> torch.Tensor:
    """Compute log P(response | prompt) under a model.

    Returns a scalar tensor (sum of per-token log-probs).
    Keeps gradients if the model has requires_grad=True.
    """
    prompt_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
    full_ids = tokenizer.encode(prompt + response, return_tensors="pt").to(model.device)
    response_start = prompt_ids.shape[1]

    # If the response is empty or only the prompt, return 0
    if full_ids.shape[1] <= response_start:
        return torch.tensor(0.0, device=model.device)

    outputs = model(full_ids)
    logits = outputs.logits

    # For each response position, get the log-prob of the actual next token
    # logits[0, t] predicts token t+1, so we shift by 1
    log_probs_all = F.log_softmax(logits[0], dim=-1)
    response_token_ids = full_ids[0, response_start:]
    token_log_probs = log_probs_all[response_start - 1 : -1]
    token_log_probs = token_log_probs.gather(1, response_token_ids.unsqueeze(1)).squeeze(1)

    return token_log_probs.sum()

---

## Exercise 1: Verify the DPO Loss by Hand (Guided)

The DPO loss for a single preference pair is:

$$\mathcal{L}_\text{DPO} = -\log \sigma\!\left(\beta \left(\log \frac{\pi(y_w|x)}{\pi_\text{ref}(y_w|x)} - \log \frac{\pi(y_l|x)}{\pi_\text{ref}(y_l|x)}\right)\right)$$

where $\sigma$ is the sigmoid function and $\beta$ controls how much the policy can deviate from the reference.

In this exercise, you have pre-computed log-probabilities for 5 preference pairs. You will:
1. Compute the DPO loss for each pair step by step
2. Compute the total loss (average across pairs)
3. Vary $\beta$ and observe how it controls conservatism

**Before running, predict:**
- For a pair where the policy already strongly prefers the preferred response (large positive log-ratio difference), will the loss be large or small?
- For a pair where the policy disagrees with the human (negative log-ratio difference), will the loss be large or small?
- As $\beta$ increases, does the loss become more or less sensitive to the log-ratio difference?

In [None]:
# --- Pre-computed log-probabilities for 5 preference pairs ---
#
# These are log P(response | prompt) under the policy and reference models.
# Negative numbers (log-probs of full sequences are always negative).
# The policy has started to learn preferences but hasn't converged.

preference_pairs = [
    {
        "prompt": "Explain quantum computing to a 10-year-old.",
        "preferred": "Age-appropriate analogy response",
        "dispreferred": "Jargon-heavy technical response",
        "policy_logp_w": -45.2,   # log pi(y_w|x)
        "policy_logp_l": -42.8,   # log pi(y_l|x)
        "ref_logp_w": -48.1,      # log pi_ref(y_w|x)
        "ref_logp_l": -43.0,      # log pi_ref(y_l|x)
    },
    {
        "prompt": "Is it safe to eat raw cookie dough?",
        "preferred": "Nuanced safety answer with caveats",
        "dispreferred": "Overconfident 'yes it's fine' answer",
        "policy_logp_w": -38.5,
        "policy_logp_l": -35.2,
        "ref_logp_w": -39.0,
        "ref_logp_l": -36.1,
    },
    {
        "prompt": "Write a haiku about machine learning.",
        "preferred": "Creative, follows haiku structure",
        "dispreferred": "Generic, doesn't follow structure",
        "policy_logp_w": -28.3,
        "policy_logp_l": -31.7,
        "ref_logp_w": -30.1,
        "ref_logp_l": -30.5,
    },
    {
        "prompt": "What causes rain?",
        "preferred": "Accurate scientific explanation",
        "dispreferred": "Vague, slightly wrong explanation",
        "policy_logp_w": -52.0,
        "policy_logp_l": -55.1,
        "ref_logp_w": -53.2,
        "ref_logp_l": -54.0,
    },
    {
        "prompt": "Summarize the plot of Romeo and Juliet.",
        "preferred": "Concise, accurate summary",
        "dispreferred": "Rambling, inaccurate summary",
        "policy_logp_w": -61.3,
        "policy_logp_l": -60.8,
        "ref_logp_w": -62.0,
        "ref_logp_l": -61.5,
    },
]

print("Pre-computed log-probabilities for 5 preference pairs:")
print(f"{'Pair':<6} {'Prompt':<45} {'policy_w':>10} {'policy_l':>10} {'ref_w':>10} {'ref_l':>10}")
print("-" * 95)
for i, pair in enumerate(preference_pairs):
    prompt_short = pair['prompt'][:42] + '...' if len(pair['prompt']) > 42 else pair['prompt']
    print(f"{i+1:<6} {prompt_short:<45} {pair['policy_logp_w']:>10.1f} {pair['policy_logp_l']:>10.1f} {pair['ref_logp_w']:>10.1f} {pair['ref_logp_l']:>10.1f}")

In [None]:
# --- Step-by-step DPO loss computation ---
#
# For each pair, we compute:
# 1. Log-ratio for preferred:    log(pi/pi_ref) for y_w
# 2. Log-ratio for dispreferred: log(pi/pi_ref) for y_l
# 3. Difference: log_ratio_w - log_ratio_l
# 4. Scale by beta: beta * difference
# 5. Apply -log(sigmoid(...)): the DPO loss for this pair

import math

def sigmoid(x):
    """The logistic sigmoid function: maps any real number to (0, 1)."""
    return 1.0 / (1.0 + math.exp(-x))


beta = 0.1  # A typical value — controls conservatism

print(f"Computing DPO loss step-by-step with beta = {beta}")
print("=" * 90)

losses = []

for i, pair in enumerate(preference_pairs):
    # Step 1: Log-ratios (how much has the policy shifted from the reference?)
    log_ratio_w = pair['policy_logp_w'] - pair['ref_logp_w']
    log_ratio_l = pair['policy_logp_l'] - pair['ref_logp_l']

    # Step 2: Difference in log-ratios
    diff = log_ratio_w - log_ratio_l

    # Step 3: Scale by beta
    scaled = beta * diff

    # Step 4: Apply -log(sigmoid(...))
    sig = sigmoid(scaled)
    loss = -math.log(sig)
    losses.append(loss)

    print(f"\nPair {i+1}: {pair['prompt']}")
    print(f"  Preferred:    {pair['preferred']}")
    print(f"  Dispreferred: {pair['dispreferred']}")
    print(f"  Log-ratio (preferred):    {pair['policy_logp_w']:.1f} - ({pair['ref_logp_w']:.1f}) = {log_ratio_w:+.2f}")
    print(f"  Log-ratio (dispreferred): {pair['policy_logp_l']:.1f} - ({pair['ref_logp_l']:.1f}) = {log_ratio_l:+.2f}")
    print(f"  Difference:  {log_ratio_w:.2f} - {log_ratio_l:.2f} = {diff:+.2f}")
    print(f"  Scaled:      {beta} * {diff:.2f} = {scaled:+.4f}")
    print(f"  sigma({scaled:.4f}) = {sig:.4f}")
    print(f"  Loss = -log({sig:.4f}) = {loss:.4f}")

    # Interpretation
    if diff > 1.0:
        print(f"  --> Model AGREES with preference (positive diff). Low-ish loss.")
    elif diff > 0:
        print(f"  --> Model SLIGHTLY agrees. Moderate loss, still learning.")
    else:
        print(f"  --> Model DISAGREES with preference (negative diff). High loss, strong gradient.")

avg_loss = sum(losses) / len(losses)
print(f"\n{'=' * 90}")
print(f"Average DPO loss across all {len(losses)} pairs: {avg_loss:.4f}")

In [None]:
# --- Vary beta and observe its effect ---
#
# beta controls how much the policy can deviate from the reference.
# Higher beta = more conservative (larger loss for the same log-ratio difference).
# Lower beta = more aggressive (allows larger deviations before penalizing).

beta_values = [0.01, 0.05, 0.1, 0.2, 0.5, 1.0]

print(f"DPO loss at different beta values:")
print(f"{'Beta':<8}" + "".join(f"{'Pair ' + str(i+1):>10}" for i in range(5)) + f"{'Average':>10}")
print("-" * 68)

avg_losses_by_beta = []

for beta_val in beta_values:
    pair_losses = []
    for pair in preference_pairs:
        log_ratio_w = pair['policy_logp_w'] - pair['ref_logp_w']
        log_ratio_l = pair['policy_logp_l'] - pair['ref_logp_l']
        diff = log_ratio_w - log_ratio_l
        scaled = beta_val * diff
        loss = -math.log(sigmoid(scaled))
        pair_losses.append(loss)

    avg = sum(pair_losses) / len(pair_losses)
    avg_losses_by_beta.append(avg)
    vals = "".join(f"{l:>10.4f}" for l in pair_losses)
    print(f"{beta_val:<8.2f}{vals}{avg:>10.4f}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Average loss vs beta
ax = axes[0]
ax.plot(beta_values, avg_losses_by_beta, 'o-', color='#a78bfa', linewidth=2, markersize=8)
ax.set_xlabel('Beta', fontsize=11)
ax.set_ylabel('Average DPO Loss', fontsize=11)
ax.set_title('Average Loss vs Beta', fontsize=12)
ax.grid(alpha=0.2)

# Right: Per-pair loss at different betas
ax2 = axes[1]
colors = ['#6366f1', '#f59e0b', '#22c55e', '#06b6d4', '#ef4444']
for pair_idx in range(5):
    pair_losses_across_beta = []
    for beta_val in beta_values:
        pair = preference_pairs[pair_idx]
        log_ratio_w = pair['policy_logp_w'] - pair['ref_logp_w']
        log_ratio_l = pair['policy_logp_l'] - pair['ref_logp_l']
        diff = log_ratio_w - log_ratio_l
        scaled = beta_val * diff
        loss = -math.log(sigmoid(scaled))
        pair_losses_across_beta.append(loss)
    ax2.plot(beta_values, pair_losses_across_beta, 'o-', color=colors[pair_idx],
             label=f'Pair {pair_idx+1}', linewidth=2, markersize=6)

ax2.set_xlabel('Beta', fontsize=11)
ax2.set_ylabel('DPO Loss', fontsize=11)
ax2.set_title('Per-Pair Loss vs Beta', fontsize=12)
ax2.legend(fontsize=9)
ax2.grid(alpha=0.2)

plt.tight_layout()
plt.show()

print("\nKey observations:")
print("- Higher beta amplifies the loss. The same log-ratio difference produces a LARGER loss.")
print("  This means higher beta = more conservative: the model is penalized more for deviating.")
print("- Lower beta shrinks the loss. The model can deviate more before the loss becomes large.")
print("  This means lower beta = more aggressive optimization.")
print("- Pairs where the model disagrees (Pair 2, 5) always have higher loss at every beta.")
print("  DPO naturally focuses learning on the cases where the model is most wrong.")

**What you just computed:** The DPO loss for each preference pair, step by step. The loss is small when the model already agrees with the preference (the log-ratio for the preferred response is higher than for the dispreferred) and large when the model disagrees.

The beta parameter controls conservatism. Higher beta amplifies the loss — the model is penalized more heavily for any deviation from the reference, making it more conservative. Lower beta allows more aggressive optimization. This falls directly out of the formula: beta scales the log-ratio difference before passing it through the sigmoid.

Notice that the loss operates on **log-ratios** (policy minus reference), not absolute log-probabilities. This is the implicit KL penalty — DPO measures how much the policy has *changed*, not the absolute quality of the response.

---

## Exercise 2: Implement the DPO Loss Function (Supported)

Now implement the DPO loss as a PyTorch function. The implementation is short (~5-10 lines of core logic) but each line maps to a step in the derivation you just computed by hand.

After implementing, you'll:
1. Verify your implementation against Exercise 1's hand calculations
2. Use autograd to compute the gradient and confirm it pushes in the right direction

<details>
<summary>Hint</summary>

The steps are exactly what you did by hand:
1. Compute log-ratios: `policy_logp - ref_logp` for preferred and dispreferred
2. Compute the difference: `log_ratio_w - log_ratio_l`
3. Scale by beta: `beta * difference`
4. Apply negative log-sigmoid: `-F.logsigmoid(scaled)` (use `F.logsigmoid` — it is numerically stable)
5. Average over the batch: `.mean()`

`F.logsigmoid(x)` computes `log(sigmoid(x))` in a numerically stable way.

</details>

In [None]:
def dpo_loss(
    policy_logps_w: torch.Tensor,   # log P(y_w|x) under policy, shape [batch]
    policy_logps_l: torch.Tensor,   # log P(y_l|x) under policy, shape [batch]
    ref_logps_w: torch.Tensor,      # log P(y_w|x) under reference, shape [batch]
    ref_logps_l: torch.Tensor,      # log P(y_l|x) under reference, shape [batch]
    beta: float = 0.1,
) -> torch.Tensor:
    """Compute the DPO loss.

    Returns a scalar tensor: the mean loss across the batch.

    The DPO loss is:
      -log sigma(beta * (log(pi/pi_ref)(y_w) - log(pi/pi_ref)(y_l)))
    averaged over the batch.
    """
    # TODO: Compute log-ratios for preferred and dispreferred responses
    # Each is: policy_logp - ref_logp
    # YOUR CODE HERE (2 lines)
    log_ratio_w = None  # REPLACE THIS
    log_ratio_l = None  # REPLACE THIS

    # TODO: Compute the scaled difference and apply -log(sigmoid(...))
    # Use F.logsigmoid() for numerical stability, then negate and take the mean.
    # YOUR CODE HERE (2 lines)
    logits = None  # REPLACE THIS
    loss = None    # REPLACE THIS

    return loss

In [None]:
# --- Verify against Exercise 1's hand calculations ---

# Convert the hand-computed data to tensors
policy_w = torch.tensor([p['policy_logp_w'] for p in preference_pairs])
policy_l = torch.tensor([p['policy_logp_l'] for p in preference_pairs])
ref_w = torch.tensor([p['ref_logp_w'] for p in preference_pairs])
ref_l = torch.tensor([p['ref_logp_l'] for p in preference_pairs])

# Compute with your implementation
computed_loss = dpo_loss(policy_w, policy_l, ref_w, ref_l, beta=0.1)

# Compare to the hand-computed average from Exercise 1
hand_computed_avg = avg_loss  # From the earlier cell

print(f"Hand-computed average loss (Exercise 1): {hand_computed_avg:.4f}")
print(f"PyTorch implementation loss:             {computed_loss.item():.4f}")
print(f"Difference:                              {abs(computed_loss.item() - hand_computed_avg):.6f}")

if abs(computed_loss.item() - hand_computed_avg) < 0.001:
    print("\nMatch! Your implementation agrees with the hand calculation.")
else:
    print("\nMismatch. Check your implementation — it should produce the same result.")

In [None]:
# --- Inspect the gradient direction via autograd ---
#
# The DPO gradient should push the policy to:
#   - INCREASE log-prob of preferred responses (positive gradient on policy_logps_w)
#   - DECREASE log-prob of dispreferred responses (negative gradient on policy_logps_l)
#
# The magnitude should be proportional to how much the model disagrees:
# large gradient when the model is wrong, small when it already agrees.

# Create tensors that require gradients (simulating differentiable policy log-probs)
policy_w_grad = torch.tensor([p['policy_logp_w'] for p in preference_pairs], requires_grad=True)
policy_l_grad = torch.tensor([p['policy_logp_l'] for p in preference_pairs], requires_grad=True)
ref_w_fixed = torch.tensor([p['ref_logp_w'] for p in preference_pairs])  # Reference is frozen
ref_l_fixed = torch.tensor([p['ref_logp_l'] for p in preference_pairs])  # Reference is frozen

# Forward pass
loss = dpo_loss(policy_w_grad, policy_l_grad, ref_w_fixed, ref_l_fixed, beta=0.1)

# Backward pass
loss.backward()

print("Gradients of the DPO loss with respect to policy log-probs:")
print(f"{'Pair':<6} {'grad(policy_w)':>15} {'grad(policy_l)':>15} {'Direction':>25}")
print("-" * 65)

for i in range(5):
    gw = policy_w_grad.grad[i].item()
    gl = policy_l_grad.grad[i].item()
    # Negative gradient on policy_w means: decreasing policy_w INCREASES loss,
    # so the optimizer will INCREASE policy_w (gradient descent = subtract gradient).
    direction = ""
    if gw < 0 and gl > 0:
        direction = "Increase w, decrease l"
    elif gw < 0 and gl < 0:
        direction = "Increase w, increase l"
    elif gw > 0 and gl > 0:
        direction = "Decrease w, decrease l"
    else:
        direction = "Decrease w, increase l"
    print(f"{i+1:<6} {gw:>15.6f} {gl:>15.6f} {direction:>25}")

print("\nKey observations:")
print("- grad(policy_w) is NEGATIVE for all pairs: the optimizer will INCREASE preferred log-probs.")
print("- grad(policy_l) is POSITIVE for all pairs: the optimizer will DECREASE dispreferred log-probs.")
print("- Gradient MAGNITUDE is larger for pairs where the model disagrees (pairs 2, 5).")
print("  DPO focuses learning on the hard cases, just as we saw in Exercise 1.")

<details>
<summary>Solution</summary>

The DPO loss implementation maps directly to the derivation steps:

```python
def dpo_loss(
    policy_logps_w: torch.Tensor,
    policy_logps_l: torch.Tensor,
    ref_logps_w: torch.Tensor,
    ref_logps_l: torch.Tensor,
    beta: float = 0.1,
) -> torch.Tensor:
    # Log-ratios: how much has the policy shifted from the reference?
    log_ratio_w = policy_logps_w - ref_logps_w
    log_ratio_l = policy_logps_l - ref_logps_l

    # DPO loss: -log sigma(beta * (log_ratio_w - log_ratio_l))
    logits = beta * (log_ratio_w - log_ratio_l)
    loss = -F.logsigmoid(logits).mean()

    return loss
```

**Why each line:**
- `log_ratio_w` and `log_ratio_l`: These are the log-ratios from the derivation. They measure how much the policy has shifted from the reference for each response. This is the implicit KL penalty — large shifts produce large ratios.
- `logits`: The difference in log-ratios, scaled by beta. When this is large and positive, the model agrees with the preference. When negative, it disagrees.
- `-F.logsigmoid(logits).mean()`: The negative log-sigmoid produces high loss when logits are negative (model disagrees) and low loss when logits are positive (model agrees). `.mean()` averages over the batch.

**Why `F.logsigmoid` instead of `torch.log(torch.sigmoid(...))`?** `F.logsigmoid` is numerically stable for large negative inputs where `sigmoid(x)` would underflow to 0 and `log(0)` would produce `-inf`.

**Common mistake:** Forgetting to negate the log-sigmoid. `F.logsigmoid(x)` returns `log(sigmoid(x))`, which is always negative. The DPO loss is `-log(sigmoid(x))`, which is always positive. Without the negation, you would be *rewarding* disagreement with preferences.

</details>

**What you just verified:** Your DPO loss implementation produces the exact same result as the hand calculation from Exercise 1. The gradient inspection confirms the loss pushes in the right direction: increase preferred log-probs, decrease dispreferred log-probs, with stronger gradients for pairs where the model disagrees.

The entire loss function is ~5 lines of core logic. Each line maps to a step in the derivation from the lesson: compute log-ratios (the implicit KL), take their difference (comparing preferred vs dispreferred), scale by beta (control conservatism), apply negative log-sigmoid (convert to a loss). The complexity is in the derivation that justifies these lines, not in the code itself.

---

## Exercise 3: Train a Small Model with DPO (Supported)

Now use your DPO loss function to train GPT-2 small on preference pairs. The training loop follows the familiar pattern — forward, loss, backward, step — with the DPO loss replacing cross-entropy.

You will:
1. Load GPT-2 as the policy model (trainable) and a frozen copy as the reference
2. Create preference pairs with clear quality differences
3. Run a DPO training loop using your loss function from Exercise 2
4. Compare the model's outputs before and after training

<details>
<summary>Hint</summary>

The training loop structure is:
```python
for epoch in range(num_epochs):
    for pair in preference_pairs:
        # Compute log-probs under policy (with gradient)
        # Compute log-probs under reference (no gradient, use torch.no_grad())
        # Compute DPO loss using your function
        # loss.backward()
        # optimizer.step()
        # optimizer.zero_grad()
```

The reference model is frozen — wrap its forward pass in `torch.no_grad()`. The policy model needs gradients — do NOT use `torch.no_grad()` for it.

</details>

In [None]:
# --- Load GPT-2 as policy and reference ---

print("Loading GPT-2 as policy model...")
policy_model = AutoModelForCausalLM.from_pretrained("gpt2").to(device)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

print("Creating frozen reference model (copy of initial policy)...")
ref_model = deepcopy(policy_model)
ref_model.eval()
# Freeze the reference model — no gradients, no updates
for param in ref_model.parameters():
    param.requires_grad = False

print(f"Policy model parameters: {sum(p.numel() for p in policy_model.parameters()):,}")
print(f"Reference model parameters: {sum(p.numel() for p in ref_model.parameters()):,} (frozen)")
print("Models loaded.")

In [None]:
# --- Preference pairs for training ---
#
# Each pair: a prompt, a preferred response (y_w), and a dispreferred response (y_l).
# We use clear quality differences so the effect is visible even with a small model
# and few training steps.

training_pairs = [
    {
        "prompt": "What is the capital of France? ",
        "preferred": "The capital of France is Paris, a city known for its rich history, art, and culture.",
        "dispreferred": "France is a country in Europe. It has many cities. Some cities are big.",
    },
    {
        "prompt": "Explain why the sky is blue. ",
        "preferred": "The sky appears blue because molecules in the atmosphere scatter shorter wavelengths of sunlight more than longer ones, a phenomenon called Rayleigh scattering.",
        "dispreferred": "The sky is blue because it just is. That's the color of the sky. It's always been blue.",
    },
    {
        "prompt": "How does a bicycle stay upright? ",
        "preferred": "A bicycle stays upright through a combination of the rider's balance adjustments, gyroscopic effects of the spinning wheels, and the geometry of the front fork which provides a self-correcting steering effect.",
        "dispreferred": "Bicycles stay up because of balance. You have to balance on them. If you don't balance, you fall.",
    },
    {
        "prompt": "What is machine learning? ",
        "preferred": "Machine learning is a branch of artificial intelligence where systems learn patterns from data rather than being explicitly programmed with rules. The model improves its performance on a task as it sees more examples.",
        "dispreferred": "Machine learning is when computers learn things. They use data and algorithms. It's very complex and difficult to understand.",
    },
    {
        "prompt": "Why do leaves change color in autumn? ",
        "preferred": "Leaves change color in autumn because shorter days trigger trees to stop producing chlorophyll. As the green chlorophyll breaks down, other pigments like carotenoids (yellow, orange) and anthocyanins (red, purple) become visible.",
        "dispreferred": "Leaves change color because of the seasons. When it gets cold, the leaves turn different colors and then they fall off the trees.",
    },
]

print(f"Training data: {len(training_pairs)} preference pairs")
for i, pair in enumerate(training_pairs):
    print(f"\nPair {i+1}: {pair['prompt'].strip()}")
    print(f"  Preferred:    {pair['preferred'][:80]}...")
    print(f"  Dispreferred: {pair['dispreferred'][:80]}...")

In [None]:
# --- Measure log-probs BEFORE training (baseline) ---
#
# Before training, the policy and reference are identical.
# All log-ratios should be 0, and the loss should be -log(sigma(0)) = log(2) ~ 0.693.

print("Log-probabilities BEFORE training (policy = reference):")
print(f"{'Pair':<6} {'policy_w':>12} {'policy_l':>12} {'ref_w':>12} {'ref_l':>12} {'log_ratio_diff':>16}")
print("-" * 75)

policy_model.eval()
with torch.no_grad():
    for i, pair in enumerate(training_pairs):
        pw = get_response_log_prob(policy_model, tokenizer, pair['prompt'], pair['preferred']).item()
        pl = get_response_log_prob(policy_model, tokenizer, pair['prompt'], pair['dispreferred']).item()
        rw = get_response_log_prob(ref_model, tokenizer, pair['prompt'], pair['preferred']).item()
        rl = get_response_log_prob(ref_model, tokenizer, pair['prompt'], pair['dispreferred']).item()
        diff = (pw - rw) - (pl - rl)
        print(f"{i+1:<6} {pw:>12.2f} {pl:>12.2f} {rw:>12.2f} {rl:>12.2f} {diff:>16.4f}")

print(f"\nLog-ratio differences are ~0 because policy = reference (no training yet).")
print(f"Expected loss: -log(sigma(0)) = log(2) = {math.log(2):.4f}")

In [None]:
# --- DPO Training Loop ---
#
# This is the core of the exercise. The training loop looks like supervised
# learning — the complexity is in the loss function, not the loop.

beta = 0.1
learning_rate = 1e-5
num_epochs = 3

optimizer = torch.optim.Adam(policy_model.parameters(), lr=learning_rate)
loss_history = []

policy_model.train()

print(f"Training with beta={beta}, lr={learning_rate}, epochs={num_epochs}")
print(f"{'Epoch':<8} {'Step':<8} {'Loss':>10} {'Prompt':>40}")
print("-" * 70)

for epoch in range(num_epochs):
    epoch_losses = []

    for step, pair in enumerate(training_pairs):
        optimizer.zero_grad()

        # TODO: Compute log-probs under the POLICY model (with gradient)
        # Use get_response_log_prob() for both preferred and dispreferred.
        # Do NOT wrap in torch.no_grad() — we need gradients!
        # YOUR CODE HERE (2 lines)
        policy_logp_w = None  # REPLACE THIS
        policy_logp_l = None  # REPLACE THIS

        # TODO: Compute log-probs under the REFERENCE model (no gradient)
        # Use torch.no_grad() — the reference model is frozen.
        # YOUR CODE HERE (3 lines: with block + 2 calls)
        ref_logp_w = None  # REPLACE THIS
        ref_logp_l = None  # REPLACE THIS

        # TODO: Compute the DPO loss using your function from Exercise 2
        # Note: the log-probs are scalars, so unsqueeze to make them [1]-shaped tensors
        # YOUR CODE HERE (1 line)
        loss = None  # REPLACE THIS

        # Backward pass and optimizer step
        loss.backward()
        optimizer.step()

        loss_val = loss.item()
        epoch_losses.append(loss_val)
        loss_history.append(loss_val)

        prompt_short = pair['prompt'].strip()[:37] + '...' if len(pair['prompt'].strip()) > 37 else pair['prompt'].strip()
        print(f"{epoch+1:<8} {step+1:<8} {loss_val:>10.4f} {prompt_short:>40}")

    avg_epoch_loss = sum(epoch_losses) / len(epoch_losses)
    print(f"  Epoch {epoch+1} average loss: {avg_epoch_loss:.4f}")

print(f"\nTraining complete. Final average loss: {sum(loss_history[-len(training_pairs):]) / len(training_pairs):.4f}")

In [None]:
# --- Visualize training loss ---

fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(loss_history, 'o-', color='#a78bfa', linewidth=1.5, markersize=4, alpha=0.8)

# Add epoch boundaries
for epoch_end in range(len(training_pairs), len(loss_history), len(training_pairs)):
    ax.axvline(x=epoch_end - 0.5, color='white', linestyle='--', alpha=0.2)

# Add the initial loss reference line: log(2)
ax.axhline(y=math.log(2), color='#ef4444', linestyle='--', alpha=0.5, label=f'Initial loss = log(2) = {math.log(2):.3f}')

ax.set_xlabel('Training Step', fontsize=11)
ax.set_ylabel('DPO Loss', fontsize=11)
ax.set_title('DPO Training Loss', fontsize=12)
ax.legend(fontsize=9)
ax.grid(alpha=0.2)
plt.tight_layout()
plt.show()

print("The loss starts at log(2) ~ 0.693 (policy = reference, random guess).")
print("It decreases as the policy learns to prefer the preferred responses.")

In [None]:
# --- Compare log-probs AFTER training ---
#
# The log-ratio differences should now be POSITIVE for all pairs:
# the policy has shifted to assign relatively more probability to
# preferred responses compared to the reference.

print("Log-probabilities AFTER training:")
print(f"{'Pair':<6} {'policy_w':>12} {'policy_l':>12} {'ref_w':>12} {'ref_l':>12} {'log_ratio_diff':>16}")
print("-" * 75)

policy_model.eval()
with torch.no_grad():
    for i, pair in enumerate(training_pairs):
        pw = get_response_log_prob(policy_model, tokenizer, pair['prompt'], pair['preferred']).item()
        pl = get_response_log_prob(policy_model, tokenizer, pair['prompt'], pair['dispreferred']).item()
        rw = get_response_log_prob(ref_model, tokenizer, pair['prompt'], pair['preferred']).item()
        rl = get_response_log_prob(ref_model, tokenizer, pair['prompt'], pair['dispreferred']).item()
        diff = (pw - rw) - (pl - rl)
        print(f"{i+1:<6} {pw:>12.2f} {pl:>12.2f} {rw:>12.2f} {rl:>12.2f} {diff:>16.4f}")

print(f"\nLog-ratio differences are now POSITIVE: the policy prefers the preferred responses.")
print(f"Compare to before training, where all differences were ~0.")

<details>
<summary>Solution</summary>

The training loop fills look like this:

```python
# Policy log-probs (with gradient)
policy_logp_w = get_response_log_prob(policy_model, tokenizer, pair['prompt'], pair['preferred'])
policy_logp_l = get_response_log_prob(policy_model, tokenizer, pair['prompt'], pair['dispreferred'])

# Reference log-probs (no gradient — frozen model)
with torch.no_grad():
    ref_logp_w = get_response_log_prob(ref_model, tokenizer, pair['prompt'], pair['preferred'])
    ref_logp_l = get_response_log_prob(ref_model, tokenizer, pair['prompt'], pair['dispreferred'])

# DPO loss (unsqueeze scalars to [1]-shaped tensors for the batch dimension)
loss = dpo_loss(policy_logp_w.unsqueeze(0), policy_logp_l.unsqueeze(0),
                ref_logp_w.unsqueeze(0), ref_logp_l.unsqueeze(0), beta=beta)
```

**Why the reference model uses `torch.no_grad()`:** The reference model is frozen — it represents what the model knew before alignment. We never update it. Wrapping its forward pass in `torch.no_grad()` saves memory (no gradient computation graph) and ensures no gradients accidentally flow through it.

**Why `unsqueeze(0)`:** The `dpo_loss` function expects batch tensors (shape `[batch_size]`), but `get_response_log_prob` returns a scalar. `unsqueeze(0)` adds a batch dimension: scalar `x` becomes tensor `[x]` with shape `[1]`.

**Key observation about the loop:** It looks like standard supervised training — forward, loss, backward, step. The only difference is the loss function. This is the "DPO partially restores the familiar training loop shape" insight from the RLHF lesson. The complexity lives in the loss, not the loop.

</details>

**What you just observed:** DPO training looks like supervised learning. The training loop is forward, loss, backward, step — the same pattern you have used many times. The only difference is the loss function: instead of cross-entropy on a single target, DPO's loss operates on pairs of responses through the reference model.

The loss starts at `log(2)` because the policy and reference are initially identical (all log-ratios are 0, sigmoid(0) = 0.5, -log(0.5) = log(2)). As training proceeds, the loss decreases as the policy learns to assign relatively more probability to preferred responses.

The reference model never changes. It is a frozen snapshot of the initial policy. The DPO loss measures how much the policy has shifted *relative to this anchor* — this is the implicit KL penalty that prevents probability mass collapse.

---

## Exercise 4: Explore the Implicit Reward (Independent)

The derivation in the lesson showed that any policy paired with a reference implicitly defines a reward function:

$$r(y, x) = \beta \log \frac{\pi(y \mid x)}{\pi_\text{ref}(y \mid x)} + \beta \log Z(x)$$

The $\beta \log Z(x)$ term is the same for all responses to the same prompt, so for *comparing* responses to the same prompt, you can drop it. The implicit reward (up to a per-prompt constant) is just:

$$r(y, x) \propto \beta \log \frac{\pi(y \mid x)}{\pi_\text{ref}(y \mid x)}$$

Using the trained model from Exercise 3, you will:
1. Compute the implicit reward for the preferred and dispreferred responses in the training data
2. Verify that preferred responses have higher implicit reward
3. Compute implicit reward for NEW responses that were NOT in the training data
4. Check whether the implicit reward generalizes beyond the training set

This exercise is independent — you write all the code yourself.

<details>
<summary>Hint</summary>

The implicit reward for a response is:
```python
reward = beta * (policy_log_prob - ref_log_prob)
```

Use `get_response_log_prob()` with both the policy model and the reference model, then compute the difference. Wrap everything in `torch.no_grad()` — you are evaluating, not training.

For new responses, write responses of varying quality to the same prompts used in training. One should be clearly good (informative, accurate), one mediocre, and one poor. The implicit reward should rank them sensibly.

</details>

In [None]:
# --- Part 1: Compute implicit reward for training data ---
#
# For each training pair, compute the implicit reward for both the
# preferred and dispreferred response. Verify preferred > dispreferred.
#
# TODO: Write the code to compute and display implicit rewards for all
# training pairs. Use get_response_log_prob() with policy_model and ref_model.
# The implicit reward is: beta * (policy_log_prob - ref_log_prob)
#
# YOUR CODE HERE


In [None]:
# --- Part 2: Implicit reward for NEW responses (not in training data) ---
#
# Test whether the implicit reward generalizes. Write new responses of
# varying quality to prompts from the training data (or new prompts).
# Compute their implicit rewards and check the ranking.
#
# TODO: Create at least 3 new responses per prompt (good, mediocre, bad).
# Compute implicit rewards and display them ranked.
#
# YOUR CODE HERE


In [None]:
# --- Part 3: Visualize implicit rewards ---
#
# TODO: Create a bar chart showing implicit rewards for all responses
# (training preferred, training dispreferred, new responses).
# Color-code by quality: preferred=green, dispreferred=red, new=blue.
#
# YOUR CODE HERE


<details>
<summary>Solution</summary>

**Part 1: Implicit reward for training data**

The key insight is that the implicit reward is just `beta * log_ratio`. Preferred responses should have higher implicit reward.

```python
beta = 0.1

print("Implicit rewards for training data:")
print(f"{'Pair':<6} {'Reward (preferred)':>20} {'Reward (dispreferred)':>22} {'Preferred > Dispreferred?':>28}")
print("-" * 80)

policy_model.eval()
with torch.no_grad():
    for i, pair in enumerate(training_pairs):
        # Implicit reward = beta * (log pi(y|x) - log pi_ref(y|x))
        pw = get_response_log_prob(policy_model, tokenizer, pair['prompt'], pair['preferred']).item()
        rw = get_response_log_prob(ref_model, tokenizer, pair['prompt'], pair['preferred']).item()
        reward_w = beta * (pw - rw)

        pl = get_response_log_prob(policy_model, tokenizer, pair['prompt'], pair['dispreferred']).item()
        rl = get_response_log_prob(ref_model, tokenizer, pair['prompt'], pair['dispreferred']).item()
        reward_l = beta * (pl - rl)

        correct = reward_w > reward_l
        print(f"{i+1:<6} {reward_w:>20.4f} {reward_l:>22.4f} {str(correct):>28}")
```

**Part 2: New responses**

```python
new_test_cases = [
    {
        "prompt": "What is the capital of France? ",
        "responses": [
            ("Paris is the capital and largest city of France, situated along the Seine River.", "good"),
            ("The capital is Paris, I think.", "mediocre"),
            ("France has a capital city. Many people live there.", "bad"),
        ],
    },
    {
        "prompt": "Explain why the sky is blue. ",
        "responses": [
            ("Sunlight contains all colors. Earth's atmosphere scatters blue light most because its shorter wavelength interacts more with air molecules.", "good"),
            ("Something about light scattering in the atmosphere makes it blue.", "mediocre"),
            ("Nobody really knows for sure why the sky looks blue.", "bad"),
        ],
    },
]

print("\nImplicit rewards for NEW responses (not in training data):")
with torch.no_grad():
    for case in new_test_cases:
        print(f"\nPrompt: {case['prompt'].strip()}")
        for response, quality in case['responses']:
            p_logp = get_response_log_prob(policy_model, tokenizer, case['prompt'], response).item()
            r_logp = get_response_log_prob(ref_model, tokenizer, case['prompt'], response).item()
            reward = beta * (p_logp - r_logp)
            print(f"  [{quality:>8}] reward={reward:+.4f}  '{response[:60]}...'")
```

**Part 3: Visualization**

```python
fig, ax = plt.subplots(figsize=(12, 5))

labels = []
rewards = []
bar_colors = []

with torch.no_grad():
    # Training data
    for i, pair in enumerate(training_pairs[:3]):  # First 3 for readability
        pw = get_response_log_prob(policy_model, tokenizer, pair['prompt'], pair['preferred']).item()
        rw = get_response_log_prob(ref_model, tokenizer, pair['prompt'], pair['preferred']).item()
        reward_w = beta * (pw - rw)
        labels.append(f"P{i+1} preferred")
        rewards.append(reward_w)
        bar_colors.append('#22c55e')

        pl = get_response_log_prob(policy_model, tokenizer, pair['prompt'], pair['dispreferred']).item()
        rl = get_response_log_prob(ref_model, tokenizer, pair['prompt'], pair['dispreferred']).item()
        reward_l = beta * (pl - rl)
        labels.append(f"P{i+1} dispreferred")
        rewards.append(reward_l)
        bar_colors.append('#ef4444')

    # New responses
    for case in new_test_cases[:1]:  # First prompt for readability
        for response, quality in case['responses']:
            p_logp = get_response_log_prob(policy_model, tokenizer, case['prompt'], response).item()
            r_logp = get_response_log_prob(ref_model, tokenizer, case['prompt'], response).item()
            reward = beta * (p_logp - r_logp)
            labels.append(f"New ({quality})")
            rewards.append(reward)
            bar_colors.append('#6366f1')

ax.barh(range(len(rewards)), rewards, color=bar_colors, alpha=0.8)
ax.set_yticks(range(len(rewards)))
ax.set_yticklabels(labels, fontsize=9)
ax.set_xlabel('Implicit Reward', fontsize=11)
ax.set_title('Implicit Reward: Training Data vs New Responses', fontsize=12)
ax.axvline(x=0, color='white', linestyle='--', alpha=0.3)
ax.grid(alpha=0.2, axis='x')
plt.tight_layout()
plt.show()
```

**Why this works:** The implicit reward extracts a reward function from the policy without ever training a separate reward model. After DPO training, the policy has learned to assign relatively higher probability to preferred responses (compared to the reference). The log-ratio captures this shift. Crucially, the implicit reward also applies to new responses the model was not trained on — the reward generalizes because the policy's distribution has shifted globally, not just on the specific training examples.

**What if the ranking is imperfect for new responses?** With only 5 training pairs and a small model, the implicit reward's generalization is limited. With more training data and a larger model, the ranking quality improves substantially. The key insight is structural: the implicit reward *exists* and *generalizes to some degree* — it is not just memorization of the training pairs.

</details>

**What you just explored:** After DPO training, the policy implicitly defines a reward function through its log-ratio with the reference. Preferred responses have higher implicit reward than dispreferred ones — this confirms the training worked. More importantly, the implicit reward applies to new responses that were never in the training data.

This is the deepest insight of DPO: the reward model was always inside the policy. DPO does not eliminate the reward model — it *absorbs* it into the policy. The log-ratio between policy and reference IS the reward function. Any policy paired with a reference defines a reward, and you can extract it at any time by computing how much the policy has shifted from its starting point.

---

## Key Takeaways

1. **The DPO loss is simple to compute but each step maps to the derivation.** Log-ratios (implicit KL), difference (comparing preferred vs dispreferred), scale by beta (control conservatism), negative log-sigmoid (convert to loss). Five lines of code, each justified by the math.

2. **Beta controls conservatism.** Higher beta produces larger losses for the same log-ratio difference, keeping the policy closer to the reference. Lower beta allows more aggressive deviation. The choice of beta is the practical knob for the KL-reward tradeoff.

3. **DPO's gradient focuses on the hard cases.** The gradient is strongest when the model disagrees with human preferences and near zero when it already agrees. Learning effort concentrates where it matters most — you confirmed this with autograd.

4. **DPO training looks like supervised learning.** The training loop is forward, loss, backward, step — the familiar pattern. The complexity is in the loss function, not the loop. The reference model is a frozen snapshot that never updates.

5. **The implicit reward generalizes beyond the training data.** After training, the log-ratio between policy and reference defines a reward function that applies to ANY response, not just the ones in the training set. The reward model was always inside the policy — DPO makes this explicit.