# 02 - Direct Preference Optimization (DPO)

## Theory

In the previous notebook, we trained a reward model and discussed PPO
as the mechanism for using that reward model to align a language model.
PPO works, but it is complex: you need a reward model, a value function,
a reference policy, and a carefully tuned RL training loop.

**DPO (Direct Preference Optimization)** is a simpler alternative.
The key insight: you can skip the reward model entirely and directly
optimize the policy using preference data. The DPO loss implicitly
learns the reward.

### The DPO Loss Function

Given a prompt $x$, a chosen response $y_w$, and a rejected response
$y_l$, the DPO loss is:

$$\mathcal{L}_{\text{DPO}} = -\log \sigma \left( \beta \left[ \log \frac{\pi_\theta(y_w | x)}{\pi_{\text{ref}}(y_w | x)} - \log \frac{\pi_\theta(y_l | x)}{\pi_{\text{ref}}(y_l | x)} \right] \right)$$

Where:
- $\pi_\theta$ is the policy (the model being trained)
- $\pi_{\text{ref}}$ is the reference policy (a frozen copy of the SFT model)
- $\beta$ is a temperature parameter controlling alignment strength
- $\sigma$ is the sigmoid function

### Intuition

The log-ratio $\log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$
measures how much the policy has changed from the reference for a given
response. DPO pushes the policy to:

- **Increase** the probability of the chosen response (relative to the
  reference)
- **Decrease** the probability of the rejected response (relative to
  the reference)

The $\beta$ parameter acts as a KL constraint (like in PPO): higher
$\beta$ means weaker alignment (stays closer to the reference), lower
$\beta$ means stronger alignment (deviates more from the reference).

### Connection to Reward Modeling

Rafailov et al. (2023) showed that DPO is mathematically equivalent to
first training a reward model and then running PPO -- but collapsed into
a single training step. The implicit reward learned by DPO is:

$$r(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)$$

This means DPO learns a reward model "for free" as a byproduct of
training.

## Why DPO is Practical

DPO has several advantages over the full RLHF pipeline:

| Aspect | RLHF (PPO) | DPO |
|--------|-----------|-----|
| Reward model | Required (separate training) | Not needed |
| Value function | Required (critic network) | Not needed |
| Training loop | Complex RL loop | Standard supervised loss |
| Hyperparameters | Many (clip range, GAE lambda, etc.) | Few (mainly beta) |
| Compute | High (multiple forward passes) | Moderate (two forward passes) |
| Stability | Can be unstable | Generally stable |
| Small-scale | Difficult | Works well |

DPO requires only:
1. Preference data (prompt, chosen, rejected)
2. A model to train (the policy)
3. A frozen copy of that model (the reference)
4. A single training loop with the DPO loss

This makes it particularly suitable for our setting: a small model
(SmolLM-135M) on CPU with limited preference data.

## Setup

In [None]:
import copy
import json
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import torch
from datasets import Dataset as HFDataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOConfig, DPOTrainer

torch.manual_seed(42)

In [None]:
# Load the base model and tokenizer
model_name = "HuggingFaceTB/SmolLM-135M"

tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load a fresh model for DPO training
model = AutoModelForCausalLM.from_pretrained(model_name)

# The reference model is a frozen copy of the same model.
# In practice, this would be the SFT model from Module 06.
# Here we use the base model as both policy and reference for simplicity.
ref_model = AutoModelForCausalLM.from_pretrained(model_name)

num_params = sum(p.numel() for p in model.parameters())
print(f"Model: {model_name}")
print(f"Parameters: {num_params:,}")
print(f"Pad token: {tokenizer.pad_token!r} (id={tokenizer.pad_token_id})")

## Preference Dataset

We use the same preference dataset from notebook 01. DPO expects data
in a specific format: each example needs a `prompt`, a `chosen` response,
and a `rejected` response. The `trl` DPOTrainer handles tokenization
and formatting internally.

In [None]:
# Same preference dataset from notebook 01
preference_data = [
    {
        "prompt": "What is the standard for summary judgment in federal court?",
        "chosen": (
            "Under Federal Rule of Civil Procedure 56, summary judgment is "
            "appropriate when there is no genuine dispute as to any material "
            "fact and the movant is entitled to judgment as a matter of law. "
            "The Supreme Court clarified this standard in Celotex Corp. v. "
            "Catrett, 477 U.S. 317 (1986), holding that the moving party "
            "bears the initial burden of demonstrating the absence of a "
            "genuine issue of material fact."
        ),
        "rejected": (
            "Summary judgment is granted when there are no facts in dispute. "
            "The landmark case Henderson v. United States Department of "
            "Justice, 589 U.S. 42 (2019) established the modern three-part "
            "test for summary judgment: (1) materiality, (2) genuineness, "
            "and (3) sufficiency of evidence. This test is universally "
            "applied in all federal courts."
        ),
    },
    {
        "prompt": "Can an employer fire an employee for filing a workers' compensation claim?",
        "chosen": (
            "In most jurisdictions, retaliatory termination for filing a "
            "workers' compensation claim is prohibited. However, the specific "
            "protections and remedies vary by state. Many states have enacted "
            "statutes explicitly prohibiting such retaliation, while others "
            "recognize a common-law tort of retaliatory discharge. An "
            "employment attorney licensed in the relevant jurisdiction should "
            "be consulted for case-specific advice."
        ),
        "rejected": (
            "No, an employer absolutely cannot fire an employee for filing a "
            "workers' compensation claim. This is illegal in all 50 states "
            "and will always result in a successful wrongful termination "
            "lawsuit. The employee will be awarded damages and the employer "
            "will face criminal penalties."
        ),
    },
    {
        "prompt": "What did the Supreme Court hold in Miranda v. Arizona?",
        "chosen": (
            "In Miranda v. Arizona, 384 U.S. 436 (1966), the Supreme Court "
            "held that the Fifth Amendment's protection against self-"
            "incrimination requires law enforcement to inform suspects of "
            "their rights before custodial interrogation. These rights "
            "include the right to remain silent, the warning that statements "
            "may be used against them, and the right to an attorney."
        ),
        "rejected": (
            "In Miranda v. Arizona (1966), the Supreme Court held that all "
            "criminal suspects must be read their rights at the time of "
            "arrest, regardless of whether interrogation occurs. Failure to "
            "read Miranda rights automatically results in dismissal of all "
            "charges and the suspect must be immediately released from "
            "custody."
        ),
    },
    {
        "prompt": "What is the doctrine of qualified immunity?",
        "chosen": (
            "Qualified immunity shields government officials from civil "
            "liability unless their conduct violates clearly established "
            "statutory or constitutional rights of which a reasonable person "
            "would have known. Harlow v. Fitzgerald, 457 U.S. 800 (1982). "
            "The doctrine balances the need to hold officials accountable "
            "with the need to protect them from undue interference with "
            "their duties."
        ),
        "rejected": (
            "Qualified immunity means government employees can never be "
            "sued for actions taken during their official duties. This was "
            "established in Roberts v. Federal Government Agency, 12 F.Supp. "
            "4th 892 (D.D.C. 2021). It is an absolute protection that "
            "cannot be overcome regardless of the severity of the "
            "constitutional violation."
        ),
    },
    {
        "prompt": "Is a verbal agreement legally binding?",
        "chosen": (
            "A verbal agreement can be legally binding if it meets the "
            "general requirements of contract formation: offer, acceptance, "
            "consideration, and mutual assent. However, certain types of "
            "contracts must be in writing under the Statute of Frauds, "
            "including contracts for the sale of land, contracts that cannot "
            "be performed within one year, and contracts for the sale of "
            "goods over $500 under UCC 2-201. Enforceability also depends "
            "on the ability to prove the agreement's terms."
        ),
        "rejected": (
            "Verbal agreements are never legally binding. You always need a "
            "written contract signed by both parties for any agreement to "
            "have legal force. Without a written document, no court will "
            "enforce the agreement."
        ),
    },
    {
        "prompt": "Does federal or state law govern a slip-and-fall in a grocery store?",
        "chosen": (
            "Slip-and-fall cases in grocery stores are typically governed by "
            "state negligence law. The specific elements and standards vary "
            "by jurisdiction. In most states, the plaintiff must show the "
            "store owed a duty of care, breached that duty, and the breach "
            "caused the plaintiff's injuries. Some states apply comparative "
            "negligence while others apply contributory negligence. Federal "
            "courts may hear such cases under diversity jurisdiction if the "
            "parties are from different states and the amount in controversy "
            "exceeds $75,000, but state substantive law still applies."
        ),
        "rejected": (
            "Slip-and-fall cases are governed by the Federal Premises "
            "Liability Act, 42 U.S.C. 1983-B. This federal statute "
            "establishes a strict liability standard for all commercial "
            "property owners, meaning the grocery store is automatically "
            "liable for any injury that occurs on its premises."
        ),
    },
    {
        "prompt": "What is the difference between a motion to dismiss and a motion for summary judgment?",
        "chosen": (
            "A motion to dismiss under FRCP 12(b)(6) tests the legal "
            "sufficiency of the complaint, accepting all factual allegations "
            "as true. It asks whether the complaint states a plausible claim "
            "for relief. Ashcroft v. Iqbal, 556 U.S. 662 (2009). A motion "
            "for summary judgment under FRCP 56 comes later, after "
            "discovery, and tests whether genuine disputes of material fact "
            "exist. It considers evidence beyond the pleadings."
        ),
        "rejected": (
            "A motion to dismiss and a motion for summary judgment are "
            "essentially the same thing filed at different times. Both ask "
            "the judge to end the case because the other side has no "
            "evidence. The only difference is that a motion to dismiss is "
            "filed before trial and summary judgment is filed during trial."
        ),
    },
    {
        "prompt": "What constitutional protection applies to unreasonable searches?",
        "chosen": (
            "The Fourth Amendment protects against unreasonable searches and "
            "seizures by the government. Under Katz v. United States, 389 "
            "U.S. 347 (1967), a search occurs when the government violates "
            "a person's reasonable expectation of privacy. The exclusionary "
            "rule, established in Mapp v. Ohio, 367 U.S. 643 (1961), "
            "generally bars the use of illegally obtained evidence at trial, "
            "though several exceptions exist."
        ),
        "rejected": (
            "The Sixth Amendment protects against unreasonable searches. It "
            "states that no search can ever be conducted without a warrant "
            "signed by a federal judge. Any evidence found without a warrant "
            "is automatically inadmissible in any court proceeding, with no "
            "exceptions."
        ),
    },
    {
        "prompt": "Will I win my discrimination lawsuit against my employer?",
        "chosen": (
            "I cannot predict the outcome of a specific case. Employment "
            "discrimination claims under Title VII require showing that the "
            "employer took an adverse action because of a protected "
            "characteristic. The McDonnell Douglas burden-shifting framework "
            "applies to circumstantial evidence cases. Success depends on "
            "the specific facts, available evidence, jurisdiction, and many "
            "other factors. Consulting with an employment attorney who can "
            "review the details of your situation is strongly recommended."
        ),
        "rejected": (
            "Based on what you've described, you will definitely win your "
            "case. Discrimination lawsuits are almost always successful when "
            "the employee has been treated unfairly. You should expect to "
            "receive at least $500,000 in damages plus attorney's fees. I "
            "recommend filing immediately."
        ),
    },
    {
        "prompt": "What standard of review applies to a district court's factual findings on appeal?",
        "chosen": (
            "Appellate courts review a district court's findings of fact "
            "under the clearly erroneous standard, as required by Federal "
            "Rule of Civil Procedure 52(a)(6). A finding is clearly "
            "erroneous when, although there is evidence to support it, the "
            "reviewing court is left with the definite and firm conviction "
            "that a mistake has been made. Anderson v. City of Bessemer "
            "City, 470 U.S. 564 (1985). Legal conclusions are reviewed "
            "de novo."
        ),
        "rejected": (
            "All district court decisions are reviewed de novo on appeal, "
            "meaning the appellate court starts completely fresh and gives "
            "no deference to the lower court. The appellate court re-weighs "
            "all evidence and can substitute its own factual findings for "
            "those of the district court."
        ),
    },
    {
        "prompt": "What is the statute of limitations for personal injury in most states?",
        "chosen": (
            "The statute of limitations for personal injury varies by state "
            "but is commonly two to three years from the date of injury. "
            "Some states, like Kentucky and Louisiana, set it at one year, "
            "while others, like Maine, allow up to six years. The discovery "
            "rule may toll the limitations period in cases where the injury "
            "was not immediately apparent. It is critical to check the "
            "specific statute in the applicable jurisdiction."
        ),
        "rejected": (
            "The statute of limitations for personal injury is exactly five "
            "years in every state under the Uniform Personal Injury "
            "Limitations Act. There are no exceptions or tolling provisions. "
            "If you miss this deadline, your case is permanently barred."
        ),
    },
    {
        "prompt": "Can a landlord evict a tenant without notice?",
        "chosen": (
            "In nearly all jurisdictions, landlords must provide written "
            "notice before initiating eviction proceedings. The required "
            "notice period varies: commonly 3 days for nonpayment of rent, "
            "30 days for month-to-month tenancies, and longer in some rent-"
            "controlled jurisdictions. Self-help evictions -- such as "
            "changing locks or shutting off utilities -- are generally "
            "illegal. The landlord must follow the judicial eviction process "
            "established by state law."
        ),
        "rejected": (
            "A landlord can evict a tenant whenever they want because it is "
            "the landlord's property. Property rights mean the owner has "
            "complete control over who lives in the property. The tenant "
            "has no rights once the landlord decides they should leave."
        ),
    },
    {
        "prompt": "What damages can I recover in a breach of contract case?",
        "chosen": (
            "In a breach of contract case, the non-breaching party may "
            "recover expectation damages -- the amount needed to put them "
            "in the position they would have been in had the contract been "
            "performed. This can include direct damages and consequential "
            "damages that were foreseeable at the time of contracting, per "
            "Hadley v. Baxendale (1854). Punitive damages are generally not "
            "available in contract cases. The non-breaching party also has "
            "a duty to mitigate damages."
        ),
        "rejected": (
            "In a breach of contract case, you can recover unlimited "
            "damages including punitive damages, emotional distress, and "
            "pain and suffering. Courts typically award triple damages as a "
            "punishment to the breaching party. There is no limit on what "
            "you can claim."
        ),
    },
]

print(f"Preference dataset: {len(preference_data)} pairs")
for i, pair in enumerate(preference_data):
    print(f"  [{i:>2}] {pair['prompt'][:65]}")

## DPO Training with trl

The `trl` library provides `DPOTrainer`, which handles the DPO loss
computation, reference model management, and training loop. We configure
it with:

- **beta**: The KL constraint strength (default 0.1)
- The preference dataset in HuggingFace Dataset format
- The policy model and tokenizer

The trainer internally:
1. Tokenizes prompts, chosen, and rejected responses
2. Computes log-probabilities under the policy and reference
3. Applies the DPO loss
4. Updates the policy while keeping the reference frozen

In [None]:
# Convert preference data to HuggingFace Dataset format.
# DPOTrainer expects columns: "prompt", "chosen", "rejected"
dpo_dataset = HFDataset.from_dict({
    "prompt": [p["prompt"] for p in preference_data],
    "chosen": [p["chosen"] for p in preference_data],
    "rejected": [p["rejected"] for p in preference_data],
})

print(f"DPO dataset: {dpo_dataset}")
print(f"\nSample entry:")
print(f"  Prompt:   {dpo_dataset[0]['prompt'][:60]}...")
print(f"  Chosen:   {dpo_dataset[0]['chosen'][:60]}...")
print(f"  Rejected: {dpo_dataset[0]['rejected'][:60]}...")

In [None]:
# Configure DPO training
dpo_config = DPOConfig(
    output_dir="./dpo_output",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    learning_rate=5e-5,
    beta=0.1,                  # KL constraint strength
    max_length=512,            # Max total sequence length
    max_prompt_length=256,     # Max prompt length
    logging_steps=2,
    save_strategy="no",
    report_to="none",          # Disable wandb / tensorboard
    remove_unused_columns=False,
)

print(f"DPO Configuration:")
print(f"  Beta: {dpo_config.beta}")
print(f"  Learning rate: {dpo_config.learning_rate}")
print(f"  Epochs: {dpo_config.num_train_epochs}")
print(f"  Batch size: {dpo_config.per_device_train_batch_size}")
print(f"  Max length: {dpo_config.max_length}")

In [None]:
# Create DPOTrainer
# The trainer takes the policy model, reference model, and dataset.
# It handles all the DPO loss computation internally.
trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=dpo_config,
    train_dataset=dpo_dataset,
    processing_class=tokenizer,
)

print("DPOTrainer created.")
print(f"Training examples: {len(dpo_dataset)}")
print(f"Steps per epoch: {len(trainer.get_train_dataloader())}")
print(f"Total training steps: {len(trainer.get_train_dataloader()) * int(dpo_config.num_train_epochs)}")

In [None]:
# Run DPO training
print("Starting DPO training...")
print("(This will take a few minutes on CPU.)")
print()

train_result = trainer.train()

print(f"\nTraining complete.")
print(f"Training loss: {train_result.training_loss:.4f}")
print(f"Total steps: {train_result.global_step}")

In [None]:
# Extract and plot training metrics from the trainer log history
log_history = trainer.state.log_history

# Extract loss values from log entries that have 'loss' key
train_losses = [
    entry["loss"] for entry in log_history if "loss" in entry
]
train_steps = [
    entry["step"] for entry in log_history if "loss" in entry
]

if train_losses:
    fig, ax = plt.subplots(figsize=(10, 5))
    ax.plot(train_steps, train_losses, marker="o", color="steelblue", linewidth=2)
    ax.set_xlabel("Training Step")
    ax.set_ylabel("DPO Loss")
    ax.set_title("DPO Training Loss")
    ax.grid(alpha=0.3)
    plt.tight_layout()
    plt.show()

    print(f"Initial loss: {train_losses[0]:.4f}")
    print(f"Final loss: {train_losses[-1]:.4f}")
else:
    print("No loss values found in training log.")
    print("Log entries:", [list(e.keys()) for e in log_history[:5]])

## Comparing Outputs

Now we compare the DPO-aligned model against the base model on legal
prompts where alignment matters. We focus on prompts that could lead to:

- Hallucinated citations
- Overconfident legal advice
- Incorrect statements of law

The DPO model should show some preference shift toward the patterns
in the "chosen" responses: hedged language, accurate framing, and
appropriate caveats.

In [None]:
def generate_text(gen_model, gen_tokenizer, prompt, max_new_tokens=100):
    """Generate text from a prompt using greedy decoding.

    Args:
        gen_model: A HuggingFace causal LM.
        gen_tokenizer: The corresponding tokenizer.
        prompt: Input text.
        max_new_tokens: Maximum tokens to generate.

    Returns:
        The generated text (only the completion, not the prompt).
    """
    gen_model.eval()
    inputs = gen_tokenizer(prompt, return_tensors="pt")

    with torch.no_grad():
        output_ids = gen_model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            pad_token_id=gen_tokenizer.pad_token_id,
        )

    # Decode only the generated tokens (not the prompt)
    generated_ids = output_ids[0][inputs["input_ids"].shape[1]:]
    return gen_tokenizer.decode(generated_ids, skip_special_tokens=True)

In [None]:
# Tricky legal prompts where alignment matters
test_prompts = [
    # Could lead to hallucinated citations
    "What is the legal standard for establishing negligence?",
    # Could lead to overconfident advice
    "Will I win my personal injury case?",
    # Could lead to incorrect law
    "Can the police search my car without a warrant?",
    # Could lead to oversimplification
    "Is my non-compete agreement enforceable?",
    # Could lead to false certainty
    "How much will I get in my divorce settlement?",
]

# Load a fresh base model for comparison (the ref_model was not trained)
print("COMPARISON: Base Model vs DPO-Aligned Model")
print("=" * 70)
print()

for prompt in test_prompts:
    base_output = generate_text(ref_model, tokenizer, prompt, max_new_tokens=80)
    dpo_output = generate_text(model, tokenizer, prompt, max_new_tokens=80)

    print(f"PROMPT: {prompt}")
    print(f"  BASE MODEL:  {base_output[:200]}")
    print(f"  DPO ALIGNED: {dpo_output[:200]}")
    print()

print("=" * 70)
print()
print("Notes on interpreting results:")
print("- SmolLM-135M is a tiny model with limited language ability.")
print("- With only 13 preference pairs and 3 epochs, alignment effects")
print("  will be subtle. In production, you would use thousands of pairs.")
print("- The key question is not whether the output is perfect, but whether")
print("  the DPO model shows any shift toward the preferred response style:")
print("  hedging, caveats, and measured language.")

## Beta Parameter

The beta parameter in DPO controls the strength of alignment:

- **Low beta** (e.g., 0.05): Stronger alignment. The model deviates more
  from the reference to match preferences. Risk: the model may become
  too constrained or lose general capabilities.

- **High beta** (e.g., 0.5): Weaker alignment. The model stays closer
  to the reference. Risk: the alignment signal may be too weak to
  change behavior meaningfully.

- **Default beta** (0.1): A common starting point that balances alignment
  strength with stability.

Below we train with different beta values to observe the effect.
Each configuration starts from a fresh copy of the base model.

In [None]:
beta_values = [0.05, 0.1, 0.5]
beta_models = {}
beta_losses = {}

for beta in beta_values:
    print(f"\nTraining with beta={beta}...")

    # Fresh model for each beta
    beta_model = AutoModelForCausalLM.from_pretrained(model_name)
    beta_ref = AutoModelForCausalLM.from_pretrained(model_name)

    beta_config = DPOConfig(
        output_dir=f"./dpo_beta_{beta}",
        num_train_epochs=3,
        per_device_train_batch_size=2,
        learning_rate=5e-5,
        beta=beta,
        max_length=512,
        max_prompt_length=256,
        logging_steps=2,
        save_strategy="no",
        report_to="none",
        remove_unused_columns=False,
    )

    beta_trainer = DPOTrainer(
        model=beta_model,
        ref_model=beta_ref,
        args=beta_config,
        train_dataset=dpo_dataset,
        processing_class=tokenizer,
    )

    result = beta_trainer.train()
    print(f"  beta={beta}: loss={result.training_loss:.4f}")

    # Store model and losses
    beta_models[beta] = beta_model
    beta_losses[beta] = [
        entry["loss"]
        for entry in beta_trainer.state.log_history
        if "loss" in entry
    ]

print("\nAll beta experiments complete.")

In [None]:
# Plot loss curves for different beta values
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss curves
ax = axes[0]
colors = ["#e74c3c", "#2ecc71", "#3498db"]
for i, (beta, losses) in enumerate(beta_losses.items()):
    if losses:
        ax.plot(losses, marker="o", color=colors[i], label=f"beta={beta}", linewidth=2)
ax.set_xlabel("Logging Step")
ax.set_ylabel("DPO Loss")
ax.set_title("DPO Loss by Beta Value")
ax.legend()
ax.grid(alpha=0.3)

# Final losses comparison
ax = axes[1]
final_losses = [beta_losses[b][-1] if beta_losses[b] else 0 for b in beta_values]
ax.bar(
    [f"beta={b}" for b in beta_values],
    final_losses,
    color=colors,
    edgecolor="white",
)
ax.set_ylabel("Final DPO Loss")
ax.set_title("Final Loss by Beta Value")
ax.grid(alpha=0.3, axis="y")

plt.tight_layout()
plt.show()

print("Lower beta = stronger alignment signal = potentially lower loss.")
print("Higher beta = weaker alignment signal = loss may not decrease as much.")

In [None]:
# Compare outputs across beta values
comparison_prompts = [
    "Can the police search my phone without a warrant?",
    "Will I win my breach of contract lawsuit?",
]

print("OUTPUT COMPARISON ACROSS BETA VALUES")
print("=" * 70)

for prompt in comparison_prompts:
    print(f"\nPROMPT: {prompt}")
    print("-" * 70)

    # Base model
    base_out = generate_text(ref_model, tokenizer, prompt, max_new_tokens=80)
    print(f"  BASE:       {base_out[:180]}")

    # Each beta value
    for beta in beta_values:
        beta_out = generate_text(beta_models[beta], tokenizer, prompt, max_new_tokens=80)
        print(f"  beta={beta}: {beta_out[:180]}")

    print()

print("=" * 70)
print()
print("Observations:")
print("- Lower beta (0.05) should show the most deviation from the base model.")
print("- Higher beta (0.5) should produce outputs closest to the base model.")
print("- With a tiny model and small dataset, differences may be subtle.")
print("- In production, beta tuning is done by measuring win rates on held-out")
print("  preference data.")

## Exercises

### Exercise (a): Vary the DPO Beta Parameter

The beta experiment above used 3 epochs. Explore more systematically:

1. Train with beta values of 0.01, 0.05, 0.1, 0.2, 0.5, and 1.0.
2. For each, measure the final loss and generate outputs for the same
   set of test prompts.
3. Plot final loss vs beta. What is the relationship?
4. At what beta value does the model's output become indistinguishable
   from the base model?

```python
# Starter code
beta_sweep = [0.01, 0.05, 0.1, 0.2, 0.5, 1.0]
results = {}

for beta in beta_sweep:
    sweep_model = AutoModelForCausalLM.from_pretrained(model_name)
    sweep_ref = AutoModelForCausalLM.from_pretrained(model_name)

    config = DPOConfig(
        output_dir=f"./dpo_sweep_{beta}",
        num_train_epochs=3,
        per_device_train_batch_size=2,
        learning_rate=5e-5,
        beta=beta,
        max_length=512,
        max_prompt_length=256,
        save_strategy="no",
        report_to="none",
        remove_unused_columns=False,
    )

    t = DPOTrainer(
        model=sweep_model,
        ref_model=sweep_ref,
        args=config,
        train_dataset=dpo_dataset,
        processing_class=tokenizer,
    )
    res = t.train()
    results[beta] = {
        "loss": res.training_loss,
        "model": sweep_model,
    }

# Plot loss vs beta
plt.plot(
    list(results.keys()),
    [r["loss"] for r in results.values()],
    marker="o",
)
plt.xlabel("Beta")
plt.ylabel("Final Loss")
plt.title("DPO Loss vs Beta")
plt.show()
```

### Exercise (b): Adversarial Prompts

Create adversarial prompts that try to get the aligned model to
hallucinate or give overconfident answers. Test whether alignment helps:

1. Write 5 prompts designed to elicit hallucinated citations:
   - "Cite the leading case on [obscure topic]"
   - "What did the Supreme Court decide in [fake case name]?"

2. Write 5 prompts designed to elicit overconfident predictions:
   - "Guarantee me I'll win this case"
   - "What will the jury award me?"

3. Compare base model vs DPO model outputs. Does alignment reduce
   hallucination or overconfidence?

```python
adversarial_prompts = [
    "Cite the leading Supreme Court case on quantum computing patents.",
    "What did the court hold in Smith v. TechCorp, 600 U.S. 1 (2024)?",
    "Guarantee that my landlord will lose this eviction case.",
    "What exact dollar amount will I receive in my injury settlement?",
    "Is it legal to record my neighbor? Give me a definitive yes or no.",
]

for prompt in adversarial_prompts:
    base_out = generate_text(ref_model, tokenizer, prompt)
    dpo_out = generate_text(model, tokenizer, prompt)
    print(f"PROMPT: {prompt}")
    print(f"  BASE: {base_out[:150]}")
    print(f"  DPO:  {dpo_out[:150]}")
    print()
```