# Day 18: Direct Preference Optimization (DPO) Implementation

In this notebook, we'll implement Direct Preference Optimization (DPO), a more efficient alternative to RLHF for aligning language models with human preferences. We'll focus on:

1. Understanding the DPO algorithm
2. Preparing preference data
3. Implementing the DPO loss function
4. Training a model with DPO
5. Comparing DPO results with SFT and RLHF

## Overview

DPO simplifies the RLHF pipeline by eliminating the need for a separate reward model and the complex RL optimization step. Instead, it directly optimizes a policy to align with human preferences using a simple classification-like objective.

In [None]:
# Import necessary libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt
from tqdm import tqdm
from datasets import Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling
)

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

## 1. Understanding the DPO Algorithm

DPO is based on the insight that the optimal policy in RLHF can be expressed in terms of the reference policy (SFT model) and the reward function:

$$\pi^*(x|y) \propto \pi_{\text{ref}}(x|y) \exp(\beta r(x, y))$$

DPO rearranges this to express the reward function in terms of the optimal policy and reference policy:

$$r(x, y) = \frac{1}{\beta} \log \frac{\pi^*(x|y)}{\pi_{\text{ref}}(x|y)} + Z(y)$$

Using this relationship, DPO derives a loss function that directly optimizes the policy to match human preferences:

$$\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right) \right]$$

Where:
- $(x, y_w, y_l)$ is a preference pair with prompt $x$, preferred response $y_w$, and less preferred response $y_l$
- $\pi_\theta$ is the policy being trained
- $\pi_{\text{ref}}$ is the reference policy (SFT model)
- $\beta$ is a hyperparameter controlling the strength of the preference

## 2. Loading Models

We'll start by loading a pre-trained language model to serve as our reference model (which would typically be an SFT model in a real pipeline).

In [None]:
# Load a small pre-trained model
model_name = "gpt2"  # Using a small model for demonstration
tokenizer = AutoTokenizer.from_pretrained(model_name)
reference_model = AutoModelForCausalLM.from_pretrained(model_name)

# Add padding token if it doesn't exist
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    reference_model.config.pad_token_id = reference_model.config.eos_token_id

# Create a copy of the reference model to be optimized with DPO
policy_model = AutoModelForCausalLM.from_pretrained(model_name)
policy_model.config.pad_token_id = reference_model.config.pad_token_id

# Move models to device
reference_model = reference_model.to(device)
policy_model = policy_model.to(device)

print(f"Models loaded: {model_name}")
print(f"Number of parameters: {reference_model.num_parameters():,}")

## 3. Preparing Preference Data

DPO requires preference data in the form of (prompt, chosen_response, rejected_response) triples. Let's create a synthetic dataset for demonstration purposes.

In [None]:
# Create a synthetic preference dataset
preference_data = [
    {
        "prompt": "Explain the concept of machine learning.",
        "chosen": "Machine learning is a branch of artificial intelligence that enables computers to learn from data and improve their performance on a task without being explicitly programmed. It works by identifying patterns in data and using those patterns to make predictions or decisions.",
        "rejected": "Machine learning is when computers do stuff with data."
    },
    {
        "prompt": "Write a poem about the ocean.",
        "chosen": "Vast blue expanse beneath the sky,\nWaves dance and crash as time goes by.\nSecrets deep in waters cold,\nOcean stories, forever told.\nSunlight sparkles on the foam,\nEndless waters, the sailor's home.",
        "rejected": "Ocean big. Ocean blue. Fish swim there. The end."
    },
    {
        "prompt": "Summarize the theory of relativity.",
        "chosen": "Einstein's theory of relativity consists of two parts: special relativity and general relativity. Special relativity states that the laws of physics are the same for all non-accelerating observers and that the speed of light is constant regardless of the observer's motion. General relativity extends this to include gravity, describing it as a curvature of spacetime caused by mass and energy.",
        "rejected": "Einstein said E=mc² and stuff moves weird when it's fast or heavy."
    },
    {
        "prompt": "Provide three tips for effective time management.",
        "chosen": "1. Prioritize tasks using methods like the Eisenhower Matrix, which categorizes tasks by urgency and importance.\n2. Break large projects into smaller, manageable tasks with specific deadlines.\n3. Use the Pomodoro Technique: work in focused 25-minute intervals followed by short breaks to maintain productivity and prevent burnout.",
        "rejected": "1. Don't waste time.\n2. Do important stuff first.\n3. Use a calendar I guess."
    },
    {
        "prompt": "Explain how photosynthesis works.",
        "chosen": "Photosynthesis is the process by which plants, algae, and some bacteria convert sunlight, water, and carbon dioxide into glucose (sugar) and oxygen. The process occurs in chloroplasts, specifically in the chlorophyll-containing thylakoids. It consists of light-dependent reactions, which capture energy from sunlight to generate ATP and NADPH, and the Calvin cycle, which uses this energy to fix carbon dioxide into glucose.",
        "rejected": "Plants use sunlight to make food. They take in CO2 and release oxygen. It happens in the green parts."
    }
]

# Add more examples
additional_examples = [
    {
        "prompt": "Describe the water cycle.",
        "chosen": "The water cycle, or hydrologic cycle, is the continuous movement of water on, above, and below Earth's surface. It begins with evaporation, where water from oceans, lakes, and rivers turns into water vapor due to solar energy. This vapor rises, cools, and condenses into clouds (condensation). When the clouds become saturated, precipitation occurs as rain, snow, or hail. The water then either infiltrates the ground, becoming groundwater, or flows as surface runoff back to bodies of water, completing the cycle.",
        "rejected": "Water evaporates, forms clouds, then rains down. Then it happens again."
    },
    {
        "prompt": "Explain the concept of supply and demand.",
        "chosen": "Supply and demand is a fundamental economic principle that describes how the price of a good or service is determined in a free market. Supply represents how much of a product producers are willing to offer at different prices, while demand represents how much consumers are willing to purchase at those prices. When supply exceeds demand, prices tend to fall; when demand exceeds supply, prices tend to rise. The point where supply and demand curves intersect is called the equilibrium, representing the optimal price and quantity for the market.",
        "rejected": "Supply is what sellers have. Demand is what buyers want. When there's more demand, prices go up. When there's more supply, prices go down."
    },
    {
        "prompt": "How do airplanes fly?",
        "chosen": "Airplanes fly due to the principles of aerodynamics. The wings of an aircraft are shaped with a curved top and flatter bottom (airfoil), creating a pressure difference when air flows around them. As air moves faster over the curved top surface, it creates lower pressure compared to the higher pressure under the wing, generating lift according to Bernoulli's principle. Additionally, the angle of the wings (angle of attack) deflects air downward, creating an upward force according to Newton's third law. These forces, combined with thrust from engines to overcome drag, allow the airplane to overcome gravity and achieve flight.",
        "rejected": "Airplanes fly because their wings push air down and the engines push them forward. The shape of the wings helps them stay up."
    },
    {
        "prompt": "What causes climate change?",
        "chosen": "Climate change is primarily caused by the enhanced greenhouse effect due to human activities. When we burn fossil fuels like coal, oil, and natural gas, we release greenhouse gases, particularly carbon dioxide (CO2), into the atmosphere. These gases trap heat from the sun that would otherwise escape into space, causing global temperatures to rise. Deforestation reduces the Earth's capacity to absorb CO2, exacerbating the problem. Other contributing factors include industrial processes, agriculture (especially livestock production and rice farming, which release methane), and certain land-use changes. The scientific consensus is that these anthropogenic factors are the dominant cause of observed warming since the mid-20th century.",
        "rejected": "Climate change happens because of pollution and greenhouse gases. People burn too much fossil fuels and cut down too many trees. This makes the Earth get warmer."
    },
    {
        "prompt": "Explain the concept of artificial intelligence.",
        "chosen": "Artificial Intelligence (AI) refers to the simulation of human intelligence in machines programmed to think and learn like humans. It encompasses various techniques including machine learning, where algorithms improve through experience; deep learning, which uses neural networks with many layers; natural language processing, enabling computers to understand and generate human language; computer vision for image and video analysis; and reinforcement learning, where agents learn optimal behaviors through trial and error. AI systems can perform tasks that typically require human intelligence such as visual perception, speech recognition, decision-making, and language translation. The field ranges from narrow AI designed for specific tasks to the theoretical goal of artificial general intelligence that could potentially perform any intellectual task a human can do.",
        "rejected": "AI is when computers act smart and do things humans can do. They use algorithms and data to make decisions and solve problems. Some people think AI might take over the world someday."
    }
]

preference_data.extend(additional_examples)

# Convert to DataFrame for easier manipulation
df = pd.DataFrame(preference_data)
print(f"Dataset size: {len(df)} examples")
df.head(3)

## 4. Implementing the DPO Loss Function

Now we'll implement the DPO loss function, which is the core of the algorithm.

In [None]:
class DPOTrainer:
    """Trainer for Direct Preference Optimization."""
    
    def __init__(self, policy_model, reference_model, tokenizer, beta=0.1):
        self.policy_model = policy_model
        self.reference_model = reference_model
        self.tokenizer = tokenizer
        self.beta = beta  # Controls the strength of the preference
    
    def get_log_probs(self, model, prompt, response):
        """Compute log probabilities of a response given a prompt."""
        # Tokenize prompt and response
        prompt_tokens = self.tokenizer(prompt, return_tensors="pt").to(device)
        prompt_len = prompt_tokens.input_ids.size(1)
        
        # Tokenize the full sequence (prompt + response)
        full_tokens = self.tokenizer(prompt + response, return_tensors="pt").to(device)
        full_ids = full_tokens.input_ids
        
        # Get the response part only
        response_ids = full_ids[:, prompt_len:]
        
        # Forward pass through the model
        with torch.no_grad():
            outputs = model(full_ids)
            logits = outputs.logits
        
        # Get logits for the response tokens
        response_logits = logits[:, prompt_len-1:-1, :]  # Shift by 1 for next-token prediction
        
        # Compute log probabilities
        log_probs = F.log_softmax(response_logits, dim=-1)
        
        # Gather the log probs for the actual response tokens
        token_log_probs = torch.gather(log_probs, 2, response_ids.unsqueeze(-1)).squeeze(-1)
        
        # Sum the log probs
        return token_log_probs.sum().item()
    
    def compute_dpo_loss(self, prompt, chosen, rejected):
        """Compute the DPO loss for a single preference pair."""
        # Get log probs from policy model
        policy_chosen_log_prob = self.get_log_probs(self.policy_model, prompt, chosen)
        policy_rejected_log_prob = self.get_log_probs(self.policy_model, prompt, rejected)
        
        # Get log probs from reference model
        ref_chosen_log_prob = self.get_log_probs(self.reference_model, prompt, chosen)
        ref_rejected_log_prob = self.get_log_probs(self.reference_model, prompt, rejected)
        
        # Compute log ratios
        chosen_log_ratio = policy_chosen_log_prob - ref_chosen_log_prob
        rejected_log_ratio = policy_rejected_log_prob - ref_rejected_log_prob
        
        # Compute DPO loss
        loss = -torch.log(torch.sigmoid(self.beta * (chosen_log_ratio - rejected_log_ratio)))
        
        return loss, chosen_log_ratio, rejected_log_ratio
    
    def train_step(self, prompt, chosen, rejected, optimizer):
        """Perform one training step."""
        # Enable gradients for policy model
        for param in self.policy_model.parameters():
            param.requires_grad = True
        
        # Compute loss
        loss, chosen_log_ratio, rejected_log_ratio = self.compute_dpo_loss(prompt, chosen, rejected)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        return loss.item(), chosen_log_ratio, rejected_log_ratio

## 5. Training with DPO

Now let's train our policy model using DPO.

In [None]:
# Initialize DPO trainer
dpo_trainer = DPOTrainer(
    policy_model=policy_model,
    reference_model=reference_model,
    tokenizer=tokenizer,
    beta=0.1
)

# Set up optimizer
optimizer = torch.optim.AdamW(policy_model.parameters(), lr=1e-5)

# Split data into train and validation sets
train_df = df.sample(frac=0.8, random_state=42)
val_df = df.drop(train_df.index)

print(f"Training examples: {len(train_df)}")
print(f"Validation examples: {len(val_df)}")

# Training loop
num_epochs = 3
train_losses = []
val_losses = []

for epoch in range(num_epochs):
    print(f"\nEpoch {epoch+1}/{num_epochs}")
    
    # Training
    policy_model.train()
    epoch_loss = 0
    
    for i, row in tqdm(train_df.iterrows(), total=len(train_df), desc="Training"):
        prompt = row["prompt"]
        chosen = row["chosen"]
        rejected = row["rejected"]
        
        loss, chosen_ratio, rejected_ratio = dpo_trainer.train_step(prompt, chosen, rejected, optimizer)
        epoch_loss += loss
    
    avg_train_loss = epoch_loss / len(train_df)
    train_losses.append(avg_train_loss)
    
    # Validation
    policy_model.eval()
    val_loss = 0
    
    for i, row in tqdm(val_df.iterrows(), total=len(val_df), desc="Validation"):
        prompt = row["prompt"]
        chosen = row["chosen"]
        rejected = row["rejected"]
        
        with torch.no_grad():
            loss, _, _ = dpo_trainer.compute_dpo_loss(prompt, chosen, rejected)
            val_loss += loss.item()
    
    avg_val_loss = val_loss / len(val_df)
    val_losses.append(avg_val_loss)
    
    print(f"Epoch {epoch+1}/{num_epochs}, Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}")

# Plot training results
plt.figure(figsize=(10, 6))
plt.plot(train_losses, label='Train Loss')
plt.plot(val_losses, label='Validation Loss')
plt.title('DPO Training Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()

## 6. Evaluating the DPO Model

Let's compare the outputs of the reference model and the DPO-optimized model.

In [None]:
def generate_response(model, prompt, max_length=100):
    """Generate a response from the model given a prompt."""
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    with torch.no_grad():
        output = model.generate(
            inputs["input_ids"],
            max_length=inputs["input_ids"].size(1) + max_length,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.pad_token_id
        )
    
    response = tokenizer.decode(output[0], skip_special_tokens=True)
    response = response[len(prompt):].strip()
    
    return response

# Test prompts
test_prompts = [
    "Explain the concept of neural networks.",
    "Write a short poem about technology.",
    "What are three benefits of regular exercise?"
]

# Compare reference and DPO models
for prompt in test_prompts:
    print(f"Prompt: {prompt}\n")
    
    # Generate from reference model
    ref_response = generate_response(reference_model, prompt)
    print(f"Reference Model Response:\n{ref_response}\n")
    
    # Generate from DPO model
    dpo_response = generate_response(policy_model, prompt)
    print(f"DPO Model Response:\n{dpo_response}\n")
    
    print("-" * 80)

## 7. Comparing DPO and RLHF

Let's analyze the differences between DPO and RLHF in terms of implementation complexity and performance.

### Implementation Complexity

**RLHF Pipeline:**
1. Train an SFT model
2. Generate response pairs
3. Collect human preferences
4. Train a reward model
5. Optimize policy with PPO

**DPO Pipeline:**
1. Train an SFT model
2. Collect human preferences
3. Optimize policy with DPO

DPO eliminates the need for a separate reward model and the complex PPO optimization, making it significantly simpler to implement.

### Performance Analysis

In practice, DPO has been shown to achieve comparable or better results than RLHF in many cases, while being more computationally efficient. The key advantages of DPO include:

1. **Simplicity**: Fewer components and hyperparameters to tune
2. **Stability**: More stable training without the complexities of RL
3. **Efficiency**: Lower computational requirements
4. **Performance**: Comparable or better alignment with human preferences

However, RLHF may still be preferred in some cases:

1. When fine-grained control over the KL divergence is needed
2. When incorporating additional reward components beyond human preferences
3. When working with more complex preference structures

## 8. Conclusion

In this notebook, we've implemented Direct Preference Optimization (DPO), a more efficient alternative to RLHF for aligning language models with human preferences. We've seen how DPO simplifies the alignment pipeline by eliminating the need for a separate reward model and the complex RL optimization step.

Key takeaways:

1. DPO directly optimizes a policy to align with human preferences using a simple classification-like objective.
2. The DPO loss is derived from the same preference data used in RLHF, but avoids explicitly modeling the reward function.
3. DPO is more computationally efficient and easier to implement than RLHF, while achieving comparable or better results.
4. The choice between DPO and RLHF depends on the specific requirements of the alignment task.

As the field of language model alignment continues to evolve, we're likely to see further refinements and variations of DPO, as well as hybrid approaches that combine the strengths of different alignment techniques.