# Direct Preference Optimization (DPO) Training on Google Colab

This notebook trains a DPO model on the Anthropic HH dataset using free Colab GPUs.

**Steps:**
1. Setup environment and install dependencies
2. Upload/clone your code
3. Train SFT baseline
4. Train DPO model
5. Evaluate and download results

**Runtime:** Make sure to use **GPU runtime** (Runtime → Change runtime type → GPU)

## 1. Setup Environment

In [None]:
# Check GPU availability
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("⚠️  No GPU detected. Go to Runtime → Change runtime type → GPU")

## 2. Upload Your Code

**Option A: Upload ZIP file**
- Compress your `dpo/` folder into `dpo.zip`
- Upload using the cell below

**Option B: Clone from GitHub** (if you've pushed to GitHub)
- Uncomment and run the git clone cell

In [None]:
# Option A: Upload ZIP file
from google.colab import files
import zipfile
import os

print("Upload your dpo.zip file...")
uploaded = files.upload()

# Extract directly to /content/dpo
for filename in uploaded.keys():
    if filename.endswith('.zip'):
        with zipfile.ZipFile(filename, 'r') as zip_ref:
            # Extract to /content/dpo
            zip_ref.extractall('/content/dpo')
        print(f"✓ Extracted {filename}")

# Change to project directory
%cd /content/dpo

# Verify we're in the right place
!ls -la

In [None]:
# Option B: Clone from GitHub
# Replace YOUR_USERNAME with your GitHub username
!git clone https://github.com/YOUR_USERNAME/dpo.git /content/dpo
%cd /content/dpo

# Verify we're in the right place
!ls -la

## 3. Install Dependencies

In [None]:
# Install requirements
!pip install -q -r requirements.txt

# Add project to Python path
import sys
sys.path.insert(0, '/content/dpo')

# Verify installation
import transformers
import datasets
print(f"✓ transformers {transformers.__version__}")
print(f"✓ datasets {datasets.__version__}")
print("✓ Python path configured")
print("\n✓ All dependencies installed!")

## 4. Run Sanity Checks (Optional)

In [None]:
# Ensure we're in the right directory and Python path is set
import sys
import os

os.chdir('/content/dpo')
if '/content/dpo' not in sys.path:
    sys.path.insert(0, '/content/dpo')

# Run sanity checks to verify everything works
!python tests/test_sanity.py

## 5. Configure Training

Adjust these settings based on your Colab GPU:
- **T4 (free)**: Use configs as-is or reduce batch size to 2
- **V100/A100 (Pro)**: Can increase batch sizes

In [None]:
# Training configuration
USE_DEBUG_MODE = True  # Set to False for full training
NUM_TRAIN_SAMPLES = 1000 if not USE_DEBUG_MODE else None  # Limit samples for faster training

# Paths
SFT_OUTPUT = "/content/outputs/sft"
DPO_OUTPUT = "/content/outputs/dpo"

print(f"Debug mode: {USE_DEBUG_MODE}")
print(f"SFT output: {SFT_OUTPUT}")
print(f"DPO output: {DPO_OUTPUT}")

## 6. Train SFT Baseline

First, we train a supervised fine-tuned model on the chosen responses.

In [None]:
# Ensure correct directory
import os
os.chdir('/content/dpo')

# Train SFT model
config = "configs/debug.yaml" if USE_DEBUG_MODE else "configs/sft.yaml"
debug_flag = "--debug" if USE_DEBUG_MODE else ""

!python scripts/train_sft.py \
    --config {config} \
    --output_dir {SFT_OUTPUT} \
    {debug_flag}

## 7. Train DPO Model

Now we train DPO using the SFT model as the starting point.

In [None]:
# Ensure correct directory
import os
os.chdir('/content/dpo')

# Train DPO model
config = "configs/debug.yaml" if USE_DEBUG_MODE else "configs/dpo.yaml"
debug_flag = "--debug" if USE_DEBUG_MODE else ""
sft_path = f"{SFT_OUTPUT}/final"

!python scripts/train_dpo.py \
    --config {config} \
    --sft_model_path {sft_path} \
    --output_dir {DPO_OUTPUT} \
    {debug_flag}

## 8. Evaluate Models

Compare the base model, SFT, and DPO models.

In [None]:
# Ensure correct directory
import os
os.chdir('/content/dpo')

# Evaluate all models
!python scripts/evaluate.py \
    --reference_model gpt2 \
    --sft_model {SFT_OUTPUT}/final \
    --dpo_model {DPO_OUTPUT}/final \
    --output_file /content/outputs/results.json \
    --num_samples 500 \
    --num_generation_samples 5

## 9. View Results

In [None]:
# Display evaluation results
import json
import pandas as pd

with open('/content/outputs/results.json', 'r') as f:
    results = json.load(f)

# Convert to DataFrame for nice display
df = pd.DataFrame(results).T
print("\n" + "="*80)
print("EVALUATION RESULTS")
print("="*80)
print(df.to_string())

# Display sample generations
print("\n" + "="*80)
print("SAMPLE GENERATIONS")
print("="*80)

with open('/content/outputs/generation_samples.json', 'r') as f:
    samples = json.load(f)

for i, sample in enumerate(samples[:3]):
    print(f"\nExample {i+1}:")
    print(f"Prompt: {sample['prompt'][:100]}...")
    print(f"\nReference: {sample.get('reference', 'N/A')[:200]}...")
    print(f"\nSFT: {sample.get('sft', 'N/A')[:200]}...")
    print(f"\nDPO: {sample.get('dpo', 'N/A')[:200]}...")
    print("-" * 80)

## 10. Download Trained Models

Download your trained models to use locally or share.

In [None]:
# Zip the outputs
!zip -r /content/dpo_models.zip /content/outputs/

# Download
from google.colab import files
files.download('/content/dpo_models.zip')

print("✓ Models packaged and downloaded!")

## 11. Monitor Training (Optional)

If you enabled Weights & Biases logging:

In [None]:
# Setup Weights & Biases (optional)
# !pip install -q wandb
# import wandb
# wandb.login()

# Then modify your config to enable wandb:
# logging:
#   use_wandb: true
#   wandb_project: "dpo-colab"

## Tips for Colab Training

### GPU Runtime
- **Free tier**: T4 GPU (16GB), limited to ~12 hours
- **Colab Pro**: Better GPUs (V100/A100), longer sessions

### Avoid Disconnects
```javascript
// Run this in browser console to keep session alive
function KeepAlive() {
    document.querySelector("colab-connect-button").click();
}
setInterval(KeepAlive, 60000);
```

### Save Checkpoints to Google Drive
```python
from google.colab import drive
drive.mount('/content/drive')
# Then set output_dir to /content/drive/MyDrive/dpo_outputs
```

### Reduce Memory Usage
If you hit OOM errors:
- Reduce `per_device_batch_size` to 2 or 1
- Increase `gradient_accumulation_steps` to maintain effective batch size
- Reduce `max_length` to 256 or 128
- Enable gradient checkpointing (already on by default)

### Speed Up Training
- Use smaller model: `model_name_or_path: "gpt2"` (124M) instead of larger variants
- Reduce dataset size: Add `--debug` flag or limit `num_samples`
- Use fewer epochs: Set `num_epochs: 1`

### Expected Training Times (T4 GPU)
- **Debug mode** (~100 samples): 5-10 minutes per model
- **Small training** (~1000 samples): 30-60 minutes per model
- **Full training** (~160k samples): 8-12 hours for SFT + DPO

### Troubleshooting

**"No module named 'src'"**
- Make sure you're in the `/content/dpo` directory
- Run: `%cd /content/dpo`

**"CUDA out of memory"**
- Reduce batch size: `--batch_size 2`
- Use debug config: `--config configs/debug.yaml`

**"Session crashed"**
- Your training is too long for free tier
- Reduce dataset size or use debug mode
- Consider Colab Pro for longer sessions