# Direct Preference Optimization: Instruction Tuning a LLM Using Preference Data
* Author: mark@datarobot.com
* Date: 2026-02-10

## Summary

This notebook outlines how to take preference data and use that to update a model to perform instruction tuninging. It will take a dataset of prompt, prefered response, rejected response and use that to update a model using Direct Preference Optimization, all in a single session without leaving the DataRobot platfrom.

1. Download the preference data from DataRobot Registry
2. Train a model using Direct Preference Optimization (DPO)
3. Upload the new model weights to DataRobot custom model workshop ready to register and then deploy with goverance and monitoring. 

DPO reformulates the Reinforcement Learning with Human Feedback (RLHF) objective into a classification problem on preference pairs:

Takes pairs of responses (preferred vs. rejected) for the same prompt
Directly optimizes the language model to increase the likelihood of preferred responses relative to rejected ones
Uses a simple binary cross-entropy loss function



## Setup

### Import libraries

This accelerator uses the following libraries:

datasets (Hugging Face)
A library for easily accessing, processing, and sharing datasets. load_dataset fetches datasets from Hugging Face Hub or local files.

trl (Transformer Reinforcement Learning)
A library for training language models with reinforcement learning techniques. DPOConfig and DPOTrainer implement Direct Preference Optimization - a method to align LLMs with human preferences without explicit reward modeling.

transformers (Hugging Face)
The core library for working with pre-trained transformer models. AutoModelForCausalLM loads text generation models, and AutoTokenizer loads the corresponding tokenizer to convert text to/from tokens.

datarobot
Enterprise MLOps platform SDK for building, deploying, and managing machine learning models. Provides programmatic access to DataRobot's AutoML and deployment features.

torch (PyTorch)
Deep learning framework providing tensor computation and automatic differentiation. The backbone for training and running neural networks.

os
Python standard library for interacting with the operating system - file paths, environment variables, directory operations, etc.

accelerate (Hugging Face)
Simplifies running PyTorch code across different hardware setups (multi-GPU, TPU, mixed precision). notebook_launcher specifically enables launching distributed training directly from Jupyter notebooks.

In [None]:
from datasets import load_dataset
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
import datarobot as dr
import requests
import torch
import os

from accelerate import notebook_launcher


### Bind variables

In [None]:
# DataRobot connection settings
# These variables can also be fetched from a secret store or config files
from zlib import MAX_WBITS


# Uncomment if want to manually specify credentals
# DATAROBOT_ENDPOINT = "https://app.eu.datarobot.com/api/v2"
# The URL may vary depending on your hosting preference, the above example is for DataRobot EU Managed AI Cloud

# DATAROBOT_API_TOKEN = "<INSERT YOUR DataRobot API Token>"
# The API Token can be found by click the avatar icon and then </> Developer Tools

# Dataset settings
# DATASET_ID = "<DATASET_ID>"
DATASET_NAME = "preference_training_dataset.jsonl"
DATASET_FORMAT = "json"

# Model settings
MODEL_ID = "Qwen/Qwen2-0.5B-Instruct"
TMP_DIR = "/tmp/qwen2-0.5b-dpo"
OUTPUT_DIR = "/home/notebooks/storage/qwen2-0.5b-dpo-final"
LOGGING_DIR = "/tmp/qwen2-0.5b-dpo-logs"
SHARD_WRAP_CLASS = "Qwen2DecoderLayer"

# Training settings
NUM_PROCESSES = 4  # Number of GPUs to use

# Custom model workshop settings
RUNTIME_ID = "662d6a54ef58f64c5a07d122"
CUSTOM_MODEL_NAME = "DPO_Trained_Model"
MAX_WAIT = 6000
 

### Connect to DataRobot

You can read more about different options for [connecting to DataRobot from the client](https://docs.datarobot.com/en/docs/api/api-quickstart/api-qs.html).

In [None]:
dr.Client()

# If running outside of the DataRobot platform pass without envvars in credential
# dr.Client(token=DATAROBOT_API_TOKEN, endpoint=DATAROBOT_ENDPOINT)

## Download Data

This section allows downloadiing a precreated dataset from the DataRobot registry. For this example, a dataset from S3 is used instead. Uncomment the function call to download from the DataRobot registry, e.g. from a dataset generated from a deployment.

In [None]:
def download_registry_file(dataset_id, local_path):
    """Download a dataset from the DataRobot registry.
    
    Args:
        dataset_id: The ID of the dataset in DataRobot registry
        local_path: The local path where the file will be saved
    
    Returns:
        The path to the downloaded file
    
    Raises:
        Exception: If the download fails
    """
    try:
        # Retrieve the dataset object from the registry
        dataset = dr.Dataset.get(dataset_id)

        # Download the file
        print(f"Downloading {dataset.name}...")
        dataset.get_file(local_path)
        print(f"File saved to: {local_path}")
        return local_path
    except Exception as e:
        print(f"Error downloading dataset: {e}")
        raise

def download_s3_example_file(local_path):

    url = 'https://s3.us-east-1.amazonaws.com/datarobot_public_datasets/ai_accelerators/preference_training.jsonl'
    try:
        response = requests.get(url, stream=True)
        response.raise_for_status() # Check if the download was successful

        with open(local_path, 'wb') as file:
            for chunk in response.iter_content(chunk_size=8192):
                file.write(chunk)
        print(f"File successfully downloaded and saved to {local_path}")

    except requests.exceptions.RequestException as e:
        print(f"An error occurred: {e}")

In [None]:
# Uncomment to download the preference dataset from the registry
# download_registry_file(DATASET_ID, DATASET_NAME)

# Download from S3
download_s3_example_file(DATASET_NAME)


### Validate Dataset

Ensure the dataset has the required columns for DPO training.

In [None]:
def validate_preference_dataset(file_path):
    """Validate that the dataset has required columns for DPO training.
    
    Args:
        file_path: Path to the CSV file
    
    Returns:
        The loaded dataset if valid
    
    Raises:
        ValueError: If required columns are missing
    """
    dataset = load_dataset("csv", data_files=file_path)["train"]
    
    required_cols = {"chosen", "rejected"}
    actual_cols = set(dataset.column_names)
    
    if not required_cols.issubset(actual_cols):
        missing = required_cols - actual_cols
        raise ValueError(
            f"Dataset is missing required columns: {missing}. "
            f"Found columns: {actual_cols}"
        )
    
    print(f"Dataset validated successfully!")
    print(f"Number of examples: {len(dataset)}")
    print(f"Columns: {dataset.column_names}")
    
    return dataset

In [None]:
# Validate the downloaded dataset
dataset = validate_preference_dataset(DATASET_NAME)

# Preview a sample
print("\nSample from dataset:")
print(dataset[0])

## Fine-Tuning with DPO

This section uses trl and HuggingFace accelrate to perform Direct Preference Optimization (https://arxiv.org/abs/2305.18290) 

This example is designed to run on 4 A10s. 

Key Features
Distributed Training: Designed to work with Accelerate's notebook_launcher for multi-GPU training
Memory Optimization: Implements FSDP (Fully Sharded Data Parallel) with parameter offloading to handle large models efficiently
Mixed Precision: Uses bfloat16 for faster training and reduced memory footprint
Gradient Checkpointing: Trades computation for memory to enable training larger models
Monitoring: Integrated TensorBoard logging for tracking training metrics

In the terminal session tensorboard can be launched on host 0.0.0.0 and if the appropriate ports are exposed it can be viewed to track learning progress.'

This leverages the HuggingFace accelerate framework. 

In [None]:
# Create output directories if they don't exist
os.makedirs(TMP_DIR, exist_ok=True)
os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(LOGGING_DIR, exist_ok=True)

In [None]:
train_function = f"""
from datasets import load_dataset
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer

import torch

def dpo_train():
    # 1. Load Dataset (Format: prompt, chosen, rejected)
    dataset = load_dataset("{DATASET_FORMAT}", data_files="{DATASET_NAME}")

    # 2. Load Model & Tokenizer
    model = AutoModelForCausalLM.from_pretrained(
        "{MODEL_ID}", 
        torch_dtype=torch.bfloat16,
        trust_remote_code=True
    )
    tokenizer = AutoTokenizer.from_pretrained("{MODEL_ID}")
    tokenizer.pad_token = tokenizer.eos_token

    # 3. DPO Configuration
    training_args = DPOConfig(
        output_dir="{TMP_DIR}",
        per_device_train_batch_size=4, # Increase this if VRAM allows
        gradient_accumulation_steps=4,
        learning_rate=5e-7,
        lr_scheduler_type="cosine",
        logging_steps=1,
        max_steps=50,
        bf16=True,
        fsdp="full_shard auto_wrap",
        fsdp_config={{
            "transformer_layer_cls_to_wrap": "{SHARD_WRAP_CLASS}",
            "fsdp_state_dict_type": "FULL_STATE_DICT",
            "fsdp_offload_params": True,               # Move gathered weights to CPU
        }},
        gradient_checkpointing=True,
        remove_unused_columns=False,
        logging_dir="{LOGGING_DIR}",          # Where TensorBoard events will be saved
        report_to=["tensorboard"],       # Enables TensorBoard logging
        logging_first_step=True,
        # Turn this OFF to avoid the ValueError during the run
        save_only_model=False,
        save_strategy="no",
    
    )

    # 4. Initialize Trainer
    trainer = DPOTrainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
        processing_class=tokenizer,
    )

    # 5. Train the model
    print("Starting DPO training...")
    trainer.train()
    print("Training completed!")

    # 6. Save the model
    save_trained_model(trainer, tokenizer, "{OUTPUT_DIR}")
    """

In [None]:
save_function = '''
def save_trained_model(trainer, tokenizer, output_dir):
    """Save the trained model handling FSDP distributed training.
    
    Args:
        trainer: The DPOTrainer instance
        tokenizer: The tokenizer to save
        output_dir: Directory to save the model
    """
    # Wait for all processes to catch up
    trainer.accelerator.wait_for_everyone()

    if trainer.is_fsdp_enabled:
        from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
        from torch.distributed.fsdp import StateDictType, FullStateDictConfig

        # Set FSDP to gather weights for a single file
        trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")

        # Configure FSDP to output a full state dict (not shards)
        save_policy = FullStateDictConfig(offload_to_cpu=True, rank0_only=True)
        
        with FSDP.state_dict_type(trainer.model, StateDictType.FULL_STATE_DICT, save_policy):
            cpu_state_dict = trainer.model.state_dict()

        # Save only on Rank 0
        if trainer.accelerator.is_main_process:
            # Save the model with the consolidated state dict
            trainer.model.save_pretrained(
                output_dir,
                state_dict=cpu_state_dict,
                safe_serialization=True
            )
            # Save the tokenizer
            tokenizer.save_pretrained(output_dir)
            print(f"Model and tokenizer saved to {output_dir}")
    else:
        # If not using FSDP, standard save
        if trainer.accelerator.is_main_process:
            trainer.save_model(output_dir)
            tokenizer.save_pretrained(output_dir)
            print(f"Model and tokenizer saved to {output_dir}")

    # Final sync
    trainer.accelerator.wait_for_everyone()

def main():
    dpo_train()

if __name__ == "__main__":
    main()
'''

### Launch Training

Launch the distributed training across multiple GPUs using accelerate's notebook_launcher.

In [None]:

# Launch the training job
print(f"Launching DPO training on {NUM_PROCESSES} GPUs...")
print(f"Model: {MODEL_ID}")
print(f"Output directory: {OUTPUT_DIR}")
print(f"Logging directory: {LOGGING_DIR}")

train_filepath = "train_dpo.py"

try:
    with open(train_filepath, "w") as f:
        f.write(train_function + "\n\n" + save_function)
    print(f"Successfully wrote string to {train_filepath}")
except IOError as e:
    print(f"An error occurred: {e}")


### Verify Saved Model

Verify that the model was saved correctly by checking the output directory.

In [None]:
def verify_saved_model(output_dir):
    """Verify that the model was saved correctly.
    
    Args:
        output_dir: Directory where the model was saved
    
    Returns:
        True if verification passes, False otherwise
    """
    required_files = [
        "config.json",
        "tokenizer.json",
        "tokenizer_config.json",
    ]
    
    # Check for model weights (either safetensors or pytorch format)
    model_files = [
        "model.safetensors",
        "pytorch_model.bin",
        "model.safetensors.index.json",
        "pytorch_model.bin.index.json"
    ]
    
    print(f"Checking saved model in {output_dir}...")
    print(f"\nFiles found:")
    
    found_files = os.listdir(output_dir)
    for f in found_files:
        file_path = os.path.join(output_dir, f)
        size = os.path.getsize(file_path)
        print(f"  - {f} ({size / 1024 / 1024:.2f} MB)")
    
    # Check required files
    missing_required = [f for f in required_files if f not in found_files]
    if missing_required:
        print(f"\nWarning: Missing required files: {missing_required}")
        return False
    
    # Check for at least one model file
    has_model_file = any(f in found_files for f in model_files) or \
                     any(f.startswith("model-") and f.endswith(".safetensors") for f in found_files)
    
    if not has_model_file:
        print(f"\nWarning: No model weights file found")
        return False
    
    print(f"\nModel verification passed!")
    return True

In [None]:
# Verify the saved model
verify_saved_model(OUTPUT_DIR)

### Upload to DataRobot Workshop

For use in the vLLM image we need to specify to load the saved model files.

Note: serving models with older versions of transformers is incompatiable with the generated json from newer versions. This can be fixed by changing the "extra_special_tokens" field to "additional_special_tokens" by uncommenting and running


In [None]:
# sed -i 's/"extra_special_tokens"/"additional_special_tokens"/g' tokenizer_config.json

In [None]:
engine_json = f"""
{{
  "args": [
    "--model", "/opt/code", "--served-model-name", "{MODEL_ID}"
  ]
}}
"""

engine_json_path = os.path.join(OUTPUT_DIR, "engine_config.json")

try:
    with open(engine_json_path, "w") as f:
        f.write(engine_json)
    print(f"Successfully wrote string to {engine_json_path}")
except IOError as e:
    print(f"An error occurred: {e}")

In [None]:
def upload_to_custom_workshop(model_name, local_folder_path, runtime_id, max_wait=MAX_WAIT):
    """Upload a model to DataRobot Custom Model Workshop.
    
    Args:
        model_name: Name for the custom model in DataRobot
        local_folder_path: Path to the folder containing model files
        runtime_id: The runtime environment ID to use
        max_wait: Maximum wait time in seconds for API calls
    
    Returns:
        The created CustomModelVersion object
    
    Raises:
        Exception: If upload fails
    """
    try:
        print(f"Creating custom model: {model_name}")
        
        # 1. Create the Custom Model shell
        custom_model = dr.CustomInferenceModel.create(
            name=model_name,
            target_type=dr.TARGET_TYPE.TEXT_GENERATION,
            target_name="promptText"
        )
        print(f"Custom model created with ID: {custom_model.id}")

        # 2. Upload files and create a version
        print(f"Uploading model files from {local_folder_path}...")
        print("This may take several minutes depending on model size.")
        
        model_version = dr.CustomModelVersion.create_clean(
            custom_model_id=custom_model.id,
            base_environment_id=runtime_id,
            folder_path=local_folder_path,
            max_wait=max_wait
        )

        print(f"\nUpload successful!")
        print(f"Custom Model ID: {custom_model.id}")
        print(f"Model Version ID: {model_version.id}")
        print(f"\nNext steps:")
        print(f"1. Go to DataRobot Custom Model Workshop")
        print(f"2. Find model: {model_name}")
        print(f"3. Register and deploy the model")
        
        return model_version
        
    except Exception as e:
        print(f"Error uploading to Custom Model Workshop: {e}")
        raise

In [None]:
# Upload the fine-tuned model to DataRobot
model_version = upload_to_custom_workshop(
    model_name=CUSTOM_MODEL_NAME,
    local_folder_path=OUTPUT_DIR,
    runtime_id=RUNTIME_ID
)

## Appendix: Utility Functions

Additional helper functions for common tasks.

In [None]:
def list_available_runtimes():
    """List available runtime environments in DataRobot.
    
    Useful for finding the correct RUNTIME_ID for your model.
    """
    try:
        environments = dr.ExecutionEnvironment.list()
        print("Available Runtime Environments:")
        print("-" * 80)
        for env in environments:
            print(f"ID: {env.id}")
            print(f"Name: {env.name}")
            print(f"Description: {env.description}")
            print("-" * 80)
        return environments
    except Exception as e:
        print(f"Error listing environments: {e}")
        return None

In [None]:
def list_custom_models():
    """List existing custom models in DataRobot.
    
    Useful for checking existing models before creating new ones.
    """
    try:
        models = dr.CustomInferenceModel.list()
        print("Existing Custom Models:")
        print("-" * 80)
        for model in models:
            print(f"ID: {model.id}")
            print(f"Name: {model.name}")
            print(f"Target Type: {model.target_type}")
            print(f"Created: {model.created_at}")
            print(f"Updated: {model.updated_at}")
            print("-" * 80)
        return models
    except Exception as e:
        print(f"Error listing models: {e}")
        return None

In [None]:
def delete_custom_model(model_id):
    """Delete a custom model from DataRobot.
    
    Args:
        model_id: The ID of the custom model to delete
    
    Warning: This action cannot be undone!
    """
    try:
        model = dr.CustomInferenceModel.get(model_id)
        model_name = model.name
        
        # Confirm deletion
        confirm = input(f"Are you sure you want to delete '{model_name}'? (yes/no): ")
        if confirm.lower() == 'yes':
            model.delete()
            print(f"Model '{model_name}' deleted successfully.")
        else:
            print("Deletion cancelled.")
    except Exception as e:
        print(f"Error deleting model: {e}")