# OpenVLA DPO Training Demo - Professional Version

This notebook provides a modular implementation of Direct Preference Optimization (DPO) training for OpenVLA models. Each section can be run independently for debugging and testing purposes.

## Overview
1. **Environment Setup** - Import libraries and configure paths
2. **Configuration** - Set training parameters and model configs
3. **Model Loading** - Load policy and reference models
4. **Data Loading** - Setup datasets and data loaders
5. **DPO Training** - Main training loop
6. **Testing & Debugging** - Utilities for debugging each component


## 1. Environment Setup and Imports

Import all necessary libraries and setup the Python path for accessing local modules.


# TO DO
## 1. 修改计算logprob时的mask, 不算separate action token的loss.
## 2. 对一个stream中每个action units赋予不同的weights, according to spatial distance.
## 3. 同时具备离线和在线的loser 轨迹采集
## 4. 

In [1]:
%load_ext autoreload
%autoreload 2           

In [2]:
#!/usr/bin/env python3
"""
DPO Training Demo - Environment Setup
"""

import os
import sys
import argparse
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Add the parent directories to Python path for imports
current_dir = os.getcwd()
parent_dir = os.path.join(current_dir, "..", "..")
sys.path.append(parent_dir)
print(f"Added to Python path: {parent_dir}")

# Core imports
import torch
import torch.nn.functional as F
from torch.utils.data import DataLoader
from trl.trainer.dpo_trainer import DataCollatorForPreference
import numpy as np
from tqdm import tqdm
from experiments.robot.libero.libero_utils import (
    get_libero_dummy_action,
    get_libero_env,
    get_libero_image,
    quat2axisangle,
    save_rollout_video_CoA,
)

# Local imports
try:
    from src.config import GenerateConfig
    from src.model_utils import setup_vla_model_with_lora, setup_model_and_config, setup_logging_and_environment
    from src.training_utils_prog import train_dpo, compute_log_probs, dpo_loss, grouped_dpo_loss
    from src.data_process import TrajectoryDataset
    print("✓ Successfully imported local modules")
except ImportError as e:
    print(f"✗ Failed to import local modules: {e}")
    print("Please ensure you're running from the correct directory")

# External imports  
try:
    from experiments.robot.robot_utils import get_model
    print("✓ Successfully imported external modules")
except ImportError as e:
    print(f"✗ Failed to import external modules: {e}")

# Check GPU availability
print(f"\nGPU Information:")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU count: {torch.cuda.device_count()}")
    for i in range(torch.cuda.device_count()):
        print(f"  GPU {i}: {torch.cuda.get_device_name(i)}")
        print(f"    Memory: {torch.cuda.get_device_properties(i).total_memory / 1e9:.1f} GB")

print("\n" + "="*50)
print("Environment setup completed!")
print("="*50)


Added to Python path: /mnt/sda/home/zijianwang/openvla/vla-scripts/DPO/../..


2025-08-30 20:02:44.233356: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-08-30 20:02:44.233389: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-08-30 20:02:44.234889: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-08-30 20:02:44.243708: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


✓ Successfully imported local modules
✓ Successfully imported external modules

GPU Information:
CUDA available: True
GPU count: 4
  GPU 0: NVIDIA GeForce RTX 4090
    Memory: 25.4 GB
  GPU 1: NVIDIA GeForce RTX 4090
    Memory: 25.4 GB
  GPU 2: NVIDIA GeForce RTX 4090
    Memory: 25.4 GB
  GPU 3: NVIDIA RTX A6000
    Memory: 51.0 GB

Environment setup completed!


## 2. Configuration Setup

Configure all training parameters. You can modify these parameters easily for different experiments.


In [20]:
"""
Configuration Setup - Modify parameters here for different experiments
"""

# ====== TRAINING PARAMETERS ======
DEVICE_POLICY = "cuda:1"  # Device for policy model
DEVICE_REF = "cuda:2"     # Device for reference model
MAX_STEPS = 800           # Maximum training steps (reduced for demo)
BATCH_SIZE = 1            # Training batch size
LEARNING_RATE = 0.0001    # Learning rate
DPO_BETA = 0.5           # DPO beta parameter
STREAM_LENGTH = 5       # Stream length for trajectory processing

# ====== WANDB CONFIGURATION ======
USE_WANDB = False         # Set to True to enable Weights & Biases logging
WANDB_PROJECT = "openvla_CoA_DPO_demo"
WANDB_ENTITY = "15652388600"
RUN_ID_NOTE = "notebook_demo"

# ====== PATH CONFIGURATION ======
ROOT_DIR = "/mnt/sda/home/zijianwang"

# Optional: Override default paths (leave empty to use defaults)
PRETRAINED_CHECKPOINT = f"{ROOT_DIR}/openvla/FT_res/openvla-7b-finetuned-libero-10+libero_10_no_noops+b4+lr-0.0005+lora-r48+dropout-0.0--image_aug--2025-07-18_19-26-25"
LORA_PATH = f"{ROOT_DIR}/openvla/adapter_tmp_dir/openvla-7b-finetuned-libero-10+libero_10_no_noops+b4+lr-0.0005+lora-r48+dropout-0.0--image_aug--2025-07-18_19-26-25"
BASE_VLA_PATH = f"{ROOT_DIR}/HF_CACHE/openvla-7b-finetuned-libero-10"
WINNER_TRAJECTORY_PATH = f"{ROOT_DIR}/openvla/vla-scripts/DPO/winner_trajectory"
ADAPTER_TMP_DIR = f"{ROOT_DIR}/openvla/DPO_adapter_tmp_dir"
TASK_NUM = 1            # Set to specific task number or None for all tasks
# Create configuration objects
print("Creating configuration...")

# Policy model configuration
model_cfg = GenerateConfig(
    root_dir=ROOT_DIR,
    device=DEVICE_POLICY,
    max_steps=MAX_STEPS,
    batch_size=BATCH_SIZE,
    learning_rate=LEARNING_RATE,
    dpo_beta=DPO_BETA,
    stream_length=STREAM_LENGTH,
    use_wandb=USE_WANDB,
    wandb_project=WANDB_PROJECT,
    wandb_entity=WANDB_ENTITY,
    run_id_note=RUN_ID_NOTE,
    grad_accumulation_steps=1,
    pretrained_checkpoint=PRETRAINED_CHECKPOINT,
    lora_path=LORA_PATH,
    base_vla_path=BASE_VLA_PATH,
    winner_trajectory_path=WINNER_TRAJECTORY_PATH,
    adapter_tmp_dir=ADAPTER_TMP_DIR,
    task_num=TASK_NUM
)

# Reference model configuration
ref_config = GenerateConfig(
    root_dir=ROOT_DIR,
    device=DEVICE_REF,
    pretrained_checkpoint=PRETRAINED_CHECKPOINT,
    lora_path=LORA_PATH,
    base_vla_path=BASE_VLA_PATH,
    winner_trajectory_path=WINNER_TRAJECTORY_PATH,
    adapter_tmp_dir=ADAPTER_TMP_DIR
)

print("\n" + "="*50)
print("CONFIGURATION SUMMARY")
print("="*50)
print(f"Policy Device: {model_cfg.device}")
print(f"Reference Device: {ref_config.device}")
print(f"Max Steps: {model_cfg.max_steps}")
print(f"Batch Size: {model_cfg.batch_size}")
print(f"Learning Rate: {model_cfg.learning_rate}")
print(f"DPO Beta: {model_cfg.dpo_beta}")
print(f"Stream Length: {model_cfg.stream_length}")
print(f"Use WandB: {model_cfg.use_wandb}")
print(f"Task Number: {model_cfg.task_num if model_cfg.task_num else 'All tasks'}")
print("\nPath Configuration:")
print(f"Root Dir: {model_cfg.root_dir}")
print(f"Pretrained Checkpoint: {model_cfg.pretrained_checkpoint}")
print(f"LoRA Path: {model_cfg.lora_path}")
print(f"Winner Trajectory Path: {model_cfg.winner_trajectory_path}")
print(f"Adapter Tmp Dir: {model_cfg.adapter_tmp_dir}")
print("="*50)


Creating configuration...

CONFIGURATION SUMMARY
Policy Device: cuda:1
Reference Device: cuda:2
Max Steps: 800
Batch Size: 1
Learning Rate: 0.0001
DPO Beta: 0.5
Stream Length: 5
Use WandB: False
Task Number: 1

Path Configuration:
Root Dir: /mnt/sda/home/zijianwang
Pretrained Checkpoint: /mnt/sda/home/zijianwang/openvla/FT_res/openvla-7b-finetuned-libero-10+libero_10_no_noops+b4+lr-0.0005+lora-r48+dropout-0.0--image_aug--2025-07-18_19-26-25
LoRA Path: /mnt/sda/home/zijianwang/openvla/adapter_tmp_dir/openvla-7b-finetuned-libero-10+libero_10_no_noops+b4+lr-0.0005+lora-r48+dropout-0.0--image_aug--2025-07-18_19-26-25
Winner Trajectory Path: /mnt/sda/home/zijianwang/openvla/vla-scripts/DPO/winner_trajectory
Adapter Tmp Dir: /mnt/sda/home/zijianwang/openvla/DPO_adapter_tmp_dir


## 3. Model Loading

Load the policy model (with LoRA) and reference model. This section handles device placement and model initialization.


In [5]:
"""
Model Loading Section
"""

print("Starting model loading...")
print("This may take several minutes depending on model size and device speed.")
print("\n" + "-"*30)

# Load policy model with LoRA
print("[1/2] Loading policy model (with LoRA)...")
print(f"Target device: {model_cfg.device}")

try:
    policy_model = setup_vla_model_with_lora(model_cfg)
    print(f"✓ Policy model loaded successfully")
    print(f"Model device: {next(policy_model.parameters()).device}")
    print(f"Model dtype: {next(policy_model.parameters()).dtype}")
    
    # Count parameters
    total_params = sum(p.numel() for p in policy_model.parameters())
    trainable_params = sum(p.numel() for p in policy_model.parameters() if p.requires_grad)
    print(f"Total parameters: {total_params:,}")
    print(f"Trainable parameters: {trainable_params:,}")
    print(f"Trainable ratio: {100 * trainable_params / total_params:.2f}%")
    
except Exception as e:
    print(f"✗ Failed to load policy model: {e}")
    raise

print("\n" + "-"*30)

# Load reference model
print("[2/2] Loading reference model...")
print(f"Target device: {ref_config.device}")

try:
    ref_model = setup_vla_model_with_lora(ref_config)
    print(f"✓ Reference model loaded successfully")
    print(f"Model device: {next(ref_model.parameters()).device}")
    print(f"Model dtype: {next(ref_model.parameters()).dtype}")
    
    # Set reference model to eval mode and freeze parameters
    ref_model.eval()
    for param in ref_model.parameters():
        param.requires_grad = False
    print("✓ Reference model set to eval mode and frozen")
    
except Exception as e:
    print(f"✗ Failed to load reference model: {e}")
    raise

print("\n" + "="*50)
print("MODEL LOADING SUMMARY")
print("="*50)
print(f"Policy Model Device: {next(policy_model.parameters()).device}")
print(f"Reference Model Device: {next(ref_model.parameters()).device}")
print(f"Policy Model Trainable: {sum(p.requires_grad for p in policy_model.parameters())} params")
print(f"Reference Model Trainable: {sum(p.requires_grad for p in ref_model.parameters())} params")
print("Models loaded successfully!")
print("="*50)

# Optional: Clear cache to free up memory
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print("GPU cache cleared.")


Starting model loading...
This may take several minutes depending on model size and device speed.

------------------------------
[1/2] Loading policy model (with LoRA)...
Target device: cuda:1
[*] Instantiating Pretrained VLA model
[*] Loading in BF16 with Flash-Attention Enabled


Loading checkpoint shards: 100%|██████████| 4/4 [00:00<00:00, 10.71it/s]


✓ Policy model loaded successfully
Model device: cuda:1
Model dtype: torch.bfloat16
Total parameters: 7,707,479,616
Trainable parameters: 166,242,432
Trainable ratio: 2.16%

------------------------------
[2/2] Loading reference model...
Target device: cuda:2
[*] Instantiating Pretrained VLA model
[*] Loading in BF16 with Flash-Attention Enabled


Loading checkpoint shards: 100%|██████████| 4/4 [00:00<00:00, 10.66it/s]


✓ Reference model loaded successfully
Model device: cuda:2
Model dtype: torch.bfloat16
✓ Reference model set to eval mode and frozen

MODEL LOADING SUMMARY
Policy Model Device: cuda:1
Reference Model Device: cuda:2
Policy Model Trainable: 878 params
Reference Model Trainable: 0 params
Models loaded successfully!
GPU cache cleared.


In [6]:
processor, log_file, task_suite, num_tasks_in_suite, resize_size = setup_logging_and_environment(model_cfg, policy_model)

Logging to local log file: ./experiments/logs/DPO-libero_10-openvla-2025_08_30-20_03_07--notebook_demo.txt
[info] using task orders [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Task suite: libero_10


In [7]:
task = task_suite.get_task(model_cfg.task_num)
env, task_description = get_libero_env(task, model_cfg.model_family, resolution=256)



In [8]:
print(task_description)

put both the cream cheese box and the butter in the basket


In [22]:
def setup_data_loader(cfg, processor, model, env, task_suite, resize_size, human_prompt_template = "What action should the robot take to {lang}?"):
    """Setup the training data loader."""
    print("[*] Setting up dataset and data loader...")
    
    # Create dataset instance
    dataset = TrajectoryDataset(
        cfg, 
        cfg.winner_trajectory_path, 
        cfg.task_suite_name, 
        processor, 
        env, 
        task_suite,
        device=cfg.device, 
        model=model, 
        img_size=resize_size,
        stream_length=cfg.stream_length,
        task_num=cfg.task_num,
        if_fixed_stream_length = True,
        human_prompt_template=human_prompt_template
    )
    
    # Create data collator
    data_collator = DataCollatorForPreference(pad_token_id=processor.tokenizer.pad_token_id)
    
    # Create data loader
    train_dataloader = DataLoader(
        dataset,
        batch_size=cfg.batch_size,
        shuffle=True,
        collate_fn=data_collator
    )
    
    print(f"Dataset created with {len(dataset)} trajectory pairs")
    return train_dataloader

# human_prompt_template = "What action should the robot take to {lang}?"
human_prompt_template = "What sequence of actions should the robot take to {lang}?"

train_dataloader = setup_data_loader(model_cfg, processor, policy_model, env, task_suite, resize_size, human_prompt_template)

[*] Setting up dataset and data loader...
Found 100 success trajectories
Task distribution: [('7', 10), ('2', 10), ('1', 10), ('6', 10), ('5', 10), ('3', 10), ('9', 10), ('8', 10), ('0', 10), ('4', 10)]
Dataset created with 10 trajectory pairs


In [24]:
print("[*] Verifying data loader setup...")
test_batch = next(iter(train_dataloader))
print(f"Batch keys: {test_batch.keys()}")
print(f"Chosen input shape: {test_batch['chosen_input_ids'].shape}")
print(f"Pixel values shape: {test_batch['pixel_values'].shape}")

[*] Verifying data loader setup...
[info] using task orders [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
The if_fixed_stream_length is True, the stream_length is **5**
**Prompt**: In: What sequence of actions should the robot take to put both the cream cheese box and the butter in the basket??
Out:
Batch keys: dict_keys(['prompt_input_ids', 'prompt_attention_mask', 'chosen_input_ids', 'chosen_attention_mask', 'rejected_input_ids', 'rejected_attention_mask', 'pixel_values', 'distance', 'start_idx'])
Chosen input shape: torch.Size([1, 40])
Pixel values shape: torch.Size([1, 6, 224, 224])


In [8]:
model_cfg.use_wandb = True
print(model_cfg.use_wandb)
try:
    final_adapter_dir = train_dpo(
        model=policy_model, 
        ref_model=ref_model, 
        train_dataloader=train_dataloader, 
        cfg=model_cfg, 
        if_not_demo=model_cfg.use_wandb
    )
    
    print(f"[*] Training completed successfully!")
    print(f"[*] Final adapter saved to: {final_adapter_dir}")
    
except KeyboardInterrupt:
    print("\n[*] Training interrupted by user")
    
except Exception as e:
    print(f"[*] Training failed with error: {e}")
    raise

True


[34m[1mwandb[0m: Currently logged in as: [33m15652388600[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
  self.scope.user = {"email": email}
  self.scope.user = {"email": email}


Policy model device: cuda:0
Reference model device: cuda:1


  0%|          | 0/100 [00:00<?, ?it/s]

************Begin to train************
[info] using task orders [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
The if_fixed_stream_length is True, the stream_length is **20**
-----This is the 0th batch----
Current action stream length: 20
     Policy chosen prediction differs from true at token position: 0
Group_accuracy: tensor([0., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1.], device='cuda:0')
Batch 0: First incorrect group at position 0
Batch 0: Using DPO loss from group 0: 0.6914 + SFT loss: 11.4396


  1%|          | 1/100 [00:25<41:53, 25.39s/it]

Saved adapter to /mnt/sda/home/zijianwang/openvla/DPO_adapter_tmp_dir/openvla-7b+libero_10_no_noops+task1+b1+lr-0.0001+lora-r48+dropout-0.0--2025-08-28_17-35-44--notebook_demo/ckpt-0, batch_idx: 0
[info] using task orders [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
The if_fixed_stream_length is True, the stream_length is **20**
-----This is the 1th batch----
Current action stream length: 20
     Policy chosen prediction differs from true at token position: 0


  2%|▏         | 2/100 [00:56<46:37, 28.55s/it]

Group_accuracy: tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1.], device='cuda:0')
Batch 1: All groups correct (accuracy=1), using only SFT loss: 12.3955
[info] using task orders [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
The if_fixed_stream_length is True, the stream_length is **20**
-----This is the 2th batch----
Current action stream length: 20
     Policy chosen prediction differs from true at token position: 0
Group_accuracy: tensor([0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1.], device='cuda:0')
Batch 2: First incorrect group at position 0
Batch 2: Using DPO loss from group 0: 6.0000 + SFT loss: 12.5192


  3%|▎         | 3/100 [01:20<43:06, 26.67s/it]

[info] using task orders [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
The if_fixed_stream_length is True, the stream_length is **20**
-----This is the 3th batch----
Current action stream length: 20
     Policy chosen prediction differs from true at token position: 0
Group_accuracy: tensor([0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1.], device='cuda:0')
Batch 3: First incorrect group at position 0
Batch 3: Using DPO loss from group 0: 4.5000 + SFT loss: 7.8637


  4%|▍         | 4/100 [01:48<43:31, 27.20s/it]

[info] using task orders [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
The if_fixed_stream_length is True, the stream_length is **20**


                                               


[*] Training interrupted by user
Error in callback <bound method _WandbInit._post_run_cell_hook of <wandb.sdk.wandb_init._WandbInit object at 0x7fc5082ab550>> (for post_run_cell), with arguments args (<ExecutionResult object at 7fc5082aa8c0, execution_count=8 error_before_exec=None error_in_exec=None info=<ExecutionInfo object at 7fc5082a9120, raw_cell="model_cfg.use_wandb = True
print(model_cfg.use_wan.." store_history=True silent=False shell_futures=True cell_id=vscode-notebook-cell://ssh-remote%2Bxuchang-lab0/mnt/sda/home/zijianwang/openvla/vla-scripts/DPO/dpo_demo_pro.ipynb#X20sdnNjb2RlLXJlbW90ZQ%3D%3D> result=None>,),kwargs {}:


BrokenPipeError: [Errno 32] Broken pipe

In [None]:
group_accuracy = torch.tensor([1., 1., 0., 0., 1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 1., 1.])
torch.where(group_accuracy == 0.0)


a = np.allclose([0.056, 0.028, 0.446], [0.056, 0.028, 0.446], atol=1e-6)
print(a)