Skip to content

codepassionor/ROVA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

18 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŽฌ Are Video Reasoning Models Ready to Go Outside?

Yangfan He1, Changgyu Boo2, Jaehong Yoon1

1Nanyang Technological University ย ย  2Korea University

arXiv License: MIT Python 3.11+ Qwen2.5-VL vLLM DeepSpeed trl


ROVA is a novel training framework that improves the robustness of vision-language models for video reasoning under real-world disturbances such as weather, occlusion, and camera motion. It models a robustness-aware consistency reward under spatio-temporal corruptions and introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. We also introduce PVRBench, a new benchmark for evaluating accuracy and reasoning quality under realistic perturbations.

Paper

Features ยท Architecture ยท Quick Start ยท Training ยท Evaluation ยท Results

ROVA Framework Overview

๐Ÿ”ฅ Highlights

  • ๐Ÿง  T-GRPO (Temporal Group Relative Policy Optimization) โ€” a novel RL algorithm that jointly optimizes over temporally-perturbed video inputs, enabling robust reasoning under frame-level corruptions
  • ๐ŸŒง๏ธ Multi-Domain Visual Corruption Engine โ€” realistic augmentation suite covering photometric effects (dusk, night, overexposure, shadows), weather simulation (rain, snow, hail, storm), spatial occlusion, and camera shake
  • ๐Ÿ”„ KL-Consistency Reward โ€” a dual-branch alignment signal that penalizes reasoning divergence between clean and corrupted video streams
  • ๐Ÿงฉ Memory-Aware Training โ€” intelligent sample difficulty tracking system that identifies and re-examines challenging examples across training
  • โšก Full-Stack Pipeline โ€” end-to-end support from CoT annotation โ†’ SFT warm-up โ†’ GRPO/T-GRPO reinforcement learning โ†’ multi-benchmark evaluation

โœจ Features

Capability Description
Model Support Qwen2.5-VL (7B) with Flash Attention 2
Training Paradigm SFT โ†’ GRPO / T-GRPO with DeepSpeed ZeRO-2/3
Acceleration vLLM-accelerated RL rollout generation
Data Modality Image-Video mixed training
Answer Types Multiple choice ยท Numerical ยท OCR ยท Free-form ยท Regression
Corruption Types Photometric ยท Weather ยท Occlusion ยท Camera shake ยท Temporal drop
Reward Functions Accuracy ยท Format ยท KL-Consistency ยท Length control
Memory System Difficulty-aware sample tracking with auto-recheck

๐Ÿ—๏ธ Architecture

                          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                          โ”‚     Input Video          โ”‚
                          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                   โ”‚
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ–ผ                              โ–ผ
            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
            โ”‚  Clean Path  โ”‚              โ”‚  Corrupted Path  โ”‚
            โ”‚              โ”‚              โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
            โ”‚              โ”‚              โ”‚  โ”‚Photometric โ”‚  โ”‚
            โ”‚              โ”‚              โ”‚  โ”‚Weather     โ”‚  โ”‚
            โ”‚              โ”‚              โ”‚  โ”‚Occlusion   โ”‚  โ”‚
            โ”‚              โ”‚              โ”‚  โ”‚Shake       โ”‚  โ”‚
            โ”‚              โ”‚              โ”‚  โ”‚Frame Drop  โ”‚  โ”‚
            โ”‚              โ”‚              โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                   โ”‚                               โ”‚
                   โ–ผ                               โ–ผ
           โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
           โ”‚  Qwen2.5-VLโ”‚                  โ”‚  Qwen2.5-VLโ”‚
           โ”‚  (Policy)  โ”‚                  โ”‚  (Policy)  โ”‚
           โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜                  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚                               โ”‚
                  โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”‚
                  โ””โ”€โ”€โ”€โ–บโ”‚  T-GRPO Trainer   โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”˜
                       โ”‚                   โ”‚
                       โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
                       โ”‚  โ”‚ R_accuracy  โ”‚  โ”‚
                       โ”‚  โ”‚ R_format    โ”‚  โ”‚
                       โ”‚  โ”‚ R_kl_cons.  โ”‚  โ”‚
                       โ”‚  โ”‚ R_length    โ”‚  โ”‚
                       โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
                       โ”‚                   โ”‚
                       โ”‚  Memory Manager   โ”‚
                       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿš€ Quick Start

Prerequisites

  • Python 3.11+, CUDA-compatible GPUs (4ร— recommended)
  • Conda package manager

Installation

# 1. Create environment
conda create -n rova python=3.11
conda activate rova

# 2. Install core dependencies
bash setup.sh

# 3. Install Qwen video extraction with decord acceleration
cd src/qwen-vl-utils
pip install -e .[decord]
cd ../..

Install Transformers (Pinned Version)

Qwen2.5-VL undergoes frequent updates in Transformers which may cause version-related inconsistencies. Use the bundled version:

unzip transformers-main.zip
cd transformers-main
pip install .

Version Requirements

Package Version Note
vLLM 0.7.2 Required for RL acceleration
TRL 0.16.0 GRPO trainer compatibility
Flash Attention latest pip install flash-attn --no-build-isolation

Prepare Data

# Place downloaded dataset into the data directory
# then unzip:
python ./src/unzip.py

๐Ÿ“ Dataset should be placed in src/r1-v/data/ before unzipping.


๐ŸŽ“ Training Pipeline

ROVA follows a two-stage training paradigm:

Stage 1 โ€” Supervised Fine-Tuning (SFT)

Warm up the model on chain-of-thought (CoT) annotated data for one epoch:

bash ./src/scripts/run_sft_video.sh

๐Ÿ’ก To generate CoT annotations on your own data, use src/generate_cot_vllm.py

Stage 2 โ€” Reinforcement Learning (GRPO / T-GRPO)

Fine-tune the SFT checkpoint with Group Relative Policy Optimization:

# Standard GRPO with temporal corruption
bash ./src/scripts/run_grpo_video.sh

# With vLLM acceleration (recommended)
bash ./src/scripts/run_grpo_vllm_qwen25vl.sh

# With Memory-aware difficulty tracking
bash ./src/scripts/run_grpo_video_memory.sh

Key Training Configurations

Parameter Value Description
--temporal true/false Toggle T-GRPO (temporal corruption) vs standard GRPO
--len_control true/false Enable/disable length control reward
--num_generations 4-8 Group size G for GRPO (โ†‘ = lower variance, โ†‘ memory)
--beta 0.04 KL penalty coefficient
--max_pixels 524288 Max pixel budget per frame
--max_frames 16 Max video frames during training
--per_device_train_batch_size 1 Must remain 1 (following R1-V convention)

Corruption Configuration

All corruption types are configurable via command-line arguments:

# Corruption probabilities (sum โ‰ค 1.0, remainder = no augmentation)
--photometric_prob 0.25    # Lighting effects: dusk / night / overexposure / shadows
--weather_prob 0.25        # Weather: rain / snow / hail / storm
--occlusion_prob 0.25      # Random block occlusion
--shake_prob 0.25          # Camera shake simulation

๐Ÿ” Inference & Evaluation

Resolution Scaling

During inference, increase resolution for better performance:

Setting Training Inference
Max Frame Resolution 128 ร— 28 ร— 28 256 ร— 28 ร— 28
Max Frames 16 16 / 32 / 64

Configure these in src/qwen-vl-utils

Decoding Configuration

Following the official Qwen2.5-VL demo settings:

top_p = 0.001
temperature = 0.01

โš ๏ธ Setting a large top_p may cause messy output during inference.

Run Evaluation on All Benchmarks

# Place evaluation files in src/r1-v/Evaluation/
# Download benchmark videos and place as specified in the provided JSON files

bash ./src/eval_bench.sh

Single Example Inference

python ./src/inference_example.py

The inference script supports all answer types through a unified prompt template:

# Supported problem types
problem_type = 'multiple choice'  # โ†’ outputs single letter (A, B, C, D)
problem_type = 'numerical'        # โ†’ outputs number (42 or 3.14)
problem_type = 'OCR'              # โ†’ outputs transcribed text
problem_type = 'free-form'        # โ†’ outputs free text answer
problem_type = 'regression'       # โ†’ outputs numerical prediction

๐Ÿ“Š Main Results

Main Results

๐Ÿ“ฆ Project Structure

robust-video-reason-main/
โ”œโ”€โ”€ setup.sh                          # Environment setup script
โ”œโ”€โ”€ assets/                           # Figures and visualizations
โ”‚   โ”œโ”€โ”€ fig2_overview.png
โ”‚   โ”œโ”€โ”€ dataset_demo.jpg
โ”‚   โ”œโ”€โ”€ main_results.jpg
โ”‚   โ””โ”€โ”€ fig_reward.pdf
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ inference_example.py          # Single-example inference
โ”‚   โ”œโ”€โ”€ eval_bench.py                 # Multi-benchmark evaluation
โ”‚   โ”œโ”€โ”€ eval_bench.sh                 # Evaluation launcher
โ”‚   โ”œโ”€โ”€ generate_cot_vllm.py          # CoT annotation generation
โ”‚   โ”œโ”€โ”€ unzip.py                      # Dataset extraction utility
โ”‚   โ”œโ”€โ”€ qwen-vl-utils/               # Qwen2.5-VL vision processing
โ”‚   โ”œโ”€โ”€ scripts/
โ”‚   โ”‚   โ”œโ”€โ”€ run_sft_video.sh          # Stage 1: SFT training
โ”‚   โ”‚   โ”œโ”€โ”€ run_grpo_video.sh         # Stage 2: GRPO training
โ”‚   โ”‚   โ”œโ”€โ”€ run_grpo_video_memory.sh  # Stage 2: GRPO + Memory
โ”‚   โ”‚   โ””โ”€โ”€ run_grpo_vllm_qwen25vl.sh # Stage 2: GRPO + vLLM
โ”‚   โ””โ”€โ”€ r1-v/
โ”‚       โ”œโ”€โ”€ configs/                  # DeepSpeed & training configs
โ”‚       โ”œโ”€โ”€ local_scripts/            # Data preparation & training
โ”‚       โ””โ”€โ”€ src/open_r1/
โ”‚           โ”œโ”€โ”€ grpo.py               # Core GRPO training logic
โ”‚           โ”œโ”€โ”€ grpo_baseline.py      # Baseline GRPO implementation
โ”‚           โ”œโ”€โ”€ grpo_memory.py        # Memory-aware GRPO
โ”‚           โ”œโ”€โ”€ sft_video.py          # SFT training script
โ”‚           โ”œโ”€โ”€ video_mask.py         # Multi-domain corruption engine
โ”‚           โ”œโ”€โ”€ video_mask_drop.py    # Token/pixel/frame masking
โ”‚           โ”œโ”€โ”€ memory_manager.py     # Difficulty-aware sample tracker
โ”‚           โ”œโ”€โ”€ memory_trainer.py     # Memory-integrated trainer
โ”‚           โ””โ”€โ”€ trainer/
โ”‚               โ”œโ”€โ”€ grpo_trainer.py           # Custom GRPO trainer
โ”‚               โ”œโ”€โ”€ grpo_trainer_baseline.py  # Baseline trainer
โ”‚               โ”œโ”€โ”€ grpo_trainer_v2.py        # Trainer v2
โ”‚               โ””โ”€โ”€ vllm_grpo_trainer_modified.py # vLLM-accelerated trainer

๐ŸŽจ Visual Corruption Gallery

The corruption engine simulates diverse real-world visual disturbances:

Corruption Type Variants Key Parameters
๐ŸŒ… Photometric Dusk ยท Night ยท Overexposure ยท Shadows lighting_intensity (0โ€“1)
๐ŸŒง๏ธ Weather Light Rain ยท Heavy Rain ยท Storm ยท Snow ยท Hail particle_density, particle_size, speed
๐ŸŸซ Occlusion Random block masking mask_ratio, block_mean, block_std
๐Ÿ“ท Camera Shake Translation + Zoom + Rotation shake_intensity, zoom_range, smoothness
โญ๏ธ Temporal Drop Random drop ยท Segment drop ยท Keep-K frame_mask_ratio, segment_len

๐Ÿงฎ Reward Functions

ROVA employs a multi-objective reward system:

Reward Formula Purpose
Accuracy Task-specific scoring (exact match / ROUGE / WER / relative error) Correctness signal
Format Regex match for <think>...</think><answer>...</answer> Structural compliance
KL-Consistency exp(-ฮฑ ยท KL(P_clean โ€– P_corrupt)) Robustness alignment
Length Control Penalizes overly long/short reasoning chains Output quality

๐Ÿง  Memory-Aware Training

The MemoryManager module tracks samples the model finds difficult:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    fail     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Training    โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚  Memory Buffer   โ”‚
โ”‚  Sample      โ”‚             โ”‚  (max_size=100)  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                      โ”‚
                              periodic recheck
                                      โ”‚
                              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                              โ”‚  Re-evaluate     โ”‚
                              โ”‚  โ”Œโ”€โ”€passโ”€โ”€โ–บ Removeโ”‚
                              โ”‚  โ””โ”€โ”€failโ”€โ”€โ–บ Keep  โ”‚
                              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Enable via:

--enable_sufficiency_check true \
--max_memory_size 100 \
--memory_file /path/to/memory.json

๐Ÿ“„ Citation

If you find this work useful, please consider citing:

@article{he2026rova,
  title={Are Video Reasoning Models Ready to Go Outside?},
  author={He, Yangfan and Boo, Changgyu and Yoon, Jaehong},
  journal={arXiv preprint arXiv:2603.10652},
  year={2026}
}

๐Ÿ™ Acknowledgements

We sincerely appreciate the contributions of the open-source community, in particular the following projects:

  • Qwen2.5-VL โ€” Base vision-language model
  • R1-V โ€” Foundational RL training framework
  • vLLM โ€” High-throughput inference engine
  • TRL โ€” Transformer reinforcement learning library
  • DeepSpeed โ€” Distributed training optimization

MIT License ยท Copyright ยฉ 2026 Yangfan He, Changgyu Boo, Jaehong Yoon

Made with โค๏ธ for robust video understanding

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors