Yangfan He1, Changgyu Boo2, Jaehong Yoon1
1Nanyang Technological University ย ย 2Korea University
ROVA is a novel training framework that improves the robustness of vision-language models for video reasoning under real-world disturbances such as weather, occlusion, and camera motion. It models a robustness-aware consistency reward under spatio-temporal corruptions and introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. We also introduce PVRBench, a new benchmark for evaluating accuracy and reasoning quality under realistic perturbations.
Features ยท Architecture ยท Quick Start ยท Training ยท Evaluation ยท Results
- ๐ง T-GRPO (Temporal Group Relative Policy Optimization) โ a novel RL algorithm that jointly optimizes over temporally-perturbed video inputs, enabling robust reasoning under frame-level corruptions
- ๐ง๏ธ Multi-Domain Visual Corruption Engine โ realistic augmentation suite covering photometric effects (dusk, night, overexposure, shadows), weather simulation (rain, snow, hail, storm), spatial occlusion, and camera shake
- ๐ KL-Consistency Reward โ a dual-branch alignment signal that penalizes reasoning divergence between clean and corrupted video streams
- ๐งฉ Memory-Aware Training โ intelligent sample difficulty tracking system that identifies and re-examines challenging examples across training
- โก Full-Stack Pipeline โ end-to-end support from CoT annotation โ SFT warm-up โ GRPO/T-GRPO reinforcement learning โ multi-benchmark evaluation
| Capability | Description |
|---|---|
| Model Support | Qwen2.5-VL (7B) with Flash Attention 2 |
| Training Paradigm | SFT โ GRPO / T-GRPO with DeepSpeed ZeRO-2/3 |
| Acceleration | vLLM-accelerated RL rollout generation |
| Data Modality | Image-Video mixed training |
| Answer Types | Multiple choice ยท Numerical ยท OCR ยท Free-form ยท Regression |
| Corruption Types | Photometric ยท Weather ยท Occlusion ยท Camera shake ยท Temporal drop |
| Reward Functions | Accuracy ยท Format ยท KL-Consistency ยท Length control |
| Memory System | Difficulty-aware sample tracking with auto-recheck |
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Input Video โ
โโโโโโโโโโฌโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโ
โผ โผ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ
โ Clean Path โ โ Corrupted Path โ
โ โ โ โโโโโโโโโโโโโโ โ
โ โ โ โPhotometric โ โ
โ โ โ โWeather โ โ
โ โ โ โOcclusion โ โ
โ โ โ โShake โ โ
โ โ โ โFrame Drop โ โ
โ โ โ โโโโโโโโโโโโโโ โ
โโโโโโโโฌโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโโ
โ โ
โผ โผ
โโโโโโโโโโโโโโ โโโโโโโโโโโโโโ
โ Qwen2.5-VLโ โ Qwen2.5-VLโ
โ (Policy) โ โ (Policy) โ
โโโโโโโโฌโโโโโโ โโโโโโโโฌโโโโโโ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโ โ
โโโโโบโ T-GRPO Trainer โโโโโโโโ
โ โ
โ โโโโโโโโโโโโโโโ โ
โ โ R_accuracy โ โ
โ โ R_format โ โ
โ โ R_kl_cons. โ โ
โ โ R_length โ โ
โ โโโโโโโโโโโโโโโ โ
โ โ
โ Memory Manager โ
โโโโโโโโโโโโโโโโโโโโโ
- Python 3.11+, CUDA-compatible GPUs (4ร recommended)
- Conda package manager
# 1. Create environment
conda create -n rova python=3.11
conda activate rova
# 2. Install core dependencies
bash setup.sh
# 3. Install Qwen video extraction with decord acceleration
cd src/qwen-vl-utils
pip install -e .[decord]
cd ../..Qwen2.5-VL undergoes frequent updates in Transformers which may cause version-related inconsistencies. Use the bundled version:
unzip transformers-main.zip
cd transformers-main
pip install .| Package | Version | Note |
|---|---|---|
| vLLM | 0.7.2 |
Required for RL acceleration |
| TRL | 0.16.0 |
GRPO trainer compatibility |
| Flash Attention | latest | pip install flash-attn --no-build-isolation |
# Place downloaded dataset into the data directory
# then unzip:
python ./src/unzip.py๐ Dataset should be placed in
src/r1-v/data/before unzipping.
ROVA follows a two-stage training paradigm:
Warm up the model on chain-of-thought (CoT) annotated data for one epoch:
bash ./src/scripts/run_sft_video.sh๐ก To generate CoT annotations on your own data, use
src/generate_cot_vllm.py
Fine-tune the SFT checkpoint with Group Relative Policy Optimization:
# Standard GRPO with temporal corruption
bash ./src/scripts/run_grpo_video.sh
# With vLLM acceleration (recommended)
bash ./src/scripts/run_grpo_vllm_qwen25vl.sh
# With Memory-aware difficulty tracking
bash ./src/scripts/run_grpo_video_memory.sh| Parameter | Value | Description |
|---|---|---|
--temporal |
true/false |
Toggle T-GRPO (temporal corruption) vs standard GRPO |
--len_control |
true/false |
Enable/disable length control reward |
--num_generations |
4-8 |
Group size G for GRPO (โ = lower variance, โ memory) |
--beta |
0.04 |
KL penalty coefficient |
--max_pixels |
524288 |
Max pixel budget per frame |
--max_frames |
16 |
Max video frames during training |
--per_device_train_batch_size |
1 |
Must remain 1 (following R1-V convention) |
All corruption types are configurable via command-line arguments:
# Corruption probabilities (sum โค 1.0, remainder = no augmentation)
--photometric_prob 0.25 # Lighting effects: dusk / night / overexposure / shadows
--weather_prob 0.25 # Weather: rain / snow / hail / storm
--occlusion_prob 0.25 # Random block occlusion
--shake_prob 0.25 # Camera shake simulationDuring inference, increase resolution for better performance:
| Setting | Training | Inference |
|---|---|---|
| Max Frame Resolution | 128 ร 28 ร 28 | 256 ร 28 ร 28 |
| Max Frames | 16 | 16 / 32 / 64 |
Configure these in
src/qwen-vl-utils
Following the official Qwen2.5-VL demo settings:
top_p = 0.001
temperature = 0.01
โ ๏ธ Setting a largetop_pmay cause messy output during inference.
# Place evaluation files in src/r1-v/Evaluation/
# Download benchmark videos and place as specified in the provided JSON files
bash ./src/eval_bench.shpython ./src/inference_example.pyThe inference script supports all answer types through a unified prompt template:
# Supported problem types
problem_type = 'multiple choice' # โ outputs single letter (A, B, C, D)
problem_type = 'numerical' # โ outputs number (42 or 3.14)
problem_type = 'OCR' # โ outputs transcribed text
problem_type = 'free-form' # โ outputs free text answer
problem_type = 'regression' # โ outputs numerical predictionrobust-video-reason-main/
โโโ setup.sh # Environment setup script
โโโ assets/ # Figures and visualizations
โ โโโ fig2_overview.png
โ โโโ dataset_demo.jpg
โ โโโ main_results.jpg
โ โโโ fig_reward.pdf
โโโ src/
โ โโโ inference_example.py # Single-example inference
โ โโโ eval_bench.py # Multi-benchmark evaluation
โ โโโ eval_bench.sh # Evaluation launcher
โ โโโ generate_cot_vllm.py # CoT annotation generation
โ โโโ unzip.py # Dataset extraction utility
โ โโโ qwen-vl-utils/ # Qwen2.5-VL vision processing
โ โโโ scripts/
โ โ โโโ run_sft_video.sh # Stage 1: SFT training
โ โ โโโ run_grpo_video.sh # Stage 2: GRPO training
โ โ โโโ run_grpo_video_memory.sh # Stage 2: GRPO + Memory
โ โ โโโ run_grpo_vllm_qwen25vl.sh # Stage 2: GRPO + vLLM
โ โโโ r1-v/
โ โโโ configs/ # DeepSpeed & training configs
โ โโโ local_scripts/ # Data preparation & training
โ โโโ src/open_r1/
โ โโโ grpo.py # Core GRPO training logic
โ โโโ grpo_baseline.py # Baseline GRPO implementation
โ โโโ grpo_memory.py # Memory-aware GRPO
โ โโโ sft_video.py # SFT training script
โ โโโ video_mask.py # Multi-domain corruption engine
โ โโโ video_mask_drop.py # Token/pixel/frame masking
โ โโโ memory_manager.py # Difficulty-aware sample tracker
โ โโโ memory_trainer.py # Memory-integrated trainer
โ โโโ trainer/
โ โโโ grpo_trainer.py # Custom GRPO trainer
โ โโโ grpo_trainer_baseline.py # Baseline trainer
โ โโโ grpo_trainer_v2.py # Trainer v2
โ โโโ vllm_grpo_trainer_modified.py # vLLM-accelerated trainer
The corruption engine simulates diverse real-world visual disturbances:
| Corruption Type | Variants | Key Parameters |
|---|---|---|
| ๐ Photometric | Dusk ยท Night ยท Overexposure ยท Shadows | lighting_intensity (0โ1) |
| ๐ง๏ธ Weather | Light Rain ยท Heavy Rain ยท Storm ยท Snow ยท Hail | particle_density, particle_size, speed |
| ๐ซ Occlusion | Random block masking | mask_ratio, block_mean, block_std |
| ๐ท Camera Shake | Translation + Zoom + Rotation | shake_intensity, zoom_range, smoothness |
| โญ๏ธ Temporal Drop | Random drop ยท Segment drop ยท Keep-K | frame_mask_ratio, segment_len |
ROVA employs a multi-objective reward system:
| Reward | Formula | Purpose |
|---|---|---|
| Accuracy | Task-specific scoring (exact match / ROUGE / WER / relative error) | Correctness signal |
| Format | Regex match for <think>...</think><answer>...</answer> |
Structural compliance |
| KL-Consistency | exp(-ฮฑ ยท KL(P_clean โ P_corrupt)) |
Robustness alignment |
| Length Control | Penalizes overly long/short reasoning chains | Output quality |
The MemoryManager module tracks samples the model finds difficult:
โโโโโโโโโโโโโโโโ fail โโโโโโโโโโโโโโโโโโโโ
โ Training โโโโโโโโโโโโโโบโ Memory Buffer โ
โ Sample โ โ (max_size=100) โ
โโโโโโโโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโโ
โ
periodic recheck
โ
โโโโโโโโโผโโโโโโโโโโ
โ Re-evaluate โ
โ โโโpassโโโบ Removeโ
โ โโโfailโโโบ Keep โ
โโโโโโโโโโโโโโโโโโโโ
Enable via:
--enable_sufficiency_check true \
--max_memory_size 100 \
--memory_file /path/to/memory.jsonIf you find this work useful, please consider citing:
@article{he2026rova,
title={Are Video Reasoning Models Ready to Go Outside?},
author={He, Yangfan and Boo, Changgyu and Yoon, Jaehong},
journal={arXiv preprint arXiv:2603.10652},
year={2026}
}We sincerely appreciate the contributions of the open-source community, in particular the following projects:
- Qwen2.5-VL โ Base vision-language model
- R1-V โ Foundational RL training framework
- vLLM โ High-throughput inference engine
- TRL โ Transformer reinforcement learning library
- DeepSpeed โ Distributed training optimization
MIT License ยท Copyright ยฉ 2026 Yangfan He, Changgyu Boo, Jaehong Yoon
Made with โค๏ธ for robust video understanding
