Official implementation of MARC (Memory-Augmented RL Token Compression), accepted at ICLR 2026.
- [2026/02/02] Preliminary code release including training and inference scripts
- [2026/01/22] Our paper is accepted at ICLR 2026!
Note: Training data and VMR code will be released in the future.
MARC is a novel framework for efficient video understanding that combines:
- Visual Memory Retriever (VMR): Segments videos into event-level fragments and retrieves query-relevant clips
- Compression Group Relative Policy Optimization (C-GRPO): An RL-based distillation strategy that compresses video tokens while preserving reasoning ability
- 95% reduction in visual tokens (64 frames → 1 frame equivalent)
- 72% reduction in GPU memory usage
- 23.9% reduction in generation latency
- Nearly identical performance to 64-frame baseline (42.20 vs 42.21 mean accuracy)
git clone https://github.com/Gimlettt/MARC
cd MARC
# Create and activate conda environment
conda create -n marc python=3.11
conda activate marc
# Install base dependencies
bash setup.sh
# Install additional required packages
pip install wandb==0.18.3
pip install tensorboardx
pip install qwen_vl_utils torchvision
pip install flash-attn --no-build-isolation
pip install nltk
pip install rouge_score
pip install deepspeedAfter installing transformers, you need to replace two files in your transformers installation with the modified versions that enable compression:
- Replace
<TRANSFORMERS_PATH>/models/qwen2_5_vl/modeling_qwen2_5_vl.pywithqwen2_5_vl/modeling_qwen2_5_vl(compress).py - Replace
<TRANSFORMERS_PATH>/models/qwen2_5_vl/processing_qwen2_5_vl.pywithqwen2_5_vl/processing_qwen2_5_vl(compress).py
You can find your transformers installation path by running:
python -c "import transformers; import os; print(os.path.dirname(transformers.__file__))"For a complete inference example, see inference_script/inference_example.py.
To evaluate on benchmarks, use:
bash inference_script/eval_bench.shTo train with Compression Group Relative Policy Optimization:
bash training_script/run_grpo_video.shTraining script: training_script/grpo.py
For comparison with standard SFT:
bash training_script/run_sft_video.shTraining script: training_script/sft_video.py
| Method | VSI | VideoMMMU | MMVU | MVBench | TempCompass | VideoMME | Mean |
|---|---|---|---|---|---|---|---|
| Qwen2.5-VL-3B (64f) | 32.93 | 35.33 | 48.64 | 44.77 | 38.05 | 53.55 | 42.21 |
| MARC-3B (1f) | 27.55 | 33.11 | 51.99 | 45.82 | 55.34 | 39.44 | 42.20 |
- Visual Tokens: 2589.93 → 122.69 (95% reduction)
- GPU Memory: 41.6 GB → 11.5 GB (72% reduction)
- Generation Latency: 0.46s → 0.35s (23.9% reduction)
We use a subset of the Video-R1-260K dataset:
- 5K samples for C-GRPO training
- Includes both video and image data
- Covers multiple domains: Knowledge, Math, Chart, Spatial, OCR, General reasoning
- See training data distribution in the paper
Note: Training data will be released in the future.
If you find MARC useful for your research, please cite:
@article{wu2025marc,
title={MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding},
author={Wu, Peiran and Yu, Zhuorui and Liu, Yunze and Wu, Chi-Hao and Zhou, Enmin and Shen, Junxiao},
journal={arXiv preprint arXiv:2510.07915},
year={2025}
}This project builds upon:
- Video-R1 for the base training framework
- Qwen2.5-VL for the base vision-language model
- TRL for the GRPO implementation
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
For questions and feedback, please open an issue on GitHub.