MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding

Official implementation of MARC (Memory-Augmented RL Token Compression), accepted at ICLR 2026.

🔥 News

[2026/02/02] Preliminary code release including training and inference scripts
[2026/01/22] Our paper is accepted at ICLR 2026!

Note: Training data and VMR code will be released in the future.

Overview

MARC is a novel framework for efficient video understanding that combines:

Visual Memory Retriever (VMR): Segments videos into event-level fragments and retrieves query-relevant clips
Compression Group Relative Policy Optimization (C-GRPO): An RL-based distillation strategy that compresses video tokens while preserving reasoning ability

Key Results

95% reduction in visual tokens (64 frames → 1 frame equivalent)
72% reduction in GPU memory usage
23.9% reduction in generation latency
Nearly identical performance to 64-frame baseline (42.20 vs 42.21 mean accuracy)

📐 Setup

git clone https://github.com/Gimlettt/MARC
cd MARC

# Create and activate conda environment
conda create -n marc python=3.11
conda activate marc

# Install base dependencies
bash setup.sh

# Install additional required packages
pip install wandb==0.18.3
pip install tensorboardx
pip install qwen_vl_utils torchvision
pip install flash-attn --no-build-isolation
pip install nltk
pip install rouge_score
pip install deepspeed

Replace Transformers Source Files

After installing transformers, you need to replace two files in your transformers installation with the modified versions that enable compression:

Replace <TRANSFORMERS_PATH>/models/qwen2_5_vl/modeling_qwen2_5_vl.py with qwen2_5_vl/modeling_qwen2_5_vl(compress).py
Replace <TRANSFORMERS_PATH>/models/qwen2_5_vl/processing_qwen2_5_vl.py with qwen2_5_vl/processing_qwen2_5_vl(compress).py

You can find your transformers installation path by running:

python -c "import transformers; import os; print(os.path.dirname(transformers.__file__))"

🔮 Inference

For a complete inference example, see inference_script/inference_example.py.

Benchmark Evaluation

To evaluate on benchmarks, use:

bash inference_script/eval_bench.sh

🚀 Training

C-GRPO Training

To train with Compression Group Relative Policy Optimization:

bash training_script/run_grpo_video.sh

Training script: training_script/grpo.py

Supervised Fine-Tuning (SFT)

For comparison with standard SFT:

bash training_script/run_sft_video.sh

Training script: training_script/sft_video.py

Results

Performance Comparison

Method	VSI	VideoMMMU	MMVU	MVBench	TempCompass	VideoMME	Mean
Qwen2.5-VL-3B (64f)	32.93	35.33	48.64	44.77	38.05	53.55	42.21
MARC-3B (1f)	27.55	33.11	51.99	45.82	55.34	39.44	42.20

Efficiency Improvements

Visual Tokens: 2589.93 → 122.69 (95% reduction)
GPU Memory: 41.6 GB → 11.5 GB (72% reduction)
Generation Latency: 0.46s → 0.35s (23.9% reduction)

Training Data

We use a subset of the Video-R1-260K dataset:

5K samples for C-GRPO training
Includes both video and image data
Covers multiple domains: Knowledge, Math, Chart, Spatial, OCR, General reasoning
See training data distribution in the paper

Note: Training data will be released in the future.

Citation

If you find MARC useful for your research, please cite:

@article{wu2025marc,
  title={MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding},
  author={Wu, Peiran and Yu, Zhuorui and Liu, Yunze and Wu, Chi-Hao and Zhou, Enmin and Shen, Junxiao},
  journal={arXiv preprint arXiv:2510.07915},
  year={2025}
}

Acknowledgments

This project builds upon:

Video-R1 for the base training framework
Qwen2.5-VL for the base vision-language model
TRL for the GRPO implementation

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Contact

For questions and feedback, please open an issue on GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
inference_script		inference_script
qwen2_5_vl		qwen2_5_vl
src		src
trainer		trainer
training_script		training_script
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ratio_1.txt		ratio_1.txt
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding

🔥 News

Overview

Key Results

📐 Setup

Replace Transformers Source Files

🔮 Inference

Benchmark Evaluation

🚀 Training

C-GRPO Training

Supervised Fine-Tuning (SFT)

Results

Performance Comparison

Efficiency Improvements

Training Data

Citation

Acknowledgments

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding

🔥 News

Overview

Key Results

📐 Setup

Replace Transformers Source Files

🔮 Inference

Benchmark Evaluation

🚀 Training

C-GRPO Training

Supervised Fine-Tuning (SFT)

Results

Performance Comparison

Efficiency Improvements

Training Data

Citation

Acknowledgments

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages