MedGPRO (CVPR 2026)

Inference pipeline for MedGPRO — Qwen2.5-VL-7B fine-tuned on medical video understanding via SFT + GRPO.

📄 Paper: arXiv:2512.06581

🌐 Project Page: uii-ai.github.io/MedGRPO

🤗 Model: UII-AI/Qwen2.5-VL-7B-MedGRPO

🤗 Dataset: UII-AI/MedVidBench

🎮 Demo: UII-AI/MedGRPO-Demo

📊 Leaderboard: UII-AI/MedVidBench-Leaderboard

Directory Structure

MedGPRO-Inference/
├── inference/
│   ├── vllm_infer.py              # VLLM batch inference engine
│   └── vision_process_medical.py  # Medical video frame processing (RC box support)
├── utils/
│   ├── split_data_balanced.py     # Split data across GPUs (balanced by task)
│   └── merge_results_manual.py    # Merge per-GPU results
├── results/                       # Inference outputs (gitignored)
├── run_inference.sh               # Inference launcher
└── requirements.txt

Quick Start

1. Install dependencies

pip install -r requirements.txt

2. Download the model

# Download from Hugging Face
huggingface-cli download UII-AI/Qwen2.5-VL-7B-MedGRPO --local-dir models/Qwen2.5-VL-7B-MedGRPO

3. Download the test data

# Download MedVidBench test set
huggingface-cli download UII-AI/MedVidBench cleaned_test_data_11_04.json --repo-type dataset --local-dir .

4. Run inference

bash run_inference.sh
# Output: results/results.json

Input Data Format

The inference script expects JSON files in SFT format:

[
  {
    "conversations": [
      {"from": "human", "value": "<video>\nQuestion text?"}
    ],
    "video": ["frame_0001.jpg", "frame_0002.jpg", ...],
    "metadata": {
      "fps": "1.0",
      "video_id": "...",
      "input_video_start_frame": "0",
      "input_video_end_frame": "100"
    },
    "qa_type": "tal",
    "data_source": "AVOS",
    "struc_info": {...},
    "is_RC": false,
    "RC_info": {}
  }
]

Note: Ground truth answers are not required. The script handles both training data (with answers) and test data (questions only).

Supported tasks (qa_type): tal, stg, next_action, dense_captioning, video_summary, region_caption, skill_assessment, cvs_assessment

Usage

Single-GPU inference

CUDA_VISIBLE_DEVICES=0 python3 inference/vllm_infer.py \
    --model_path ./models/Qwen2.5-VL-7B-MedGRPO \
    --data_path test_data.json \
    --output_path results.json \
    --batch_size 4 \
    --gpu_memory_utilization 0.85 \
    --max_new_tokens 256

Multi-GPU inference

# Split data across GPUs (balanced by task type)
python3 utils/split_data_balanced.py test_data.json 4

# Run on each GPU in parallel
CUDA_VISIBLE_DEVICES=0 python3 inference/vllm_infer.py --data_path test_data_gpu0.json --output_path results/results_gpu0.json ... &
CUDA_VISIBLE_DEVICES=1 python3 inference/vllm_infer.py --data_path test_data_gpu1.json --output_path results/results_gpu1.json ... &
wait

# Merge results
python3 utils/merge_results_manual.py results/results.json 4

Or simply use the launcher script which handles all of the above:

bash run_inference.sh

Output Format

Inference saves results as a dict keyed by sequential index:

{
  "0": {
    "metadata": {"fps": "1.0", "video_id": "...", ...},
    "qa_type": "tal",
    "struc_info": {...},
    "question": "When does cutting happen?",
    "answer": "<model prediction>",
    "data_source": "AVOS"
  },
  "1": {...}
}

Leaderboard Submission

The leaderboard expects a list with prediction field. Convert first:

python3 utils/convert_to_submission.py results/results.json submission.json

Output submission.json:

[
  {"id": "video_001", "qa_type": "tal", "prediction": "The action starts at 5.2s and ends at 12.7s."},
  {"id": "video_002", "qa_type": "stg", "prediction": "..."},
  ...
]

Then upload submission.json to the MedVidBench Leaderboard.

Notes

vision_process_medical.py handles per-sample FPS and draws bounding boxes for region caption tasks (is_RC=True)
Uses greedy decoding (temperature=0.0) by default
Model must be in HuggingFace format

Citation

@article{su2024medgrpo,
  title={MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding},
  author={Su, Yuhao and Choudhuri, Anwesa and Gao, Zhongpai and Planche, Benjamin and Nguyen, Van Nguyen and Zheng, Meng and Shen, Yuhan and Innanje, Arun and Chen, Terrence and Elhamifar, Ehsan and Wu, Ziyan},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
inference		inference
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run_inference.sh		run_inference.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MedGPRO (CVPR 2026)

Directory Structure

Quick Start

1. Install dependencies

2. Download the model

3. Download the test data

4. Run inference

Input Data Format

Usage

Single-GPU inference

Multi-GPU inference

Output Format

Leaderboard Submission

Notes

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MedGPRO (CVPR 2026)

Directory Structure

Quick Start

1. Install dependencies

2. Download the model

3. Download the test data

4. Run inference

Input Data Format

Usage

Single-GPU inference

Multi-GPU inference

Output Format

Leaderboard Submission

Notes

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages