Inference pipeline for MedGPRO — Qwen2.5-VL-7B fine-tuned on medical video understanding via SFT + GRPO.
📄 Paper: arXiv:2512.06581
🌐 Project Page: uii-ai.github.io/MedGRPO
🤗 Model: UII-AI/Qwen2.5-VL-7B-MedGRPO
🤗 Dataset: UII-AI/MedVidBench
🎮 Demo: UII-AI/MedGRPO-Demo
📊 Leaderboard: UII-AI/MedVidBench-Leaderboard
MedGPRO-Inference/
├── inference/
│ ├── vllm_infer.py # VLLM batch inference engine
│ └── vision_process_medical.py # Medical video frame processing (RC box support)
├── utils/
│ ├── split_data_balanced.py # Split data across GPUs (balanced by task)
│ └── merge_results_manual.py # Merge per-GPU results
├── results/ # Inference outputs (gitignored)
├── run_inference.sh # Inference launcher
└── requirements.txt
pip install -r requirements.txt# Download from Hugging Face
huggingface-cli download UII-AI/Qwen2.5-VL-7B-MedGRPO --local-dir models/Qwen2.5-VL-7B-MedGRPO# Download MedVidBench test set
huggingface-cli download UII-AI/MedVidBench cleaned_test_data_11_04.json --repo-type dataset --local-dir .bash run_inference.sh
# Output: results/results.jsonThe inference script expects JSON files in SFT format:
[
{
"conversations": [
{"from": "human", "value": "<video>\nQuestion text?"}
],
"video": ["frame_0001.jpg", "frame_0002.jpg", ...],
"metadata": {
"fps": "1.0",
"video_id": "...",
"input_video_start_frame": "0",
"input_video_end_frame": "100"
},
"qa_type": "tal",
"data_source": "AVOS",
"struc_info": {...},
"is_RC": false,
"RC_info": {}
}
]Note: Ground truth answers are not required. The script handles both training data (with answers) and test data (questions only).
Supported tasks (qa_type): tal, stg, next_action, dense_captioning, video_summary, region_caption, skill_assessment, cvs_assessment
CUDA_VISIBLE_DEVICES=0 python3 inference/vllm_infer.py \
--model_path ./models/Qwen2.5-VL-7B-MedGRPO \
--data_path test_data.json \
--output_path results.json \
--batch_size 4 \
--gpu_memory_utilization 0.85 \
--max_new_tokens 256# Split data across GPUs (balanced by task type)
python3 utils/split_data_balanced.py test_data.json 4
# Run on each GPU in parallel
CUDA_VISIBLE_DEVICES=0 python3 inference/vllm_infer.py --data_path test_data_gpu0.json --output_path results/results_gpu0.json ... &
CUDA_VISIBLE_DEVICES=1 python3 inference/vllm_infer.py --data_path test_data_gpu1.json --output_path results/results_gpu1.json ... &
wait
# Merge results
python3 utils/merge_results_manual.py results/results.json 4Or simply use the launcher script which handles all of the above:
bash run_inference.shInference saves results as a dict keyed by sequential index:
{
"0": {
"metadata": {"fps": "1.0", "video_id": "...", ...},
"qa_type": "tal",
"struc_info": {...},
"question": "When does cutting happen?",
"answer": "<model prediction>",
"data_source": "AVOS"
},
"1": {...}
}The leaderboard expects a list with prediction field. Convert first:
python3 utils/convert_to_submission.py results/results.json submission.jsonOutput submission.json:
[
{"id": "video_001", "qa_type": "tal", "prediction": "The action starts at 5.2s and ends at 12.7s."},
{"id": "video_002", "qa_type": "stg", "prediction": "..."},
...
]Then upload submission.json to the MedVidBench Leaderboard.
vision_process_medical.pyhandles per-sample FPS and draws bounding boxes for region caption tasks (is_RC=True)- Uses greedy decoding (
temperature=0.0) by default - Model must be in HuggingFace format
@article{su2024medgrpo,
title={MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding},
author={Su, Yuhao and Choudhuri, Anwesa and Gao, Zhongpai and Planche, Benjamin and Nguyen, Van Nguyen and Zheng, Meng and Shen, Yuhan and Innanje, Arun and Chen, Terrence and Elhamifar, Ehsan and Wu, Ziyan},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}