- Authors: Yankai Jiang, Qiaoru Li, Binlu Xu, Haoran Sun, Chao Ding, Junting Dong, Yuxiang Cai📧, Xuhong Zhang, Jianwei Yin
- Institutes: Zhejiang University; Shanghai AI Laboratory
- Resources: [📖Paper] [🤗Huggingface]
IBISAgent is a novel agentic Multimodal Large Language Model (MLLM) framework designed to address the limitations of existing medical MLLMs in fine-grained pixel-level understanding. unlike previous approaches that rely on implicit segmentation tokens and single-pass reasoning, IBISAgent reformulates segmentation as a vision-centric, multi-step decision-making process.
By treating segmentation tools (e.g., MedSAM2) as plug-and-play modules controllable through natural language, IBISAgent iteratively generates interleaved reasoning (Thinking) and text-based click actions (Action) to progressively refine segmentation masks. This approach mimics the interactive behavior of human experts, allowing for self-correction and high-quality mask generation without requiring architectural modifications to the MLLM.
- 🔥 Agentic Reasoning Framework. We reformulate medical image segmentation as a multi-step Markov Decision Process (MDP), enabling the model to "think" and "act" iteratively to solve complex visual grounding tasks.
- 🔥 No Implicit Tokens. IBISAgent eliminates the need for special
<SEG>tokens and external pixel decoders, preserving the LLM's inherent text generation capabilities and ensuring better generalization. - 🔥 Two-Stage Training Strategy.
- Cold-Start SFT: Initialized with high-quality trajectory data synthesized from automatic click simulation and self-reflection error correction.
- Agentic Reinforcement Learning. Further optimized using GRPO with novel fine-grained rewards (Region-based Click Placement, Progressive Improvement), enabling the model to discover advanced segmentation strategies beyond imitation.
- 🔥 SOTA Performance. IBISAgent significantly outperforms existing medical MLLMs on both in-domain and out-of-domain benchmarks, demonstrating superior robustness and pixel-level reasoning ability.
Please refer to our Huggingface repository for the pre-trained model weights.
- Create a new conda environment and install the required packages.
conda create -n ibisagent python=3.12
conda activate ibisagent
pip install -r infer/requirements.txt- Download our RL-trained model weights to
infer/models/mllmfrom here.
huggingface-cli download manglu3935/IBIS \
--include "qwen2_5vl-7b-RL/*" \
--local-dir infer/models/mllm \
--local-dir-use-symlinks False- Download MedSAM2 model weights to
infer/models/sam2from here.
huggingface-cli download wanglab/MedSAM2 MedSAM2_latest.pt \
--local-dir infer/models/sam2 \
--local-dir-use-symlinks False- Run the multi-turn inference script.
python infer/multi_turn.py \
--image "infer/test_img.png" \
--prompt "Can you find a liver in this image?" \
--mllm_path "infer/models/mllm" \
--sam2_cfg "infer/models/sam2/medsam2_cfg.yaml" \
--sam2_ckpt "infer/models/sam2/MedSAM2_latest.pt"Parameters:
| Parameter | Description | Default | Required |
|---|---|---|---|
--image |
Path to the input medical image | None |
Yes |
--prompt |
User text prompt (e.g., 'Is there a colon tumor?') | None |
Yes |
--mllm_path |
Path to the MLLM model | infer/models/mllm |
No |
--sam2_cfg |
Path to the MedSAM2 config file | infer/models/sam2/medsam2_cfg.yaml |
No |
--sam2_ckpt |
Path to the MedSAM2 checkpoint | infer/models/sam2/MedSAM2_latest.pt |
No |
--max_turns |
Maximum number of iterations | 20 |
No |
--use_history |
Whether to enable chat history (1 for True, 0 for False) | 0 |
No |
--output_dir |
Directory to save results | ./outputs |
No |
- [2026/02/28] 🚀 Code and dataset release preparation.
- [2026/02/21] 🎉 IBISAgent is accepted to CVPR 2026!
- Release training scripts (SFT & RL)
- Release inference code
- Release pre-trained model weights
- Release Cold-Start and RL datasets
If you find our work helpful for your research, please consider giving one star ⭐️ and citing:
@inproceedings{jiang2026ibisagent,
title={IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation},
author={Jiang, Yankai and Li, Qiaoru and Xu, Binlu and Sun, Haoran and Ding, Chao and Dong, Junting and Cai, Yuxiang and Zhang, Xuhong and Yin, Jianwei},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}

