[CVPR 2026] IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation

Authors: Yankai Jiang, Qiaoru Li, Binlu Xu, Haoran Sun, Chao Ding, Junting Dong, Yuxiang Cai📧, Xuhong Zhang, Jianwei Yin
Institutes: Zhejiang University; Shanghai AI Laboratory
Resources: [📖Paper] [🤗Huggingface]

📖 Introduction

IBISAgent is a novel agentic Multimodal Large Language Model (MLLM) framework designed to address the limitations of existing medical MLLMs in fine-grained pixel-level understanding. unlike previous approaches that rely on implicit segmentation tokens and single-pass reasoning, IBISAgent reformulates segmentation as a vision-centric, multi-step decision-making process.

By treating segmentation tools (e.g., MedSAM2) as plug-and-play modules controllable through natural language, IBISAgent iteratively generates interleaved reasoning (Thinking) and text-based click actions (Action) to progressively refine segmentation masks. This approach mimics the interactive behavior of human experts, allowing for self-correction and high-quality mask generation without requiring architectural modifications to the MLLM.

💡 Highlights

🔥 Agentic Reasoning Framework. We reformulate medical image segmentation as a multi-step Markov Decision Process (MDP), enabling the model to "think" and "act" iteratively to solve complex visual grounding tasks.
🔥 No Implicit Tokens. IBISAgent eliminates the need for special <SEG> tokens and external pixel decoders, preserving the LLM's inherent text generation capabilities and ensuring better generalization.
🔥 Two-Stage Training Strategy.
- Cold-Start SFT: Initialized with high-quality trajectory data synthesized from automatic click simulation and self-reflection error correction.
- Agentic Reinforcement Learning. Further optimized using GRPO with novel fine-grained rewards (Region-based Click Placement, Progressive Improvement), enabling the model to discover advanced segmentation strategies beyond imitation.
🔥 SOTA Performance. IBISAgent significantly outperforms existing medical MLLMs on both in-domain and out-of-domain benchmarks, demonstrating superior robustness and pixel-level reasoning ability.

Model Weights

Please refer to our Huggingface repository for the pre-trained model weights.

🤖 Inference

Create a new conda environment and install the required packages.

conda create -n ibisagent python=3.12
conda activate ibisagent
pip install -r infer/requirements.txt

Download our RL-trained model weights to infer/models/mllm from here.

huggingface-cli download manglu3935/IBIS \
    --include "qwen2_5vl-7b-RL/*" \
    --local-dir infer/models/mllm \
    --local-dir-use-symlinks False

Download MedSAM2 model weights to infer/models/sam2 from here.

huggingface-cli download wanglab/MedSAM2 MedSAM2_latest.pt \
    --local-dir infer/models/sam2 \
    --local-dir-use-symlinks False

Run the multi-turn inference script.

python infer/multi_turn.py \
    --image "infer/test_img.png" \
    --prompt "Can you find a liver in this image?" \
    --mllm_path "infer/models/mllm" \
    --sam2_cfg "infer/models/sam2/medsam2_cfg.yaml" \
    --sam2_ckpt "infer/models/sam2/MedSAM2_latest.pt"

Parameters:

Parameter	Description	Default	Required
`--image`	Path to the input medical image	`None`	Yes
`--prompt`	User text prompt (e.g., 'Is there a colon tumor?')	`None`	Yes
`--mllm_path`	Path to the MLLM model	`infer/models/mllm`	No
`--sam2_cfg`	Path to the MedSAM2 config file	`infer/models/sam2/medsam2_cfg.yaml`	No
`--sam2_ckpt`	Path to the MedSAM2 checkpoint	`infer/models/sam2/MedSAM2_latest.pt`	No
`--max_turns`	Maximum number of iterations	`20`	No
`--use_history`	Whether to enable chat history (1 for True, 0 for False)	`0`	No
`--output_dir`	Directory to save results	`./outputs`	No

📜 News

[2026/02/28] 🚀 Code and dataset release preparation.
[2026/02/21] 🎉 IBISAgent is accepted to CVPR 2026!

👨‍💻 Todo

Release training scripts (SFT & RL)
Release inference code
Release pre-trained model weights
Release Cold-Start and RL datasets

✒️ Citation

If you find our work helpful for your research, please consider giving one star ⭐️ and citing:

@inproceedings{jiang2026ibisagent,
  title={IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation},
  author={Jiang, Yankai and Li, Qiaoru and Xu, Binlu and Sun, Haoran and Ding, Chao and Dong, Junting and Cai, Yuxiang and Zhang, Xuhong and Yin, Jianwei},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

❤️ Acknowledgments

VERL: The reinforcement learning framework we built upon.
MedSAM2: The segmentation tool used in our agent.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets		assets
infer		infer
training_scripts		training_scripts
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[CVPR 2026] IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation

📖 Introduction

💡 Highlights

Model Weights

🤖 Inference

📜 News

👨‍💻 Todo

✒️ Citation

❤️ Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[CVPR 2026] IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation

📖 Introduction

💡 Highlights

Model Weights

🤖 Inference

📜 News

👨‍💻 Todo

✒️ Citation

❤️ Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages