Skip to content

Yankai96/IBISAgent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[CVPR 2026] IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation

  • Authors: Yankai Jiang, Qiaoru Li, Binlu Xu, Haoran Sun, Chao Ding, Junting Dong, Yuxiang Cai📧, Xuhong Zhang, Jianwei Yin
  • Institutes: Zhejiang University; Shanghai AI Laboratory
  • Resources: [📖Paper] [🤗Huggingface]

📖 Introduction

IBISAgent is a novel agentic Multimodal Large Language Model (MLLM) framework designed to address the limitations of existing medical MLLMs in fine-grained pixel-level understanding. unlike previous approaches that rely on implicit segmentation tokens and single-pass reasoning, IBISAgent reformulates segmentation as a vision-centric, multi-step decision-making process.

By treating segmentation tools (e.g., MedSAM2) as plug-and-play modules controllable through natural language, IBISAgent iteratively generates interleaved reasoning (Thinking) and text-based click actions (Action) to progressively refine segmentation masks. This approach mimics the interactive behavior of human experts, allowing for self-correction and high-quality mask generation without requiring architectural modifications to the MLLM.

IBISAgent Introduction

💡 Highlights

  • 🔥 Agentic Reasoning Framework. We reformulate medical image segmentation as a multi-step Markov Decision Process (MDP), enabling the model to "think" and "act" iteratively to solve complex visual grounding tasks.
  • 🔥 No Implicit Tokens. IBISAgent eliminates the need for special <SEG> tokens and external pixel decoders, preserving the LLM's inherent text generation capabilities and ensuring better generalization.
  • 🔥 Two-Stage Training Strategy.
    • Cold-Start SFT: Initialized with high-quality trajectory data synthesized from automatic click simulation and self-reflection error correction.
    • Agentic Reinforcement Learning. Further optimized using GRPO with novel fine-grained rewards (Region-based Click Placement, Progressive Improvement), enabling the model to discover advanced segmentation strategies beyond imitation.
  • 🔥 SOTA Performance. IBISAgent significantly outperforms existing medical MLLMs on both in-domain and out-of-domain benchmarks, demonstrating superior robustness and pixel-level reasoning ability.

IBISAgent Performance

Model Weights

Please refer to our Huggingface repository for the pre-trained model weights.

🤖 Inference

  1. Create a new conda environment and install the required packages.
conda create -n ibisagent python=3.12
conda activate ibisagent
pip install -r infer/requirements.txt
  1. Download our RL-trained model weights to infer/models/mllm from here.
huggingface-cli download manglu3935/IBIS \
    --include "qwen2_5vl-7b-RL/*" \
    --local-dir infer/models/mllm \
    --local-dir-use-symlinks False
  1. Download MedSAM2 model weights to infer/models/sam2 from here.
huggingface-cli download wanglab/MedSAM2 MedSAM2_latest.pt \
    --local-dir infer/models/sam2 \
    --local-dir-use-symlinks False
  1. Run the multi-turn inference script.
python infer/multi_turn.py \
    --image "infer/test_img.png" \
    --prompt "Can you find a liver in this image?" \
    --mllm_path "infer/models/mllm" \
    --sam2_cfg "infer/models/sam2/medsam2_cfg.yaml" \
    --sam2_ckpt "infer/models/sam2/MedSAM2_latest.pt"

Parameters:

Parameter Description Default Required
--image Path to the input medical image None Yes
--prompt User text prompt (e.g., 'Is there a colon tumor?') None Yes
--mllm_path Path to the MLLM model infer/models/mllm No
--sam2_cfg Path to the MedSAM2 config file infer/models/sam2/medsam2_cfg.yaml No
--sam2_ckpt Path to the MedSAM2 checkpoint infer/models/sam2/MedSAM2_latest.pt No
--max_turns Maximum number of iterations 20 No
--use_history Whether to enable chat history (1 for True, 0 for False) 0 No
--output_dir Directory to save results ./outputs No

📜 News

  • [2026/02/28] 🚀 Code and dataset release preparation.
  • [2026/02/21] 🎉 IBISAgent is accepted to CVPR 2026!

👨‍💻 Todo

  • Release training scripts (SFT & RL)
  • Release inference code
  • Release pre-trained model weights
  • Release Cold-Start and RL datasets

✒️ Citation

If you find our work helpful for your research, please consider giving one star ⭐️ and citing:

@inproceedings{jiang2026ibisagent,
  title={IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation},
  author={Jiang, Yankai and Li, Qiaoru and Xu, Binlu and Sun, Haoran and Ding, Chao and Dong, Junting and Cai, Yuxiang and Zhang, Xuhong and Yin, Jianwei},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

❤️ Acknowledgments

  • VERL: The reinforcement learning framework we built upon.
  • MedSAM2: The segmentation tool used in our agent.

About

Discover the repository for "IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation," a pioneering study that has been accepted for presentation at CVPR 2026.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages