Skip to content

Visual-AI/Pancap

Repository files navigation

Panoptic Captioning: An Equivalence Bridge for Image and Text

NeurIPS 2025

Paper PDF arXiv Project Page

Visual AI Lab, HKU

Kun-Yu Lin, Hongjun Wang, Weining Ren, Kai Han

📢 Updates

  • [2025/12/19] 🔥Released PancapChain-13B checkpoint.
  • [2025/12/12] 🔥Released the evaluation metric.
  • [2025/12/03] 🔥Released training and inference code.
  • [2025/09/18] 🎉The paper was accepted by NeurIPS'25.

🌈 Overview

TL;DR

  • A new image captioning task to seek the minimum text equivalent of images

alt text

  • Panoptic captioning aims to generate a comprehensive textual description for an image, which encapsulates all entities, their respective locations and attributes, relationships among entities, as well as global image state.
  • Through an extensive evaluation, our work reveals that state-of-the-art Multi-modal Large Language Models (MLLMs) have limited performance in solving panoptic captioning.
  • To address this task, we propose a effective data engine, contribute a new benchmark, and develop a novel decoupling method.

Contributions

  • New task with new metric
  • New data engine and new benchmark
  • New model, that beats Qwen2.5-VL-72B, InternVL-2.5-78B, Gemini-2.0-Pro with only 13B parameters

💪 Environment

Please refer to README_env.md for environment configuration.

📚 Data Preparation

Our SA-Pancap benchmark is based on SA-1B, so you should download the required images from SA-1B. Our adopted images come from the first 64 subsets of SA-1B.

  • Download the first 63 subsets of the dataset, i.e., sa_000000 ~ sa_000063. Totally, this part roughly consists of 734243 images.
  • After downloading all of them, organize the data in a specific DATA_ROOT as follows:
├── sam
│   ├── sa_000000
│   ├── sa_000001
│   ├── ...
│   └── sa_000063

🚀 Model: PancapChain

PancapChain is a simple yet effective method to improve panoptic captioning, following a decoupled learning pipeline.

🚝 Training

We use the pretrained ASMv2 model as initialization, so users should first download the stage2-trained checkpoint from ASMv2. Then, use the following script to run the training code. You should modify the paths of DATA_ROOT and SAVE_CKPT before running the code.

bash scripts/pancapchain_train.sh

After finish training, you can use the following script to merge LoRA weights. You should modify the path of MODEL_NAME before running the code.

bash scripts_pancap/eval/merge_lora.sh

🚝 Inference

You can use the following script to do inference on the validation set. You should modify the paths of DATA_ROOT and MODEL_NAME before running the code.

bash scripts_pancap/eval/inference_pancapchain_val.sh

You can use the following script to do inference on the test set. You should modify the paths of DATA_ROOT and MODEL_NAME before running the code.

bash scripts_pancap/eval/inference_pancapchain_test.sh

We have released the trained PancapChain-13B checkpoint on Hugging Face. You can download the checkpoint and try it out locally.

🛸 Metric: PancapScore

PancapScore is a new metric to comprehensively evaluate the quality of generated panoptic captions. To accelerate computation, we leverage parallel processing by spawning multiple worker processes using Python's built-in multiprocessing module.

🔦 Evaluation

You can use the following script to evaluate the performance on the validation set. You should modify the paths of DATA_ROOT, SAVE_CKPT, and CACHE_PTH before running the code.

bash scripts_llmjudge/eval_sapancap_val.sh

You can use the following script to evaluate the performance on the test set. You should modify the paths of DATA_ROOT, SAVE_CKPT, and CACHE_PTH before running the code.

bash scripts_llmjudge/eval_sapancap_test.sh

📌 Citation

For any question, please contact Kun-Yu Lin. If you find this work useful, please star this repo and cite our work as follows:

@inproceedings{lin2025pancap,
    title={Panoptic Captioning: An Equivalence Bridge for Image and Text},
    author={Lin, Kun-Yu and Wang, Hongjun and Ren, Weining and Han, Kai},
    journal={The Thirty-Ninth Annual Conference on Neural Information Processing Systems},
    year={2025}
}

🌟 Acknowledgements

Thanks to these great repositories: LLaVA and All-Seeing, and many other inspiring works in the community.

About

[NeurIPS 2025] Panoptic Captioning: An Equivalence Bridge for Image and Text

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published