Panoptic Captioning: An Equivalence Bridge for Image and Text

NeurIPS 2025

Visual AI Lab, HKU

Kun-Yu Lin, Hongjun Wang, Weining Ren, Kai Han

📢 Updates

[2025/12/19] 🔥Released PancapChain-13B checkpoint.
[2025/12/12] 🔥Released the evaluation metric.
[2025/12/03] 🔥Released training and inference code.
[2025/09/18] 🎉The paper was accepted by NeurIPS'25.

🌈 Overview

TL;DR

A new image captioning task to seek the minimum text equivalent of images

Panoptic captioning aims to generate a comprehensive textual description for an image, which encapsulates all entities, their respective locations and attributes, relationships among entities, as well as global image state.
Through an extensive evaluation, our work reveals that state-of-the-art Multi-modal Large Language Models (MLLMs) have limited performance in solving panoptic captioning.
To address this task, we propose a effective data engine, contribute a new benchmark, and develop a novel decoupling method.

Contributions

New task with new metric
New data engine and new benchmark
New model, that beats Qwen2.5-VL-72B, InternVL-2.5-78B, Gemini-2.0-Pro with only 13B parameters

💪 Environment

Please refer to README_env.md for environment configuration.

📚 Data Preparation

Our SA-Pancap benchmark is based on SA-1B, so you should download the required images from SA-1B. Our adopted images come from the first 64 subsets of SA-1B.

Download the first 63 subsets of the dataset, i.e., sa_000000 ~ sa_000063. Totally, this part roughly consists of 734243 images.
After downloading all of them, organize the data in a specific DATA_ROOT as follows:

├── sam
│   ├── sa_000000
│   ├── sa_000001
│   ├── ...
│   └── sa_000063

The paths of training, validation and test images are summarized in sapancap_train_data_list.json, sapancap_val_data_list.json and sapancap_test_data_list.json.

🚀 Model: PancapChain

PancapChain is a simple yet effective method to improve panoptic captioning, following a decoupled learning pipeline.

🚝 Training

We use the pretrained ASMv2 model as initialization, so users should first download the stage2-trained checkpoint from ASMv2. Then, use the following script to run the training code. You should modify the paths of DATA_ROOT and SAVE_CKPT before running the code.

bash scripts/pancapchain_train.sh

After finish training, you can use the following script to merge LoRA weights. You should modify the path of MODEL_NAME before running the code.

bash scripts_pancap/eval/merge_lora.sh

🚝 Inference

You can use the following script to do inference on the validation set. You should modify the paths of DATA_ROOT and MODEL_NAME before running the code.

bash scripts_pancap/eval/inference_pancapchain_val.sh

You can use the following script to do inference on the test set. You should modify the paths of DATA_ROOT and MODEL_NAME before running the code.

bash scripts_pancap/eval/inference_pancapchain_test.sh

We have released the trained PancapChain-13B checkpoint on Hugging Face. You can download the checkpoint and try it out locally.

🛸 Metric: PancapScore

PancapScore is a new metric to comprehensively evaluate the quality of generated panoptic captions. To accelerate computation, we leverage parallel processing by spawning multiple worker processes using Python's built-in multiprocessing module.

🔦 Evaluation

You can use the following script to evaluate the performance on the validation set. You should modify the paths of DATA_ROOT, SAVE_CKPT, and CACHE_PTH before running the code.

bash scripts_llmjudge/eval_sapancap_val.sh

You can use the following script to evaluate the performance on the test set. You should modify the paths of DATA_ROOT, SAVE_CKPT, and CACHE_PTH before running the code.

bash scripts_llmjudge/eval_sapancap_test.sh

📌 Citation

For any question, please contact Kun-Yu Lin. If you find this work useful, please star this repo and cite our work as follows:

@inproceedings{lin2025pancap,
    title={Panoptic Captioning: An Equivalence Bridge for Image and Text},
    author={Lin, Kun-Yu and Wang, Hongjun and Ren, Weining and Han, Kai},
    journal={The Thirty-Ninth Annual Conference on Neural Information Processing Systems},
    year={2025}
}

🌟 Acknowledgements

Thanks to these great repositories: LLaVA and All-Seeing, and many other inspiring works in the community.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
docs		docs
llava		llava
pancapscore		pancapscore
playground/data/pancap		playground/data/pancap
scripts_llmjudge		scripts_llmjudge
scripts_pancap		scripts_pancap
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_env.md		README_env.md
app.py		app.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Panoptic Captioning: An Equivalence Bridge for Image and Text

NeurIPS 2025

📢 Updates

🌈 Overview

TL;DR

Contributions

💪 Environment

📚 Data Preparation

🚀 Model: PancapChain

🚝 Training

🚝 Inference

🛸 Metric: PancapScore

🔦 Evaluation

📌 Citation

🌟 Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

Visual-AI/Pancap

Folders and files

Latest commit

History

Repository files navigation

Panoptic Captioning: An Equivalence Bridge for Image and Text

NeurIPS 2025

📢 Updates

🌈 Overview

TL;DR

Contributions

💪 Environment

📚 Data Preparation

🚀 Model: PancapChain

🚝 Training

🚝 Inference

🛸 Metric: PancapScore

🔦 Evaluation

📌 Citation

🌟 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages