- [2025/12/19] 🔥Released PancapChain-13B checkpoint.
- [2025/12/12] 🔥Released the evaluation metric.
- [2025/12/03] 🔥Released training and inference code.
- [2025/09/18] 🎉The paper was accepted by NeurIPS'25.
- A new image captioning task to seek the minimum text equivalent of images
- Panoptic captioning aims to generate a comprehensive textual description for an image, which encapsulates all entities, their respective locations and attributes, relationships among entities, as well as global image state.
- Through an extensive evaluation, our work reveals that state-of-the-art Multi-modal Large Language Models (MLLMs) have limited performance in solving panoptic captioning.
- To address this task, we propose a effective data engine, contribute a new benchmark, and develop a novel decoupling method.
- New task with new metric
- New data engine and new benchmark
- New model, that beats Qwen2.5-VL-72B, InternVL-2.5-78B, Gemini-2.0-Pro with only 13B parameters
Please refer to README_env.md for environment configuration.
Our SA-Pancap benchmark is based on SA-1B, so you should download the required images from SA-1B. Our adopted images come from the first 64 subsets of SA-1B.
- Download the first 63 subsets of the dataset, i.e., sa_000000 ~ sa_000063. Totally, this part roughly consists of 734243 images.
- After downloading all of them, organize the data in a specific DATA_ROOT as follows:
├── sam
│ ├── sa_000000
│ ├── sa_000001
│ ├── ...
│ └── sa_000063
- The paths of training, validation and test images are summarized in sapancap_train_data_list.json, sapancap_val_data_list.json and sapancap_test_data_list.json.
PancapChain is a simple yet effective method to improve panoptic captioning, following a decoupled learning pipeline.
We use the pretrained ASMv2 model as initialization, so users should first download the stage2-trained checkpoint from ASMv2. Then, use the following script to run the training code. You should modify the paths of DATA_ROOT and SAVE_CKPT before running the code.
bash scripts/pancapchain_train.shAfter finish training, you can use the following script to merge LoRA weights. You should modify the path of MODEL_NAME before running the code.
bash scripts_pancap/eval/merge_lora.shYou can use the following script to do inference on the validation set. You should modify the paths of DATA_ROOT and MODEL_NAME before running the code.
bash scripts_pancap/eval/inference_pancapchain_val.shYou can use the following script to do inference on the test set. You should modify the paths of DATA_ROOT and MODEL_NAME before running the code.
bash scripts_pancap/eval/inference_pancapchain_test.shWe have released the trained PancapChain-13B checkpoint on Hugging Face. You can download the checkpoint and try it out locally.
PancapScore is a new metric to comprehensively evaluate the quality of generated panoptic captions. To accelerate computation, we leverage parallel processing by spawning multiple worker processes using Python's built-in multiprocessing module.
You can use the following script to evaluate the performance on the validation set. You should modify the paths of DATA_ROOT, SAVE_CKPT, and CACHE_PTH before running the code.
bash scripts_llmjudge/eval_sapancap_val.shYou can use the following script to evaluate the performance on the test set. You should modify the paths of DATA_ROOT, SAVE_CKPT, and CACHE_PTH before running the code.
bash scripts_llmjudge/eval_sapancap_test.shFor any question, please contact Kun-Yu Lin. If you find this work useful, please star this repo and cite our work as follows:
@inproceedings{lin2025pancap,
title={Panoptic Captioning: An Equivalence Bridge for Image and Text},
author={Lin, Kun-Yu and Wang, Hongjun and Ren, Weining and Han, Kai},
journal={The Thirty-Ninth Annual Conference on Neural Information Processing Systems},
year={2025}
}Thanks to these great repositories: LLaVA and All-Seeing, and many other inspiring works in the community.
