SO-Bench: A Structural Output Evaluation of Multimodal LLMs (CVPR 2026)

SO-Bench is a benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to generate schema-compliant structured outputs grounded in visual inputs. The benchmark targets realistic agentic settings where model outputs must not only be semantically correct, but also strictly conform to predefined JSON schemas. SO-Bench spans four visual domains—UI screens, natural images, documents, and charts—and is constructed from over 6.5K diverse schemas paired with 1.8K high-quality image–schema instances, all human-verified for accuracy. We hope this benchmark serves as a catalyst for advancing research and training methods that improve visual structured output capabilities in MLLMs.

Getting Started

Step 1: Download and prepare the images

Run the following command to download and prepare the datasets:

python download.py

All data will be saved to ml-sobench/data/{dataset_name}. In total, we download 10 datasets covering the full SO-Bench evaluation suite.

Step 2: Generate the evaluation file

Unzip data/labels.zip, which contains the ground-truth evaluation annotations. Then run:

python convert_to_eval_jsonl.py \
  --input_dir=data/labels \
  --output_file=data/so_bench_eval.jsonl \
  --num_threads=16

This step converts the annotations into a unified JSONL format that can be directly used for OpenAI-style model inference and evaluation.

Note: We filter out 20 label entries that were used in the original paper due to corrupted or missing images. As a result, the evaluation results produced by this codebase may differ slightly from the numbers reported in the paper.

Step 3: Run inference and report evaluation results

The evaluation pipeline is implemented in eval.py For a complete, runnable example of the inference and evaluation workflow, please refer to demo.ipynb

License

This software and accompanying data and models have been released under the following licenses:

Code: Apple Sample Code License (ASCL)
Data: CC-BY-NC-ND Deed

Citation

@misc{feng2025so,
      title={SO-Bench: A Structural Output Evaluation of Multimodal LLMs},
      author={Feng, Di and Ma, Kaixin and Nan, Feng and Chen, Haofeng and Zhai, Bohan and Griffiths, David and Gao, Mingfei and Gan, Zhe and Verma, Eshan and Yang, Yinfei and Chen, Zhifeng, and Dehghan, Afshin},
      year={2025},
      eprint={2511.21750},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.21750},
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
LICENSE_DATA.md		LICENSE_DATA.md
README.md		README.md
__init__.py		__init__.py
convert_to_eval_jsonl.py		convert_to_eval_jsonl.py
core.py		core.py
demo.ipynb		demo.ipynb
download.py		download.py
eval.py		eval.py
so-bench-teaser.png		so-bench-teaser.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SO-Bench: A Structural Output Evaluation of Multimodal LLMs (CVPR 2026)

Getting Started

Step 1: Download and prepare the images

Step 2: Generate the evaluation file

Step 3: Run inference and report evaluation results

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SO-Bench: A Structural Output Evaluation of Multimodal LLMs (CVPR 2026)

Getting Started

Step 1: Download and prepare the images

Step 2: Generate the evaluation file

Step 3: Run inference and report evaluation results

License

Citation

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages