Leyi Wu1,3,*, Yifan Zhao1,*, Jinjie Zhang1,*, Suzeyu Chen1,3,*, Wosong Chen1,3, Zhifei Chen1, Tianshuo Xu1, Qingchun He1, Hongxin Hu1, Haojian Huang1,3, Yangkai Wei3, Wenqian Li3, Yinchuan Li3, Ying-Cong Chen1,2,†
1HKUST(GZ), 2HKUST, 3Knowin
Paper Project Page Dataset Join Evaluation
run.py # CLI entry point
eval/ # dataset loading, prompting, adapters, parsing, metrics, results
configs/ # release-ready YAML templates
requirements.txt # default environment
requirements-molmo.txt # Molmo-only environment, use separately
docs/ # dataset and evaluation notes
Generated outputs are written under outputs/ by default.
The dataset is intended to be released separately on Hugging Face Datasets, with Zenodo recommended for DOI archival. Download or symlink the dataset so this path exists:
data/RoboStressBench-Dataset/manifest.jsonlExample after downloading from Hugging Face:
mkdir -p data
huggingface-cli download RoboStressBench/RoboStressBench-Dataset --repo-type dataset --local-dir data/RoboStressBench-DatasetThe evaluation code reads manifest.jsonl and per-sample records/*.json. The optional metadata.jsonl file is provided for easier dataset browsing and external scripts.
Default environment for API, Qwen-style, and InternVL-style evaluation:
conda create -n robostressbench python=3.10
conda activate robostressbench
pip install -r requirements.txtMolmo requires a separate environment because its supported transformers version can conflict with newer checkpoints:
conda create -n robostressbench-molmo python=3.10
conda activate robostressbench-molmo
pip install -r requirements-molmo.txtDo not install both requirement files into the same environment unless you have verified the model-specific transformers constraints yourself.
Edit configs/model_registry.yaml before local evaluation. Replace checkpoints/... paths with your actual model checkpoint directories.
API evaluation uses environment variables for credentials:
export OPENAI_API_KEY=...
export GEMINI_API_KEY=...API example:
python run.py --config configs/api.yamlLocal checkpoint example:
python run.py --config configs/local.yaml --gpu-ids 0The release defaults are conservative (batch_size: 1, max_workers: 1) so the first run is less likely to hit memory limits. After a smoke test passes, increase batch_size, max_workers, or --gpu-ids to match your hardware.
Molmo example, from the Molmo-specific environment:
python run.py --config configs/molmo.yaml --gpu-ids 0,1,2,3See docs/evaluation.md for smoke-test commands and output details.
If you would like to share your evaluation results, please format them according to the instructions on the Join Evaluation page and submit them there. The public website will be updated every two weeks.
- Bounding-box ground truth uses
xyxycoordinates in permille space[0, 1000]. - Placement grounding expects the model to output one point
(x, y)in permille space; the prediction is correct if it lands on the ground-truth mask. - Multiple-choice prompts are expanded with lettered options at runtime.
@misc{wu2026robostressbenchbenchmarkingvlmrobustness,
title={RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes},
author={Leyi Wu and Yifan Zhao and Jinjie Zhang and Suzeyu Chen and Wosong Chen and Zhifei Chen and Tianshuo Xu and Qingchun He and Hongxin Hu and Haojian Huang and Yangkai Wei and Wenqian Li and Yinchuan Li and Ying-Cong Chen},
year={2026},
eprint={2606.00828},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2606.00828},
}The evaluation framework code in this repository is released under the Apache License 2.0. The RoboStressBench dataset is released separately and remains subject to its dataset-specific terms and the licenses/terms of its source data.