RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes

Leyi Wu^1,3,*, Yifan Zhao^1,*, Jinjie Zhang^1,*, Suzeyu Chen^1,3,*, Wosong Chen^1,3, Zhifei Chen¹, Tianshuo Xu¹, Qingchun He¹, Hongxin Hu¹, Haojian Huang^1,3, Yangkai Wei³, Wenqian Li³, Yinchuan Li³, Ying-Cong Chen^1,2,†

¹HKUST(GZ), ²HKUST, ³Knowin

Paper Project Page Dataset Join Evaluation

Repository Contents

run.py                  # CLI entry point
eval/                   # dataset loading, prompting, adapters, parsing, metrics, results
configs/                # release-ready YAML templates
requirements.txt        # default environment
requirements-molmo.txt  # Molmo-only environment, use separately
docs/                   # dataset and evaluation notes

Generated outputs are written under outputs/ by default.

Dataset

The dataset is intended to be released separately on Hugging Face Datasets, with Zenodo recommended for DOI archival. Download or symlink the dataset so this path exists:

data/RoboStressBench-Dataset/manifest.jsonl

Example after downloading from Hugging Face:

mkdir -p data
huggingface-cli download RoboStressBench/RoboStressBench-Dataset --repo-type dataset --local-dir data/RoboStressBench-Dataset

The evaluation code reads manifest.jsonl and per-sample records/*.json. The optional metadata.jsonl file is provided for easier dataset browsing and external scripts.

Installation

Default environment for API, Qwen-style, and InternVL-style evaluation:

conda create -n robostressbench python=3.10
conda activate robostressbench
pip install -r requirements.txt

Molmo requires a separate environment because its supported transformers version can conflict with newer checkpoints:

conda create -n robostressbench-molmo python=3.10
conda activate robostressbench-molmo
pip install -r requirements-molmo.txt

Do not install both requirement files into the same environment unless you have verified the model-specific transformers constraints yourself.

Configuration

Edit configs/model_registry.yaml before local evaluation. Replace checkpoints/... paths with your actual model checkpoint directories.

API evaluation uses environment variables for credentials:

export OPENAI_API_KEY=...
export GEMINI_API_KEY=...

Running

API example:

python run.py --config configs/api.yaml

Local checkpoint example:

python run.py --config configs/local.yaml --gpu-ids 0

The release defaults are conservative (batch_size: 1, max_workers: 1) so the first run is less likely to hit memory limits. After a smoke test passes, increase batch_size, max_workers, or --gpu-ids to match your hardware.

Molmo example, from the Molmo-specific environment:

python run.py --config configs/molmo.yaml --gpu-ids 0,1,2,3

See docs/evaluation.md for smoke-test commands and output details.

Submitting Results

If you would like to share your evaluation results, please format them according to the instructions on the Join Evaluation page and submit them there. The public website will be updated every two weeks.

Notes

Bounding-box ground truth uses xyxy coordinates in permille space [0, 1000].
Placement grounding expects the model to output one point (x, y) in permille space; the prediction is correct if it lands on the ground-truth mask.
Multiple-choice prompts are expanded with lettered options at runtime.

Citation

@misc{wu2026robostressbenchbenchmarkingvlmrobustness,
      title={RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes}, 
      author={Leyi Wu and Yifan Zhao and Jinjie Zhang and Suzeyu Chen and Wosong Chen and Zhifei Chen and Tianshuo Xu and Qingchun He and Hongxin Hu and Haojian Huang and Yangkai Wei and Wenqian Li and Yinchuan Li and Ying-Cong Chen},
      year={2026},
      eprint={2606.00828},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.00828}, 
}

License

The evaluation framework code in this repository is released under the Apache License 2.0. The RoboStressBench dataset is released separately and remains subject to its dataset-specific terms and the licenses/terms of its source data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes

Repository Contents

Dataset

Installation

Configuration

Running

Submitting Results

Notes

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
configs		configs
docs		docs
eval		eval
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements-molmo.txt		requirements-molmo.txt
requirements.txt		requirements.txt
run.py		run.py

Folders and files

Latest commit

History

Repository files navigation

RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes

Repository Contents

Dataset

Installation

Configuration

Running

Submitting Results

Notes

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages