Skip to content

YUEVII/RoboStressBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes

Leyi Wu1,3,*, Yifan Zhao1,*, Jinjie Zhang1,*, Suzeyu Chen1,3,*, Wosong Chen1,3, Zhifei Chen1, Tianshuo Xu1, Qingchun He1, Hongxin Hu1, Haojian Huang1,3, Yangkai Wei3, Wenqian Li3, Yinchuan Li3, Ying-Cong Chen1,2,†

1HKUST(GZ), 2HKUST, 3Knowin

Paper     Project Page     Dataset     Join Evaluation

Repository Contents

run.py                  # CLI entry point
eval/                   # dataset loading, prompting, adapters, parsing, metrics, results
configs/                # release-ready YAML templates
requirements.txt        # default environment
requirements-molmo.txt  # Molmo-only environment, use separately
docs/                   # dataset and evaluation notes

Generated outputs are written under outputs/ by default.

Dataset

The dataset is intended to be released separately on Hugging Face Datasets, with Zenodo recommended for DOI archival. Download or symlink the dataset so this path exists:

data/RoboStressBench-Dataset/manifest.jsonl

Example after downloading from Hugging Face:

mkdir -p data
huggingface-cli download RoboStressBench/RoboStressBench-Dataset --repo-type dataset --local-dir data/RoboStressBench-Dataset

The evaluation code reads manifest.jsonl and per-sample records/*.json. The optional metadata.jsonl file is provided for easier dataset browsing and external scripts.

Installation

Default environment for API, Qwen-style, and InternVL-style evaluation:

conda create -n robostressbench python=3.10
conda activate robostressbench
pip install -r requirements.txt

Molmo requires a separate environment because its supported transformers version can conflict with newer checkpoints:

conda create -n robostressbench-molmo python=3.10
conda activate robostressbench-molmo
pip install -r requirements-molmo.txt

Do not install both requirement files into the same environment unless you have verified the model-specific transformers constraints yourself.

Configuration

Edit configs/model_registry.yaml before local evaluation. Replace checkpoints/... paths with your actual model checkpoint directories.

API evaluation uses environment variables for credentials:

export OPENAI_API_KEY=...
export GEMINI_API_KEY=...

Running

API example:

python run.py --config configs/api.yaml

Local checkpoint example:

python run.py --config configs/local.yaml --gpu-ids 0

The release defaults are conservative (batch_size: 1, max_workers: 1) so the first run is less likely to hit memory limits. After a smoke test passes, increase batch_size, max_workers, or --gpu-ids to match your hardware.

Molmo example, from the Molmo-specific environment:

python run.py --config configs/molmo.yaml --gpu-ids 0,1,2,3

See docs/evaluation.md for smoke-test commands and output details.

Submitting Results

If you would like to share your evaluation results, please format them according to the instructions on the Join Evaluation page and submit them there. The public website will be updated every two weeks.

Notes

  • Bounding-box ground truth uses xyxy coordinates in permille space [0, 1000].
  • Placement grounding expects the model to output one point (x, y) in permille space; the prediction is correct if it lands on the ground-truth mask.
  • Multiple-choice prompts are expanded with lettered options at runtime.

Citation

@misc{wu2026robostressbenchbenchmarkingvlmrobustness,
      title={RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes}, 
      author={Leyi Wu and Yifan Zhao and Jinjie Zhang and Suzeyu Chen and Wosong Chen and Zhifei Chen and Tianshuo Xu and Qingchun He and Hongxin Hu and Haojian Huang and Yangkai Wei and Wenqian Li and Yinchuan Li and Ying-Cong Chen},
      year={2026},
      eprint={2606.00828},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.00828}, 
}

License

The evaluation framework code in this repository is released under the Apache License 2.0. The RoboStressBench dataset is released separately and remains subject to its dataset-specific terms and the licenses/terms of its source data.

About

Official implementation of "RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages