RoboSemanticBench (RSB) is an embodied benchmark for diagnosing semantic grounding in action prediction for vision-language-action (VLA) models.
In each episode, the robot receives a multiple-choice math or general-knowledge question, observes candidate answer blocks, and must pick the physical block corresponding to the correct answer. The manipulation primitive is intentionally simple, while the semantic decision is non-trivial. This makes RSB useful for separating low-level grasping ability from whether a VLA policy actually uses instruction semantics to select the correct target.
This repository is built on top of RoboTwin 2.0 and keeps the same simulation, expert-trajectory, data-collection, and policy-evaluation workflow.
- π§ Six embodied semantic-answering suites covering controlled arithmetic, GSM8K-style word problems, and general QA.
- π’ Both 4-choice and 10-choice variants, where the 10-choice suites reduce chance accuracy from 25% to 10%. The 10-choice suites use labels A-I and K, skipping J to avoid visual ambiguity with I.
- π¦Ύ A fixed tabletop pick-and-place primitive: solve the question, bind the correct option to a visible block, then place that block in the answer zone.
- π Diagnostic metrics that report both task success and grasp success, exposing cases where a policy can grasp blocks but selects the wrong semantic target.
- π οΈ Built-in scripts for sharded data collection, shard merging, trajectory reuse, and simulation evaluation.
| Suite | Task name | Choices | Semantic source | Train config | Eval config |
|---|---|---|---|---|---|
| RSB-Math-4 | rsb_math |
4 | Procedural arithmetic | rsb_math_train_500 |
rsb_math_train_500 |
| RSB-Math-10 | rsb_math_10blocks |
10 | Procedural arithmetic | rsb_math_10blocks_train_500 |
rsb_math_10blocks_train_500 |
| RSB-HardMath-4 | rsb_hardmath |
4 | GSM8K | rsb_hardmath_train_7473 |
rsb_hardmath_train_700 |
| RSB-HardMath-10 | rsb_hardmath_10blocks |
10 | GSM8K | rsb_hardmath_10blocks_train_7473 |
rsb_hardmath_10blocks_train_700 |
| RSB-General-4 | rsb_general |
4 | MMLU-style QA | rsb_general_train_10k |
rsb_general_test_500 |
| RSB-General-10 | rsb_general_10blocks |
10 | MMLU-style QA | rsb_general_10blocks_train_10k |
rsb_general_10blocks_test_500 |
Training-set sizes follow the paper setup:
| Subset | Source | Choices | Train questions |
|---|---|---|---|
| RSB-Math-4 | procedural arithmetic | 4 | 500 |
| RSB-Math-10 | procedural arithmetic | 10 | 500 |
| RSB-HardMath-4 | GSM8K | 4 | 7,473 |
| RSB-HardMath-10 | GSM8K | 10 | 7,473 |
| RSB-General-4 | MMLU-style QA | 4 | 10,000 |
| RSB-General-10 | MMLU-style QA | 10 | 10,000 |
For evaluation, the paper protocol uses 500 simulation episodes per suite with held-out semantic questions.
envs/rsb_*.py # RSB task environments
description/task_instruction/rsb_*.json # instruction templates
task_config/rsb_*.yml # collection/evaluation configs
gsm8k/ # GSM8K preprocessing utilities
mmluqa2/ # MMLU-style QA preprocessing utilities
script/collect_data.py # trajectory collection entry
script/merge_collect_data_shards.py # shard merge utility
script/build_* # dataset construction/reuse utilities
script/eval_policy.py # simulation evaluation entry
Follow the original RoboTwin installation instructions first, including simulator assets and the Python environment. RSB uses the same simulator stack, SAPIEN runtime, cameras, and Aloha-AgileX embodiment configuration.
After installing RoboTwin dependencies, install or update paths from the repository root:
cd RoboSemanticBench
bash script/_install.shIf assets are not available yet, use the RoboTwin asset download path or the helper script:
bash script/_download_assets.shRSB task code expects the semantic QA files below when collecting or evaluating HardMath and General suites:
gsm8k/data/train.json
gsm8k/data/test.json
mmluqa2/data/train.json
mmluqa2/data/test.json
The repository includes preprocessing utilities:
python gsm8k/convert_to_json.py
python gsm8k/generate_distractors.py
python mmluqa2/generate_dataset.pyCheck the license terms of upstream datasets before redistributing generated JSON files.
If you do not want to run the simulation data-collection pipeline yourself, we provide pre-collected training data for part of the benchmark on Hugging Face:
| Suite | Task name | Format | Download |
|---|---|---|---|
| RSB-Math-4 | rsb_math |
RoboTwin format | HuggingFace |
| RSB-Math-10 | rsb_math_10blocks |
RoboTwin format | HuggingFace |
More suites will be released gradually. We also plan to release LeRobot-format versions to make the datasets easier to use for policy training.
Basic collection format:
bash collect_data.sh <task_name> <task_config> <gpu_id> [extra_args]Examples:
# RSB-Math-4
bash collect_data.sh rsb_math rsb_math_train_500 0 --skip_instructions
# RSB-HardMath-10
bash collect_data.sh rsb_hardmath_10blocks rsb_hardmath_10blocks_train_7473 0 --skip_instructions
# RSB-General-4, 2500 fresh trajectories
bash collect_data.sh rsb_general rsb_general_train_2500 0 --skip_instructionsCollected data is saved under:
data/<task_name>/<task_config>/
Each setting contains simulator trajectories, HDF5 files, videos, instructions, scene_info.json, and seed metadata.
For long-running datasets, collect shards in parallel and merge them:
bash collect_data.sh rsb_general rsb_general_train_2500 0 \
--episode_start 0 \
--episode_end 625 \
--output_setting rsb_general_train_2500_shard_0_625 \
--skip_instructions
python script/merge_collect_data_shards.py \
rsb_general \
rsb_general_train_2500 \
rsb_general_train_2500_shard_0_625 \
rsb_general_train_2500_shard_625_1250 \
rsb_general_train_2500_shard_1250_1875 \
rsb_general_train_2500_shard_1875_2500Convenience scripts are provided for common sharded workflows:
bash script/collect_rsb_hardmath_10blocks_4gpu.sh
bash script/collect_rsb_general_2500_to_10k_4gpu.sh
bash script/collect_rsb_general_10blocks_2500_to_10k_4gpu.sh
bash script/collect_rsb_general_test_500_4gpu.sh
bash script/collect_rsb_general_10blocks_test_500_4gpu.shRSB-General can collect 2,500 fresh trajectories and expand them to 10,000 semantic samples by reusing the robot motion while rewriting the QA fields:
python script/build_general_answer_reuse_trajectories.py \
--task_name rsb_general \
--source_setting rsb_general_train_2500 \
--target_setting rsb_general_train_10k \
--overwriteThe 10-choice version is analogous:
python script/build_general_answer_reuse_trajectories.py \
--task_name rsb_general_10blocks \
--source_setting rsb_general_10blocks_train_2500 \
--target_setting rsb_general_10blocks_train_10k \
--overwriteRSB uses RoboTwin's policy interface. A policy must expose:
get_model(args)
eval(task_env, model, observation)
reset_model(model)Example:
python script/eval_policy.py --config policy/Your_Policy/deploy_policy.yml \
--overrides \
--task_name rsb_general_10blocks \
--task_config rsb_general_10blocks_test_500 \
--ckpt_setting your_checkpoint_name \
--seed 0 \
--policy_name Your_PolicyEvaluation results are written to:
eval_result/<task_name>/<policy_name>/<task_config>/<ckpt_setting>/<timestamp>/
RSB reports three diagnostic metrics:
- π― Task Success Rate (TSR): the episode succeeds only when the robot selects the semantically correct answer block and places it in the answer zone.
- β Grasp Success Rate (GSR): the robot successfully grasps any candidate answer block, regardless of whether it is correct.
- π§ Normalized Semantic Grounding (nSG): semantic target selection conditioned on successful grasping:
nSG = ((TSR / GSR) - (1 / N)) / (1 - (1 / N))
where N is the number of choices. nSG = 0 corresponds to random target selection among grasped candidates; positive values indicate better-than-random semantic grounding; negative values indicate worse-than-random selection.
GSR intentionally does not count empty gripper closure. A grasp is counted only when the gripper contacts a candidate block, at least one gripper is closed, and the contacted block is either lifted from the tabletop or remains in stable gripper contact for eight consecutive simulation steps.
RSB is designed to expose the gap between motor execution and semantic target selection:
- β High GSR and high TSR: the policy can both manipulate and ground the semantic answer.
β οΈ High GSR and low TSR: the policy can grasp candidate blocks, but does not reliably select the correct semantic target.- β Low GSR and low TSR: the policy has low-level control or grasping failures.
The paper finds that many evaluated VLA policies achieve substantial grasp success while remaining near random or below random on semantic target selection after normalizing for grasp success.
If you use RoboSemanticBench, please cite the paper:
@misc{RoboSemanticBench,
title={RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models},
author={Bin Yu and Yao Zhang and Haishan Liu and Shijie Lian and Yuliang Wei and Xiaopeng Lin and Zhaolong Shen and Changti Wu and Ruina Hu and Bailing Wang and Cong Huang and Kai Chen},
year={2026},
eprint={2606.02277},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2606.02277},
}RoboSemanticBench is built on RoboTwin and uses its simulator, embodiment configuration, expert-trajectory pipeline, and policy evaluation interface.