🤖 RoboSemanticBench

RoboSemanticBench (RSB) is an embodied benchmark for diagnosing semantic grounding in action prediction for vision-language-action (VLA) models.

In each episode, the robot receives a multiple-choice math or general-knowledge question, observes candidate answer blocks, and must pick the physical block corresponding to the correct answer. The manipulation primitive is intentionally simple, while the semantic decision is non-trivial. This makes RSB useful for separating low-level grasping ability from whether a VLA policy actually uses instruction semantics to select the correct target.

This repository is built on top of RoboTwin 2.0 and keeps the same simulation, expert-trajectory, data-collection, and policy-evaluation workflow.

✨ Highlights

🧠 Six embodied semantic-answering suites covering controlled arithmetic, GSM8K-style word problems, and general QA.
🔢 Both 4-choice and 10-choice variants, where the 10-choice suites reduce chance accuracy from 25% to 10%. The 10-choice suites use labels A-I and K, skipping J to avoid visual ambiguity with I.
🦾 A fixed tabletop pick-and-place primitive: solve the question, bind the correct option to a visible block, then place that block in the answer zone.
📈 Diagnostic metrics that report both task success and grasp success, exposing cases where a policy can grasp blocks but selects the wrong semantic target.
🛠️ Built-in scripts for sharded data collection, shard merging, trajectory reuse, and simulation evaluation.

🧩 Benchmark Suites

Suite	Task name	Choices	Semantic source	Train config	Eval config
RSB-Math-4	`rsb_math`	4	Procedural arithmetic	`rsb_math_train_500`	`rsb_math_train_500`
RSB-Math-10	`rsb_math_10blocks`	10	Procedural arithmetic	`rsb_math_10blocks_train_500`	`rsb_math_10blocks_train_500`
RSB-HardMath-4	`rsb_hardmath`	4	GSM8K	`rsb_hardmath_train_7473`	`rsb_hardmath_train_700`
RSB-HardMath-10	`rsb_hardmath_10blocks`	10	GSM8K	`rsb_hardmath_10blocks_train_7473`	`rsb_hardmath_10blocks_train_700`
RSB-General-4	`rsb_general`	4	MMLU-style QA	`rsb_general_train_10k`	`rsb_general_test_500`
RSB-General-10	`rsb_general_10blocks`	10	MMLU-style QA	`rsb_general_10blocks_train_10k`	`rsb_general_10blocks_test_500`

Training-set sizes follow the paper setup:

Subset	Source	Choices	Train questions
RSB-Math-4	procedural arithmetic	4	500
RSB-Math-10	procedural arithmetic	10	500
RSB-HardMath-4	GSM8K	4	7,473
RSB-HardMath-10	GSM8K	10	7,473
RSB-General-4	MMLU-style QA	4	10,000
RSB-General-10	MMLU-style QA	10	10,000

For evaluation, the paper protocol uses 500 simulation episodes per suite with held-out semantic questions.

📁 Repository Layout

envs/rsb_*.py                           # RSB task environments
description/task_instruction/rsb_*.json  # instruction templates
task_config/rsb_*.yml                    # collection/evaluation configs
gsm8k/                                   # GSM8K preprocessing utilities
mmluqa2/                                 # MMLU-style QA preprocessing utilities
script/collect_data.py                   # trajectory collection entry
script/merge_collect_data_shards.py      # shard merge utility
script/build_*                           # dataset construction/reuse utilities
script/eval_policy.py                    # simulation evaluation entry

⚙️ Installation

Follow the original RoboTwin installation instructions first, including simulator assets and the Python environment. RSB uses the same simulator stack, SAPIEN runtime, cameras, and Aloha-AgileX embodiment configuration.

After installing RoboTwin dependencies, install or update paths from the repository root:

cd RoboSemanticBench
bash script/_install.sh

If assets are not available yet, use the RoboTwin asset download path or the helper script:

bash script/_download_assets.sh

📚 Preparing Semantic Sources

RSB task code expects the semantic QA files below when collecting or evaluating HardMath and General suites:

gsm8k/data/train.json
gsm8k/data/test.json
mmluqa2/data/train.json
mmluqa2/data/test.json

The repository includes preprocessing utilities:

python gsm8k/convert_to_json.py
python gsm8k/generate_distractors.py
python mmluqa2/generate_dataset.py

Check the license terms of upstream datasets before redistributing generated JSON files.

📦 Pre-collected Datasets

If you do not want to run the simulation data-collection pipeline yourself, we provide pre-collected training data for part of the benchmark on Hugging Face:

Suite	Task name	Format	Download
RSB-Math-4	`rsb_math`	RoboTwin format	HuggingFace
RSB-Math-10	`rsb_math_10blocks`	RoboTwin format	HuggingFace

More suites will be released gradually. We also plan to release LeRobot-format versions to make the datasets easier to use for policy training.

🎬 Data Collection

Basic collection format:

bash collect_data.sh <task_name> <task_config> <gpu_id> [extra_args]

Examples:

# RSB-Math-4
bash collect_data.sh rsb_math rsb_math_train_500 0 --skip_instructions

# RSB-HardMath-10
bash collect_data.sh rsb_hardmath_10blocks rsb_hardmath_10blocks_train_7473 0 --skip_instructions

# RSB-General-4, 2500 fresh trajectories
bash collect_data.sh rsb_general rsb_general_train_2500 0 --skip_instructions

Collected data is saved under:

data/<task_name>/<task_config>/

Each setting contains simulator trajectories, HDF5 files, videos, instructions, scene_info.json, and seed metadata.

⚡ Sharded Collection

For long-running datasets, collect shards in parallel and merge them:

bash collect_data.sh rsb_general rsb_general_train_2500 0 \
  --episode_start 0 \
  --episode_end 625 \
  --output_setting rsb_general_train_2500_shard_0_625 \
  --skip_instructions

python script/merge_collect_data_shards.py \
  rsb_general \
  rsb_general_train_2500 \
  rsb_general_train_2500_shard_0_625 \
  rsb_general_train_2500_shard_625_1250 \
  rsb_general_train_2500_shard_1250_1875 \
  rsb_general_train_2500_shard_1875_2500

Convenience scripts are provided for common sharded workflows:

bash script/collect_rsb_hardmath_10blocks_4gpu.sh
bash script/collect_rsb_general_2500_to_10k_4gpu.sh
bash script/collect_rsb_general_10blocks_2500_to_10k_4gpu.sh
bash script/collect_rsb_general_test_500_4gpu.sh
bash script/collect_rsb_general_10blocks_test_500_4gpu.sh

🔁 Reusing General Trajectories

RSB-General can collect 2,500 fresh trajectories and expand them to 10,000 semantic samples by reusing the robot motion while rewriting the QA fields:

python script/build_general_answer_reuse_trajectories.py \
  --task_name rsb_general \
  --source_setting rsb_general_train_2500 \
  --target_setting rsb_general_train_10k \
  --overwrite

The 10-choice version is analogous:

python script/build_general_answer_reuse_trajectories.py \
  --task_name rsb_general_10blocks \
  --source_setting rsb_general_10blocks_train_2500 \
  --target_setting rsb_general_10blocks_train_10k \
  --overwrite

🧪 Evaluation

RSB uses RoboTwin's policy interface. A policy must expose:

get_model(args)
eval(task_env, model, observation)
reset_model(model)

Example:

python script/eval_policy.py --config policy/Your_Policy/deploy_policy.yml \
  --overrides \
  --task_name rsb_general_10blocks \
  --task_config rsb_general_10blocks_test_500 \
  --ckpt_setting your_checkpoint_name \
  --seed 0 \
  --policy_name Your_Policy

Evaluation results are written to:

eval_result/<task_name>/<policy_name>/<task_config>/<ckpt_setting>/<timestamp>/

📊 Metrics

RSB reports three diagnostic metrics:

🎯 Task Success Rate (TSR): the episode succeeds only when the robot selects the semantically correct answer block and places it in the answer zone.
✋ Grasp Success Rate (GSR): the robot successfully grasps any candidate answer block, regardless of whether it is correct.
🧭 Normalized Semantic Grounding (nSG): semantic target selection conditioned on successful grasping:

nSG = ((TSR / GSR) - (1 / N)) / (1 - (1 / N))

where N is the number of choices. nSG = 0 corresponds to random target selection among grasped candidates; positive values indicate better-than-random semantic grounding; negative values indicate worse-than-random selection.

GSR intentionally does not count empty gripper closure. A grasp is counted only when the gripper contacts a candidate block, at least one gripper is closed, and the contacted block is either lifted from the tabletop or remains in stable gripper contact for eight consecutive simulation steps.

🔍 Interpreting Results

RSB is designed to expose the gap between motor execution and semantic target selection:

✅ High GSR and high TSR: the policy can both manipulate and ground the semantic answer.
⚠️ High GSR and low TSR: the policy can grasp candidate blocks, but does not reliably select the correct semantic target.
❌ Low GSR and low TSR: the policy has low-level control or grasping failures.

The paper finds that many evaluated VLA policies achieve substantial grasp success while remaining near random or below random on semantic target selection after normalizing for grasp success.

📝 Citation

If you use RoboSemanticBench, please cite the paper:

@misc{RoboSemanticBench,
      title={RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models}, 
      author={Bin Yu and Yao Zhang and Haishan Liu and Shijie Lian and Yuliang Wei and Xiaopeng Lin and Zhaolong Shen and Changti Wu and Ruina Hu and Bailing Wang and Cong Huang and Kai Chen},
      year={2026},
      eprint={2606.02277},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2606.02277}, 
}

🙏 Acknowledgements

RoboSemanticBench is built on RoboTwin and uses its simulator, embodiment configuration, expert-trajectory pipeline, and policy evaluation interface.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 RoboSemanticBench

✨ Highlights

🧩 Benchmark Suites

📁 Repository Layout

⚙️ Installation

📚 Preparing Semantic Sources

📦 Pre-collected Datasets

🎬 Data Collection

⚡ Sharded Collection

🔁 Reusing General Trajectories

🧪 Evaluation

📊 Metrics

🔍 Interpreting Results

📝 Citation

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
code_gen		code_gen
data		data
description		description
envs		envs
gsm8k		gsm8k
mmluqa2		mmluqa2
policy		policy
script		script
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
collect_data.sh		collect_data.sh

Folders and files

Latest commit

History

Repository files navigation

🤖 RoboSemanticBench

✨ Highlights

🧩 Benchmark Suites

📁 Repository Layout

⚙️ Installation

📚 Preparing Semantic Sources

📦 Pre-collected Datasets

🎬 Data Collection

⚡ Sharded Collection

🔁 Reusing General Trajectories

🧪 Evaluation

📊 Metrics

🔍 Interpreting Results

📝 Citation

🙏 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages