Skip to content

ZGC-EmbodyAI/RoboSemanticBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ€– RoboSemanticBench

RoboSemanticBench (RSB) is an embodied benchmark for diagnosing semantic grounding in action prediction for vision-language-action (VLA) models.

In each episode, the robot receives a multiple-choice math or general-knowledge question, observes candidate answer blocks, and must pick the physical block corresponding to the correct answer. The manipulation primitive is intentionally simple, while the semantic decision is non-trivial. This makes RSB useful for separating low-level grasping ability from whether a VLA policy actually uses instruction semantics to select the correct target.

This repository is built on top of RoboTwin 2.0 and keeps the same simulation, expert-trajectory, data-collection, and policy-evaluation workflow.

✨ Highlights

  • 🧠 Six embodied semantic-answering suites covering controlled arithmetic, GSM8K-style word problems, and general QA.
  • πŸ”’ Both 4-choice and 10-choice variants, where the 10-choice suites reduce chance accuracy from 25% to 10%. The 10-choice suites use labels A-I and K, skipping J to avoid visual ambiguity with I.
  • 🦾 A fixed tabletop pick-and-place primitive: solve the question, bind the correct option to a visible block, then place that block in the answer zone.
  • πŸ“ˆ Diagnostic metrics that report both task success and grasp success, exposing cases where a policy can grasp blocks but selects the wrong semantic target.
  • πŸ› οΈ Built-in scripts for sharded data collection, shard merging, trajectory reuse, and simulation evaluation.

🧩 Benchmark Suites

Suite Task name Choices Semantic source Train config Eval config
RSB-Math-4 rsb_math 4 Procedural arithmetic rsb_math_train_500 rsb_math_train_500
RSB-Math-10 rsb_math_10blocks 10 Procedural arithmetic rsb_math_10blocks_train_500 rsb_math_10blocks_train_500
RSB-HardMath-4 rsb_hardmath 4 GSM8K rsb_hardmath_train_7473 rsb_hardmath_train_700
RSB-HardMath-10 rsb_hardmath_10blocks 10 GSM8K rsb_hardmath_10blocks_train_7473 rsb_hardmath_10blocks_train_700
RSB-General-4 rsb_general 4 MMLU-style QA rsb_general_train_10k rsb_general_test_500
RSB-General-10 rsb_general_10blocks 10 MMLU-style QA rsb_general_10blocks_train_10k rsb_general_10blocks_test_500

Training-set sizes follow the paper setup:

Subset Source Choices Train questions
RSB-Math-4 procedural arithmetic 4 500
RSB-Math-10 procedural arithmetic 10 500
RSB-HardMath-4 GSM8K 4 7,473
RSB-HardMath-10 GSM8K 10 7,473
RSB-General-4 MMLU-style QA 4 10,000
RSB-General-10 MMLU-style QA 10 10,000

For evaluation, the paper protocol uses 500 simulation episodes per suite with held-out semantic questions.

πŸ“ Repository Layout

envs/rsb_*.py                           # RSB task environments
description/task_instruction/rsb_*.json  # instruction templates
task_config/rsb_*.yml                    # collection/evaluation configs
gsm8k/                                   # GSM8K preprocessing utilities
mmluqa2/                                 # MMLU-style QA preprocessing utilities
script/collect_data.py                   # trajectory collection entry
script/merge_collect_data_shards.py      # shard merge utility
script/build_*                           # dataset construction/reuse utilities
script/eval_policy.py                    # simulation evaluation entry

βš™οΈ Installation

Follow the original RoboTwin installation instructions first, including simulator assets and the Python environment. RSB uses the same simulator stack, SAPIEN runtime, cameras, and Aloha-AgileX embodiment configuration.

After installing RoboTwin dependencies, install or update paths from the repository root:

cd RoboSemanticBench
bash script/_install.sh

If assets are not available yet, use the RoboTwin asset download path or the helper script:

bash script/_download_assets.sh

πŸ“š Preparing Semantic Sources

RSB task code expects the semantic QA files below when collecting or evaluating HardMath and General suites:

gsm8k/data/train.json
gsm8k/data/test.json
mmluqa2/data/train.json
mmluqa2/data/test.json

The repository includes preprocessing utilities:

python gsm8k/convert_to_json.py
python gsm8k/generate_distractors.py
python mmluqa2/generate_dataset.py

Check the license terms of upstream datasets before redistributing generated JSON files.

πŸ“¦ Pre-collected Datasets

If you do not want to run the simulation data-collection pipeline yourself, we provide pre-collected training data for part of the benchmark on Hugging Face:

Suite Task name Format Download
RSB-Math-4 rsb_math RoboTwin format HuggingFace
RSB-Math-10 rsb_math_10blocks RoboTwin format HuggingFace

More suites will be released gradually. We also plan to release LeRobot-format versions to make the datasets easier to use for policy training.

🎬 Data Collection

Basic collection format:

bash collect_data.sh <task_name> <task_config> <gpu_id> [extra_args]

Examples:

# RSB-Math-4
bash collect_data.sh rsb_math rsb_math_train_500 0 --skip_instructions

# RSB-HardMath-10
bash collect_data.sh rsb_hardmath_10blocks rsb_hardmath_10blocks_train_7473 0 --skip_instructions

# RSB-General-4, 2500 fresh trajectories
bash collect_data.sh rsb_general rsb_general_train_2500 0 --skip_instructions

Collected data is saved under:

data/<task_name>/<task_config>/

Each setting contains simulator trajectories, HDF5 files, videos, instructions, scene_info.json, and seed metadata.

⚑ Sharded Collection

For long-running datasets, collect shards in parallel and merge them:

bash collect_data.sh rsb_general rsb_general_train_2500 0 \
  --episode_start 0 \
  --episode_end 625 \
  --output_setting rsb_general_train_2500_shard_0_625 \
  --skip_instructions

python script/merge_collect_data_shards.py \
  rsb_general \
  rsb_general_train_2500 \
  rsb_general_train_2500_shard_0_625 \
  rsb_general_train_2500_shard_625_1250 \
  rsb_general_train_2500_shard_1250_1875 \
  rsb_general_train_2500_shard_1875_2500

Convenience scripts are provided for common sharded workflows:

bash script/collect_rsb_hardmath_10blocks_4gpu.sh
bash script/collect_rsb_general_2500_to_10k_4gpu.sh
bash script/collect_rsb_general_10blocks_2500_to_10k_4gpu.sh
bash script/collect_rsb_general_test_500_4gpu.sh
bash script/collect_rsb_general_10blocks_test_500_4gpu.sh

πŸ” Reusing General Trajectories

RSB-General can collect 2,500 fresh trajectories and expand them to 10,000 semantic samples by reusing the robot motion while rewriting the QA fields:

python script/build_general_answer_reuse_trajectories.py \
  --task_name rsb_general \
  --source_setting rsb_general_train_2500 \
  --target_setting rsb_general_train_10k \
  --overwrite

The 10-choice version is analogous:

python script/build_general_answer_reuse_trajectories.py \
  --task_name rsb_general_10blocks \
  --source_setting rsb_general_10blocks_train_2500 \
  --target_setting rsb_general_10blocks_train_10k \
  --overwrite

πŸ§ͺ Evaluation

RSB uses RoboTwin's policy interface. A policy must expose:

get_model(args)
eval(task_env, model, observation)
reset_model(model)

Example:

python script/eval_policy.py --config policy/Your_Policy/deploy_policy.yml \
  --overrides \
  --task_name rsb_general_10blocks \
  --task_config rsb_general_10blocks_test_500 \
  --ckpt_setting your_checkpoint_name \
  --seed 0 \
  --policy_name Your_Policy

Evaluation results are written to:

eval_result/<task_name>/<policy_name>/<task_config>/<ckpt_setting>/<timestamp>/

πŸ“Š Metrics

RSB reports three diagnostic metrics:

  • 🎯 Task Success Rate (TSR): the episode succeeds only when the robot selects the semantically correct answer block and places it in the answer zone.
  • βœ‹ Grasp Success Rate (GSR): the robot successfully grasps any candidate answer block, regardless of whether it is correct.
  • 🧭 Normalized Semantic Grounding (nSG): semantic target selection conditioned on successful grasping:
nSG = ((TSR / GSR) - (1 / N)) / (1 - (1 / N))

where N is the number of choices. nSG = 0 corresponds to random target selection among grasped candidates; positive values indicate better-than-random semantic grounding; negative values indicate worse-than-random selection.

GSR intentionally does not count empty gripper closure. A grasp is counted only when the gripper contacts a candidate block, at least one gripper is closed, and the contacted block is either lifted from the tabletop or remains in stable gripper contact for eight consecutive simulation steps.

πŸ” Interpreting Results

RSB is designed to expose the gap between motor execution and semantic target selection:

  • βœ… High GSR and high TSR: the policy can both manipulate and ground the semantic answer.
  • ⚠️ High GSR and low TSR: the policy can grasp candidate blocks, but does not reliably select the correct semantic target.
  • ❌ Low GSR and low TSR: the policy has low-level control or grasping failures.

The paper finds that many evaluated VLA policies achieve substantial grasp success while remaining near random or below random on semantic target selection after normalizing for grasp success.

πŸ“ Citation

If you use RoboSemanticBench, please cite the paper:

@misc{RoboSemanticBench,
      title={RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models}, 
      author={Bin Yu and Yao Zhang and Haishan Liu and Shijie Lian and Yuliang Wei and Xiaopeng Lin and Zhaolong Shen and Changti Wu and Ruina Hu and Bailing Wang and Cong Huang and Kai Chen},
      year={2026},
      eprint={2606.02277},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2606.02277}, 
}

πŸ™ Acknowledgements

RoboSemanticBench is built on RoboTwin and uses its simulator, embodiment configuration, expert-trajectory pipeline, and policy evaluation interface.

Releases

No releases published

Packages

 
 
 

Contributors

Languages