Skip to content

ESI-Bench/ESI-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

20 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ESI-Bench: Towards Embodied Spatial Intelligence
that Closes the Perception-Action Loop

arXiv Project Page Dataset License

Yining Hong*1 Β  Jiageng Liu*2 Β  Han Yin1 Β  Manling Li3 Β  Leonidas Guibas1 Β  Fei-Fei Li1 Β  Jiajun Wu1 Β  Yejin Choi1

1Stanford University Β  2UCLA Β  3Northwestern University


ESI-Bench Teaser

Overview

Spatial intelligence unfolds through a perception-action loop: agents act to acquire observations, and reason about how observations vary as a function of action. Rather than passively processing what is seen, they actively uncover what is unseen β€” occluded structure, dynamics, containment, and functionality that cannot be resolved from passive sensing alone.

ESI-Bench moves beyond prior formulations of spatial intelligence that assume oracle observations by recasting the observer as an actor. We introduce a comprehensive benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories built on OmniGibson, grounded in Spelke's core knowledge systems. Agents must decide what abilities to deploy β€” perception, locomotion, and manipulation β€” and how to sequence them to actively accumulate task-relevant evidence.

Key Findings

  • Active exploration substantially outperforms passive counterparts, with agents spontaneously discovering emergent spatial strategies without explicit instruction.
  • Passive multi-view adds noise rather than signal despite consuming far more images.
  • Most failures stem from action blindness: poor action choices lead to poor observations, which drive cascading errors.
  • Explicit 3D grounding stabilizes reasoning on depth-sensitive tasks, but imperfect reconstruction proves more harmful than 2D baselines.
  • Models exhibit a metacognitive gap: unlike humans who seek falsifying viewpoints and revise beliefs under contradiction, models commit prematurely with high confidence regardless of evidence quality.

Repository Structure

esi-bench/
β”œβ”€β”€ dataset/
β”‚   └── json_clean/                    # Task question JSONs
β”‚       β”œβ”€β”€ Action Sequencing/
β”‚       β”œβ”€β”€ Cognitive Mapping/
β”‚       β”œβ”€β”€ Enumerative Perception/
β”‚       β”œβ”€β”€ Metric Comparison/
β”‚       β”œβ”€β”€ Perceptual Grounding/
β”‚       β”œβ”€β”€ Physical Dynamics/
β”‚       β”œβ”€β”€ Physical Structure/
β”‚       β”œβ”€β”€ Spatial Relations/
β”‚       β”œβ”€β”€ Specular Reflection/
β”‚       └── Temporal Understanding/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ active_explore/                # Active exploration runner
β”‚   β”‚   β”œβ”€β”€ main.py
β”‚   β”‚   └── tasks/                     # Per-task modules
β”‚   └── dataset_generation/            # Dataset construction scripts
β”‚       └── (see Dataset Generation section below)
β”œβ”€β”€ outputs/                           # Results and step images (git-ignored)
└── README.md

Active Exploration

The active exploration module loads an OmniGibson scene, captures step images, calls a GPT or Gemini model, and writes an answer.json.

Environment Setup

Use the existing behavior conda environment:

source ~/miniconda3/etc/profile.d/conda.sh
conda activate behavior

Set one API key depending on the provider:

export OPENAI_API_KEY=...
export GEMINI_API_KEY=...

OmniGibson and BEHAVIOR-1K assets are expected to be available from the conda environment and local machine setup.

Special OmniGibson Setting

Remove walls from the OmniGibson generated maps before running ESI-Bench. In your local OmniGibson source tree, edit ./asset_pipeline/b1k_pipeline/usd_conversion/make_maps.py so that NEEDED_STRUCTURE_CATEGORIES only includes floor categories:

WALL_CATEGORIES = ["walls", "rail_fence"]
FLOOR_CATEGORIES = ["floors", "driveway", "lawn"]
DOOR_CATEGORIES = ["door", "sliding_door", "garage_door", "gate"]
IGNORE_CATEGORIES = ["carpet"]
# NEEDED_STRUCTURE_CATEGORIES = FLOOR_CATEGORIES + WALL_CATEGORIES
NEEDED_STRUCTURE_CATEGORIES = FLOOR_CATEGORIES

See issue #1.

Running the Explorer

Run from the repository root:

python src/main.py \
  --task counting \
  --metadata "dataset/json_clean/Enumerative Perception/Spatial Segmentation/Merom_0_int/living_room_0/q_000.json" \
  --provider gemini \
  --model gemini-3.1-pro-preview \
  --max-steps 30 \
  --min-steps 1 \
  --threshold 0.9 \
  --results-root outputs/results \
  --step-image-root outputs/steps \
  --overwrite

For GPT:

python src/main.py \
  --task cognitivemap \
  --metadata "dataset/json_clean/Cognitive Mapping.json" \
  --question-index 0 \
  --provider gpt \
  --model gpt-5 \
  --max-steps 30 \
  --min-steps 1 \
  --threshold 0.9 \
  --results-root outputs/results \
  --step-image-root outputs/steps \
  --overwrite

--metadata can be a single canonical question JSON under dataset/json_clean, or a big-task summary JSON such as dataset/json_clean/Cognitive Mapping.json containing json_paths. Use --question-index to select from a summary list.

See docs/run_tasks.md for the per-small-task --task, summary JSON, and example --question-index mapping.

Task Names

--task names are the module names under src/active_explore/tasks:

action, angle_confusion, cognitivemap, counting, deformable, distance,
line, mirror, multiagent, occlusion, pour, size, slope, stacking,
storage, touching, transparent, triangle, unobserved_changes

The input JSON directories follow the ESI-Bench table categories:

Action Sequencing, Cognitive Mapping, Enumerative Perception,
Metric Comparison, Perceptual Grounding, Physical Dynamics,
Physical Structure, Spatial Relations, Specular Reflection,
Temporal Understanding

Output Format

The runner writes:

  • answer.json under --results-root
  • step_*.png under --step-image-root

Note: Smoke tests were run with max_steps=1, which verifies environment loading, rendering, model calls, and result writing. It is not a full accuracy evaluation for tasks that need multi-step physical interaction.


Dataset Generation

Dataset construction scripts for all task categories live under src/dataset_generation/. Each task folder contains a Python script and a corresponding bash runner. To generate data, activate the behavior environment and run the bash script for the task you want:

source ~/miniconda3/etc/profile.d/conda.sh
conda activate behavior

# Example: generate occlusion data
bash src/dataset_generation/task_hallucination/batch_occlusion_yining.sh

# Example: generate slope/stacking data
bash src/dataset_generation/task_physics/batch_slope.sh
bash src/dataset_generation/task_physics/batch_stack.sh

Set your API key before running any script that calls a model:

export OPENAI_API_KEY=...
export GEMINI_API_KEY=...

The task folders and their scripts are:

Folder Scripts
task_action_sequencing batch_action
task_capacity batch_pour, batch_storage, batch_storage_multi, batch_water
task_cognitive_map batch_cognitivemap_connect, batch_cognitivemap_merge, batch_cognitivemap_plan, batch_cognitivemap_region
task_comparison batch_distance, batch_size, batch_size_robot
task_confusing_relation batch_equilateral, batch_isosceles, batch_randomtriangle, batch_line, batch_line_positive, batch_touching, batch_touching_false, batch_touching_real
task_counting batch_counting_merge
task_deformable batch_deformable
task_hallucination batch_angle_confusion, batch_angle_confusion_yining, batch_dependency, batch_occlusion, batch_occlusion_yining, batch_transparent, batch_transparent_false
task_mirror batch_mirror_correspondence, batch_mirror_distance, batch_mirror_merge, batch_mirror_object_reality
task_multi_agent batch_multi_agent
task_physics batch_slope, batch_stack
task_unobserved_changes batch_unobserved_changes

Citation

If you find ESI-Bench useful in your research, please cite:

@inproceedings{hong2026esibench,
  title     = {{ESI-Bench}: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop},
  author    = {Hong, Yining and Liu, Jiageng and Yin, Han and Li, Manling and Guibas, Leonidas and Li, Fei-Fei and Wu, Jiajun and Choi, Yejin},
  year      = {2026}
}

We also build on BEHAVIOR-1K and OmniGibson. Please cite them as well:

@inproceedings{li2023behavior1k,
  title     = {{BEHAVIOR-1K}: A Benchmark for Embodied {AI} with 1,000 Everyday Activities and Realistic Simulation},
  author    = {Li, Chengshu and Zhang, Ruohan and Wong, Josiah and Gokmen, Cem and Srivastava, Sanjana and Mart{\'i}n-Mart{\'i}n, Roberto and Wang, Chen and Levine, Gabrael and Lingelbach, Michael and Sun, Jiankai and Anvari, Mona and Hwang, Minjune and Sharma, Manasi and Aydin, Arman and Bansal, Dhruva and Hunter, Samuel and Kim, Kyu-Young and Lou, Alan and Matthews, Caleb R and Villa-Renteria, Ivan and Tang, Jerry Huayang and Tang, Claire and Xia, Fei and Savarese, Silvio and Gweon, Hyowon and Liu, Karen and Wu, Jiajun and Fei-Fei, Li},
  booktitle = {Proceedings of The 6th Conference on Robot Learning},
  series    = {Proceedings of Machine Learning Research},
  volume    = {205},
  pages     = {80--93},
  publisher = {PMLR},
  year      = {2023}
}

@inproceedings{li2022omnigibson,
  title     = {{OmniGibson}: A Platform for Accelerating Embodied {AI} Research Built upon {NVIDIA}'s Omniverse Engine},
  author    = {Li, Chengshu and Gokmen, Cem and Lingelbach, Michael and Srivastava, Sanjana and Mart{\'i}n-Mart{\'i}n, Roberto and Ber, Daniel and Shen, William and Hirose, Noriaki and Zhang, Ruohan and Liu, Karen and Gweon, Hyowon and Savarese, Silvio and Fei-Fei, Li and Wu, Jiajun},
  booktitle = {Proceedings of The 6th Conference on Robot Learning},
  year      = {2022}
}

License

This project is licensed under the MIT License. See LICENSE for details.


Built on OmniGibson Β· Stanford University Β· UCLA Β· Northwestern University

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors