SpatialBench: A Benchmark for Video Spatial Understanding

Code and dataset for the CVPR 2026 paper "SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition"

SpatialBench is a benchmark suite designed to evaluate the video spatial understanding capabilities of Multimodal Large Language Models (MLLMs). This project uses an OpenAI-compatible API interface to send video frames and related spatial reasoning questions to models, automatically evaluating their response accuracy.

Features

Multi-dimensional Evaluation: Covers 5 major categories and 15 sub-categories of spatial tasks, including Observation & Measurement, Topology & Composition, Symbolic Visual Reasoning, Spatial Causality, and Spatial Planning.
Flexible API Support: Supports any Vision-Language Model compatible with the OpenAI Chat Completion API (e.g., GPT-4o, Qwen2.5-VL, GLM-4V, etc.).
Multiple Testing Modes:
- Standard Evaluation: Standard QA evaluation using the full dataset.
- Deep Guide Mode: Uses video examples for In-Context Learning (via QA_fewshot.txt).
- Multi-turn Conversation: Maintains context to test model performance in continuous interactions.
Automated Evaluation: Provides dedicated scripts to calculate detailed classification accuracy and weighted overall scores.

Setup

Prerequisites

Before starting, ensure you have the following installed:

Python 3.8+
Git (Required for downloading the dataset)
- Windows: Download Git for Windows. During installation, make sure to select "Git LFS (Large File Support)".
- Linux (Ubuntu/Debian): sudo apt-get install git git-lfs
- macOS: brew install git git-lfs

1. Get Started: Download Dataset and Scripts

Because GitHub limits large file storage, the videos live on Hugging Face, so download them before moving on.

First make sure Git LFS is installed:

git lfs install

Then clone the SpatialBench repository from Hugging Face:

git clone https://huggingface.co/datasets/XPR2004/SpatialBench

After cloning, make sure the directory layout looks like this:

SpatialBench/
├── dataset/
│   ├── video_1.mp4
│   ├── video_2.mp4
│   └── ...
├── benchmark_vision_base64.py
└── ...

2. Install Python Dependencies

After pulling the assets, install the libraries required to run the scripts:

pip install openai opencv-python numpy tqdm httpx

3. Configure Environment Variables

Finish the setup by configuring the API-related environment variables.

Linux / macOS:

export OPENAI_API_KEY="sk-your-api-key"
export OPENAI_API_BASE="https://api.openai-proxy.org/v1" # Replace with your API Base URL

Windows (PowerShell):

$env:OPENAI_API_KEY="sk-your-api-key"
$env:OPENAI_API_BASE="https://api.openai-proxy.org/v1"

Dataset Files

The repository includes the benchmark question files (JSON/Text format). Note: The corresponding video files must be downloaded separately (see Setup step 1).

QA.txt: The standard benchmark dataset containing spatial reasoning questions.
QA_fewshot.txt: A dataset variant designed for "Deep Guide" mode, where problems are paired with video examples for few-shot learning.
test_sample.txt: A small sample dataset for quick testing and debugging.

Usage

1. Run Benchmark

The main script is benchmark_vision_base64.py. It reads the input file (defaults to QA.txt), processes videos, calls the API, and saves the results.

Standard Benchmark (Default):

# Uses QA.txt by default
python benchmark_vision_base64.py -m "Qwen2.5-VL-72B-Instruct"

Run Deep Guide Mode (Few-Shot): This mode is automatically activated when using the QA_fewshot.txt file.

python benchmark_vision_base64.py QA_fewshot.txt -m "gpt-4o"

Quick Test: Run on a small sample to verify your setup.

python benchmark_vision_base64.py test_sample.txt

Common Arguments:

-w <int>: Set the number of concurrent worker threads (default is 4).
-m <str>: Specify the model name.
--keep-context: Enable multi-turn conversation mode (default is independent questions).
--resume: Resume from interruption, skipping completed questions.
--rerun-incorrect <file.json>: Rerun only the incorrect questions from a specific result file.
--with-reasoning: Force the model to output its reasoning process (Chain of Thought).

2. Evaluate Results

After testing, results are saved in a JSON file within the *_results directory (e.g., QA_results/). Use evaluate_benchmark_results.py to generate a statistical report.

Usage:

# Evaluate a specific results directory
python evaluate_benchmark_results.py QA_results

This script generates evaluation_summary.json, containing:

Overall Accuracy
Weighted Overall Score
Scores by Major Category
Scores by Sub Category

Data Format

The input files (e.g., QA.txt) are in JSON format, containing a list of objects. Each object must contain a sample field.

Example Structure:

[
  {
    "sample": {
      "problem_id": 1001,
      "path": "dataset/video_01.mp4",
      "problem_type": "object_counting",
      "problem": "How many red cups are in the video?",
      "options": ["1", "2", "3", "4"],
      "solution": "<answer>2</answer>",
      "scene_type": "indoor"
    }
  }
]

Project Structure

SpatialBench/
├── benchmark_vision_base64.py      # Main benchmark script
├── evaluate_benchmark_results.py   # Evaluation and statistics script
├── QA.txt                          # Standard dataset
├── QA_fewshot.txt                  # Dataset for Deep Guide/Few-shot mode
├── dataset/                        # Directory for test videos
└── README.md                       # Project documentation

Evaluation Logic

The evaluation script calculates scores based on the following logic:

Multiple Choice: Matches the model's output option (A/B/C/D). Correct = 1 point, Incorrect = 0 points.
Regression (e.g., Distance Estimation): Uses the Mean Relative Accuracy (MRA) algorithm. Scores range from 0 to 1 based on the relative error between the predicted value and the ground truth.
Weighted Overall Score: Calculates the final score by weighting different task categories based on their difficulty and importance.

Citation

If you find our project interesting, we hope you can star our repo and cite our paper as follows:

@misc{xu2025spatialbenchbenchmarkingmultimodallarge,
      title={SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition}, 
      author={Peiran Xu and Sudong Wang and Yao Zhu and Jianing Li and Yunjian Zhang},
      year={2025},
      eprint={2511.21471},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2511.21471}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SpatialBench: A Benchmark for Video Spatial Understanding

Features

Setup

Prerequisites

1. Get Started: Download Dataset and Scripts

2. Install Python Dependencies

3. Configure Environment Variables

Dataset Files

Usage

1. Run Benchmark

2. Evaluate Results

Data Format

Project Structure

Evaluation Logic

Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
images		images
LICENSE		LICENSE
QA.txt		QA.txt
QA_fewshot.txt		QA_fewshot.txt
README.md		README.md
benchmark_vision_base64.py		benchmark_vision_base64.py
evaluate_benchmark_results.py		evaluate_benchmark_results.py
test_sample.txt		test_sample.txt

License

XPR2004/SpatialBench

Folders and files

Latest commit

History

Repository files navigation

SpatialBench: A Benchmark for Video Spatial Understanding

Features

Setup

Prerequisites

1. Get Started: Download Dataset and Scripts

2. Install Python Dependencies

3. Configure Environment Variables

Dataset Files

Usage

1. Run Benchmark

2. Evaluate Results

Data Format

Project Structure

Evaluation Logic

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages