Code and dataset for the CVPR 2026 paper "SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition"
SpatialBench is a benchmark suite designed to evaluate the video spatial understanding capabilities of Multimodal Large Language Models (MLLMs). This project uses an OpenAI-compatible API interface to send video frames and related spatial reasoning questions to models, automatically evaluating their response accuracy.
- Multi-dimensional Evaluation: Covers 5 major categories and 15 sub-categories of spatial tasks, including Observation & Measurement, Topology & Composition, Symbolic Visual Reasoning, Spatial Causality, and Spatial Planning.
- Flexible API Support: Supports any Vision-Language Model compatible with the OpenAI Chat Completion API (e.g., GPT-4o, Qwen2.5-VL, GLM-4V, etc.).
- Multiple Testing Modes:
- Standard Evaluation: Standard QA evaluation using the full dataset.
- Deep Guide Mode: Uses video examples for In-Context Learning (via
QA_fewshot.txt). - Multi-turn Conversation: Maintains context to test model performance in continuous interactions.
- Automated Evaluation: Provides dedicated scripts to calculate detailed classification accuracy and weighted overall scores.
Before starting, ensure you have the following installed:
- Python 3.8+
- Git (Required for downloading the dataset)
- Windows: Download Git for Windows. During installation, make sure to select "Git LFS (Large File Support)".
- Linux (Ubuntu/Debian):
sudo apt-get install git git-lfs - macOS:
brew install git git-lfs
Because GitHub limits large file storage, the videos live on Hugging Face, so download them before moving on.
First make sure Git LFS is installed:
git lfs installThen clone the SpatialBench repository from Hugging Face:
git clone https://huggingface.co/datasets/XPR2004/SpatialBenchAfter cloning, make sure the directory layout looks like this:
SpatialBench/
├── dataset/
│ ├── video_1.mp4
│ ├── video_2.mp4
│ └── ...
├── benchmark_vision_base64.py
└── ...
After pulling the assets, install the libraries required to run the scripts:
pip install openai opencv-python numpy tqdm httpxFinish the setup by configuring the API-related environment variables.
Linux / macOS:
export OPENAI_API_KEY="sk-your-api-key"
export OPENAI_API_BASE="https://api.openai-proxy.org/v1" # Replace with your API Base URLWindows (PowerShell):
$env:OPENAI_API_KEY="sk-your-api-key"
$env:OPENAI_API_BASE="https://api.openai-proxy.org/v1"The repository includes the benchmark question files (JSON/Text format). Note: The corresponding video files must be downloaded separately (see Setup step 1).
QA.txt: The standard benchmark dataset containing spatial reasoning questions.QA_fewshot.txt: A dataset variant designed for "Deep Guide" mode, where problems are paired with video examples for few-shot learning.test_sample.txt: A small sample dataset for quick testing and debugging.
The main script is benchmark_vision_base64.py. It reads the input file (defaults to QA.txt), processes videos, calls the API, and saves the results.
Standard Benchmark (Default):
# Uses QA.txt by default
python benchmark_vision_base64.py -m "Qwen2.5-VL-72B-Instruct"Run Deep Guide Mode (Few-Shot):
This mode is automatically activated when using the QA_fewshot.txt file.
python benchmark_vision_base64.py QA_fewshot.txt -m "gpt-4o"Quick Test: Run on a small sample to verify your setup.
python benchmark_vision_base64.py test_sample.txtCommon Arguments:
-w <int>: Set the number of concurrent worker threads (default is 4).-m <str>: Specify the model name.--keep-context: Enable multi-turn conversation mode (default is independent questions).--resume: Resume from interruption, skipping completed questions.--rerun-incorrect <file.json>: Rerun only the incorrect questions from a specific result file.--with-reasoning: Force the model to output its reasoning process (Chain of Thought).
After testing, results are saved in a JSON file within the *_results directory (e.g., QA_results/). Use evaluate_benchmark_results.py to generate a statistical report.
Usage:
# Evaluate a specific results directory
python evaluate_benchmark_results.py QA_resultsThis script generates evaluation_summary.json, containing:
- Overall Accuracy
- Weighted Overall Score
- Scores by Major Category
- Scores by Sub Category
The input files (e.g., QA.txt) are in JSON format, containing a list of objects. Each object must contain a sample field.
Example Structure:
[
{
"sample": {
"problem_id": 1001,
"path": "dataset/video_01.mp4",
"problem_type": "object_counting",
"problem": "How many red cups are in the video?",
"options": ["1", "2", "3", "4"],
"solution": "<answer>2</answer>",
"scene_type": "indoor"
}
}
]SpatialBench/
├── benchmark_vision_base64.py # Main benchmark script
├── evaluate_benchmark_results.py # Evaluation and statistics script
├── QA.txt # Standard dataset
├── QA_fewshot.txt # Dataset for Deep Guide/Few-shot mode
├── dataset/ # Directory for test videos
└── README.md # Project documentation
The evaluation script calculates scores based on the following logic:
- Multiple Choice: Matches the model's output option (A/B/C/D). Correct = 1 point, Incorrect = 0 points.
- Regression (e.g., Distance Estimation): Uses the Mean Relative Accuracy (MRA) algorithm. Scores range from 0 to 1 based on the relative error between the predicted value and the ground truth.
- Weighted Overall Score: Calculates the final score by weighting different task categories based on their difficulty and importance.
If you find our project interesting, we hope you can star our repo and cite our paper as follows:
@misc{xu2025spatialbenchbenchmarkingmultimodallarge,
title={SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition},
author={Peiran Xu and Sudong Wang and Yao Zhu and Jianing Li and Yunjian Zhang},
year={2025},
eprint={2511.21471},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2511.21471},
}
