XModBench is a comprehensive benchmark designed to evaluate the cross-modal capabilities and consistency of omni-language models. It systematically assesses model performance across multiple modalities (text, vision, audio) and various cognitive tasks, revealing critical gaps in current state-of-the-art models.
- 🎯 Multi-Modal Evaluation: Comprehensive testing across text, vision, and audio modalities
- 🧩 5 Task Dimensions: Perception, Spatial, Temporal, Linguistic, and Knowledge tasks
- 📊 13 SOTA Models Evaluated: Including Gemini 2.5 Pro, Qwen2.5-Omni, EchoInk-R1, and more
- 🔄 Consistency Analysis: Measures performance stability across different modal configurations
- 👥 Human Performance Baseline: Establishes human-level benchmarks for comparison
# Clone the repository
git clone https://github.com/XingruiWang/XModBench.git
cd XModBench
# Install dependencies
pip install -r requirements.txtXModBench/
├── data/
│ ├── text/
│ │ ├── perception/
│ │ ├── spatial/
│ │ ├── temporal/
│ │ ├── linguistic/
│ │ └── knowledge/
│ ├── vision/
│ │ └── [same task categories]
│ └── audio/
│ └── [same task categories]
├── models/
│ └── evaluation_scripts/
├── results/
│ └── model_performances/
└── analysis/
└── visualization/
#!/bin/bash
#SBATCH --job-name=VLM_eval
#SBATCH --output=log/job_%j.out
#SBATCH --error=log/job_%j.log
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=4
echo "Running on host: $(hostname)"
echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
module load conda
# conda activate vlm
conda activate omni
export audioBench='/home/xwang378/scratch/2025/AudioBench'
# python $audioBench/scripts/run.py \
# --model gemini \
# --task_name perception/vggss_audio_vision \
# --sample 1000
# python $audioBench/scripts/run.py \
# --model gemini \
# --task_name perception/vggss_vision_audio \
# --sample 1000
# python $audioBench/scripts/run.py \
# --model gemini \
# --task_name perception/vggss_vision_text \
# --sample 1000
# python $audioBench/scripts/run.py \
# --model gemini \
# --task_name perception/vggss_audio_text \
# --sample 1000
# Qwen2.5-Omni
# python $audioBench/scripts/run.py \
# --model qwen2.5_omni \
# --task_name perception/vggss_audio_text \
# --sample 1000
python $audioBench/scripts/run.py \
--model qwen2.5_omni \
--task_name perception/vggss_vision_text \
--sample 1000
| Model | Perception | Spatial | Temporal | Linguistic | Knowledge | Average |
|---|---|---|---|---|---|---|
| Gemini 2.5 Pro | 75.9% | 50.1% | 60.8% | 76.8% | 89.3% | 70.6% |
| Human Performance | 91.0% | 89.7% | 88.9% | 93.9% | 93.9% | 91.5% |
- Strong Performance: Perception and linguistic tasks (~75% for best models)
- Weak Performance: Spatial (50.1%) and temporal reasoning (60.8%)
- Performance Drop: 15-25 points decrease in spatial/temporal vs. perception tasks
- Audio vs. Text: 20-49 point performance drop
- Audio vs. Vision: 33-point average gap
- Vision vs. Text: ~15-point disparity
- Consistency: Best models show 10-12 point standard deviation
- Vision↔Text: 9-17 point gaps between directions
- Audio↔Text: 6-8 point asymmetries
- Root Cause: Training data imbalance favoring image-to-text over inverse directions
If you use XModBench in your research, please cite our paper:
@article{wang2024xmodbench,
title={XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models},
author={Wang, Xingrui, etc.},
journal={arXiv preprint arXiv:2510.15148},
year={2024}
}This project is licensed under the MIT License - see the LICENSE file for details.
We thank all contributors and the research community for their valuable feedback and suggestions.
- Project Lead: Xingrui Wang
- Email: [xwang378@jhu.edu]
- Website: https://xingruiwang.github.io/projects/XModBench/
- Release Huggingface data
- Release data processing code
- Release data evaluation code
Note: XModBench is actively maintained and regularly updated with new models and evaluation metrics. For the latest updates, please check our releases page.
