VCBench is a streaming counting benchmark that repositions counting as a minimal probe for diagnosing spatial-temporal state maintenance capability in video-language models. By querying models at multiple timepoints during video playback, VCBench observes how model predictions evolve rather than checking isolated answers.
Pengyiang Liu, Zhongyue Shi, Hongye Hao, Qi Fu, Xueting Bi, Siwei Zhang, Xiaoyang Hu, Zitian Wang, Linjiang Huang, Si Liu Institute of Artificial Intelligence, Beihang University
VCBench decomposes state maintenance into 8 fine-grained subcategories across two dimensions:
Object Counting — tracking entities in the scene
| Subcategory | Description |
|---|---|
| O1-Snap | How many objects are visible at this moment? |
| O1-Delta | How many objects appeared in the past N seconds (change relative to window start)? |
| O2-Unique | How many different individuals have appeared so far? |
| O2-Gain | How many new individuals appeared in the past N seconds (identity tracking + time window)? |
Event Counting — tracking actions and activities
| Subcategory | Description |
|---|---|
| E1-Action | How many times has an atomic action occurred so far? |
| E1-Transit | How many scene transitions have occurred so far? |
| E2-Episode | How many activity segments have occurred so far? |
| E2-Periodic | How many complete cycles of a periodic action so far? |
- 406 videos from diverse sources (YouTube, ARKitScenes, ScanNet, ScanNet++, Ego4D, RoomTour3D, CODa, OmniWorld, simulation)
- 1,000 counting questions with 4,576 streaming query points
- 10,071 frame-by-frame annotated event occurrence and object state change moments
huggingface-cli download buaaplay/VCBench --repo-type dataset --local-dir data/videosThe chunkedVideos/ directory contains 4,576 video clips (one per query point), each truncated to the query timestamp.
Three complementary metrics diagnose different aspects of state maintenance:
- GPA (Gaussian Precision Accuracy): numerical prediction precision. σ = 0.05 × max(g, 1), penalizing predictions with >15% relative deviation.
- MoC (Monotonicity Consistency): whether cumulative count predictions are monotonically non-decreasing. Applies to O2, E1, E2 subcategories.
- UDA (Update Direction Accuracy): whether the model correctly perceives the direction of state change between adjacent query points.
Overall GPA / MoC / UDA (scaled to 0–100). Full per-subcategory results in the paper.
| Model | Fr. | GPA | MoC | UDA |
|---|---|---|---|---|
| Human | — | 96.1 | 100.0 | 99.3 |
| GPT-4-Turbo (blind) | — | 18.7 | 95.7 | 4.3 |
| Gemini-3-Flash | 64 | 37.0 | 73.7 | 73.8 |
| Doubao-Seed-1.8 | 64 | 36.2 | 77.2 | 76.8 |
| Kimi-K2.5 | 64 | 26.4 | 66.8 | 73.4 |
| Qwen3-VL-30B | 64 | 24.0 | 76.9 | 70.7 |
| Qwen3-VL-8B | 64 | 31.0 | 84.3 | 55.1 |
| Qwen2.5-VL-7B | 64 | 19.1 | 68.2 | 40.8 |
| InternVL3.5-8B | 64 | 24.2 | 80.2 | 52.1 |
| Molmo2-8B | 64 | 8.5 | 66.2 | 34.6 |
| StreamingVLM (Qwen2.5-VL-7B) | 1fps | 19.1 | 68.1 | 50.3 |
| Dispider | 1fps | 7.7 | 34.8 | 16.8 |
pip install -r requirements.txt
# Compute metrics on provided results
python eval/compute_metrics.py results/vcbench_gemini3flash_unified.jsonl data/vcbench_eval.jsonl-
Run your model on each query point. For offline models, use the video clip truncated to
query_time(fromchunkedVideos/). -
Format your results as a JSONL file, one line per question:
{"id": "0000", "query_times": [663.0, 665.3, 829.7], "predictions": [2, 2, 1], "gts": [2, 4, 0]}Use
eval/unify_results.pyto convert raw per-query-point results to this format. -
Compute metrics:
python eval/compute_metrics.py your_results_unified.jsonl data/vcbench_eval.jsonl
data/
vcbench_eval.jsonl # 4,576 query points (one per line, flat format)
vcbench_data.jsonl # 1,000 questions (grouped, with all query points)
eval/
compute_metrics.py # GPA / MoC / UDA computation
unify_results.py # convert raw results to unified format
run_gemini.py # Gemini-3-Flash evaluation script
run_gpt4turbo_blind.py # GPT-4-Turbo blind baseline script
results/
*_unified.jsonl # per-model results in unified format
- Paper link (coming soon)
- Evaluation scripts for open-source models (Qwen-VL, InternVL, etc.)
- Leaderboard
@article{vcbench2025,
title={VCBench: A Streaming Counting Benchmark for Spatial-Temporal State Maintenance in Long Videos},
author={Liu, Pengyiang and Shi, Zhongyue and Hao, Hongye and Fu, Qi and Bi, Xueting and Zhang, Siwei and Hu, Xiaoyang and Wang, Zitian and Huang, Linjiang and Liu, Si},
year={2026}
}This dataset and code are released under CC BY 4.0.