VCBench: A Streaming Counting Benchmark for Spatial-Temporal State Maintenance in Long Videos

VCBench is a streaming counting benchmark that repositions counting as a minimal probe for diagnosing spatial-temporal state maintenance capability in video-language models. By querying models at multiple timepoints during video playback, VCBench observes how model predictions evolve rather than checking isolated answers.

Pengyiang Liu, Zhongyue Shi, Hongye Hao, Qi Fu, Xueting Bi, Siwei Zhang, Xiaoyang Hu, Zitian Wang, Linjiang Huang, Si Liu Institute of Artificial Intelligence, Beihang University

Task Taxonomy

VCBench decomposes state maintenance into 8 fine-grained subcategories across two dimensions:

Object Counting — tracking entities in the scene

Subcategory	Description
O1-Snap	How many objects are visible at this moment?
O1-Delta	How many objects appeared in the past N seconds (change relative to window start)?
O2-Unique	How many different individuals have appeared so far?
O2-Gain	How many new individuals appeared in the past N seconds (identity tracking + time window)?

Event Counting — tracking actions and activities

Subcategory	Description
E1-Action	How many times has an atomic action occurred so far?
E1-Transit	How many scene transitions have occurred so far?
E2-Episode	How many activity segments have occurred so far?
E2-Periodic	How many complete cycles of a periodic action so far?

Dataset

406 videos from diverse sources (YouTube, ARKitScenes, ScanNet, ScanNet++, Ego4D, RoomTour3D, CODa, OmniWorld, simulation)
1,000 counting questions with 4,576 streaming query points
10,071 frame-by-frame annotated event occurrence and object state change moments

Download

huggingface-cli download buaaplay/VCBench --repo-type dataset --local-dir data/videos

The chunkedVideos/ directory contains 4,576 video clips (one per query point), each truncated to the query timestamp.

Evaluation Metrics

Three complementary metrics diagnose different aspects of state maintenance:

GPA (Gaussian Precision Accuracy): numerical prediction precision. σ = 0.05 × max(g, 1), penalizing predictions with >15% relative deviation.
MoC (Monotonicity Consistency): whether cumulative count predictions are monotonically non-decreasing. Applies to O2, E1, E2 subcategories.
UDA (Update Direction Accuracy): whether the model correctly perceives the direction of state change between adjacent query points.

Main Results

Overall GPA / MoC / UDA (scaled to 0–100). Full per-subcategory results in the paper.

Model	Fr.	GPA	MoC	UDA
Human	—	96.1	100.0	99.3
GPT-4-Turbo (blind)	—	18.7	95.7	4.3
Gemini-3-Flash	64	37.0	73.7	73.8
Doubao-Seed-1.8	64	36.2	77.2	76.8
Kimi-K2.5	64	26.4	66.8	73.4
Qwen3-VL-30B	64	24.0	76.9	70.7
Qwen3-VL-8B	64	31.0	84.3	55.1
Qwen2.5-VL-7B	64	19.1	68.2	40.8
InternVL3.5-8B	64	24.2	80.2	52.1
Molmo2-8B	64	8.5	66.2	34.6
StreamingVLM (Qwen2.5-VL-7B)	1fps	19.1	68.1	50.3
Dispider	1fps	7.7	34.8	16.8

Quick Start

pip install -r requirements.txt

# Compute metrics on provided results
python eval/compute_metrics.py results/vcbench_gemini3flash_unified.jsonl data/vcbench_eval.jsonl

Evaluate Your Own Model

Run your model on each query point. For offline models, use the video clip truncated to query_time (from chunkedVideos/).
Format your results as a JSONL file, one line per question:
```
{"id": "0000", "query_times": [663.0, 665.3, 829.7], "predictions": [2, 2, 1], "gts": [2, 4, 0]}
```
Use eval/unify_results.py to convert raw per-query-point results to this format.

Compute metrics:

python eval/compute_metrics.py your_results_unified.jsonl data/vcbench_eval.jsonl

Repository Structure

data/
  vcbench_eval.jsonl     # 4,576 query points (one per line, flat format)
  vcbench_data.jsonl     # 1,000 questions (grouped, with all query points)
eval/
  compute_metrics.py     # GPA / MoC / UDA computation
  unify_results.py       # convert raw results to unified format
  run_gemini.py          # Gemini-3-Flash evaluation script
  run_gpt4turbo_blind.py # GPT-4-Turbo blind baseline script
results/
  *_unified.jsonl        # per-model results in unified format

TODO / Coming Soon

Paper link (coming soon)
Evaluation scripts for open-source models (Qwen-VL, InternVL, etc.)
Leaderboard

Citation

@article{vcbench2025,
  title={VCBench: A Streaming Counting Benchmark for Spatial-Temporal State Maintenance in Long Videos},
  author={Liu, Pengyiang and Shi, Zhongyue and Hao, Hongye and Fu, Qi and Bi, Xueting and Zhang, Siwei and Hu, Xiaoyang and Wang, Zitian and Huang, Linjiang and Liu, Si},
  year={2026}
}

License

This dataset and code are released under CC BY 4.0.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VCBench: A Streaming Counting Benchmark for Spatial-Temporal State Maintenance in Long Videos

Task Taxonomy

Dataset

Download

Evaluation Metrics

Main Results

Quick Start

Evaluate Your Own Model

Repository Structure

TODO / Coming Soon

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

VCBench: A Streaming Counting Benchmark for Spatial-Temporal State Maintenance in Long Videos

Task Taxonomy

Dataset

Download

Evaluation Metrics

Main Results

Quick Start

Evaluate Your Own Model

Repository Structure

TODO / Coming Soon

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages