Skip to content

buaaplay/VCBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

VCBench: A Streaming Counting Benchmark for Spatial-Temporal State Maintenance in Long Videos

HuggingFace

VCBench is a streaming counting benchmark that repositions counting as a minimal probe for diagnosing spatial-temporal state maintenance capability in video-language models. By querying models at multiple timepoints during video playback, VCBench observes how model predictions evolve rather than checking isolated answers.

Pengyiang Liu, Zhongyue Shi, Hongye Hao, Qi Fu, Xueting Bi, Siwei Zhang, Xiaoyang Hu, Zitian Wang, Linjiang Huang, Si Liu Institute of Artificial Intelligence, Beihang University


Task Taxonomy

VCBench decomposes state maintenance into 8 fine-grained subcategories across two dimensions:

Object Counting — tracking entities in the scene

Subcategory Description
O1-Snap How many objects are visible at this moment?
O1-Delta How many objects appeared in the past N seconds (change relative to window start)?
O2-Unique How many different individuals have appeared so far?
O2-Gain How many new individuals appeared in the past N seconds (identity tracking + time window)?

Event Counting — tracking actions and activities

Subcategory Description
E1-Action How many times has an atomic action occurred so far?
E1-Transit How many scene transitions have occurred so far?
E2-Episode How many activity segments have occurred so far?
E2-Periodic How many complete cycles of a periodic action so far?

Dataset

  • 406 videos from diverse sources (YouTube, ARKitScenes, ScanNet, ScanNet++, Ego4D, RoomTour3D, CODa, OmniWorld, simulation)
  • 1,000 counting questions with 4,576 streaming query points
  • 10,071 frame-by-frame annotated event occurrence and object state change moments

Download

huggingface-cli download buaaplay/VCBench --repo-type dataset --local-dir data/videos

The chunkedVideos/ directory contains 4,576 video clips (one per query point), each truncated to the query timestamp.


Evaluation Metrics

Three complementary metrics diagnose different aspects of state maintenance:

  • GPA (Gaussian Precision Accuracy): numerical prediction precision. σ = 0.05 × max(g, 1), penalizing predictions with >15% relative deviation.
  • MoC (Monotonicity Consistency): whether cumulative count predictions are monotonically non-decreasing. Applies to O2, E1, E2 subcategories.
  • UDA (Update Direction Accuracy): whether the model correctly perceives the direction of state change between adjacent query points.

Main Results

Overall GPA / MoC / UDA (scaled to 0–100). Full per-subcategory results in the paper.

Model Fr. GPA MoC UDA
Human 96.1 100.0 99.3
GPT-4-Turbo (blind) 18.7 95.7 4.3
Gemini-3-Flash 64 37.0 73.7 73.8
Doubao-Seed-1.8 64 36.2 77.2 76.8
Kimi-K2.5 64 26.4 66.8 73.4
Qwen3-VL-30B 64 24.0 76.9 70.7
Qwen3-VL-8B 64 31.0 84.3 55.1
Qwen2.5-VL-7B 64 19.1 68.2 40.8
InternVL3.5-8B 64 24.2 80.2 52.1
Molmo2-8B 64 8.5 66.2 34.6
StreamingVLM (Qwen2.5-VL-7B) 1fps 19.1 68.1 50.3
Dispider 1fps 7.7 34.8 16.8

Quick Start

pip install -r requirements.txt

# Compute metrics on provided results
python eval/compute_metrics.py results/vcbench_gemini3flash_unified.jsonl data/vcbench_eval.jsonl

Evaluate Your Own Model

  1. Run your model on each query point. For offline models, use the video clip truncated to query_time (from chunkedVideos/).

  2. Format your results as a JSONL file, one line per question:

    {"id": "0000", "query_times": [663.0, 665.3, 829.7], "predictions": [2, 2, 1], "gts": [2, 4, 0]}

    Use eval/unify_results.py to convert raw per-query-point results to this format.

  3. Compute metrics:

    python eval/compute_metrics.py your_results_unified.jsonl data/vcbench_eval.jsonl

Repository Structure

data/
  vcbench_eval.jsonl     # 4,576 query points (one per line, flat format)
  vcbench_data.jsonl     # 1,000 questions (grouped, with all query points)
eval/
  compute_metrics.py     # GPA / MoC / UDA computation
  unify_results.py       # convert raw results to unified format
  run_gemini.py          # Gemini-3-Flash evaluation script
  run_gpt4turbo_blind.py # GPT-4-Turbo blind baseline script
results/
  *_unified.jsonl        # per-model results in unified format

TODO / Coming Soon

  • Paper link (coming soon)
  • Evaluation scripts for open-source models (Qwen-VL, InternVL, etc.)
  • Leaderboard

Citation

@article{vcbench2025,
  title={VCBench: A Streaming Counting Benchmark for Spatial-Temporal State Maintenance in Long Videos},
  author={Liu, Pengyiang and Shi, Zhongyue and Hao, Hongye and Fu, Qi and Bi, Xueting and Zhang, Siwei and Hu, Xiaoyang and Wang, Zitian and Huang, Linjiang and Liu, Si},
  year={2026}
}

License

This dataset and code are released under CC BY 4.0.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors