StreamingCoT is the first dataset explicitly designed for temporally evolving reasoning in streaming Video Question Answering (VideoQA) and multimodal Chain-of-Thought (CoT) tasks. Addressing critical limitations in current VideoQA benchmarks, StreamingCoT features:
- Dynamic temporal understanding: Captures evolving answers in video streams
- Explicit reasoning chains: Provides annotated multimodal reasoning paths
- Temporal dependency modeling: Tracks semantic evolution across video timelines
- Spatiotemporal grounding: Links reasoning steps to visual evidence
This dataset establishes a new foundation for research in streaming video understanding, complex temporal reasoning, and multimodal inference.
- 5,745 high-quality short videos (≤60 seconds)
- Global representation through stratified geographic sampling
- Rigorous multimodal filtering:
- Social validation (>5,000 interactions)
- Lexical density constraints
- HD resolution (≥720p)
- Motion dynamics analysis
- Aesthetic scoring (≥7/10)
- Per-second dense captions aligned with visual content
- Adaptive temporal segmentation via Dynamic Semantic Fusion (DSF)
- Context-aware narration generation with inter-segment coherence
- Expert-validated semantic completeness and temporal alignment
- 6 specialized question types:
- Cumulative counting
- Periodic pattern recognition
- Sequential step recognition
- State duration measurement
- Object state recognition
- Clue-revealing responses
- Distractor-aware option design targeting temporal misperceptions
- Human-verified temporal consistency and answer validity
- Spatiotemporally grounded reasoning chains:
- Temporally-aware CoT initialization
- Key object extraction and spatial grounding
- Multimodal reasoning fusion
- Iterative human validation protocol ensuring:
- Spatiotemporal consistency
- Temporal causality
- Evidence completeness
- Answer derivation soundness
StreamingCoT/
├── bbox/ # Per-second bounding box annotations
│ └── VIDEO_ID/ # Directory per video (YouTube ID)
│ ├── sec_0_idx_48.json # Bounding boxes at second 0 (frame 48)
│ ├── sec_1_idx_17.json # Second 1 annotations
│ └── ...
├── final_cot/ # Verified reasoning chains
│ ├── VIDEO_ID.jsonl # Final CoT in JSON Lines format
│ └── ...
├── initial_cot/ # Preliminary reasoning chains
│ ├── VIDEO_ID.jsonl # Initial CoT annotations
│ └── ...
└── key_frames/ # Temporally significant frames
└── VIDEO_ID/ # Directory per video
└── metadata.json # Key frame positions and features
Our hierarchical annotation framework:
- Video Collection & Filtering
YouTube API → Geographic balancing → Multimodal quality screening - Hierarchical Captioning
Per-second captioning → Dynamic segmentation → Context-aware narration - Dynamic QA Generation
Question typing → Distractor design → Temporal realignment - Multimodal CoT Synthesis
Keyframe selection → Object grounding → Reasoning fusion - Iterative Validation
Expert verification → Error taxonomy → Corrective regeneration
StreamingCoT enables research in:
- Temporal reasoning in video understanding
- Multimodal chain-of-thought development
- Streaming video question answering
- Spatiotemporally grounded inference
- Dynamic distractor analysis
- Video-based logical deduction systems
The StreamingCoT dataset and construction toolkit are available at:
https://anonymous.4open.science/
StreamingCoT is released for non-commercial research purposes. All videos are sourced from YouTube and remain subject to original content creators' rights. Users must comply with YouTube's Terms of Service.
@article{streamingcot2024,
title={StreamingCoT: Advancing Temporal Reasoning in VideoQA through Dynamic Multimodal Chain-of-Thought},
author={Anonymous},
journal={Submitted to Preprint},
year={2024},
note={Dataset available at \url{https://anonymous.4open.science/}}
}For dataset inquiries, please open an issue on our repository or contact the maintainers through the anonymous submission portal.