[NeurIPS 2025] StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
🌟 StreamBridge is a simple yet powerful framework that enables offline Video-LLMs to perform effectively in streaming scenarios. It features:
- A memory buffer with round-decayed compression for long-context, multi-turn interactions.
- A decoupled and lightweight activation model that enables proactive, timely responses without affecting the base model’s reasoning capabilities.
- A newly built dataset, Stream-IT, tailored for streaming video understanding with interleaved video-text sequences and diverse instructions.
Important
For copyright reasons, we can’t release model weights trained on YouTube or other videos that may contain IP-protected content. However, we’re open-sourcing the model implementation and the synthetic data used for training.
- Clone this repository and navigate to folder
git clone https://github.com/apple/ml-streambridge
cd ml-streambridge- Install package
conda create -n ml-streambridge python=3.10.14
conda activate ml-streambridge
pip install -e .
pip install flash-attn==2.3.3 --no-build-isolation- Download checkpoints: TBD due to video copyright reasons.
- Organize as:
├── /your/path/to/checkpoints
│ └── llava-onevision-qwen2-0.5b-ov-hf-seperated
│ └── activation_0.5_ratio_anet_coin_yc2_s2s_fa_mhego_hacs_cha_et_llava-ov_epoch_5.pth
│ └── LLaVA-OV-7B-du2e2hjxik
│ └── Oryx-1.5-7B-jfsvkb3hn8
│ └── Qwen2-VL-7B-jh6p673iyp
- Run a demo
- Update the
your_weight_pathindemo.pyto match the weight directory above:
python demo.py # activation threshold is set for the response frequency- You should see output like:
18 seconds: Pour the cooked noodles.
32 seconds: Cut the lemon.
44 seconds: Cut the olives in half.
55 seconds: Chop the parsley.
68 seconds: Squeeze the lemon juice into the measuring cup.
78 seconds: Pound the chicken.
...
- You can download the raw videos for OVO-Bench from [🤗HF] and VideoMME from [🤗HF]. And reorganize the folder as follows:
├── /your/path/to/ovo_bench
│ └── videos
│ └── ovo_bench.json
│ └── ...
├── /your/path/to/videomme
│ └── videos
│ └── videomme.json
│ └── ...
- Here, we provide the OVO-Bench's
ovo_bench.jsonand VideoMME'svideomme.jsonin./assets.
- Run evaluation script
- Set
ANNO_PATHandVIDEO_PATHinscripts/eval.shfor the OVO-Bench and VideoMME you download above, and then run:
bash scripts/eval.sh- Evaluate different models by modifiying
MODELandCKPTin the script. - By default, 8 A100-80G GPUs are used; you can adjust
NUM_GPUSandMAX_IMG_TOKENto reduce memory usage.
- Report the results
python eval/metric_report.py- And you should reproduce the results below (see our paper for more details):
| Model Name | OVO-Bench-Real-Time (OCR/ACR/ATR/STU/FPD/OJR/AVG.) | VideoMME (w/o subs) |
|---|---|---|
| Qwen2-VL-StreamBridge | 85.24/67.89/75.00/52.25/70.30/72.28/70.49 | 63.0 |
| Oryx-1.5-StreamBridge | 81.21/70.64/70.69/49.44/74.26/68.48/69.12 | 64.2 |
| LLaVA-OV-StreamBridge | 74.50/78.90/72.41/52.81/78.22/68.68/70.89 | 61.0 |
- The raw 1.28 million videos of StreamingQA-120K are sourced from [🤗WebVid], [🤗InternVid] and [🤗Panda]. You can also download them from their official repos [WebVid-10M] [InternVid-10M] [Panda-70M]
- We concatenate videos with higher similarites from these three datasets and annotate QA pairs for them. We provide the similarity-ordered json file. You can dynamically control the grouping size via
GROUP_LEN:
import json
def load_json(path):
with open(path) as f:
data = json.load(f)
return data
GROUP_LEN = 10
anns = load_json("/your/path/to/qa_groups.json")
groups = [i for i in range(len(anns))]
groups = [groups[i : i + GROUP_LEN] for i in range(0, len(groups), GROUP_LEN)]
grouped_anns = []
for group in groups:
if len(group) != GROUP_LEN:
continue
grouped_anns.append(
{
"video_ids": [anns[i]["video_id"] for i in group],
"video_files": [anns[i]["video_file"] for i in group],
"captions": [anns[i]["caption"] for i in group],
"questions": [anns[i]["question"] for i in group],
"answers": [anns[i]["answer"] for i in group],
"options": [anns[i]["options"] for i in group],
"types": [anns[i]["type"] for i in group],
}
)
print(grouped_anns[0])This software and accompanying data and models have been released under the following licenses:
- Code: Apple Sample Code License (ASCL)
- Data: CC-BY-NC-ND Deed
If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝.
@article{wang2025streambridge,
title={StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant},
author={Wang, Haibo and Feng, Bo and Lai, Zhengfeng and Xu, Mingze and Li, Shiyu and Ge, Weifeng and Dehghan, Afshin and Cao, Meng and Huang, Ping},
journal={arXiv preprint arXiv:2505.05467},
year={2025}
}
