1+1 > 2 : Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning

Shida Gao · Feng Xue · Xiangfeng Wang · Anlong Ming^✉
Teng Long · Yihua Shao · Haozhe Wang · Zhaowen Lin · Wei Wang · Nicu Sebe

^*Equal contribution. ^✉Corresponding author.

This work presents DeViL, a detector-empowered MLLM that mitigates error accumulation and exposure bias seen in previous methods. Compared with previous methods, it delivers faster inference, fewer parameters, and higher accuracy.

📰 News

2025-12-09 Our paper is now publicly available on arXiv.

📝 Abstract

Spatio-temporal grounding and reasoning aims to locate the temporal segment and spatial region of an event in a video given a user query, while also reasoning about semantics such as causality, temporal order, and action relationships. To achieve this, current MLLMs primarily treats bounding boxes as text tokens and generates them autoregressively. However, such autoregressive spatial decoding leads to very-long output sequences, causing spatial errors to accumulated over time and the localization results to progressively drift across a video. To address this, we present a Detector-Empowered Video LLM, short for DEViL, which couples a Video LLM with an open-vocabulary detector (OVD). Specifically, the MLLM and detector are connected via a reference-semantic token (RST) that distills the user query into a rich semantic representation. Unlike tokens that merely serve as spatial prompts or segmentor switches, the RST functions as both a control signal and a replacement for the OVD's text embedding, enabling end-to-end learning of both referential understanding and spatial localization. Furthermore, we propose a tube-mined temporal regularization (TTReg) within OVD, which drives the OVD to generate temporally-consistent queries for target objects, thereby ensuring effective temporal association. Experiments demonstrate that DEViL achieves strong performance across various fine-grained video understanding tasks, particularly STVG and GroundedVQA.

🔎 Framework

Citation

If you use our work or our implementation in this repo, or find them helpful, please consider giving a citation in the following format.

@article{gao20251+,
  title={1+ 1> 2: Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning},
  author={Gao, Shida and Xue, Feng and Wang, Xiangfeng and Ming, Anlong and Long, Teng and Shao, Yihua and Wang, Haozhe and Lin, Zhaowen and Wang, Wei and Sebe, Nicu},
  journal={arXiv preprint arXiv:2512.06673},
  year={2025}
}

Acknowledgements

We sincerely thank the following projects for their contributions to this work:

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1+1 > 2 : Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning

📰 News

📝 Abstract

🔎 Framework

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

1+1 > 2 : Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning

📰 News

📝 Abstract

🔎 Framework

Citation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Packages