VideoSeeker is a novel agentic instance-level video understanding paradigm via native tool invocation with visual prompts.
- [2026/05/14] 🔥 We have released
VideoSeeker, a novel agentic instance-level video understanding paradigm via visual prompts.
git clone https://github.com/gaotiexinqu/VideoSeeker
conda create -n llamafactory python=3.12
conda activate LLaMA-Factory
cd VideoSeeker/LLaMA-Factory/LLaMA-Factory
pip install -e .
conda create -n verl python=3.12
conda activate verl
cd VideoSeeker/verl/verl
bash scripts/install.sh
We support multi-benchmark parallel inference and evaluation on various video understanding benchmarks.
Configure your model and data paths in benchmarks.json:
{
"name": "V2P-Bench",
"root": "/path/to/V2P-Bench",
"frames_root": "$ROOT/frames",
"videos_root": "$ROOT/videos",
"dataset_info_path": "$ROOT/dataset_info_1148.json",
"media_root": "$ROOT/videos",
"tools": "view_visual_prompt",
"mode": "tool"
}Key configuration options:
root: Base path for the datasettools: Tool type (view_visual_promptorcrop_video)mode: Inference mode (direct,reasoning, ortool)$ROOTwill be automatically replaced with therootvalue
# Set your checkpoint path in run_multi_inference.sh
CKPT_PATH="/path/to/your/model"
# Run multi-benchmark inference
bash eval/inference/run_multi_inference.sh# Calculate metrics for all benchmarks
bash eval/calu_metrics/start_all_eval.sh
# Run LLM-as-judge evaluation for LongVT benchmarks
bash eval/calu_metrics/longvt/start_judge.sh@article{zhao2026videoseeker,
title={VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation},
author={Yiming Zhao and Yu Zeng and Wenxuan Huang and Zhen Fang and Qing Miao and Qisheng Su and Jiawei Zhao and Jiayin Cai and Lin Chen and Zehui Chen and Yukun Qi and Yao Hu and Xiaolong Jiang and Feng Zhao},
journal={arXiv preprint arXiv:2605.16079},
year={2026}
}


