VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

VideoSeeker is a novel agentic instance-level video understanding paradigm via native tool invocation with visual prompts.

🔥 News

[2026/05/14] 🔥 We have released VideoSeeker, a novel agentic instance-level video understanding paradigm via visual prompts.

Teaser

Data Pipeline

Performance

🚀 Quickstart

🔧 Environmental Setup

SFT

git clone https://github.com/gaotiexinqu/VideoSeeker

conda create -n llamafactory python=3.12
conda activate LLaMA-Factory
cd VideoSeeker/LLaMA-Factory/LLaMA-Factory
pip install -e .

RL

conda create -n verl python=3.12
conda activate verl
cd VideoSeeker/verl/verl
bash scripts/install.sh

🛠️ Prepare Dataset

SFT

RL

Eval

⚡ Start Training

SFT

RL

📊 Evaluation

We support multi-benchmark parallel inference and evaluation on various video understanding benchmarks.

1. Inference

Configure your model and data paths in benchmarks.json:

{
  "name": "V2P-Bench",
  "root": "/path/to/V2P-Bench",
  "frames_root": "$ROOT/frames",
  "videos_root": "$ROOT/videos",
  "dataset_info_path": "$ROOT/dataset_info_1148.json",
  "media_root": "$ROOT/videos",
  "tools": "view_visual_prompt",
  "mode": "tool"
}

Key configuration options:

root: Base path for the dataset
tools: Tool type (view_visual_prompt or crop_video)
mode: Inference mode (direct, reasoning, or tool)
$ROOT will be automatically replaced with the root value

# Set your checkpoint path in run_multi_inference.sh
CKPT_PATH="/path/to/your/model"

# Run multi-benchmark inference
bash eval/inference/run_multi_inference.sh

2. Evaluation

# Calculate metrics for all benchmarks
bash eval/calu_metrics/start_all_eval.sh

# Run LLM-as-judge evaluation for LongVT benchmarks
bash eval/calu_metrics/longvt/start_judge.sh

📜 Citation

@article{zhao2026videoseeker,
  title={VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation},
  author={Yiming Zhao and Yu Zeng and Wenxuan Huang and Zhen Fang and Qing Miao and Qisheng Su and Jiawei Zhao and Jiayin Cai and Lin Chen and Zehui Chen and Yukun Qi and Yao Hu and Xiaolong Jiang and Feng Zhao},
  journal={arXiv preprint arXiv:2605.16079},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
LLaMA-Factory		LLaMA-Factory
assets		assets
docs		docs
eval		eval
paper		paper
verl		verl
.gitignore		.gitignore
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

🔥 News

Teaser

Data Pipeline

Performance

🚀 Quickstart

🔧 Environmental Setup

SFT

RL

🛠️ Prepare Dataset

SFT

RL

Eval

⚡ Start Training

SFT

RL

📊 Evaluation

1. Inference

2. Evaluation

📜 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

🔥 News

Teaser

Data Pipeline

Performance

🚀 Quickstart

🔧 Environmental Setup

SFT

RL

🛠️ Prepare Dataset

SFT

RL

Eval

⚡ Start Training

SFT

RL

📊 Evaluation

1. Inference

2. Evaluation

📜 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages