🛠️ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use

📚 Table of Contents

Overview
Key Features
Experimental Results
Quickstart
Acknowledgments

Overview

ToolScope is a training-free, modular framework for tool-augmented multimodal reasoning. It unifies global planning with local multimodal perception to solve complex visual question answering (VQA) tasks, mitigating visual context degradation and enabling adaptive tool use across diverse datasets. The system supports multiple LMM backends via vLLM and integrates specialized tools for retrieval, dynamic perception, and code execution.

ToolScope structures inference into three coordinated, modular phases:

Global Navigator
Performs high-level task decomposition and tool selection (from Search/Code/Perceive), providing strategic guidance to avoid redundant tool calls.
Agentic Executor
Executes iterative, tool-augmented reasoning. Dynamically invokes tools to resolve subproblems, with a dedicated Perceive tool for re-attending to visual details.
Response Synthesizer
Condenses reasoning trajectories, filters noise, and formats final answers for evaluation (concise text/numerical values/options).

Key Features

Visual Context Preservation: Dedicated Perceive tool mitigates visual context degradation in long-horizon reasoning.
- Modular Toolkit: Integrates Search (external knowledge), Code (numerical/logical computation), and Perceive (dynamic visual grounding).
- Training-Free & Plug-and-Play: No task-specific fine-tuning; works with off-the-shelf multimodal LLMs.
Multi-Backend Compatibility: Supports Qwen2.5-VL, InternVL3, MiMo-VL series; maintains performance gains across model sizes.
Broad Dataset Support: Compatible with VQA 2.0, ScienceQA, MAT-Search, and MathVista (covers general/ scientific/retrieval/mathematical VQA tasks).

Experimental Results

ToolScope delivers consistent and significant performance gains across 4 diverse VQA benchmarks (VQA 2.0, ScienceQA, MAT-Search, MathVista) and 3 major LMM backends (Qwen2.5-VL, InternVL3, MiMo-VL), outperforming all prompt-based (Direct Prompting/CoT/PoT) and tool-augmented (Naïve MRAG/Cantor) baselines.

Key highlights:

Average Performance Gain: Up to +6.69% (MiMo-VL-7B-RL), with consistent improvements across all backends (+4.81% to +6.33%).
Task-Specific Breakthroughs: Peak gain of +9.12% on MAT-Search (retrieval-heavy tasks), +4.67% on ScienceQA (scientific reasoning), and stable gains on general VQA (VQA 2.0) and mathematical reasoning (MathVista).
Cross-Model Robustness: Maintains superior performance across model sizes (from 7B to 78B) and architectures, with no task-specific fine-tuning.

Quickstart

Run end-to-end inference with your preferred model and dataset. The model_path can point to a local directory or Hugging Face Hub ID (supported by vLLM).

python -m src.run \
  --data_dir /path/to/the/dataset  # Path to your dataset directory  
  --dataset_name ScienceQA       # Supported: VQAv2/ScienceQA/MAT/MathVista  
  --split testvqa                # Dataset split (e.g., train/val/test/testmini)  
  --model_path OpenGVLab/InternVL3-8B  # LMM backend (local path or HuggingFace ID)  
  --retriever_cache_path bm25.pkl  # textual retriever cache
  --text_corpus_path /path/to/text/corpus # path to your text corpus
  --mm_retriever_path ViT-B/16  # multimodal retriever

Acknowledgments

ToolScope builds on open-source projects and datasets:

Inference engine: vLLM and Hugging Face Transformers
Retrieval & Perception: CLIP (cross-modal) and bm25s (sparse retrieval)
Datasets: VQA 2.0, ScienceQA, MAT-Search, MathVista

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🛠️ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use

📚 Table of Contents

Overview

Key Features

Experimental Results

Quickstart

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🛠️ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use

📚 Table of Contents

Overview

Key Features

Experimental Results

Quickstart

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages