Eshika Khandelwal1, Jingjing Pan3, Mingfang Zhang3, Quan Kong3, Lorenzo Garattoni4, and Hilde Kuehne1,2
1 Tübingen AI Center, University of Tübingen, 2 MIT-IBM Watson AI Lab, 3 Woven by Toyota, Inc., Tokyo, Japan, 4 Toyota Motor Europe, Brussels, Belgium.
Code release for FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal LLMs.
FindIt evaluates the promptable bounding-box localization ability of generalist MLLMs across four task families:
- Object detection (single- and multi-label) — Pascal VOC, OpenImages V7, iGround.
- Referring expression detection — RefCOCO, RefCOCO+, RefCOCO-g, RefL4, D3, PhraseCut, Flickr30k Entities, Synthetic Visual Genome.
- Instance detection (visual support image instead of a text query) — HR-InsDet (easy and hard) and RoboTools.
- Video detection (multi-frame extension) — iGround as multi-frame object detection, RoboTools as multi-frame instance detection.
For every (model, dataset) cell, the benchmark sweeps two further axes (Section 3.2 / 3.3 of the paper):
- Bounding-box representation —
xyxy,xywh,yxyx,yxhw,cxcywh,all,all-labelled, plusunconstrainedwhich omits any format specification from the prompt. - Output format — plain text vs JSON. JSON variants additionally sweep
the dictionary key (
bbox,bbox_2d,coordinates,bounding_box, andclass_namefor multi-label queries). All paper results use the per-box structure (one JSON object per detected instance). A nested variant that collects all boxes under a single list key is also available via--json-structure nestedbut did not work well in practice.
The evaluation reports each model at the (bbox_repr, output_format, json_key, json_structure) combination that maximizes its average F1@0.5,
chosen via the multi-stage format search described in Section 4 of the paper.
Predictions are matched to ground truth by Hungarian assignment with cost
1 - IoU (single-label) or 1 - IoU - 1[label matches] (multi-label, so
label agreement dominates IoU). On the matched pairs we report:
- F1@0.5 — a prediction counts as true positive when its IoU with the
matched ground-truth box is
>= 0.5; multi-label queries additionally require the predicted and ground-truth labels to agree. - mIoU — mean IoU averaged over matched pairs.
- Format Adherence (FA) — fraction of responses that parse as the prompted format. Non-parseable responses are scored as an empty prediction list, which lowers precision/recall.
No post-processing is applied to model outputs; we score what the model emits directly.
configs/ # YAML configs (edit paths before use)
config.yaml # output dir, model identifiers
datasets.yaml # per-dataset image-folder paths
src/findit/ # Python package
axes.py # task-name builder, dataset registry, rescale modes
promptloader.py # prompts for every (task, fmt, bbox_repr, json_key)
dataset.py # parquet -> in-memory loaders, one per dataset
prepare_datasets.py # build the on-disk Arrow datasets lmms-eval reads
generate_tasks.py # emit per-variant lmms-eval task YAMLs
evaluate.py # bbox extraction, Hungarian matching, metrics
doc_utils.py # lmms-eval hooks (doc_to_visual / process_results)
task_templates/ # shared lmms-eval template
plugin/ # custom model wrappers
models/
gemma4.py # Gemma 4 with positional <image N> binding
glm4v_nothink.py # GLM-4.6V with thinking disabled
qwen3_5_nothink.py # Qwen3.5-VL with thinking disabled
openrouter.py # OpenRouter-routed proprietary models
_interleave.py # shared helper for <image N> placeholders
scripts/
freeze_subsets.py # documents how subsets/*.parquet were produced
subsets/ # 1,000-query parquets (one per dataset variant)
data/README.md # where to obtain the source images
LICENSE
pyproject.toml
pip install -e .
This installs findit and all dependencies including
lmms-eval.
subsets/*.parquet ships the 1,000-query subsets used for every result in
the paper. You still need the source images. Edit
configs/datasets.yaml so each images: / scenes: / support: /
frames: value points to your local copy. See data/README.md for download
URLs.
After paths are set, build the on-disk Arrow datasets that lmms-eval consumes:
python -m findit.prepare_datasets
# or only a subset:
python -m findit.prepare_datasets --only pascal,refcoco_test,hr_insdet_easy
Built datasets land under src/findit/built_datasets/ (gitignored).
RefCOCO/+/g and RefL4 are pulled directly from HuggingFace at build time, so no local image folder is required for those.
findit.generate_tasks emits one lmms-eval task YAML per variant of the
benchmark grid (task family x bbox representation x output format x JSON
key x JSON structure x frame count). Example: object detection across three
datasets (emits 6 task YAMLs — 3 datasets × 2 output formats):
python -m findit.generate_tasks \
--model qwen3_vl \
--dataset pascal,openimages,iground \
--fmt text,json \
--bbox-repr xyxy \
--json-key bbox_2d \
--json-structure per_box \
--out src/findit/tasks \
--group-name FindIt_objdet_xyxy
Then run lmms-eval with the emitted group:
lmms-eval \
--include_path src/findit/tasks \
--tasks FindIt_objdet_xyxy \
--model qwen3_vl \
--output_path ./Outputs/qwen3_vl_objdet_xyxy
To reproduce the multi-stage format search from Section 4 (50 queries for open-source models, 20 for proprietary models):
# Stage 1: sweep bbox representations on Pascal at the default JSON key.
python -m findit.generate_tasks \
--model qwen3_vl --dataset pascal \
--fmt text,json --bbox-repr all --json-key bbox_2d \
--group-name FindIt_stage1
lmms-eval \
--include_path src/findit/tasks \
--tasks FindIt_stage1 \
--model qwen3_vl \
--output_path ./Outputs/qwen3_vl_stage1 \
--limit 50
# Stage 2: pin the winning bbox representation from Stage 1 (e.g. xyxy) and sweep JSON keys.
python -m findit.generate_tasks \
--model qwen3_vl --dataset pascal --append \
--fmt json --bbox-repr xyxy --json-key all \
--group-name FindIt_stage2
lmms-eval \
--include_path src/findit/tasks \
--tasks FindIt_stage2 \
--model qwen3_vl \
--output_path ./Outputs/qwen3_vl_stage2 \
--limit 50
The paper evaluates six open-source models — Qwen2.5-VL, Qwen3-VL, Qwen3.5-VL
(both with and without reasoning), InternVL3, Gemma 4, GLM-4.6V — and
three proprietary models routed through OpenRouter — GPT-5.4, Claude
Sonnet 4.5, Gemini 2.5 Flash. Identifiers and rescale modes are declared in
src/findit/axes.py::MODEL_RESCALE. Custom lmms-eval model wrappers live
under src/findit/plugin/models/:
| Wrapper | Purpose |
|---|---|
gemma4.Gemma4 |
Reuses lmms-eval's Gemma3 loop with the Gemma4 HF class; patches apply_chat_template so <image N> placeholders bind positionally. |
glm4v_nothink.GLM4VNoThink |
GLM-4.6V with enable_thinking=False. |
qwen3_5_nothink.Qwen3_5NoThink |
Qwen3.5-VL run without the <think>...</think> block. |
openrouter.OpenRouterNoThink |
OpenRouter-routed proprietary models (Claude / GPT / Gemini); disables reasoning, JPEG-encodes images >= 2048 px on the longest side, and pre-resizes scenes >= 4096 px to keep payloads under provider size caps. |
Coordinate-space assumptions per model are recorded in MODEL_RESCALE and
matched in evaluate._get_scale:
standard— model returns coordinates in a 1000x1000 normalized grid.no_rescale— model returns pixel coordinates of the input image.unit_rescale— model returns coordinates in[0, 1].smart_resize— Qwen2.5-VL: pixel coordinates of the smart-resized image (max_pixels=12.8M,factor=28); we invert the resize to recover original pixel space before scoring.
GLM-4.6V wraps every box in <|begin_of_box|>...<|end_of_box|> and repeats
the block; evaluate.extract_*_bboxes reads only the first block, matching
the paper's protocol (Section 4, "Output parsing").
lmms-eval writes its run logs and per-sample predictions to whatever path
you pass via --output_path; the included config defaults to
./Outputs/. The five aggregated metrics
(format_adherence, mean_iou, precision_at_05, recall_at_05,
f1_at_05) are produced by the aggregators in
findit.doc_utils and reported by lmms-eval as the run summary.
If you find this repository useful, please consider citing our work:
@article{khandelwal2026findit,
title = {FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal {LLMs}},
author = {Khandelwal, Eshika and Pan, Jingjing and Zhang, Mingfang
and Kong, Quan and Garattoni, Lorenzo and Kuehne, Hilde},
journal = {arXiv},
year = {2026}
}If you run into any issues setting up or running the benchmark, feel free to open a GitHub issue or reach out via email.
CC BY-NC-SA 4.0. See LICENSE.