FindIt

FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal LLMs

Eshika Khandelwal¹, Jingjing Pan³, Mingfang Zhang³, Quan Kong³, Lorenzo Garattoni⁴, and Hilde Kuehne^1,2

¹ Tübingen AI Center, University of Tübingen, ² MIT-IBM Watson AI Lab, ³ Woven by Toyota, Inc., Tokyo, Japan, ⁴ Toyota Motor Europe, Brussels, Belgium.

Code release for FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal LLMs.

FindIt evaluates the promptable bounding-box localization ability of generalist MLLMs across four task families:

Object detection (single- and multi-label) — Pascal VOC, OpenImages V7, iGround.
Referring expression detection — RefCOCO, RefCOCO+, RefCOCO-g, RefL4, D3, PhraseCut, Flickr30k Entities, Synthetic Visual Genome.
Instance detection (visual support image instead of a text query) — HR-InsDet (easy and hard) and RoboTools.
Video detection (multi-frame extension) — iGround as multi-frame object detection, RoboTools as multi-frame instance detection.

For every (model, dataset) cell, the benchmark sweeps two further axes (Section 3.2 / 3.3 of the paper):

Bounding-box representation — xyxy, xywh, yxyx, yxhw, cxcywh, all, all-labelled, plus unconstrained which omits any format specification from the prompt.
Output format — plain text vs JSON. JSON variants additionally sweep the dictionary key (bbox, bbox_2d, coordinates, bounding_box, and class_name for multi-label queries). All paper results use the per-box structure (one JSON object per detected instance). A nested variant that collects all boxes under a single list key is also available via --json-structure nested but did not work well in practice.

The evaluation reports each model at the (bbox_repr, output_format, json_key, json_structure) combination that maximizes its average F1@0.5, chosen via the multi-stage format search described in Section 4 of the paper.

Metrics

Predictions are matched to ground truth by Hungarian assignment with cost 1 - IoU (single-label) or 1 - IoU - 1[label matches] (multi-label, so label agreement dominates IoU). On the matched pairs we report:

F1@0.5 — a prediction counts as true positive when its IoU with the matched ground-truth box is >= 0.5; multi-label queries additionally require the predicted and ground-truth labels to agree.
mIoU — mean IoU averaged over matched pairs.
Format Adherence (FA) — fraction of responses that parse as the prompted format. Non-parseable responses are scored as an empty prediction list, which lowers precision/recall.

No post-processing is applied to model outputs; we score what the model emits directly.

Repository layout

configs/                   # YAML configs (edit paths before use)
    config.yaml            # output dir, model identifiers
    datasets.yaml          # per-dataset image-folder paths
src/findit/                # Python package
    axes.py                # task-name builder, dataset registry, rescale modes
    promptloader.py        # prompts for every (task, fmt, bbox_repr, json_key)
    dataset.py             # parquet -> in-memory loaders, one per dataset
    prepare_datasets.py    # build the on-disk Arrow datasets lmms-eval reads
    generate_tasks.py      # emit per-variant lmms-eval task YAMLs
    evaluate.py            # bbox extraction, Hungarian matching, metrics
    doc_utils.py           # lmms-eval hooks (doc_to_visual / process_results)
    task_templates/        # shared lmms-eval template
    plugin/                # custom model wrappers
        models/
            gemma4.py            # Gemma 4 with positional <image N> binding
            glm4v_nothink.py     # GLM-4.6V with thinking disabled
            qwen3_5_nothink.py   # Qwen3.5-VL with thinking disabled
            openrouter.py        # OpenRouter-routed proprietary models
            _interleave.py       # shared helper for <image N> placeholders
scripts/
    freeze_subsets.py      # documents how subsets/*.parquet were produced
subsets/                   # 1,000-query parquets (one per dataset variant)
data/README.md             # where to obtain the source images
LICENSE
pyproject.toml

Installation

pip install -e .

This installs findit and all dependencies including lmms-eval.

Datasets

subsets/*.parquet ships the 1,000-query subsets used for every result in the paper. You still need the source images. Edit configs/datasets.yaml so each images: / scenes: / support: / frames: value points to your local copy. See data/README.md for download URLs.

After paths are set, build the on-disk Arrow datasets that lmms-eval consumes:

python -m findit.prepare_datasets
# or only a subset:
python -m findit.prepare_datasets --only pascal,refcoco_test,hr_insdet_easy

Built datasets land under src/findit/built_datasets/ (gitignored).

RefCOCO/+/g and RefL4 are pulled directly from HuggingFace at build time, so no local image folder is required for those.

Running the benchmark

findit.generate_tasks emits one lmms-eval task YAML per variant of the benchmark grid (task family x bbox representation x output format x JSON key x JSON structure x frame count). Example: object detection across three datasets (emits 6 task YAMLs — 3 datasets × 2 output formats):

python -m findit.generate_tasks \
    --model qwen3_vl \
    --dataset pascal,openimages,iground \
    --fmt text,json \
    --bbox-repr xyxy \
    --json-key bbox_2d \
    --json-structure per_box \
    --out src/findit/tasks \
    --group-name FindIt_objdet_xyxy

Then run lmms-eval with the emitted group:

lmms-eval \
    --include_path src/findit/tasks \
    --tasks FindIt_objdet_xyxy \
    --model qwen3_vl \
    --output_path ./Outputs/qwen3_vl_objdet_xyxy

To reproduce the multi-stage format search from Section 4 (50 queries for open-source models, 20 for proprietary models):

# Stage 1: sweep bbox representations on Pascal at the default JSON key.
python -m findit.generate_tasks \
    --model qwen3_vl --dataset pascal \
    --fmt text,json --bbox-repr all --json-key bbox_2d \
    --group-name FindIt_stage1
lmms-eval \
    --include_path src/findit/tasks \
    --tasks FindIt_stage1 \
    --model qwen3_vl \
    --output_path ./Outputs/qwen3_vl_stage1 \
    --limit 50

# Stage 2: pin the winning bbox representation from Stage 1 (e.g. xyxy) and sweep JSON keys.
python -m findit.generate_tasks \
    --model qwen3_vl --dataset pascal --append \
    --fmt json --bbox-repr xyxy --json-key all \
    --group-name FindIt_stage2
lmms-eval \
    --include_path src/findit/tasks \
    --tasks FindIt_stage2 \
    --model qwen3_vl \
    --output_path ./Outputs/qwen3_vl_stage2 \
    --limit 50

Models

The paper evaluates six open-source models — Qwen2.5-VL, Qwen3-VL, Qwen3.5-VL (both with and without reasoning), InternVL3, Gemma 4, GLM-4.6V — and three proprietary models routed through OpenRouter — GPT-5.4, Claude Sonnet 4.5, Gemini 2.5 Flash. Identifiers and rescale modes are declared in src/findit/axes.py::MODEL_RESCALE. Custom lmms-eval model wrappers live under src/findit/plugin/models/:

Wrapper	Purpose
`gemma4.Gemma4`	Reuses `lmms-eval`'s Gemma3 loop with the Gemma4 HF class; patches `apply_chat_template` so `<image N>` placeholders bind positionally.
`glm4v_nothink.GLM4VNoThink`	GLM-4.6V with `enable_thinking=False`.
`qwen3_5_nothink.Qwen3_5NoThink`	Qwen3.5-VL run without the `<think>...</think>` block.
`openrouter.OpenRouterNoThink`	OpenRouter-routed proprietary models (Claude / GPT / Gemini); disables reasoning, JPEG-encodes images >= 2048 px on the longest side, and pre-resizes scenes >= 4096 px to keep payloads under provider size caps.

Coordinate-space assumptions per model are recorded in MODEL_RESCALE and matched in evaluate._get_scale:

standard — model returns coordinates in a 1000x1000 normalized grid.
no_rescale — model returns pixel coordinates of the input image.
unit_rescale — model returns coordinates in [0, 1].
smart_resize — Qwen2.5-VL: pixel coordinates of the smart-resized image (max_pixels=12.8M, factor=28); we invert the resize to recover original pixel space before scoring.

GLM-4.6V wraps every box in <|begin_of_box|>...<|end_of_box|> and repeats the block; evaluate.extract_*_bboxes reads only the first block, matching the paper's protocol (Section 4, "Output parsing").

Outputs

lmms-eval writes its run logs and per-sample predictions to whatever path you pass via --output_path; the included config defaults to ./Outputs/. The five aggregated metrics (format_adherence, mean_iou, precision_at_05, recall_at_05, f1_at_05) are produced by the aggregators in findit.doc_utils and reported by lmms-eval as the run summary.

Citation

If you find this repository useful, please consider citing our work:

@article{khandelwal2026findit,
  title   = {FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal {LLMs}},
  author  = {Khandelwal, Eshika and Pan, Jingjing and Zhang, Mingfang
             and Kong, Quan and Garattoni, Lorenzo and Kuehne, Hilde},
  journal = {arXiv},
  year    = {2026}
}

If you run into any issues setting up or running the benchmark, feel free to open a GitHub issue or reach out via email.

License

CC BY-NC-SA 4.0. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FindIt

FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal LLMs

Metrics

Repository layout

Installation

Datasets

Running the benchmark

Models

Outputs

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
data		data
src/findit		src/findit
subsets		subsets
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

FindIt

FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal LLMs

Metrics

Repository layout

Installation

Datasets

Running the benchmark

Models

Outputs

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages