Skip to content

esh04/FindIt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FindIt

Eshika Khandelwal1, Jingjing Pan3, Mingfang Zhang3, Quan Kong3, Lorenzo Garattoni4, and Hilde Kuehne1,2

1 Tübingen AI Center, University of Tübingen, 2 MIT-IBM Watson AI Lab, 3 Woven by Toyota, Inc., Tokyo, Japan, 4 Toyota Motor Europe, Brussels, Belgium.

Paper arXiv Project GitHub

Code release for FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal LLMs.

FindIt evaluates the promptable bounding-box localization ability of generalist MLLMs across four task families:

  1. Object detection (single- and multi-label) — Pascal VOC, OpenImages V7, iGround.
  2. Referring expression detection — RefCOCO, RefCOCO+, RefCOCO-g, RefL4, D3, PhraseCut, Flickr30k Entities, Synthetic Visual Genome.
  3. Instance detection (visual support image instead of a text query) — HR-InsDet (easy and hard) and RoboTools.
  4. Video detection (multi-frame extension) — iGround as multi-frame object detection, RoboTools as multi-frame instance detection.

For every (model, dataset) cell, the benchmark sweeps two further axes (Section 3.2 / 3.3 of the paper):

  • Bounding-box representationxyxy, xywh, yxyx, yxhw, cxcywh, all, all-labelled, plus unconstrained which omits any format specification from the prompt.
  • Output format — plain text vs JSON. JSON variants additionally sweep the dictionary key (bbox, bbox_2d, coordinates, bounding_box, and class_name for multi-label queries). All paper results use the per-box structure (one JSON object per detected instance). A nested variant that collects all boxes under a single list key is also available via --json-structure nested but did not work well in practice.

The evaluation reports each model at the (bbox_repr, output_format, json_key, json_structure) combination that maximizes its average F1@0.5, chosen via the multi-stage format search described in Section 4 of the paper.

Metrics

Predictions are matched to ground truth by Hungarian assignment with cost 1 - IoU (single-label) or 1 - IoU - 1[label matches] (multi-label, so label agreement dominates IoU). On the matched pairs we report:

  • F1@0.5 — a prediction counts as true positive when its IoU with the matched ground-truth box is >= 0.5; multi-label queries additionally require the predicted and ground-truth labels to agree.
  • mIoU — mean IoU averaged over matched pairs.
  • Format Adherence (FA) — fraction of responses that parse as the prompted format. Non-parseable responses are scored as an empty prediction list, which lowers precision/recall.

No post-processing is applied to model outputs; we score what the model emits directly.

Repository layout

configs/                   # YAML configs (edit paths before use)
    config.yaml            # output dir, model identifiers
    datasets.yaml          # per-dataset image-folder paths
src/findit/                # Python package
    axes.py                # task-name builder, dataset registry, rescale modes
    promptloader.py        # prompts for every (task, fmt, bbox_repr, json_key)
    dataset.py             # parquet -> in-memory loaders, one per dataset
    prepare_datasets.py    # build the on-disk Arrow datasets lmms-eval reads
    generate_tasks.py      # emit per-variant lmms-eval task YAMLs
    evaluate.py            # bbox extraction, Hungarian matching, metrics
    doc_utils.py           # lmms-eval hooks (doc_to_visual / process_results)
    task_templates/        # shared lmms-eval template
    plugin/                # custom model wrappers
        models/
            gemma4.py            # Gemma 4 with positional <image N> binding
            glm4v_nothink.py     # GLM-4.6V with thinking disabled
            qwen3_5_nothink.py   # Qwen3.5-VL with thinking disabled
            openrouter.py        # OpenRouter-routed proprietary models
            _interleave.py       # shared helper for <image N> placeholders
scripts/
    freeze_subsets.py      # documents how subsets/*.parquet were produced
subsets/                   # 1,000-query parquets (one per dataset variant)
data/README.md             # where to obtain the source images
LICENSE
pyproject.toml

Installation

pip install -e .

This installs findit and all dependencies including lmms-eval.

Datasets

subsets/*.parquet ships the 1,000-query subsets used for every result in the paper. You still need the source images. Edit configs/datasets.yaml so each images: / scenes: / support: / frames: value points to your local copy. See data/README.md for download URLs.

After paths are set, build the on-disk Arrow datasets that lmms-eval consumes:

python -m findit.prepare_datasets
# or only a subset:
python -m findit.prepare_datasets --only pascal,refcoco_test,hr_insdet_easy

Built datasets land under src/findit/built_datasets/ (gitignored).

RefCOCO/+/g and RefL4 are pulled directly from HuggingFace at build time, so no local image folder is required for those.

Running the benchmark

findit.generate_tasks emits one lmms-eval task YAML per variant of the benchmark grid (task family x bbox representation x output format x JSON key x JSON structure x frame count). Example: object detection across three datasets (emits 6 task YAMLs — 3 datasets × 2 output formats):

python -m findit.generate_tasks \
    --model qwen3_vl \
    --dataset pascal,openimages,iground \
    --fmt text,json \
    --bbox-repr xyxy \
    --json-key bbox_2d \
    --json-structure per_box \
    --out src/findit/tasks \
    --group-name FindIt_objdet_xyxy

Then run lmms-eval with the emitted group:

lmms-eval \
    --include_path src/findit/tasks \
    --tasks FindIt_objdet_xyxy \
    --model qwen3_vl \
    --output_path ./Outputs/qwen3_vl_objdet_xyxy

To reproduce the multi-stage format search from Section 4 (50 queries for open-source models, 20 for proprietary models):

# Stage 1: sweep bbox representations on Pascal at the default JSON key.
python -m findit.generate_tasks \
    --model qwen3_vl --dataset pascal \
    --fmt text,json --bbox-repr all --json-key bbox_2d \
    --group-name FindIt_stage1
lmms-eval \
    --include_path src/findit/tasks \
    --tasks FindIt_stage1 \
    --model qwen3_vl \
    --output_path ./Outputs/qwen3_vl_stage1 \
    --limit 50

# Stage 2: pin the winning bbox representation from Stage 1 (e.g. xyxy) and sweep JSON keys.
python -m findit.generate_tasks \
    --model qwen3_vl --dataset pascal --append \
    --fmt json --bbox-repr xyxy --json-key all \
    --group-name FindIt_stage2
lmms-eval \
    --include_path src/findit/tasks \
    --tasks FindIt_stage2 \
    --model qwen3_vl \
    --output_path ./Outputs/qwen3_vl_stage2 \
    --limit 50

Models

The paper evaluates six open-source models — Qwen2.5-VL, Qwen3-VL, Qwen3.5-VL (both with and without reasoning), InternVL3, Gemma 4, GLM-4.6V — and three proprietary models routed through OpenRouter — GPT-5.4, Claude Sonnet 4.5, Gemini 2.5 Flash. Identifiers and rescale modes are declared in src/findit/axes.py::MODEL_RESCALE. Custom lmms-eval model wrappers live under src/findit/plugin/models/:

Wrapper Purpose
gemma4.Gemma4 Reuses lmms-eval's Gemma3 loop with the Gemma4 HF class; patches apply_chat_template so <image N> placeholders bind positionally.
glm4v_nothink.GLM4VNoThink GLM-4.6V with enable_thinking=False.
qwen3_5_nothink.Qwen3_5NoThink Qwen3.5-VL run without the <think>...</think> block.
openrouter.OpenRouterNoThink OpenRouter-routed proprietary models (Claude / GPT / Gemini); disables reasoning, JPEG-encodes images >= 2048 px on the longest side, and pre-resizes scenes >= 4096 px to keep payloads under provider size caps.

Coordinate-space assumptions per model are recorded in MODEL_RESCALE and matched in evaluate._get_scale:

  • standard — model returns coordinates in a 1000x1000 normalized grid.
  • no_rescale — model returns pixel coordinates of the input image.
  • unit_rescale — model returns coordinates in [0, 1].
  • smart_resize — Qwen2.5-VL: pixel coordinates of the smart-resized image (max_pixels=12.8M, factor=28); we invert the resize to recover original pixel space before scoring.

GLM-4.6V wraps every box in <|begin_of_box|>...<|end_of_box|> and repeats the block; evaluate.extract_*_bboxes reads only the first block, matching the paper's protocol (Section 4, "Output parsing").

Outputs

lmms-eval writes its run logs and per-sample predictions to whatever path you pass via --output_path; the included config defaults to ./Outputs/. The five aggregated metrics (format_adherence, mean_iou, precision_at_05, recall_at_05, f1_at_05) are produced by the aggregators in findit.doc_utils and reported by lmms-eval as the run summary.

Citation

If you find this repository useful, please consider citing our work:

@article{khandelwal2026findit,
  title   = {FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal {LLMs}},
  author  = {Khandelwal, Eshika and Pan, Jingjing and Zhang, Mingfang
             and Kong, Quan and Garattoni, Lorenzo and Kuehne, Hilde},
  journal = {arXiv},
  year    = {2026}
}

If you run into any issues setting up or running the benchmark, feel free to open a GitHub issue or reach out via email.

License

CC BY-NC-SA 4.0. See LICENSE.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages