This repository is the official implementation of the ECCV 2024 paper LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction.
LaMI-DETR adapts the DETR model by incorporating a frozen CLIP image encoder as the backbone.
LaMI harnesses LLMs to extract inter-category relationships, utilizing this information to sample easy negative categories and avoid overfitting to base categories, while also refining concept representations to enable effective classification between visually similar categories.

The code is tested under python=3.9 torch=1.10.0 cuda=11.7. Please download and unzip this environment under your conda envs dir.
cd your_conda_envs_path
unzip tar -xvf lami.tar
vim your_conda_envs_path/lami/bin/pip
change '#!~/.conda/envs/lami/bin/python' to '#!your_conda_envs_path/lami/bin/python'
export CUDA_HOME=/usr/local/cuda-11.7or you can create a conda environment and activate it. Install PyTorch following the official documentation.
For example,
conda create -n lami python=3.9
conda activate lami
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
export CUDA_HOME=/usr/local/cuda-11.7Check the torch installation.
python
>>> import torch
>>> torch.cuda.is_available()
True
>>> from torch.utils.cpp_extension import CUDA_HOME
>>> CUDA_HOME
'/usr/local/cuda-11.7'
>>> exit()Install the detectron2 and detrex.
cd LaMI-DETR
pip install -e detectron2
pip install -e .Download the MS-COCO dataset to dataset/coco.
Download and unzip the LVIS annotation to dataset/lvis.
Download and unzip the VG annotation to dataset/VisualGenome.
LaMI-DETR/dataset
βββ coco/
β βββ train2017/
β βββ val2017/
βββ lvis
| βββ lvis_v1_train_norare.json
| βββ lvis_v1_val.json
| βββ lvis_v1_minival.json
| βββ lvis_v1_train_norare_cat_info.json
| βββ lvis_v1_seen_classes.json
| βββ lvis_v1_all_classes.json
βββ VisualGenome
| βββ lvis_v1_all_classes.json
| βββ lvis_v1_seen_classes.json
| βββ vg_filter_rare_cat_info.json
| βββ vg_filter_rare.json
| βββ images/
βββ cluster
| βββ lvis_cluster_128.npy
| βββ vg_cluster_256.npy
βββ metadata
βββ lvis_visual_desc_convnextl.npy
βββ lvis_visual_desc_confuse_lvis_convnextl.npy
βββ concept_dict_visual_desc_convnextl.npy
Referring to Detectron2
detectron2/detectron2/data/datasets/builtin.py
detectron2/detectron2/data/datasets/builtin_meta.py
Change "model.eval_query_path" in config file
LaMI-DETR/pretrained_models
βββ lami_convnext_large_12ep_lvis/
β βββ model_final.pth
βββ lami_convnext_large_12ep_vg/
β βββ model_final.pth
βββ lami_convnext_large_obj365_12ep.pth
βββ clip_convnext_large_trans.pth
βββ clip_convnext_large_head.pth
In the paper, we reported p2 layer score ensemble results. This repository provides p3 layer results, which are generally higher. We found p2 and p3 layers with ConvNeXt yield similar results, but p3 is much faster. Thus, we recommend using p3.
| # | Training Data | Visual-desc API | Inference Data | AP | APr | Script | Init checkpoint | Checkpoint |
|---|---|---|---|---|---|---|---|---|
| 1 | LVIS-base | gpt-3.5-turbo | LVIS | 41.6 | 43.3 | script | clip_convnext_large_trans.pth | lami_convnext_large_12ep_lvis/model_final.pth |
| 2 | VGdedup | gpt-3.5-turbo | LVIS | 35.4 | 38.8 | script | lami_convnext_large_obj365_12ep.pth | lami_convnext_large_12ep_vg/model_final.pth |
| 2 | LVIS-base | deepseek-v2-chat(MoE-236B) | LVIS | 41.2 | 41.5 | script | lami_convnext_large_obj365_12ep.pth | lami_convnext_large_12ep_lvis_deepseek/model_final.pth |
OV-LVIS
CUDA_VISIBLE_DEVICES=0,1,2,3 python tools/train_net.py --config-file lami_dino/configs/dino_convnext_large_4scale_12ep_lvis.py --num-gpus 4 --eval-only train.init_checkpoint=pretrained_models/lami_convnext_large_12ep_lvis/model_final.pthZero-shot LVIS
CUDA_VISIBLE_DEVICES=0,1,2,3 python tools/train_net.py --config-file lami_dino/configs/dino_convnext_large_4scale_12ep_vg.py --num-gpus 4 --eval-only train.init_checkpoint=pretrained_models/lami_convnext_large_12ep_vg/model_final.pthFor a quick debug you can update numpy to 1.24.0 and install lvis-debug, then comment the 372 line and uncomment the 373 line in detectron2/detectron2/evaluation/lvis_evaluation.py
pip uninstall lvis
git clone https://github.com/eternaldolphin/lvis-debug.git
cd lvis-debug
pip install -e .
cd ../
CUDA_VISIBLE_DEVICES=1 python tools/train_net.py --config-file lami_dino/configs/dino_convnext_large_4scale_12ep_lvis.py --num-gpus 1 --ddebug --eval-only
cd LaMI-DETR/lami_detr
python lamidetr_sam2_inference.py \
--image ../examples/richhf/images/1.jpg\
--visual_desc ../examples/richhf/visual_descs/1.json \
--output ../examples/richhf/results/1_result.pngcd LaMI-DETR/lami_detr
python lamidetr_sam2_batch.pyHere is an example for richhf dataset.
We use the doubao-seed API, you need to add your API key.
The meta.json from ./examples/richhf is copied from part of the richhf dataset.
How to run:
cd LaMI-DETR
python extract_richhf10_visual_des_doubao.pyOV-LVIS
python tools/train_net.py --config-file lami_dino/configs/dino_convnext_large_4scale_12ep_lvis.py --num-gpus 8 train.init_checkpoint=pretrained_models/clip_convnext_large_trans.pthZero-shot LVIS
python tools/train_net.py --config-file lami_dino/configs/dino_convnext_large_4scale_12ep_vg.py --num-gpus 8 train.init_checkpoint=pretrained_models/lami_convnext_large_obj365_12ep.pthFor a quick debug you can update numpy to 1.24.0 and install lvis-debug, then comment the 372 line and uncomment the 373 line in detectron2/detectron2/evaluation/lvis_evaluation.py
CUDA_VISIBLE_DEVICES=1 python tools/train_net.py --config-file lami_dino/configs/dino_convnext_large_4scale_12ep_lvis.py --num-gpus 1 --ddebug- Release inference codes.
- Release checkpoints.
- Release training codes.
- Release demo.
- Release coco and o365 inference codes.
@inproceedings{du2024lami,
title={LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction},
author={Du, Penghui and Wang, Yu and Sun, Yifan and Wang, Luting and Liao, Yue and Zhang, Gang and Ding, Errui and Wang, Yan and Wang, Jingdong and Liu, Si},
booktitle={Proceedings of the European conference on computer vision (ECCV)},
year={2024}
}
LaMI-DETR is built based on detectron2 and detrex, thanks to all the contributors!