Skip to content

Latest commit



143 lines (90 loc) · 11.9 KB

File metadata and controls

143 lines (90 loc) · 11.9 KB

Detic model zoo


This file documents a collection of models reported in our paper. The training time was measured on Big Basin servers with 8 NVIDIA V100 GPUs & NVLink.

How to Read the Tables

The "Name" column contains a link to the config file. To train a model, run

python --num-gpus 8 --config-file /path/to/config/name.yaml

To evaluate a model with a trained/ pretrained model, run

python --num-gpus 8 --config-file /path/to/config/name.yaml --eval-only MODEL.WEIGHTS /path/to/weight.pth

Third-party ImageNet-21K Pretrained Models

Our paper uses ImageNet-21K pretrained models that are not part of Detectron2 (ResNet-50-21K from MIIL and SwinB-21K from Swin-Transformer). Before training, please download the models and place them under DETIC_ROOT/models/, and following this tool to convert the format.

Open-vocabulary LVIS

Name Training time mask mAP mask mAP_novel Download
Box-Supervised_C2_R50_640_4x 17h 30.2 16.4 model
Detic_C2_IN-L_R50_640_4x 22h 32.4 24.9 model
Detic_C2_CCimg_R50_640_4x 22h 31.0 19.8 model
Detic_C2_CCcapimg_R50_640_4x 22h 31.0 21.3 model
Box-Supervised_C2_SwinB_896_4x 43h 38.4 21.9 model
Detic_C2_IN-L_SwinB_896_4x 47h 40.7 33.8 model


  • The open-vocabulary LVIS setup is LVIS without rare class annotations in training. We evaluate rare classes as novel classes in testing.

  • The models with C2 are trained using our improved LVIS baseline (Appendix D of the paper), including CenterNet2 detector, Federated Loss, large-scale jittering, etc.

  • All models use CLIP embeddings as classifiers. This makes the box-supervised models have non-zero mAP on novel classes.

  • The models with IN-L use the overlap classes between ImageNet-21K and LVIS as image-labeled data.

  • The models with CC use Conception Captions. CCimg uses image labels extracted from the captions (using a naive text-match) as image-labeled data. CCcapimg additionally uses the row captions (Appendix C of the paper).

  • The Detic models are finetuned on the corresponding Box-Supervised models above (indicated by MODEL.WEIGHTS in the config files). Please train or download the Box-Supervised model and place them under DETIC_ROOT/models/ before training the Detic models.

Standard LVIS

Name Training time mask mAP mask mAP_rare Download
Box-Supervised_C2_R50_640_4x 17h 31.5 25.6 model
Detic_C2_R50_640_4x 22h 33.2 29.7 model
Box-Supervised_C2_SwinB_896_4x 43h 40.7 35.9 model
Detic_C2_SwinB_896_4x 47h 41.7 41.7 model
Name Training time box mAP box mAP_rare Download
Box-Supervised_DeformDETR_R50_4x 31h 31.7 21.4 model
Detic_DeformDETR_R50_4x 47h 32.5 26.2 model


  • All Detic models use the overlap classes between ImageNet-21K and LVIS as image-labeled data;

  • The models with C2 are trained using our improved LVIS baseline in the paper, including CenterNet2 detector, Federated loss, large-scale jittering, etc.

  • The models with DeformDETR are Deformable DETR models. We train the models with Federated Loss.

Open-vocabulary COCO

Name Training time box mAP50 box mAP50_novel Download
BoxSup_CLIP_R50_1x 12h 39.3 1.3 model
Detic_CLIP_R50_1x_image 13h 44.7 24.1 model
Detic_CLIP_R50_1x_caption 16h 43.8 21.0 model
Detic_CLIP_R50_1x_caption-image 16h 45.0 27.8 model


  • All models are trained with ResNet50-C4 without multi-scale augmentation. All models use CLIP embeddings as the classifier.

  • We extract class names from COCO-captions as image-labels. Detic_CLIP_R50_1x_image uses the max-size loss; Detic_CLIP_R50_1x_caption directly uses CLIP caption embedding within each mini-batch for classification; Detic_CLIP_R50_1x_caption-image uses both losses.

  • We report box mAP50 under the "generalized" open-vocabulary setting.

Cross-dataset evaluation

Name Training time Objects365 box mAP OpenImages box mAP50 Download
Box-Supervised_C2_SwinB_896_4x 43h 19.1 46.2 model
Detic_C2_SwinB_896_4x 47h 21.2 53.0 model
Detic_C2_SwinB_896_4x_IN-21K 47h 21.4 55.2 model
Box-Supervised_C2_SwinB_896_4x+COCO 43h 19.7 46.4 model
Detic_C2_SwinB_896_4x_IN-21K+COCO 47h 21.6 54.6 model


  • Box-Supervised_C2_SwinB_896_4x and Detic_C2_SwinB_896_4x are the same model in the Standard LVIS section, but evaluated with Objects365/ OpenImages vocabulary (i.e. CLIP embeddings of the corresponding class names as classifier). To run the evaluation on Objects365/ OpenImages, run

    python --num-gpus 8 --config-file configs/Detic_C2_SwinB_896_4x.yaml --eval-only DATASETS.TEST "('oid_val_expanded','objects365_v2_val',)" MODEL.RESET_CLS_TESTS True MODEL.TEST_CLASSIFIERS "('datasets/metadata/oid_clip_a+cname.npy','datasets/metadata/o365_clip_a+cnamefix.npy',)" MODEL.TEST_NUM_CLASSES "(500,365)" MODEL.MASK_ON False
  • Detic_C2_SwinB_896_4x_IN-21K trains on the full ImageNet-22K. We additionally use a dynamic class sampling ("Modified Federated Loss" in Section 4.4) and use a larger data sampling ratio of ImageNet images (1:16 instead of 1:4).

  • Detic_C2_SwinB_896_4x_IN-21K-COCO is a model trained on combined LVIS-COCO and ImageNet-21K for better demo purposes. LVIS models do not detect persons well due to its federated annotation protocol. LVIS+COCO models give better visual results.

Real-time models

Name Run time (ms) LVIS box mAP Download
Detic_C2_SwinB_896_4x_IN-21K+COCO (800x1333, no threshold) 115 44.4 model
Detic_C2_SwinB_896_4x_IN-21K+COCO 46 35.0 model
Detic_C2_ConvNeXtT_896_4x_IN-21K+COCO 26 30.7 model
Detic_C2_R5021k_896_4x_IN-21K+COCO 23 29.0 model
Detic_C2_R18_896_4x_IN-21K+COCO 18 22.1 model
  • Detic_C2_SwinB_896_4x_IN-21K+COCO (800x1333, thresh 0.02) is the entry on the [Cross-dataset evaluation](#Cross-dataset evaluation) section without the mask head. All other entries use a max-size of 640 and an output score threshold of 0.3 using the following command (e.g., with R50).

    python --config-file configs/Detic_LCOCOI21k_CLIP_R5021k_640b32_4x_ft4x_max-size.yaml --num-gpus 2 --eval-only DATASETS.TEST "('lvis_v1_val',)" MODEL.RESET_CLS_TESTS True MODEL.TEST_CLASSIFIERS "('datasets/metadata/lvis_v1_clip_a+cname.npy',)" MODEL.TEST_NUM_CLASSES "(1203,)" MODEL.MASK_ON False MODEL.WEIGHTS models/Detic_LCOCOI21k_CLIP_R5021k_640b32_4x_ft4x_max-size.pth INPUT.MIN_SIZE_TEST 640 INPUT.MAX_SIZE_TEST 640 MODEL.ROI_HEADS.SCORE_THRESH_TEST 0.3
  • All models are trained using the same training recipe except for different backbones.

  • The ConvNeXtT and Res50 models are initialized from their corresponding ImageNet-21K pretrained models. The Res18 model is initialized from its ImageNet-1K pretrained model.

  • The runtimes are measured on a local workstation with a Titan RTX GPU.