Skip to content

Latest commit

 

History

History
1115 lines (836 loc) · 74.8 KB

ARCHITECTURES.md

File metadata and controls

1115 lines (836 loc) · 74.8 KB

Available Architectures

The following is a list of all available architectures and their pre-trained parameters, with brief descriptions and references, classified into network families. Unless otherwise mentioned, the pre-trained parameters were learned through supervised training.

CaiT
ConvMixer
ConvNeXt
ConvNeXt V2
DaViT
EfficientNetV2
GC ViT
HorNet
MaxViT
NesT
PiT
PVT V2
RegNet
ResNet
ResNet V2
ResNet-D
ResNet-T
Wide ResNet
ResNeXt
SENet
ECANet
ResNet-RS
SKNet
ResNeSt
Swin
Swin-S3
VAN
VGG
ViT
DeiT 3
BEiT
XCiT

CaiT

Class attention image transformer (CaiT) from Going deeper with Image Transformers by Touvron et al. CaiT presents two novel modules, LayerScale and class attention, that enable ViTs to go significantly deeper with little saturation in accuracy at greater depths.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

cait_xxsmall24: CaiT-XXSmall, depth 24.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • in1k_384: Trained on ImageNet1K at resolution 384 x 384.

cait_xxsmall36: CaiT-XXSmall, depth 36.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • in1k_384: Trained on ImageNet1K at resolution 384 x 384.

cait_small24_224: CaiT-Small, depth 24.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • in1k_384: Trained on ImageNet1K at resolution 384 x 384.

cait_xsmall24: CaiT-XSmall, depth 24.

  • in1k_384: Trained on ImageNet1K at resolution 384 x 384.

cait_small36: CaiT-Small, depth 36.

  • in1k_384: Trained on ImageNet1K at resolution 384 x 384.

cait_medium36: CaiT-Medium, depth 36.

  • in1k_384: Trained on ImageNet1K at resolution 384 x 384.

cait_medium48: CaiT-Medium, depth 48.

  • in1k_448: Trained on ImageNet1K at resolution 448 x 448.

ConvMixer

ConvMixer from Patches Are All You Need? by Trockman et al. ConvMixer is similar to isotropic architectures like ViT but uses convolutions with large kernel sizes to perform token mixing.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

convmixer20_1024d_patch14_kernel9: ConvMixer, depth 20, token dimension 1024, patch size 14 x 14, kernel size 9 x 9.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

convmixer20_1536d_patch7_kernel9: ConvMixer, depth 20, token dimension 1536, patch size 7 x 7, kernel size 9 x 9.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

convmixer32_768d_patch7_kernel7: ConvMixer, depth 32, token dimension 768, patch size 7 x 7, kernel size 7 x 7.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

ConvNeXt

ConvNeXt from A ConvNet for the 2020s by Liu et al. ConvNeXt borrows ideas from the vision transformer literature, such as larger kernel sizes and more aggressive downsampling in the stem, to modernize a plain ResNet and attain results on par with state-of-the-art vision transformers like Swin using a purely convolutional network.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

convnext_atto: ConvNeXt-Atto.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

convnext_femto: ConvNeXt-Femto.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

convnext_pico: ConvNeXt-Pico.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

convnext_nano: ConvNeXt-Nano.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • in12k_224: Trained on ImageNet12K at resolution 224 x 224.
  • in12k_ft_in1k_224: Pre-trained on ImageNet12K and fine-tuned on ImageNet1K at resolution 224 x 224.

convnext_tiny: ConvNeXt-Tiny.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • in12k_224: Trained on ImageNet12K at resolution 224 x 224.
  • in12k_ft_in1k_224: Pre-trained on ImageNet12K and fine-tuned on ImageNet1K at resolution 224 x 224.
  • in12k_ft_in1k_384: Pre-trained on ImageNet12K and fine-tuned on ImageNet1K at resolution 384 x 384.
  • in22k_224: Trained on ImageNet22K at resolution 224 x 224.
  • in22k_ft_in1k_224: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 224 x 224.
  • in22k_ft_in1k_384: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 384 x 384.

convnext_small: ConvNeXt-Small.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • in12k_224: Trained on ImageNet12K at resolution 224 x 224.
  • in12k_ft_in1k_224: Pre-trained on ImageNet12K and fine-tuned on ImageNet1K at resolution 224 x 224.
  • in12k_ft_in1k_384: Pre-trained on ImageNet12K and fine-tuned on ImageNet1K at resolution 384 x 384.
  • in22k_224: Trained on ImageNet22K at resolution 224 x 224.
  • in22k_ft_in1k_224: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 224 x 224.
  • in22k_ft_in1k_384: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 384 x 384.

convnext_base: ConvNeXt-Base.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • in22k_224: Trained on ImageNet22K at resolution 224 x 224.
  • in22k_ft_in1k_224: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 224 x 224.
  • in22k_ft_in1k_384: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 384 x 384.
  • clip_laion2b_256: Trained using CLIP on LAION-2B at resolution 256 x 256.
  • clip_laion2b_augreg_256: Trained using CLIP on LAION-2B with additional augmentation & regularization at resolution 256 x 256.
  • clip_laiona_256: Trained using CLIP on LAION-Aesthetics at resolution 256 x 256.
  • clip_laiona_320: Trained using CLIP on LAION-Aesthetics at resolution 320 x 320.
  • clip_laiona_augreg_320: Trained using CLIP on LAION-Aesthetics with additional augmentation & regularization at resolution 320 x 320.
  • clip_laion2b_augreg_ft_in1k_256: Pre-trained using CLIP on LAION-2B with additional augmentation & regularization and fine-tuned on ImageNet1K at resolution 256 x 256.
  • clip_laiona_augreg_ft_in1k_384: Pre-trained using CLIP on LAION-Aesthetics with additional augmentation & regularization and fine-tuned on ImageNet1K at resolution 384 x 384.

convnext_large: ConvNeXt-Large.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • in22k_224: Trained on ImageNet22K at resolution 224 x 224.
  • in22k_ft_in1k_224: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 224 x 224.
  • in22k_ft_in1k_384: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 384 x 384.

convnext_large_mlp: ConvNeXt-Large with an MLP in the head.

  • clip_laion2b_augreg_256: Trained using CLIP on LAION-2B with additional augmentation & regularization at resolution 256 x 256.
  • clip_laion2b_augreg_256_ft_320: Trained using CLIP on LAION-2B with additional augmentation & regularization at resolution 256 x 256 and fine-tuned at resolution 320 x 320.
  • clip_laion2b_soup_augreg_256_ft_320: A soup (i.e., parameters averaged) of 3 models trained using CLIP on LAION-2B with additional augmentation & regularization at resolution 256 x 256 and fine-tuned at resolution 320 x 320.
  • clip_laion2b_augreg_ft_in1k_256: Pre-trained using CLIP on LAION-2B with additional augmentation & regularization and fine-tuned on ImageNet1K at resolution 256 x 256.
  • clip_laion2b_augreg_ft_in1k_384: Pre-trained using CLIP on LAION-2B with additional augmentation & regularization and fine-tuned on ImageNet1K at resolution 384 x 384.

convnext_xlarge: ConvNeXt-XLarge.

  • in22k_224: Trained on ImageNet22K at resolution 224 x 224.
  • in22k_ft_in1k_224: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 224 x 224.
  • in22k_ft_in1k_384: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 384 x 384.

convnext_xxlarge: ConvNeXt-XXLarge.

  • clip_laion2b_rewind_256: Trained using CLIP on LAION-2B at resolution 256 x 256, last 10% of training rewinded and resumed with slightly different settings.
  • clip_laion2b_soup_256: Trained using CLIP on LAION-2B at resolution 256 x 256, a soup (i.e., parameters averaged) of the original and rewinded training runs.

ConvNeXt V2

ConvNeXt V2 from ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders by Woo et al. ConvNeXt V2 is based on ConvNeXt but is trained using a fully convolutional masked autoencoder training scheme (FCMAE) and also incorporates global response normalization (GRN), a normalization module that abates inter-channel feature redundancies and is particularly important for effective learning with FCMAE.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

convnextv2_atto: ConvNeXtV2-Atto.

  • fcmae_in1k_224: Trained using FCMAE with no labels on ImageNet1K at resolution 224 x 224.
  • fcmae_in1k_ft_in1k_224: Pre-trained using FCMAE with no labels on ImageNet1K and fine-tuned with labels on ImageNet1K at resolution 224 x 224.

convnextv2_femto: ConvNeXtV2-Femto.

  • fcmae_in1k_224: Trained using FCMAE with no labels on ImageNet1K at resolution 224 x 224.
  • fcmae_in1k_ft_in1k_224: Pre-trained using FCMAE with no labels on ImageNet1K and fine-tuned with labels on ImageNet1K at resolution 224 x 224.

convnextv2_pico: ConvNeXtV2-Pico.

  • fcmae_in1k_224: Trained using FCMAE with no labels on ImageNet1K at resolution 224 x 224.
  • fcmae_in1k_ft_in1k_224: Pre-trained using FCMAE with no labels on ImageNet1K and fine-tuned with labels on ImageNet1K at resolution 224 x 224.

convnextv2_nano: ConvNeXtV2-Nano.

  • fcmae_in1k_224: Trained using FCMAE with no labels on ImageNet1K at resolution 224 x 224.
  • fcmae_in1k_ft_in1k_224: Pre-trained using FCMAE with no labels on ImageNet1K and fine-tuned with labels on ImageNet1K at resolution 224 x 224.
  • fcmae_in22k_ft_in22k_ft_in1k_224: Pre-trained using FCMAE with no labels on ImageNet22K, fine-tuned with labels on ImageNet22K, and fine-tuned with labels on ImageNet1K at resolution 224 x 224.
  • fcmae_in22k_ft_in22k_ft_in1k_384: Pre-trained using FCMAE with no labels on ImageNet22K, fine-tuned with labels on ImageNet22K, and fine-tuned with labels on ImageNet1K at resolution 384 x 384.

convnextv2_tiny: ConvNeXtV2-Tiny.

  • fcmae_in1k_224: Trained using FCMAE with no labels on ImageNet1K at resolution 224 x 224.
  • fcmae_in1k_ft_in1k_224: Pre-trained using FCMAE with no labels on ImageNet1K and fine-tuned with labels on ImageNet1K at resolution 224 x 224.
  • fcmae_in22k_ft_in22k_ft_in1k_224: Pre-trained using FCMAE with no labels on ImageNet22K, fine-tuned with labels on ImageNet22K, and fine-tuned with labels on ImageNet1K at resolution 224 x 224.
  • fcmae_in22k_ft_in22k_ft_in1k_384: Pre-trained using FCMAE with no labels on ImageNet22K, fine-tuned with labels on ImageNet22K, and fine-tuned with labels on ImageNet1K at resolution 384 x 384.

convnextv2_base: ConvNeXtV2-Base.

  • fcmae_in1k_224: Trained using FCMAE with no labels on ImageNet1K at resolution 224 x 224.
  • fcmae_in1k_ft_in1k_224: Pre-trained using FCMAE with no labels on ImageNet1K and fine-tuned with labels on ImageNet1K at resolution 224 x 224.
  • fcmae_in22k_ft_in22k_ft_in1k_224: Pre-trained using FCMAE with no labels on ImageNet22K, fine-tuned with labels on ImageNet22K, and fine-tuned with labels on ImageNet1K at resolution 224 x 224.
  • fcmae_in22k_ft_in22k_ft_in1k_384: Pre-trained using FCMAE with no labels on ImageNet22K, fine-tuned with labels on ImageNet22K, and fine-tuned with labels on ImageNet1K at resolution 384 x 384.

convnextv2_large: ConvNeXtV2-Large.

  • fcmae_in1k_224: Trained using FCMAE with no labels on ImageNet1K at resolution 224 x 224.
  • fcmae_in1k_ft_in1k_224: Pre-trained using FCMAE with no labels on ImageNet1K and fine-tuned with labels on ImageNet1K at resolution 224 x 224.
  • fcmae_in22k_ft_in22k_ft_in1k_224: Pre-trained using FCMAE with no labels on ImageNet22K, fine-tuned with labels on ImageNet22K, and fine-tuned with labels on ImageNet1K at resolution 224 x 224.
  • fcmae_in22k_ft_in22k_ft_in1k_384: Pre-trained using FCMAE with no labels on ImageNet22K, fine-tuned with labels on ImageNet22K, and fine-tuned with labels on ImageNet1K at resolution 384 x 384.

convnextv2_huge: ConvNeXtV2-Huge.

  • fcmae_in1k_224: Trained using FCMAE with no labels on ImageNet1K at resolution 224 x 224.
  • fcmae_in1k_ft_in1k_224: Pre-trained using FCMAE with no labels on ImageNet1K and fine-tuned with labels on ImageNet1K at resolution 224 x 224.
  • fcmae_in22k_ft_in22k_ft_in1k_384: Pre-trained using FCMAE with no labels on ImageNet22K, fine-tuned with labels on ImageNet22K, and fine-tuned with labels on ImageNet1K at resolution 384 x 384.
  • fcmae_in22k_ft_in22k_ft_in1k_512: Pre-trained using FCMAE with no labels on ImageNet22K, fine-tuned with labels on ImageNet22K, and fine-tuned with labels on ImageNet1K at resolution 512 x 512.

DaViT

DaViT from DaViT: Dual Attention Vision Transformers by Ding et al. DaViT captures local spatial interactions through window attention and avails of channel self-attention to model global interactions whilst maintaining a linear complexity.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

davit_tiny: DaViT-Tiny.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

davit_small: DaViT-Small.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

davit_base: DaViT-Base.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

EfficientNetV2

EfficientNetV2 from EfficientNetV2: Smaller Models and Faster Training by Tan et al. EfficientNetV2 builds on EfficientNet but refines the architecture search space by introducing fused MBConv to it, takes training speed into account, and more.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

efficientnetv2_small: EfficientNetV2-Small.

  • in1k_300: Trained on ImageNet1K at resolution 300 x 300.
  • in22k_300: Trained on ImageNet22K at resolution 300 x 300.
  • in22k_ft_in1k_300: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 300 x 300.

efficientnetv2_medium: EfficientNetV2-Medium.

  • in1k_384: Trained on ImageNet1K at resolution 384 x 384.
  • in22k_384: Trained on ImageNet22K at resolution 384 x 384.
  • in22k_ft_in1k_384: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 384 x 384.

efficientnetv2_large: EfficientNetV2-Large.

  • in1k_384: Trained on ImageNet1K at resolution 384 x 384.
  • in22k_384: Trained on ImageNet22K at resolution 384 x 384.
  • in22k_ft_in1k_384: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 384 x 384.

efficientnetv2_xlarge: EfficientNetV2-XLarge.

  • in22k_384: Trained on ImageNet22K at resolution 384 x 384.
  • in22k_ft_in1k_384: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 384 x 384.

GC ViT

Global context vision transformer (GC ViT) from Global Context Vision Transformers by Hatamizadeh et al. GC ViT's core contribution is global context attention, where global queries are matched against local keys and values to calculate global spatial relationships, which works in tandem with window attention to efficiently model both long- and short-range interactions without the need for sophisticated tricks like window shifting.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

gcvit_xxtiny: GCViT-XXTiny.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

gcvit_xtiny: GCViT-XTiny.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

gcvit_tiny: GCViT-Tiny.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

gcvit_small: GCViT-Small.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

gcvit_base: GCViT-Base.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

HorNet

High-order spatial interaction network (HorNet) from HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions by Rao et al. HorNet uses recursive gated convolutions (g^n convolutions) to model long-range spatial interactions in the input via convolutions.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

hornet_tiny: HorNet-Tiny, kernel size 7 x 7.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

hornet_small: HorNet-Small, kernel size 7 x 7.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

hornet_base: HorNet-Base, kernel size 7 x 7.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

hornet_large: HorNet-Large, kernel size 7 x 7.

  • in22k_224: Trained on ImageNet22K at resolution 224 x 224.

MaxViT

Multi-axis vision transformer (MaxViT) from MaxViT: Multi-Axis Vision Transformer by Tu et al. MaxViT combines dilated global attention, a method for efficiently expressing global interactions, with MBConv and window attention for an architecture that is able to model global and local spatial relationships and can be scaled to high-resolution input sizes.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

maxvit_tiny: MaxViT-Tiny.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • in1k_384: Trained on ImageNet1K at resolution 384 x 384.
  • in1k_512: Trained on ImageNet1K at resolution 512 x 512.

maxvit_small: MaxViT-Small.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • in1k_384: Trained on ImageNet1K at resolution 384 x 384.
  • in1k_512: Trained on ImageNet1K at resolution 512 x 512.

maxvit_base: MaxViT-Base.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • in1k_384: Trained on ImageNet1K at resolution 384 x 384.
  • in1k_512: Trained on ImageNet1K at resolution 512 x 512.
  • in22k_ft_in1k_384: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 384 x 384.
  • in22k_ft_in1k_512: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 512 x 512.

maxvit_large: MaxViT-Large.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • in1k_384: Trained on ImageNet1K at resolution 384 x 384.
  • in1k_512: Trained on ImageNet1K at resolution 512 x 512.
  • in22k_ft_in1k_384: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 384 x 384.
  • in22k_ft_in1k_512: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 512 x 512.

maxvit_xlarge: MaxViT-XLarge.

  • in22k_ft_in1k_384: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 384 x 384.
  • in22k_ft_in1k_512: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 512 x 512.

NesT

Nested transformer (NesT) from Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding by Zhang et al. NesT partitions the input into non-overlapping blocks, separately processes them using transformer layers, and aggregates them via convolutions & pooling to obtain competitive performance - especially on small-scale datasets - without sacrificing the simplicity of the original ViT.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

nest_tiny: NesT-Tiny.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

nest_small: NesT-Small.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

nest_base: NesT-Base.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

PiT

Pooling-based vision transformer (PiT) from Rethinking Spatial Dimensions of Vision Transformers by Heo et al. PiT notes that convolutional neural networks, unlike transformers, have a pyramidal configuration where the number of channels gradually increase in later layers whereas the spatial dimensions decrease, and applies a similar design principle to ViTs for better performance.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

pit_tiny: PiT-Tiny.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

pit_xsmall: PiT-XSmall.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

pit_small: PiT-Small.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

pit_base: PiT-Base.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

PVT V2

Pyramid vision transformer V2 (PVT V2) from PVT v2: Improved Baselines with Pyramid Vision Transformer by Wang et al. PVT V2 modifies the self-attention operation by downsampling the input prior to generating keys and values, thereby diminishing the cost of this module, and couples it with convolutions in each MLP for adaptive position encoding and overlapping patch embedding for cutting-edge image recognition.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

pvtv2_b0: PVTV2-B0.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

pvtv2_b1: PVTV2-B1.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

pvtv2_b2: PVTV2-B2.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

pvtv2_b3: PVTV2-B3.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

pvtv2_b4: PVTV2-B4.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

pvtv2_b5: PVTV2-B5.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

RegNet

RegNet from Designing Network Design Spaces by Radosavovic et al. RegNet is a family of ResNeXt-like networks whose depths and widths are parameterized through a simple linear function that yields a surprisingly powerful collection of networks that work well under a variety of FLOPS settings.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

regnetx_200mf: RegNetX costing approximately 200 mega FLOPS/0.2 giga FLOPS.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

regnetx_400mf: RegNetX costing approximately 400 mega FLOPS/0.4 giga FLOPS.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

regnetx_600mf: RegNetX costing approximately 600 mega FLOPS/0.6 giga FLOPS.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

regnetx_800mf: RegNetX costing approximately 800 mega FLOPS/0.8 giga FLOPS.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

regnetx_1600mf: RegNetX costing approximately 1600 mega FLOPS/1.6 giga FLOPS.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

regnetx_3200mf: RegNetX costing approximately 3200 mega FLOPS/3.2 giga FLOPS.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

regnetx_4000mf: RegNetX costing approximately 4000 mega FLOPS/4.0 giga FLOPS.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

regnetx_6400mf: RegNetX costing approximately 6400 mega FLOPS/6.4 giga FLOPS.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

regnetx_8000mf: RegNetX costing approximately 8000 mega FLOPS/8.0 giga FLOPS.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

regnetx_12gf: RegNetX costing approximately 12000 mega FLOPS/12 giga FLOPS.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

regnetx_16gf: RegNetX costing approximately 16000 mega FLOPS/16 giga FLOPS.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

regnetx_32gf: RegNetX costing approximately 32000 mega FLOPS/32 giga FLOPS.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

regnety_200mf: RegNetY costing approximately 200 mega FLOPS/0.2 giga FLOPS.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

regnety_400mf: RegNetY costing approximately 400 mega FLOPS/0.4 giga FLOPS.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

regnety_600mf: RegNetY costing approximately 600 mega FLOPS/0.6 giga FLOPS.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

regnety_800mf: RegNetY costing approximately 800 mega FLOPS/0.8 giga FLOPS.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

regnety_1600mf: RegNetY costing approximately 1600 mega FLOPS/1.6 giga FLOPS.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

regnety_3200mf: RegNetY costing approximately 3200 mega FLOPS/3.2 giga FLOPS.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

regnety_4000mf: RegNetY costing approximately 4000 mega FLOPS/4.0 giga FLOPS.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

regnety_6400mf: RegNetY costing approximately 6400 mega FLOPS/6.4 giga FLOPS.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

regnety_8000mf: RegNetY costing approximately 8000 mega FLOPS/8.0 giga FLOPS.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

regnety_12gf: RegNetY costing approximately 12000 mega FLOPS/12 giga FLOPS.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

regnety_16gf: RegNetY costing approximately 16000 mega FLOPS/16 giga FLOPS.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

regnety_32gf: RegNetY costing approximately 32000 mega FLOPS/32 giga FLOPS.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • seer_224: Trained using SwAV with no labels on 1 billion Instagram images (Facebook's SEER) at resolution 224 x 224.
  • seer_ft_in1k_384: Pre-trained using SwAV with no labels on 1 billion Instagram images (Facebook's SEER) and fine-tuned on ImageNet1K at resolution 384 x 384.

regnety_64gf: RegNetY costing approximately 64000 mega FLOPS/64 giga FLOPS.

  • seer_224: Trained using SwAV with no labels on 1 billion Instagram images (Facebook's SEER) at resolution 224 x 224.
  • seer_ft_in1k_384: Pre-trained using SwAV with no labels on 1 billion Instagram images (Facebook's SEER) and fine-tuned on ImageNet1K at resolution 384 x 384.

regnety_128gf: RegNetY costing approximately 128000 mega FLOPS/128 giga FLOPS.

  • seer_224: Trained using SwAV with no labels on 1 billion Instagram images (Facebook's SEER) at resolution 224 x 224.
  • seer_ft_in1k_384: Pre-trained using SwAV with no labels on 1 billion Instagram images (Facebook's SEER) and fine-tuned on ImageNet1K at resolution 384 x 384.

regnety_256gf: RegNetY costing approximately 256000 mega FLOPS/256 giga FLOPS.

  • seer_ft_in1k_384: Pre-trained using SwAV with no labels on 1 billion Instagram images (Facebook's SEER) and fine-tuned on ImageNet1K at resolution 384 x 384.

ResNet

ResNet from Deep Residual Learning for Image Recognition by He et al. ResNet proposes residual connections to facilitate the propagation of information throughout the network and for the first time rendered possible models of unprecedented depths.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

resnet18: ResNet-18.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • ssl_ft_in1k_224: Pre-trained on a subset of YFCC100M using semi-supervised learning and fine-tuned on ImageNet1K at resolution 224 x 224.
  • swsl_ft_in1k_224: Pre-trained on 940 million Instagram images using semi-weakly supervised learning and fine-tuned on ImageNet1K at resolution 224 x 224.

resnet34: ResNet-34.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

resnet26: ResNet-26.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

resnet50: ResNet-50.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • ssl_ft_in1k_224: Pre-trained on a subset of YFCC100M using semi-supervised learning and fine-tuned on ImageNet1K at resolution 224 x 224.
  • swsl_ft_in1k_224: Pre-trained on 940 million Instagram images using semi-weakly supervised learning and fine-tuned on ImageNet1K at resolution 224 x 224.

resnet101: ResNet-101.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

resnet152: ResNet-152.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

ResNet V2

ResNet V2 from Identity Mappings in Deep Residual Networks by He et al. ResNet V2 is a descendant of ResNet that trains more smoothly and generalizes better thanks to its redesigned residual units, which abide by a pre-activation pattern where normalization modules and activations functions precede rather than succeed convolutional layers.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

resnetv2_bit_50x1: ResNetV2-50 (width multiplier 1x) with group normalization and weight-standardized convolutions, used by Google's Big Transfer (BiT).

  • dist_in1k_224: Trained using distillation on ImageNet1K at resolution 224 x 224.
  • in22k_224: Trained on ImageNet22K at resolution 224 x 224.
  • in22k_ft_in1k_448: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 448 x 448.

resnetv2_bit_50x3: ResNetV2-50 (width multiplier 3x) with group normalization and weight-standardized convolutions, used by Google's Big Transfer (BiT).

  • in22k_224: Trained on ImageNet22K at resolution 224 x 224.
  • in22k_ft_in1k_448: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 448 x 448.

resnetv2_bit_101x1: ResNetV2-101 (width multiplier 1x) with group normalization and weight-standardized convolutions, used by Google's Big Transfer (BiT).

  • in22k_224: Trained on ImageNet22K at resolution 224 x 224.
  • in22k_ft_in1k_448: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 448 x 448.

resnetv2_bit_101x3: ResNetV2-101 (width multiplier 3x) with group normalization and weight-standardized convolutions, used by Google's Big Transfer (BiT).

  • in22k_224: Trained on ImageNet22K at resolution 224 x 224.
  • in22k_ft_in1k_448: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 448 x 448.

resnetv2_bit_152x2: ResNetV2-152 (width multiplier 2x) with group normalization and weight-standardized convolutions, used by Google's Big Transfer (BiT).

  • teacher_in22k_ft_in1k_224: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 224 x 224, used as a teacher model to perform knowledge distillation.
  • teacher_in22k_ft_in1k_384: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 384 x 384, used as a teacher model to perform knowledge distillation.
  • in22k_224: Trained on ImageNet22K at resolution 224 x 224.
  • in22k_ft_in1k_448: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 448 x 448.

ResNet-D

ResNet-D from Bag of Tricks for Image Classification with Convolutional Neural Networks by He et al. ResNet-D is identical to ResNet, except that it supplants the 7 x 7 convolution in the stem with three 3 x 3 convolutions and delegates spatial reduction in the identity branch of residual blocks to average pooling in lieu of a strided 1 x 1 convolution.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

resnet18d: ResNet-D-18.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

resnet34d: ResNet-D-34.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

resnet26d: ResNet-D-26.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

resnet50d: ResNet-D-50.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

resnet101d: ResNet-D-101.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

resnet152d: ResNet-D-152.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

resnet200d: ResNet-D-200.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

ResNet-T

ResNet-T from PyTorch Image Models by Wightman. ResNet-T is identical to ResNet-D, except that the stem is tiered, i.e., the output dimension of the convolutions are 24, 32, and 64 instead of 32, 32, and 64.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

resnet10t: ResNet-T-10.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

resnet14t: ResNet-T-14.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

resnet26t: ResNet-T-26.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

Wide ResNet

Wide ResNet from Wide Residual Networks by Zagoruyko et al. Wide ResNet recommends widening the bottleneck dimension of residual blocks in ResNet as an alternative to deepening the model.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

wide_resnet50_2: Wide ResNet-50, width multiplier 2x.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

wide_resnet101_2: Wide ResNet-101, width multiplier 2x.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

ResNeXt

ResNeXt from Aggregated Residual Transformations for Deep Neural Networks by Xie et al. ResNeXt is inspired by Inception's split-transform-merge strategy and has a multi-branch topology that aggregates a set of homogeneous transformations - 3 x 3 convolutions - in residual blocks for better performance.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

resnext50_32x4d: ResNeXt-50, cardinality 32, bottleneck dimensionality per cardinal group in the first stage 4.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • ssl_ft_in1k_224: Pre-trained on a subset of YFCC100M using semi-supervised learning and fine-tuned on ImageNet1K at resolution 224 x 224.
  • swsl_ft_in1k_224: Pre-trained on 940 million Instagram images using semi-weakly supervised learning and fine-tuned on ImageNet1K at resolution 224 x 224.

resnext101_32x4d: ResNeXt-101, cardinality 32, bottleneck dimensionality per cardinal group in the first stage 4.

  • ssl_ft_in1k_224: Pre-trained on a subset of YFCC100M using semi-supervised learning and fine-tuned on ImageNet1K at resolution 224 x 224.
  • swsl_ft_in1k_224: Pre-trained on 940 million Instagram images using semi-weakly supervised learning and fine-tuned on ImageNet1K at resolution 224 x 224.

resnext101_32x8d: ResNeXt-101, cardinality 32, bottleneck dimensionality per cardinal group in the first stage 8.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • ssl_ft_in1k_224: Pre-trained on a subset of YFCC100M using semi-supervised learning and fine-tuned on ImageNet1K at resolution 224 x 224.
  • wsl_ft_in1k_224: Pre-trained on 940 million Instagram images using weakly-supervised learning and fine-tuned on ImageNet1K at resolution 224 x 224.
  • swsl_ft_in1k_224: Pre-trained on 940 million Instagram images using semi-weakly supervised learning and fine-tuned on ImageNet1K at resolution 224 x 224.

resnext101_32x16d: ResNeXt-101, cardinality 32, bottleneck dimensionality per cardinal group in the first stage 16.

  • ssl_ft_in1k_224: Pre-trained on a subset of YFCC100M using semi-supervised learning and fine-tuned on ImageNet1K at resolution 224 x 224.
  • wsl_ft_in1k_224: Pre-trained on 940 million Instagram images using weakly-supervised learning and fine-tuned on ImageNet1K at resolution 224 x 224.
  • swsl_ft_in1k_224: Pre-trained on 940 million Instagram images using semi-weakly supervised learning and fine-tuned on ImageNet1K at resolution 224 x 224.

resnext101_32x32d: ResNeXt-101, cardinality 32, bottleneck dimensionality per cardinal group in the first stage 32.

  • wsl_ft_in1k_224: Pre-trained on 940 million Instagram images using weakly-supervised learning and fine-tuned on ImageNet1K at resolution 224 x 224.

resnext101_32x48d: ResNeXt-101, cardinality 32, bottleneck dimensionality per cardinal group in the first stage 48.

  • wsl_ft_in1k_224: Pre-trained on 940 million Instagram images using weakly-supervised learning and fine-tuned on ImageNet1K at resolution 224 x 224.

resnext101_64x4d: ResNeXt-101, cardinality 64, bottleneck dimensionality per cardinal group in the first stage 4.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

SENet

Squeeze-and-excitation network (SENet) from Squeeze-and-Excitation Networks by Hu et al. SENet introduces squeeze-and-excitation (SE), a simple channel attention mechanism, for capturing relationships between different channels and adaptively weighing them.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

seresnet50: SE-ResNet-50.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

seresnet152d: SE-ResNet-D-152.

  • in1k_256: Trained on ImageNet1K at resolution 256 x 256.

seresnext26d_32x4d: SE-ResNeXt-D-26, cardinality 32, bottleneck dimensionality per cardinal group in the first stage 4.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

seresnext26t_32x4d: SE-ResNeXt-T-26, cardinality 32, bottleneck dimensionality per cardinal group in the first stage 4.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

seresnext50_32x4d: SE-ResNeXt-50, cardinality 32, bottleneck dimensionality per cardinal group in the first stage 4.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

seresnext101_32x8d: SE-ResNeXt-101, cardinality 32, bottleneck dimensionality per cardinal group in the first stage 8.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

seresnext101d_32x8d: SE-ResNeXt-D-101, cardinality 32, bottleneck dimensionality per cardinal group in the first stage 8.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

ECANet

Efficient channel attention network (ECANet) from ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks by Wang et al. ECANet suggests an alternative to squeeze-and-excitation, dubbed efficient channel attention (ECA), that eliminates dimensionality reduction in the bottleneck layer of the excitation module for better accuracy, in addition to being cheaper.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

ecaresnet26t: ECA-ResNet-T-26.

  • in1k_256: Trained on ImageNet1K at resolution 256 x 256.

ecaresnet50_light: Lightweight ECA-ResNet-50 where most of the layers are in stage 3.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

ecaresnet50d: ECA-ResNet-D-50.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

ecaresnet50t: ECA-ResNet-T-50.

  • in1k_256: Trained on ImageNet1K at resolution 256 x 256.

ecaresnet101d: ECA-ResNet-D-101.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

ecaresnet269d: ECA-ResNet-D-269.

  • in1k_320: Trained on ImageNet1K at resolution 320 x 320.

ResNet-RS

ResNet-RS from Revisiting ResNets: Improved Training and Scaling Strategies by Bello et al. ResNet-RS is architecturally simple, differing not much from SE-ResNet-D, but rivals the performance of more advanced networks like EfficientNet thanks to its enhanced training and scaling procedures.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

resnetrs50: ResNet-RS-50.

  • in1k_160: Trained on ImageNet1K at resolution 160 x 160.

resnetrs101: ResNet-RS-101.

  • in1k_192: Trained on ImageNet1K at resolution 192 x 192.

resnetrs152: ResNet-RS-152.

  • in1k_256: Trained on ImageNet1K at resolution 256 x 256.

resnetrs200: ResNet-RS-200.

  • in1k_256: Trained on ImageNet1K at resolution 256 x 256.

resnetrs270: ResNet-RS-270.

  • in1k_256: Trained on ImageNet1K at resolution 256 x 256.

resnetrs350: ResNet-RS-350.

  • in1k_288: Trained on ImageNet1K at resolution 288 x 288.

resnetrs420: ResNet-RS-420.

  • in1k_320: Trained on ImageNet1K at resolution 320 x 320.

SKNet

Selective kernel networks (SKNet) from Selective Kernel Networks by Li et al. SKNet uses selective kernel units (SK units), a branch attention mechanism, to adjust each residual block's kernel size according to the input.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

skresnet18: SK-ResNet-18.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

skresnet34: SK-Resnet-34.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

skresnext50_32x4d: SK-ResNeXt-50, cardinality 32, bottleneck dimensionality per cardinal group in the first stage 4.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

ResNeSt

ResNeSt from ResNeSt: Split-Attention Networks by Zhang et al. ResNeSt compounds the power of multi-branch architectures with that of channel attention using a mechanism known as split attention (SplAt) that can be considered a generalization of SK but with static kernel sizes.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

resnest14_2s1x64d: ResNeSt-14, radix 2, cardinality 1, dimensionality per radix group in the first stage 64.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

resnest26_2s1x64d: ResNeSt-26, radix 2, cardinality 1, dimensionality per radix group in the first stage 64.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

resnest50_1s4x24d: ResNeSt-50, radix 1, cardinality 4, dimensionality per radix group in the first stage 24.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

resnest50_2s1x64d: ResNeSt-50, radix 2, cardinality 1, dimensionality per radix group in the first stage 64.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

resnest50_4s2x40d: ResNeSt-50, radix 4, cardinality 2, dimensionality per radix group in the first stage 40.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

resnest101_2s1x64d: ResNeSt-101, radix 2, cardinality 1, dimensionality per radix group in the first stage 64.

  • in1k_256: Trained on ImageNet1K at resolution 256 x 256.

resnest200_2s1x64d: ResNeSt-200, radix 2, cardinality 1, dimensionality per radix group in the first stage 64.

  • in1k_320: Trained on ImageNet1K at resolution 320 x 320.

resnest269_2s1x64d: ResNeSt-269, radix 2, cardinality 1, dimensionality per radix group in the first stage 64.

  • in1k_416: Trained on ImageNet1K at resolution 416 x 416.

Swin

Shifted window attention transformer (Swin transformer) from Swin Transformer: Hierarchical Vision Transformer using Shifted Windows by Liu et al. Swin is a hierarchical vision transformer that efficiently calculates attention by limiting it to local windows and captures cross-window relationships using window shifting.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

swin_tiny_window7: Swin-Tiny, window size 7 x 7.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • in22k_224: Trained on ImageNet22K at resolution 224 x 224.
  • in22k_ft_in1k_224: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 224 x 224.

swin_small_window7: Swin-Small, window size 7 x 7.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • in22k_224: Trained on ImageNet22K at resolution 224 x 224.
  • in22k_ft_in1k_224: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 224 x 224.

swin_base_window6: Swin-Base, window size 6 x 6.

  • simmim_in1k_100ep_192: Trained using SimMIM with no labels on ImageNet1K for 100 epochs at resolution 192 x 192.
  • simmim_in1k_800ep_192: Trained using SimMIM with no labels on ImageNet1K for 800 epochs at resolution 192 x 192.
  • simmim_in1k_100ep_ft_in1k_192: Pre-trained using SimMIM with no labels on ImageNet1K for 100 epochs and fine-tuned with labels on ImageNet1K at resolution 192 x 192.

swin_base_window7: Swin-Base, window size 7 x 7.

  • simmim_in1k_100ep_ft_in1k_224: Pre-trained using SimMIM with no labels on ImageNet1K for 100 epochs and fine-tuned with labels on ImageNet1K at resolution 224 x 224.
  • simmim_in1k_800ep_ft_in1k_224: Pre-trained using SimMIM with no labels on ImageNet1K for 800 epochs and fine-tuned with labels on ImageNet1K at resolution 224 x 224.
  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • in22k_224: Trained on ImageNet22K at resolution 224 x 224.
  • in22k_ft_in1k_224: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 224 x 224.

swin_base_window12: Swin-Base, window size 12 x 12.

  • in1k_384: Trained on ImageNet1K at resolution 384 x 384.
  • in22k_384: Trained on ImageNet22K at resolution 384 x 384.
  • in22k_ft_in1k_384: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 384 x 384.

swin_large_window7: Swin-Large, window size 7 x 7.

  • in22k_224: Trained on ImageNet22K at resolution 224 x 224.
  • in22k_ft_in1k_224: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 224 x 224.

swin_large_window12: Swin-Large, window size 12 x 12.

  • simmim_in1k_800ep_192: Trained using SimMIM with no labels on ImageNet1K for 800 epochs at resolution 192 x 192.
  • in22k_384: Trained on ImageNet22K at resolution 384 x 384.
  • in22k_ft_in1k_384: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 384 x 384.

swin_large_window14: Swin-Large, window size 14 x 14.

  • simmim_in1k_800ep_ft_in1k_224: Pre-trained using SimMIM with no labels on ImageNet1K for 800 epochs and fine-tuned with labels on ImageNet1K at resolution 224 x 224.

Swin-S3

Swin-S3 from Searching the Search Space of Vision Transformer by Chen et al. Swin-S3 is composed of the same type of blocks as Swin but was discovered through neural architecture search.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

swin_s3_tiny_224: Swin-S3-Tiny, resolution 224 x 224.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

swin_s3_small_224: Swin-S3-Small, resolution 224 x 224.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

swin_s3_base_224: Swin-S3-Base, resolution 224 x 224.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

VAN

Visual attention network (VAN) from Visual Attention Network by Guo et al. VAN unifies the merits of convolutions and self-attention through large kernel attention (LKA), a cheap attention module with spatial and channel adaptability as well as the ability to capture long-range dependencies.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

van_b0: VAN-B0.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

van_b1: VAN-B1.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

van_b2: VAN-B2.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

van_b3: VAN-B3.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

VGG

VGG from Very Deep Convolutional Networks for Large-Scale Image Recognition by Simonyan et al. VGG stacks many 3 x 3 convolutions on top of one another, interleaved with non-linearities, pooling, and potentially batch normalization, and was one of the first architectures exhibiting the importance of depth for vision tasks.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

vgg11: VGG-11.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

vgg11_bn: VGG-11 with batch normalization.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

vgg13: VGG-13.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

vgg13_bn: VGG-13 with batch normalization.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

vgg16: VGG-16.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

vgg16_bn: VGG-16 with batch normalization.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

vgg19: VGG-19.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

vgg19_bn: VGG-19 with batch normalization.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.

ViT

Vision transformer (ViT) from An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. ViT closely resembles NLP transformers but the input is first patchified and linearly transformed to manage the quadratic complexity of self-attention.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

vit_tiny_patch16: ViT-Tiny, patch size 16 x 16.

  • augreg_in22k_224: Trained on ImageNet22K with additional augmentation & regularization at resolution 224 x 224.
  • augreg_in22k_ft_in1k_224: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K with additional augmentation & regularization at resolution 224 x 224.
  • augreg_in22k_ft_in1k_384: Trained on ImageNet22K with additional augmentation & regularization at resolution 384 x 384.

vit_small_patch32: ViT-Small, patch size 32 x 32.

  • augreg_in22k_224: Trained on ImageNet22K with additional augmentation & regularization at resolution 224 x 224.
  • augreg_in22k_ft_in1k_224: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K with additional augmentation & regularization at resolution 224 x 224.
  • augreg_in22k_ft_in1k_384: Trained on ImageNet22K with additional augmentation & regularization at resolution 384 x 384.

vit_small_patch16: ViT-Small, patch size 16 x 16.

  • dino_in1k_224: Trained using DINO on ImageNet1K at resolution 224 x 224.
  • augreg_in1k_224: Trained on ImageNet1K with additional augmentation & regularization at resolution 224 x 224.
  • augreg_in1k_384: Trained on ImageNet1K with additional augmentation & regularization at resolution 384 x 384.
  • augreg_in22k_224: Trained on ImageNet22K with additional augmentation & regularization at resolution 224 x 224.
  • augreg_in22k_ft_in1k_224: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K with additional augmentation & regularization at resolution 224 x 224.
  • augreg_in22k_ft_in1k_384: Trained on ImageNet22K with additional augmentation & regularization at resolution 384 x 384.

vit_small_patch8: ViT-Small, patch size 8 x 8.

  • dino_in1k_224: Trained using DINO on ImageNet1K at resolution 224 x 224.

vit_base_patch32: ViT-Base, patch size 32 x 32.

  • sam_in1k_224: Trained using SAM on ImageNet1K at resolution 224 x 224.
  • augreg_in1k_224: Trained on ImageNet1K with additional augmentation & regularization at resolution 224 x 224.
  • augreg_in1k_384: Trained on ImageNet1K with additional augmentation & regularization at resolution 384 x 384.
  • augreg_in22k_224: Trained on ImageNet22K with additional augmentation & regularization at resolution 224 x 224.
  • augreg_in22k_ft_in1k_224: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K with additional augmentation & regularization at resolution 224 x 224.
  • augreg_in22k_ft_in1k_384: Trained on ImageNet22K with additional augmentation & regularization at resolution 384 x 384.

vit_base_patch16: ViT-Base, patch size 16 x 16.

  • mae_in1k_224: Trained using MAE with no labels on ImageNet1K at resolution 224 x 224.
  • sam_in1k_224: Trained using SAM on ImageNet1K at resolution 224 x 224.
  • dino_in1k_224: Trained using DINO on ImageNet1K at resolution 224 x 224.
  • augreg_in1k_224: Trained on ImageNet1K with additional augmentation & regularization at resolution 224 x 224.
  • augreg_in1k_384: Trained on ImageNet1K with additional augmentation & regularization at resolution 384 x 384.
  • augreg_in22k_224: Trained on ImageNet22K with additional augmentation & regularization at resolution 224 x 224.
  • augreg_in22k_ft_in1k_224: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K with additional augmentation & regularization at resolution 224 x 224.
  • augreg_in22k_ft_in1k_384: Trained on ImageNet22K with additional augmentation & regularization at resolution 384 x 384.

vit_base_pool_patch16: ViT-Base with global average pooling in the head, patch size 16 x 16.

  • mae_in1k_ft_in1k_224: Pre-trained using MAE with no labels on ImageNet1K and fine-tuned with labels on ImageNet1K at resolution 224 x 224.

vit_base_patch8: ViT-Base, patch size 8 x 8.

  • dino_in1k_224: Trained using DINO on ImageNet1K at resolution 224 x 224.
  • augreg_in22k_224: Trained on ImageNet22K with additional augmentation & regularization at resolution 224 x 224.
  • augreg_in22k_ft_in1k_224: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K with additional augmentation & regularization at resolution 224 x 224.

vit_large_patch32: ViT-Large, patch size 32 x 32.

  • orig_in22k_224: Trained on ImageNet22K (original weights) at resolution 224 x 224.
  • orig_in22k_ft_in1k_384: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K (original weights) at resolution 384 x 384.

vit_large_patch16: ViT-Large, patch size 16 x 16.

  • mae_in1k_224: Trained using MAE with no labels on ImageNet1K at resolution 224 x 224.
  • augreg_in22k_224: Trained on ImageNet22K with additional augmentation & regularization at resolution 224 x 224.
  • augreg_in22k_ft_in1k_224: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K with additional augmentation & regularization at resolution 224 x 224.
  • augreg_in22k_ft_in1k_384: Trained on ImageNet22K with additional augmentation & regularization at resolution 384 x 384.

vit_large_pool_patch16: ViT-Large with global average pooling in the head, patch size 16 x 16.

  • mae_in1k_ft_in1k_224: Pre-trained using MAE with no labels on ImageNet1K and fine-tuned with labels on ImageNet1K at resolution 224 x 224.

vit_huge_patch14: ViT-Huge, patch size 14 x 14.

  • mae_in1k_224: Trained using MAE with no labels on ImageNet1K at resolution 224 x 224.
  • orig_in22k_224: Trained on ImageNet22K (original weights) at resolution 224 x 224.

vit_huge_pool_patch14: ViT-Huge with global average pooling in the head, patch size 14 x 14.

  • mae_in1k_ft_in1k_224: Pre-trained using MAE with no labels on ImageNet1K and fine-tuned with labels on ImageNet1K at resolution 224 x 224.

vit_base_clip_patch32: ViT-Base with an extra layer normalization before the transformer layers, patch size 32 x 32.

  • clip_openai_224: Trained using CLIP on 400 million (image, text) pairs at resolution 224 x 224.
  • clip_openai_ft_in1k_224: Pre-trained using CLIP on 400 million (image, text) pairs and fine-tuned on ImageNet1K at resolution 224 x 224.
  • clip_laion2b_224: Trained using CLIP on LAION-2B at resolution 224 x 224.
  • clip_laion2b_ft_in1k_224: Pre-trained using CLIP on LAION-2B and fine-tuned on ImageNet1K at resolution 224 x 224.

vit_base_clip_patch16: ViT-Base with an extra layer normalization before the transformer layers, patch size 16 x 16.

  • clip_openai_224: Trained using CLIP on 400 million (image, text) pairs at resolution 224 x 224.
  • clip_openai_ft_in1k_224: Pre-trained using CLIP on 400 million (image, text) pairs and fine-tuned on ImageNet1K at resolution 224 x 224.
  • clip_openai_ft_in1k_384: Pre-trained using CLIP on 400 million (image, text) pairs and fine-tuned on ImageNet1K at resolution 384 x 384.
  • clip_laion2b_224: Trained using CLIP on LAION-2B at resolution 224 x 224.
  • clip_laion2b_ft_in1k_224: Pre-trained using CLIP on LAION-2B and fine-tuned on ImageNet1K at resolution 224 x 224.
  • clip_laion2b_ft_in1k_384: Pre-trained using CLIP on LAION-2B and fine-tuned on ImageNet1K at resolution 384 x 384.

vit_large_clip_patch14: ViT-Large with an extra layer normalization before the transformer layers, patch size 14 x 14.

  • clip_openai_224: Trained using CLIP on 400 million (image, text) pairs at resolution 224 x 224.
  • clip_openai_ft_in1k_224: Pre-trained using CLIP on 400 million (image, text) pairs and fine-tuned on ImageNet1K at resolution 224 x 224.
  • clip_laion2b_224: Trained using CLIP on LAION-2B at resolution 224 x 224.
  • clip_laion2b_ft_in1k_224: Pre-trained using CLIP on LAION-2B and fine-tuned on ImageNet1K at resolution 224 x 224.
  • clip_laion2b_ft_in1k_336: Pre-trained using CLIP on LAION-2B and fine-tuned on ImageNet1K at resolution 336 x 336.

vit_huge_clip_patch14: ViT-Huge with an extra layer normalization before the transformer layers, patch size 14 x 14.

  • clip_laion2b_224: Trained using CLIP on LAION-2B at resolution 224 x 224.
  • clip_laion2b_ft_in1k_224: Pre-trained using CLIP on LAION-2B and fine-tuned on ImageNet1K at resolution 224 x 224.

vit_giant_clip_patch14: ViT-Giant with an extra layer normalization before the transformer layers, patch size 14 x 14.

  • clip_laion2b_224: Trained using CLIP on LAION-2B at resolution 224 x 224.

DeiT 3

Data-efficient image transformer III (DeiT 3) from DeiT III: Revenge of the ViT by Touvron et al. DeiT 3 is architecturally similar to ViT but is trained using a refined training recipe.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

deit3_small_patch16: DeiT3-Small, patch size 16 x 16.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • in1k_384: Trained on ImageNet1K at resolution 384 x 384.
  • in22k_ft_in1k_224: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 224 x 224.
  • in22k_ft_in1k_384: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 384 x 384.

deit3_medium_patch16: DeiT3-Medium, patch size 16 x 16.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • in22k_ft_in1k_224: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 224 x 224.

deit3_base_patch16: DeiT3-Base, patch size 16 x 16.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • in1k_384: Trained on ImageNet1K at resolution 384 x 384.
  • in22k_ft_in1k_224: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 224 x 224.
  • in22k_ft_in1k_384: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 384 x 384.

deit3_large_patch16: DeiT3-Large, patch size 16 x 16.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • in1k_384: Trained on ImageNet1K at resolution 384 x 384.
  • in22k_ft_in1k_224: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 224 x 224.
  • in22k_ft_in1k_384: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 384 x 384.

deit3_huge_patch14_224: DeiT3-Huge, patch size 14 x 14.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • in22k_ft_in1k_224: Pre-trained on ImageNet22K and fine-tuned on ImageNet1K at resolution 224 x 224.

BEiT

BEiT from BEiT: BERT Pre-Training of Image Transformers by Bao et al. BEiT is architecturally similar to ViT but is trained using BERT-style masked modelling.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

beit_base_patch16: BEiT-Base, patch size 16 x 16.

  • beit_in22k_ft_in22k_224: Pre-trained using BeiT on ImageNet22K and fine-tuned with labels on ImageNet22K at resolution 224 x 224.
  • beit_in22k_ft_in22k_ft_in1k_224: Pre-trained using BeiT on ImageNet22K, fine-tuned with labels on ImageNet22K, and fine-tuned with labels on ImageNet1K at resolution 224 x 224.
  • beit_in22k_ft_in22k_ft_in1k_224: Pre-trained using BeiT on ImageNet22K, fine-tuned with labels on ImageNet22K, and fine-tuned with labels on ImageNet1K at resolution 384 x 384.
  • beitv2_in1k_ft_in22k_224: Pre-trained using BeiT V2 on ImageNet1K and fine-tuned with labels on ImageNet22K at resolution 224 x 224.
  • beitv2_in1k_ft_in22k_ft_in1k_224: Pre-trained using BeiT V2 on ImageNet1K, fine-tuned with labels on ImageNet22K, and fine-tuned with labels on ImageNet1K at resolution 224 x 224.

beit_large_patch16: BEiT-Large, patch size 16 x 16.

  • beit_in22k_ft_in22k_224: Pre-trained using BeiT on ImageNet22K and fine-tuned with labels on ImageNet22K at resolution 224 x 224.
  • beit_in22k_ft_in22k_ft_in1k_224: Pre-trained using BeiT on ImageNet22K, fine-tuned with labels on ImageNet22K, and fine-tuned with labels on ImageNet1K at resolution 224 x 224.
  • beit_in22k_ft_in22k_ft_in1k_224: Pre-trained using BeiT on ImageNet22K, fine-tuned with labels on ImageNet22K, and fine-tuned with labels on ImageNet1K at resolution 384 x 384.
  • beitv2_in1k_ft_in22k_224: Pre-trained using BeiT V2 on ImageNet1K and fine-tuned with labels on ImageNet22K at resolution 224 x 224.
  • beitv2_in1k_ft_in22k_ft_in1k_224: Pre-trained using BeiT V2 on ImageNet1K, fine-tuned with labels on ImageNet22K, and fine-tuned with labels on ImageNet1K at resolution 224 x 224.

XCiT

Cross-covariance image transformer (XCiT) from XCiT: Cross-Covariance Image Transformers by El-Nouby et al. XCiT eliminates the quadratic complexity of traditional transformers using cross-covariance attention (XCA), an inverted version of self-attention that is applied along the feature rather than the token axis of the input, and complements it with local patch interaction modules (LPI) that explicitly enable communications amongst patches for scalable, efficient image transformers.
For the reference implementation, source of pre-trained parameters, and copyrights, please visit here.

xcit_nano12_patch16: XCiT-Nano, depth 12, patch size 16 x 16.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • dist_in1k_224: Trained using distillation on ImageNet1K at resolution 224 x 224.
  • dist_in1k_384: Trained using distillation on ImageNet1K at resolution 384 x 384.

xcit_nano12_patch8: XCiT-Nano, depth 12, patch size 8 x 8.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • dist_in1k_224: Trained using distillation on ImageNet1K at resolution 224 x 224.
  • dist_in1k_384: Trained using distillation on ImageNet1K at resolution 384 x 384.

xcit_tiny12_patch16: XCiT-Tiny, depth 12, patch size 16 x 16.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • dist_in1k_224: Trained using distillation on ImageNet1K at resolution 224 x 224.
  • dist_in1k_384: Trained using distillation on ImageNet1K at resolution 384 x 384.

xcit_tiny12_patch8: XCiT-Tiny, depth 12, patch size 8 x 8.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • dist_in1k_224: Trained using distillation on ImageNet1K at resolution 224 x 224.
  • dist_in1k_384: Trained using distillation on ImageNet1K at resolution 384 x 384.

xcit_tiny24_patch16: XCiT-Tiny, depth 24, patch size 16 x 16.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • dist_in1k_224: Trained using distillation on ImageNet1K at resolution 224 x 224.
  • dist_in1k_384: Trained using distillation on ImageNet1K at resolution 384 x 384.

xcit_tiny24_patch8: XCiT-Tiny, depth 24, patch size 8 x 8.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • dist_in1k_224: Trained using distillation on ImageNet1K at resolution 224 x 224.
  • dist_in1k_384: Trained using distillation on ImageNet1K at resolution 384 x 384.

xcit_small12_patch16: XCiT-Small, depth 12, patch size 16 x 16.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • dist_in1k_224: Trained using distillation on ImageNet1K at resolution 224 x 224.
  • dist_in1k_384: Trained using distillation on ImageNet1K at resolution 384 x 384.

xcit_small12_patch8: XCiT-Small, depth 12, patch size 8 x 8.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • dist_in1k_224: Trained using distillation on ImageNet1K at resolution 224 x 224.
  • dist_in1k_384: Trained using distillation on ImageNet1K at resolution 384 x 384.

xcit_small24_patch16: XCiT-Small, depth 24, patch size 16 x 16.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • dist_in1k_224: Trained using distillation on ImageNet1K at resolution 224 x 224.
  • dist_in1k_384: Trained using distillation on ImageNet1K at resolution 384 x 384.

xcit_small24_patch8: XCiT-Small, depth 24, patch size 8 x 8.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • dist_in1k_224: Trained using distillation on ImageNet1K at resolution 224 x 224.
  • dist_in1k_384: Trained using distillation on ImageNet1K at resolution 384 x 384.

xcit_medium24_patch16: XCiT-Medium, depth 24, patch size 16 x 16.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • dist_in1k_224: Trained using distillation on ImageNet1K at resolution 224 x 224.
  • dist_in1k_384: Trained using distillation on ImageNet1K at resolution 384 x 384.

xcit_medium24_patch8: XCiT-Medium, depth 24, patch size 8 x 8.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • dist_in1k_224: Trained using distillation on ImageNet1K at resolution 224 x 224.
  • dist_in1k_384: Trained using distillation on ImageNet1K at resolution 384 x 384.

xcit_large24_patch16: XCiT-Large, depth 24, patch size 16 x 16.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • dist_in1k_224: Trained using distillation on ImageNet1K at resolution 224 x 224.
  • dist_in1k_384: Trained using distillation on ImageNet1K at resolution 384 x 384.

xcit_large24_patch8: XCiT-Large, depth 24, patch size 8 x 8.

  • in1k_224: Trained on ImageNet1K at resolution 224 x 224.
  • dist_in1k_224: Trained using distillation on ImageNet1K at resolution 224 x 224.
  • dist_in1k_384: Trained using distillation on ImageNet1K at resolution 384 x 384.