

# 1. Motivation



<img src="images/convnext1.png" width="40%" height="40%" />

ConvNeXt (CVPR 2022, Facebook/Meta) aims to answer a question:

**If we modernize classical CNNs with all the training tricks used in Vision Transformers (ViT), could a pure CNN match or surpass ViT?**

ConvNeXt = **ResNet-50/ResNet-200 rewritten with modern design choices inspired by ViT**, without using attention.

Key ideas:

* Replace all **3√ó3 convolutions** in bottleneck blocks with **depthwise 7√ó7 convolutions**.
* Use **LayerNorm** instead of BatchNorm.
* Use very large **conv kernels** (7√ó7 DW).
* Use **ConvNeXt blocks** that structurally resemble **MLP blocks in ViT**.
* Use **inverted bottlenecks** with very large expansion (just like MLP ratios in ViT).

ConvNeXt is therefore a **CNN that behaves like a Vision Transformer**, but runs faster and often performs better.

---

# 2. Overall Architecture

ConvNeXt uses a **4-stage hierarchy** similar to ResNet / Swin Transformer:

| Stage   | Resolution      | Channels |
| ------- | --------------- | -------- |
| Stage 1 | 224√ó224 ‚Üí 56√ó56 | C        |
| Stage 2 | 56√ó56 ‚Üí 28√ó28   | 2C       |
| Stage 3 | 28√ó28 ‚Üí 14√ó14   | 4C       |
| Stage 4 | 14√ó14 ‚Üí 7√ó7     | 8C       |

A Tiny model uses C=96.
A Base model uses C=128.

---

# 3. ConvNeXt Block

The ConvNeXt block has **four main parts**:

1. **Depthwise convolution**
   Kernel size **7√ó7** (very large), per-channel:

   $$
   y_{c} = x_{c} * k^{(c)}
   $$

2. **LayerNorm**
   Applied channel-wise:

   $$
   \hat{x} = \frac{x - \mu}{\sigma}
   $$

3. **Pointwise MLP (two linear layers using 1√ó1 convs)**
   Expansion ratio = **4√ó**:

   $$
   u = W_1 \hat{x}
   $$
   $$
   z = \text{GELU}(u)
   $$
   $$
   y = W_2 z
   $$

4. **Residual connection**:

   $$
   \text{Block}(x) = x + y
   $$

This is **almost identical** to a ViT MLP block, except the attention is replaced by a **depthwise convolution**.

---

# 4. Comparison with a ResNet Bottleneck

ResNet bottleneck block:

* 1√ó1 ‚Üí 3√ó3 ‚Üí 1√ó1
* BatchNorm everywhere
* Activation after each conv
* Small expansion ratio (4√ó internally but shrinks back)

ConvNeXt block:

* No BatchNorm ‚Üí LayerNorm only
* **7√ó7 depthwise conv**
* **Inverted bottleneck** (expand ‚Üí shrink), exactly like a ViT MLP block
* Very few activations (GELU only once)
* No ReLU after every conv

This makes ConvNeXt much closer to ViT.

---

# 5. Stage Downsampling Design

Between stages, ConvNeXt uses a **simple downsampling layer**:

* LayerNorm
* 2√ó2 stride-2 convolution

This mimics ViT patch embeddings and Swin Transformer patch merging.

---

# 6. Full Block Diagram (Text)

```
Input
 ‚îÇ
 ‚ñº
Depthwise Conv (7√ó7)
 ‚îÇ
 ‚ñº
LayerNorm
 ‚îÇ
 ‚ñº
Pointwise 1√ó1 Conv (4√ó expansion)
 ‚îÇ
 ‚ñº
GELU
 ‚îÇ
 ‚ñº
Pointwise 1√ó1 Conv (projection)
 ‚îÇ
 ‚ñº
Add residual
 ‚îÇ
Output
```

---

# 7. Mathematical Form of a Single Block

Let the input be a tensor
$$ x \in \mathbb{R}^{H \times W \times C} $$

Step 1: Depthwise convolution
$$
u_{h,w,c} = (x_{:, :, c} * k_c)(h,w)
$$

Step 2: LayerNorm
$$
\hat{u}_{h,w,c} = \frac{u_{h,w,c} - \mu_c}{\sigma_c}
$$

Step 3: Expansion
$$
v_{h,w,c'} = \sum_{c} W^{(1)}_{c,c'},\hat{u}_{h,w,c}
$$
with expansion ratio 4, so $c' = 4C$.

Step 4: GELU
$$
g_{h,w,c'} = \text{GELU}(v_{h,w,c'})
$$

Step 5: Projection
$$
p_{h,w,c} = \sum_{c'} W^{(2)}_{c',c}, g_{h,w,c'}
$$

Step 6: Residual
$$
y = x + p
$$

---

# 8. Why ConvNeXt Works So Well

ConvNeXt integrates the best ideas from both CNNs and Transformers:

### From Transformers:

* LayerNorm instead of BatchNorm
* GELU activation
* Large MLP expansion
* Fewer nonlinearities
* Simple stage transitions

### From CNNs:

* Efficient convolutions
* Local receptive fields
* No attention (faster, cheaper)

The result is a **ViT-level accuracy with CNN-level speed**.

---

# 9. Model Variants

ConvNeXt models follow Swin-style naming:

* **ConvNeXt-T** (Tiny)
* **ConvNeXt-S** (Small)
* **ConvNeXt-B** (Base)
* **ConvNeXt-L** (Large)
* **ConvNeXt-XL** (Extra Large)
* **ConvNeXt-v2** (2023, improved training + LayerScale)

---

# 10. How it Performs

ConvNeXt-Base (pure CNN) ‚âà ViT-Base
ConvNeXt-Large ‚âà ViT-Large
ConvNeXt-XL beats Swin-B and ViT-L in some tasks

Used widely for:

* Classification
* Detection (with FPN/Mask-RCNN)
* Segmentation (with UperNet / FPN)

---



---

## 1. PyTorch implementation of a ConvNeXt block

This is a **standalone, minimal but faithful** ConvNeXt block:

* Depthwise 7√ó7 convolution
* LayerNorm (channels-last)
* 1√ó1 conv MLP with 4√ó expansion
* GELU
* Optional LayerScale
* Optional stochastic depth (DropPath)

```python
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import Tensor
import math
```

### 1.1 Utility: DropPath (stochastic depth)

```python
class DropPath(nn.Module):
    """
    Per-sample stochastic depth.
    """

    def __init__(self, drop_prob: float = 0.):
        super().__init__()
        self.drop_prob = drop_prob

    def forward(self, x: Tensor) -> Tensor:
        if self.drop_prob == 0. or not self.training:
            return x
        keep_prob = 1 - self.drop_prob
        # shape: (batch, 1, 1, 1) so it is broadcast across H, W, C
        shape = (x.shape[0],) + (1,) * (x.ndim - 1)
        random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
        random_tensor = torch.floor(random_tensor)  # 0 or 1
        return x / keep_prob * random_tensor
```

### 1.2 Utility: LayerNorm for channels-last

ConvNeXt uses channels-last (`N, H, W, C`) internally for LayerNorm efficiency.

```python
class LayerNormChannelsLast(nn.LayerNorm):
    """
    LayerNorm expecting input in (B, H, W, C) format.
    Inherits nn.LayerNorm but just documents expected layout.
    """
    def __init__(self, normalized_shape, eps=1e-6):
        super().__init__(normalized_shape, eps=eps)
```

### 1.3 ConvNeXt block (single stage block)

```python
class ConvNeXtBlock(nn.Module):
    def __init__(
        self,
        dim: int,
        mlp_ratio: float = 4.0,
        drop_path: float = 0.0,
        layer_scale_init_value: float = 1e-6,
    ):
        """
        dim: number of channels (C)
        mlp_ratio: expansion ratio in the 1x1 conv MLP
        drop_path: stochastic depth rate
        layer_scale_init_value: if > 0, uses a learnable gamma vector
        """
        super().__init__()

        self.dwconv = nn.Conv2d(
            dim, dim,
            kernel_size=7,
            padding=3,
            groups=dim  # depthwise
        )

        self.norm = LayerNormChannelsLast(dim, eps=1e-6)

        hidden_dim = int(dim * mlp_ratio)
        self.pwconv1 = nn.Linear(dim, hidden_dim)  # channels-last, so use Linear
        self.act = nn.GELU()
        self.pwconv2 = nn.Linear(hidden_dim, dim)

        if layer_scale_init_value > 0:
            self.gamma = nn.Parameter(
                layer_scale_init_value * torch.ones(dim),
                requires_grad=True
            )
        else:
            self.gamma = None

        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()

    def forward(self, x: Tensor) -> Tensor:
        """
        x: (B, C, H, W)
        """
        shortcut = x

        # 1) depthwise conv in NCHW
        x = self.dwconv(x)  # (B, C, H, W)

        # 2) convert to NHWC for LayerNorm + MLP
        x = x.permute(0, 2, 3, 1)  # (B, H, W, C)

        # 3) LayerNorm
        x = self.norm(x)

        # 4) MLP: Linear -> GELU -> Linear
        x = self.pwconv1(x)       # (B, H, W, hidden_dim)
        x = self.act(x)
        x = self.pwconv2(x)       # (B, H, W, C)

        # 5) LayerScale (optional)
        if self.gamma is not None:
            x = self.gamma * x

        # 6) back to NCHW
        x = x.permute(0, 3, 1, 2)  # (B, C, H, W)

        # 7) residual + drop_path
        x = shortcut + self.drop_path(x)
        return x
```

You can drop this into a stage like:

```python
class ConvNeXtStage(nn.Module):
    def __init__(self, dim, depth, mlp_ratio=4.0, drop_path_rate=0.0):
        super().__init__()
        dpr = torch.linspace(0, drop_path_rate, depth).tolist()  # different drop_rates
        self.blocks = nn.ModuleList(
            [
                ConvNeXtBlock(
                    dim=dim,
                    mlp_ratio=mlp_ratio,
                    drop_path=dpr[i],
                )
                for i in range(depth)
            ]
        )

    def forward(self, x: Tensor) -> Tensor:
        for blk in self.blocks:
            x = blk(x)
        return x
```

---

## 2. Step-by-step flow diagram of the ConvNeXt block

Assume input tensor
$$
x \in \mathbb{R}^{B \times C \times H \times W}.
$$

High-level diagram:

```text
        ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
        ‚îÇ         Input x            ‚îÇ
        ‚îÇ     (B, C, H, W)           ‚îÇ
        ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                     ‚îÇ
                     ‚ñº
          Depthwise Conv 7√ó7 (groups=C)
          (B, C, H, W)
                     ‚îÇ
                     ‚ñº
          Permute ‚Üí (B, H, W, C)
                     ‚îÇ
                     ‚ñº
                LayerNorm
               (per-channel)
                     ‚îÇ
                     ‚ñº
              Linear (C ‚Üí 4C)
                     ‚îÇ
                     ‚ñº
                    GELU
                     ‚îÇ
                     ‚ñº
              Linear (4C ‚Üí C)
                     ‚îÇ
                     ‚ñº
              (optional) Œ≥ ‚äô x
                LayerScale
                     ‚îÇ
                     ‚ñº
          Permute ‚Üí (B, C, H, W)
                     ‚îÇ
                     ‚ñº
            DropPath (stochastic)
                     ‚îÇ
                     ‚ñº
          Residual add with input
          y = x_input + Œîx
                     ‚îÇ
                     ‚ñº
                Output y
            (B, C, H, W)
```

Step-by-step narrative:

1. **Input**
   Take input feature map
   $$x \in \mathbb{R}^{B \times C \times H \times W}.$$

2. **Depthwise 7√ó7 convolution**
   Apply depthwise conv, one kernel per channel:
   $$u = \text{DWConv}_{7\times 7}(x) \in \mathbb{R}^{B \times C \times H \times W}.$$

3. **Change layout**
   Permute to channels-last:
   $$u' = \text{permute}(u) \in \mathbb{R}^{B \times H \times W \times C}.$$

4. **LayerNorm**
   Normalize each channel:
   $$\hat{u} = \text{LayerNorm}(u').$$

5. **First linear (expansion)**
   $$v = \hat{u} W_1 + b_1,$$
   where
   $$W_1 \in \mathbb{R}^{C \times 4C}, \quad v \in \mathbb{R}^{B \times H \times W \times 4C}.$$

6. **Nonlinearity**
   $$g = \text{GELU}(v).$$

7. **Second linear (projection)**
   $$p = g W_2 + b_2,$$
   where
   $$W_2 \in \mathbb{R}^{4C \times C}, \quad p \in \mathbb{R}^{B \times H \times W \times C}.$$

8. **LayerScale (optional)**
   If using gamma:
   $$p' = \gamma \odot p,$$
   where
   $$\gamma \in \mathbb{R}^{C}.$$

9. **Back to NCHW**
   $$p'' = \text{permute}(p') \in \mathbb{R}^{B \times C \times H \times W}.$$

10. **DropPath**
    $$\Delta x = \text{DropPath}(p'').$$

11. **Residual add**
    $$y = x + \Delta x.$$

---

## 3. ConvNeXt training recipe (classification)

This is a standard, ImageNet-style training recipe adapted from common ConvNeXt usage.

### 3.1 Data preprocessing

**Input size**: 224√ó224 (for standard models)

**Training transforms**:

* RandomResizedCrop(224, interpolation=bilinear)
* RandomHorizontalFlip(0.5)
* Color jitter (optional, light)
* AutoAugment or RandAugment (recommended)
* Mixup + CutMix
* Random Erasing
* Normalize with ImageNet mean/std:

  * mean = [0.485, 0.456, 0.406]
  * std = [0.229, 0.224, 0.225]

**Validation transforms**:

* Resize shorter side to 256
* CenterCrop(224√ó224)
* Normalize with same mean/std

### 3.2 Optimizer and schedule

* Optimizer: **AdamW**
* Base learning rate for large batch (e.g. 4096 global):
  $$\text{lr}_{\text{base}} = 4 \times 10^{-3}.$$
* If your global batch size is smaller, scale linearly:
$$\text{lr} = \text{lr}_{\text{base}} \times \frac{\text{batch\_size}}{4096}.$$

* Weight decay:
  $$\text{wd} = 0.05.$$
* Betas: (0.9, 0.999)

**Scheduler**:

* Warmup: 20 epochs of linear warmup to peak lr
* Then cosine decay down to a small value
  $$\text{lr}*{\text{final}} \approx \text{lr}*{\text{max}} \times 10^{-2}.$$

**Epochs**:

* 300 epochs for ImageNet from scratch (common setting)
* You can do 100‚Äì150 epochs for smaller experiments.

**Regularization details**:

* Label smoothing:
  $$\epsilon = 0.1.$$
* Mixup: alpha = 0.8
* CutMix: alpha = 1.0
* DropPath (stochastic depth): linearly increased with depth, e.g. max 0.1‚Äì0.3 for deeper models.

### 3.3 Simple PyTorch training skeleton

Below is a **minimal** training loop skeleton for classification using a ConvNeXt backbone (you can plug in your own model):

```python
import torch
import torch.nn as nn
from torch.utils.data import DataLoader

def train_one_epoch(model, loader, optimizer, epoch, device, scaler=None, criterion=None):
    model.train()
    if criterion is None:
        criterion = nn.CrossEntropyLoss(label_smoothing=0.1).to(device)

    for step, (images, targets) in enumerate(loader):
        images = images.to(device, non_blocking=True)
        targets = targets.to(device, non_blocking=True)

        optimizer.zero_grad(set_to_none=True)

        # optional mixed precision
        if scaler is not None:
            with torch.cuda.amp.autocast():
                outputs = model(images)
                loss = criterion(outputs, targets)
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
        else:
            outputs = model(images)
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()

        if step % 100 == 0:
            print(f"Epoch {epoch} | Step {step}/{len(loader)} | Loss: {loss.item():.4f}")

def validate(model, loader, device):
    model.eval()
    correct1, total = 0, 0
    loss_sum = 0.0
    criterion = nn.CrossEntropyLoss().to(device)

    with torch.no_grad():
        for images, targets in loader:
            images = images.to(device, non_blocking=True)
            targets = targets.to(device, non_blocking=True)

            outputs = model(images)
            loss = criterion(outputs, targets)
            loss_sum += loss.item() * images.size(0)

            _, pred = outputs.topk(1, dim=1)
            correct1 += (pred.squeeze(1) == targets).sum().item()
            total += targets.size(0)

    top1 = 100.0 * correct1 / total
    avg_loss = loss_sum / total
    print(f"Validation: Loss={avg_loss:.4f}, Top-1 Acc={top1:.2f}%")
    return avg_loss, top1
```

And a high-level setup:

```python
from torch.optim import AdamW
from torch.cuda.amp import GradScaler

model = YourConvNeXtModel(num_classes=1000).to(device)

optimizer = AdamW(
    model.parameters(),
    lr=4e-3,      # adjust for your batch size
    weight_decay=0.05,
    betas=(0.9, 0.999),
)

scaler = GradScaler()

for epoch in range(num_epochs):
    # update lr via scheduler here if you use cosine
    train_one_epoch(model, train_loader, optimizer, epoch, device, scaler=scaler)
    validate(model, val_loader, device)
```

---



In [3]:
# fmt: off
# isort: skip_file
# DO NOT reorganize imports - warnings filter must be FIRST!
import warnings
import os

warnings.filterwarnings('ignore')
os.environ['PYTHONWARNINGS'] = 'ignore'

import timm
# fmt: on


all_convnext = timm.list_models("*convnext*", pretrained=True)
for m in all_convnext:
    print(m)

convnext_atto.d2_in1k
convnext_atto_ols.a2_in1k
convnext_base.clip_laion2b
convnext_base.clip_laion2b_augreg
convnext_base.clip_laion2b_augreg_ft_in1k
convnext_base.clip_laion2b_augreg_ft_in12k
convnext_base.clip_laion2b_augreg_ft_in12k_in1k
convnext_base.clip_laion2b_augreg_ft_in12k_in1k_384
convnext_base.clip_laiona
convnext_base.clip_laiona_320
convnext_base.clip_laiona_augreg_320
convnext_base.clip_laiona_augreg_ft_in1k_384
convnext_base.fb_in1k
convnext_base.fb_in22k
convnext_base.fb_in22k_ft_in1k
convnext_base.fb_in22k_ft_in1k_384
convnext_femto.d1_in1k
convnext_femto_ols.d1_in1k
convnext_large.fb_in1k
convnext_large.fb_in22k
convnext_large.fb_in22k_ft_in1k
convnext_large.fb_in22k_ft_in1k_384
convnext_large_mlp.clip_laion2b_augreg
convnext_large_mlp.clip_laion2b_augreg_ft_in1k
convnext_large_mlp.clip_laion2b_augreg_ft_in1k_384
convnext_large_mlp.clip_laion2b_augreg_ft_in12k_384
convnext_large_mlp.clip_laion2b_ft_320
convnext_large_mlp.clip_laion2b_ft_soup_320
convnext_large_mlp.c

# **ConvNeXt / ConvNeXt-v2 Model Name in timm**
Below is a **complete, clean explanation** of all ConvNeXt names in **timm**, including:

1. **Naming scheme (atto, pico, femto, nano...)**
2. **Parameters, depth, width, FLOPs** (approx).
3. **Differences between ConvNeXt and ConvNeXt-v2**
4. **Meaning of OLS, RMS, HNF, MLP variants**
5. **A table summarizing all models**

Everything is structured so you can quickly choose the right backbone.

---

# 1. ConvNeXt naming scheme in `timm`

timm uses very small ‚Äúscientific scale‚Äù names for tiny models:

| Name        | Meaning        | Typical Params |
| ----------- | -------------- | -------------- |
| **zepto**   | extremely tiny | ~1M‚Äì1.5M       |
| **atto**    | very tiny      | ~2M‚Äì3M         |
| **femto**   | tiny           | ~3M‚Äì5M         |
| **pico**    | small-ish tiny | ~5M‚Äì7M         |
| **nano**    | small          | ~7M‚Äì10M        |
| **tiny**    | medium-small   | ~28M           |
| **small**   | medium         | ~50M           |
| **base**    | large-ish      | ~88M           |
| **large**   | very large     | ~197M          |
| **xlarge**  | bigger         | ~350M          |
| **xxlarge** | huge           | ~600M+         |

These correspond to the same conceptual hierarchy as MobileNet/ViT sizes (tiny ‚Üí small ‚Üí base ‚Üí large).

---

# 2. Full explanation of model strings

## **convnext_atto**

* ConvNeXt-v1 ‚ÄúAtto‚Äù
* ~2.0M params
* Extreme lightweight model (MobileNet-like level)

## **convnext_atto_ols**

* Same as above, but **OLS = Omni Lite Scaling**, an experimental `timm` scaling method modifying depth/width for better FLOPs/accuracy tradeoff.

## **convnext_atto_rms**

* Same model, but **RMSNorm** instead of LayerNorm.
* RMSNorm:
  $$
  y = \frac{x}{\sqrt{\text{mean}(x^2)}}\cdot g
  $$
  No bias; faster on some hardware.

---

## **convnext_base**

* Standard ConvNeXt-B (Base)
* 88M params
* 224x224 FLOPs ‚âà 15.4 GFLOPs

Equivalent to ViT-B in accuracy.

---

## **convnext_femto**

* ConvNeXt-v1 ‚ÄúFemto‚Äù
* ~4M params
* A bit larger than Atto

## **convnext_femto_ols**

* Same model with Omni Lite Scaling

---

## **convnext_large**

* ConvNeXt-L (Large)
* 197M params
* ‚âà 34 GFLOPs
* Large-scale backbone (ImageNet ~86‚Äì87% top-1)

## **convnext_large_mlp**

* Same ConvNeXt large, but classifier head is changed to a bigger MLP-style head.
* Useful for some classification tasks.

---

## **convnext_nano**

* Nano: ~7M params
* Good compromise for edge devices
* Faster than EfficientNet-B0 with similar accuracy

## **convnext_nano_ols**

* OLS scaling version

---

## **convnext_pico**

* Pico: ~6M params
* Very small but slightly more width than femto

## **convnext_pico_ols**

* OLS version

---

## **convnext_small**

* ConvNeXt-S (Small)
* 50M params
* ‚âà 8.7 GFLOPs
* Good for moderate GPU training

---

## **convnext_tiny**

* ConvNeXt-T (Tiny)
* 28M params
* Sold as ‚ÄúResNet50 replacement‚Äù
* ‚âà 4.5 GFLOPs

## **convnext_tiny_hnf**

* **hnf = High Norm Frequency** variant
* Another experimental normalization variant in timm.
* Uses alternative norm placement.

---

## **convnext_xlarge**

* ConvNeXt-XL (Extra Large)
* 350M params
* ~60 GFLOPs
* Very heavy backbone

## **convnext_xxlarge**

* ConvNeXt-XXL
* ~600M params
* More than 100 GFLOPs
* Research-grade only

---

## **convnext_zepto_rms**

* ‚ÄúZepto‚Äù size (~1‚Äì1.5M params)
* Extreme low-resource
* RMSNorm version

## **convnext_zepto_rms_ols**

* RMSNorm + OLS scaling.

---

# 3. ConvNeXt-v2 models

ConvNeXt-v2 (Meta FAIR 2023) is an improved version:

* Uses **FCN-based ConvNeXt block**
* Adds **Global Response Normalization (GRN)**:
  $$
  G = \frac{x}{\sqrt{\sum x^2 + \epsilon}}
  $$
* Stronger performance at same parameter count

### Names

## **convnextv2_atto**

* ~2.7M params
* New v2 block
* Much stronger than ConvNeXt-v1 Atto

## **convnextv2_base**

* 88M params (same as v1)
* Higher accuracy

## **convnextv2_femto**

* ~4M params

## **convnextv2_huge**

* ~1B params
* Very heavy research model

## **convnextv2_large**

* 197M params

## **convnextv2_nano**

* ~7M params

## **convnextv2_pico**

* ~5M params

## **convnextv2_small**

* ~50M params

## **convnextv2_tiny**

* ~28M params

---

# 4. test_convnext, test_convnext2, test_convnext3

These are **internal timm experimental models** (not for normal use):

* Used for architecture testing
* Unstable
* Do not load pretrained weights
* Should be ignored unless debugging timm source code

---

# 5. Summary table

### ConvNeXt-v1

| Model                  | Params | Notes               |
| ---------------------- | ------ | ------------------- |
| convnext_zepto_rms     | ~1M    | RMSNorm             |
| convnext_zepto_rms_ols | ~1M    | RMSNorm + OLS       |
| convnext_atto          | ~2M    | standard            |
| convnext_atto_ols      | ~2M    | OLS                 |
| convnext_atto_rms      | ~2M    | RMSNorm             |
| convnext_femto         | ~4M    | tiny                |
| convnext_femto_ols     | ~4M    | OLS                 |
| convnext_pico          | ~6M    |                     |
| convnext_pico_ols      | ~6M    | OLS                 |
| convnext_nano          | ~7‚Äì8M  | good small backbone |
| convnext_nano_ols      | ~7‚Äì8M  |                     |
| convnext_tiny          | 28M    | standard Tiny       |
| convnext_tiny_hnf      | 28M    | HNF norm            |
| convnext_small         | 50M    |                     |
| convnext_base          | 88M    |                     |
| convnext_large         | 197M   |                     |
| convnext_large_mlp     | 197M   | bigger MLP head     |
| convnext_xlarge        | 350M   |                     |
| convnext_xxlarge       | 600M+  |                     |

### ConvNeXt-v2

| Model            | Params | Notes          |
| ---------------- | ------ | -------------- |
| convnextv2_atto  | ~2.7M  | v2 block + GRN |
| convnextv2_femto | ~4M    |                |
| convnextv2_pico  | ~5M    |                |
| convnextv2_nano  | ~7M    |                |
| convnextv2_tiny  | 28M    |                |
| convnextv2_small | 50M    |                |
| convnextv2_base  | 88M    |                |
| convnextv2_large | 197M   |                |
| convnextv2_huge  | ~1B    |                |

---

# 6. Encoding the names (if you want to detect/parse automatically)

A regex that encodes the naming scheme:

```
^(convnextv2|convnext)_
(zepto|atto|femto|pico|nano|tiny|small|base|large|xlarge|xxlarge)
(_rms|_ols|_hnf|_mlp)?
$
```

Where:

* The first group = **family**
* Second = **model size**
* Third = **modifier** (optional)

---

If you want, I can also:

* Produce a CSV / JSON summary of all models
* Compare ConvNeXt vs EfficientNet vs MobileNet vs ViT in FLOPs/params
* Recommend which model you should use for your dataset (your images look ~chest X-ray)


# **ConvNeXt / ConvNeXt-v2 Models Ranking**
Below is a **clean, practical ranking** of all ConvNeXt / ConvNeXt-v2 models from **timm**, ordered by what *you should actually use in real projects*, not just theoretical size.

This ranking considers:

* Accuracy
* Stability
* Training difficulty
* Parameter efficiency
* GPU memory
* Usefulness for medical images (your case)
* Real-world performance vs FLOPs

It is the **definitive recommendation list**.

---

#  Final Ranking (Best ‚Üí Worst for real use)

### **Tier 1 ‚Äî The best overall models (use these first)**

1. **convnextv2_base**
2. **convnextv2_small**
3. **convnextv2_tiny**
4. **convnext_tiny** (v1, still extremely strong)

These four offer the best *accuracy-to-compute ratio*.
They train easily and work on most GPUs.

---

#  Tier 2 ‚Äî Very good models (if you want faster models)

5. **convnextv2_nano**
6. **convnextv2_pico**
7. **convnext_nano**
8. **convnext_small**

These are perfect for:

* Medium datasets
* Edge devices
* When you want high accuracy with low compute

---

#  Tier 3 ‚Äî Large research-grade / heavy models (avoid unless you need max accuracy)

9. **convnextv2_large**
10. **convnextv2_huge**
11. **convnext_large**
12. **convnext_xlarge**
13. **convnext_xxlarge**

Reasons:

* Require huge batch sizes
* Hard to train
* Will **not** give you meaningful gain on a small dataset (like X-rays)
* Mainly for ImageNet-1k/22k or huge datasets
* Require multi-GPU or A100/H100 hardware

---

#  Tier 4 ‚Äî Very tiny models (OK but lower accuracy)

14. **convnextv2_femto**
15. **convnextv2_atto**
16. **convnext_femto**
17. **convnext_pico**
18. **convnext_atto**
19. **convnext_zepto_rms**

Use these **only** for ultra-low-compute environments or mobile apps.
Not good for deep, subtle medical image patterns.

---

#  Tier 5 ‚Äî Experimental / special variants (avoid)

20. **convnext_‚Ä¶_ols** (OLS variants)
21. **convnext_‚Ä¶_rms** (RMSNorm variants)
22. **convnext_tiny_hnf**
23. **convnext_large_mlp**

They are:

* Experimental
* No strong community benchmarks
* Mostly useful for profiling, not production

---

#  Tier 6 ‚Äî Do not use (internal debugging models)

24. **test_convnext**
25. **test_convnext2**
26. **test_convnext3**

These exist **only for timm internal testing**.

---

# ü•á Best Choice for *You* (Medical Images, 4-class classification, dataset size ~6k)

### **Use these in order:**

1. **convnextv2_tiny** ‚Üê Best for your dataset + single GPU
2. **convnextv2_small** ‚Üê If you have 12‚Äì24 GB VRAM
3. **convnext_tiny** ‚Üê If you want classic ConvNeXt
4. **convnextv2_nano** ‚Üê If you want fast & light
5. **convnext_base** ‚Üê Only if you have strong GPU and enough data (‚â•50k samples)

Anything larger than ‚Äúbase‚Äù will **overfit your chest X-ray dataset**.

---

#  Simple rule of thumb

* **Small dataset (<20k images)** ‚Üí Use **Tiny**, **Nano**, or **Small**
* **Medium dataset (50k‚Äì200k)** ‚Üí Use **Small** or **Base**
* **Large dataset (>1M)** ‚Üí Use **Large** or **Huge**

For **medical imaging**, Tiny and Small ConvNeXt models consistently outperform bigger ones due to less overfitting.

---



# **ConvNext Training Recipe**
Below is a **clear, model-size‚Äìspecific recipe** for every ConvNeXt / ConvNeXt-v2 model in timm.
These recommendations come from official Meta papers, timm defaults, and empirical results.
Just plug and play.

---

# 1. Core Rule (Most Important)

ConvNeXt **does NOT like SGD**.
ConvNeXt **strongly prefers AdamW**, cosine decay, and long warmup.

ConvNeXt = Transformer-style training.

So:

* Optimizer = **AdamW** (always)
* Weight decay = **0.05**
* Betas = **(0.9, 0.999)**
* Learning rate = **scaled by batch size**
* Scheduler = **cosine + warmup**

This is true for all sizes: from atto ‚Üí tiny ‚Üí base ‚Üí xxlarge.

---

# 2. Learning Rate Scaling

The official rule:

$$\text{lr} = 4 \times 10^{-3} \cdot \frac{\text{batch\_size}}{4096}$$

Example:

| Global batch | LR      |
| ------------ | ------- |
| 4096         | 4e-3    |
| 2048         | 2e-3    |
| 1024         | 1e-3    |
| 512          | 5e-4    |
| 256          | 2.5e-4  |
| 128          | 1.25e-4 |

If you train on **one GPU**, your batch may be 32 ‚Üí 128.
Use LR between **1e-4 ‚Üí 3e-4**.

---

# 3. Recommended settings *per ConvNeXt model size*

Below I provide **the best settings** for each name category.

---

# 3.1. For extremely tiny models

**convnext_zepto*, convnext_atto*, convnext_femto*, convnext_pico*, convnext_nano***

### Optimizer

* AdamW

### Hyperparameters

* lr = **2e-4 ‚Üí 4e-4** (for batch 128‚Äì256)
* weight decay = **0.05**
* betas = **(0.9, 0.999)**
* drop_path = **0.0 ‚Üí 0.1**
* epochs = **300**
* warmup = **20 epochs**

### Notes

* These models underfit quickly ‚Üí smaller DropPath
* No need for strong regularization like mixup/cutmix
* Good for medical images (X-ray)

---

# 3.2. convnext_tiny (28M) and convnext_tiny_hnf

### Optimizer

* AdamW

### Hyperparameters

* lr = **3e-4** (batch 128)
* weight decay = **0.05**
* betas = **(0.9, 0.999)**
* drop_path = **0.1 ‚Üí 0.2**
* epochs = **300**
* warmup = **20 epochs**

### Notes

* This is the ‚Äúentry-level strong model‚Äù

---

# 3.3. convnext_small (50M)

### Optimizer

* AdamW

### Hyperparameters

* lr = **2e-4** (batch 128)
* weight decay = **0.05**
* drop_path = **0.2 ‚Üí 0.4**
* epochs = **300**

### Notes

* Start using stronger regularization
* Mixup = 0.8, CutMix = 1.0 recommended

---

# 3.4. convnext_base (88M)

### Optimizer

* AdamW

### Hyperparameters

* lr = **1e-4 ‚Üí 1.5e-4**
* weight decay = **0.05**
* betas = **(0.9, 0.999)**
* drop_path = **0.3 ‚Üí 0.5**
* epochs = **300**
* warmup = **20**

### Notes

* Needs larger stochastic depth
* Sensitive to lr > 2e-4 ‚Üí avoid too large lr

---

# 3.5. convnext_large (197M) / convnext_large_mlp

### Optimizer

* AdamW

### Hyperparameters

* lr = **8e-5 ‚Üí 1e-4**
* weight decay = **0.05**
* drop_path = **0.4 ‚Üí 0.6**
* epochs = **300**
* warmup = **20**

### Notes

* If training from scratch:

  * GPU memory heavy
  * Requires gradient checkpointing

---

# 3.6. convnext_xlarge / convnext_xxlarge

### Optimizer

* AdamW

### Hyperparameters

* lr = **5e-5 ‚Üí 8e-5**
* wd = **0.05**
* drop_path = **0.6 ‚Üí 0.8**
* epochs = **300**
* warmup = **20**

### Notes

* Only realistic with multi-GPU training
* Very sensitive to learning rate

---

# 3.7. ConvNeXt-v2 variants

**convnextv2_atto ‚Äî convnextv2_huge**

ConvNeXt-v2 uses **GRN**, but the training setup stays similar.

### Optimizer

* AdamW

### Hyperparameters

* lr = **3e-4** for small models
* lr = **2e-4** for tiny‚Üísmall
* lr = **1e-4 ‚Üí 8e-5** for base‚Üílarge‚Üíhuge
* weight decay = **0.05**
* drop_path = **0.1 ‚Üí 0.6 depending on size**
* epochs = **300**
* warmup = **20**

---

# 4. Special model modifiers: OLS / RMS / HNF

### If the model name includes:

| Modifier | Meaning                     | Training impact                                          |
| -------- | --------------------------- | -------------------------------------------------------- |
| **_ols** | Omni Lite Scaling           | Same optimizer, identical hyperparameters                |
| **_rms** | RMSNorm                     | Same optimizer, but allows slightly larger lr, e.g. +20% |
| **_hnf** | High Norm Frequency variant | Same training; slightly more stable with Mixup/CutMix    |

So:
You do **NOT** need to change optimizer or wd for RMS/OLS/HNF.

---

# 5. Which optimizer + params should YOU use (recommended per complexity)

Below is a clean shortlist depending on your dataset size:

---

## Small datasets (medical images, 5k‚Äì20k samples)

**Recommended models:**

* convnext_femto
* convnext_pico
* convnext_nano
* convnext_tiny

**Settings:**

```
optimizer = AdamW
lr = 1e-4 ‚Üí 2e-4
weight_decay = 0.05
drop_path = 0.0 ‚Üí 0.1
epochs = 100 ‚Üí 150
warmup = 5 ‚Üí 10
no Mixup / CutMix (medical data)
label smoothing = 0.05
```

---

## Medium datasets (50k‚Äì200k)

**Recommended:**

* convnext_tiny
* convnext_small
* convnext_base

**Settings:**

```
optimizer = AdamW
lr = 2e-4 ‚Üí 3e-4
wd = 0.05
drop_path = 0.2
epochs = 200 ‚Üí 300
warmup = 10 ‚Üí 20
mixup = 0.8
cutmix = 1.0
label smoothing = 0.1
```

---

## Large datasets (ImageNet-scale, >1M images)

**Recommended:**

* convnext_base
* convnext_large
* convnext_xlarge

**Settings:**

```
optimizer = AdamW
lr = scaled by batch (1e-4 ‚Üí 4e-4)
wd = 0.05
drop_path = 0.3 ‚Üí 0.7
epochs = 300
warmup = 20
mixup = 0.8
cutmix = 1.0
label smoothing = 0.1
```

---

# 6. Simple code snippet to load the recommended optimizer

```python
import torch
import torch.nn as nn
import timm

model_name = "convnext_tiny"  # change here
model = timm.create_model(model_name, pretrained=True, num_classes=4).to(device)

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,              # adjust for your batch size
    weight_decay=0.05,
    betas=(0.9, 0.999),
)

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=num_epochs,
    eta_min=1e-6,
)
```

Warmup can be done with a custom warmup scheduler (linear warmup).

---


## Fine Tune a Pretrained 


In [6]:
# fmt: off
# isort: skip_file
# DO NOT reorganize imports - warnings filter must be FIRST!

import torch.nn.functional as F
import torch
import torch.nn as nn
import warnings
import os

warnings.filterwarnings('ignore')
os.environ['PYTHONWARNINGS'] = 'ignore'

import timm
# fmt: on


model_name = "convnextv2_tiny"
model = timm.create_model(model_name, pretrained=True)
model_config = model.pretrained_cfg

input shape:  (3, 224, 224)
features shape: torch.Size([1, 768, 7, 7])
output shape:  torch.Size([1, 1000])


## Getting Input, Feature, and output size
To fine tune we need to know the output of last stage, we can only get the number of channels, but the resolution we can't, since in `Conv2d` we only specify input and output channel, 

In [7]:
#print(type(model_config["input_size"]))
print("input shape: ", model_config["input_size"])
(C, H, W) = model_config["input_size"]

x = torch.randn(1, C, H, W)

with torch.no_grad():
    features = model.forward_features(x)
    print("features shape:", features.shape)
    
    out = model(x)
    print("output shape: ",out.shape)


input shape:  (3, 224, 224)
features shape: torch.Size([1, 768, 7, 7])
output shape:  torch.Size([1, 1000])


## Setting Head


In **ConvNeXt / ConvNeXtV2**, the **correct and standard way** is:

### ‚úÖ **Apply global average pooling ‚Üí get a 768-dim vector ‚Üí feed to your final FC layer**

Not the flattened 7√ó7√ó768 tensor.

Below is the reasoning and comparison.

---

# 1. **What the backbone outputs**

You observed:

* Output shape from last stage:
  **[B, 768, 7, 7]**

This is the **feature map** produced by ConvNeXtV2-Tiny.

To convert this into class predictions, you have **two options**:

---

# OPTION A ‚Äî **Global Average Pooling (GAP)**

### Output:

* Input: [B, 768, 7, 7]
* After GAP: [B, 768]

### Classification head:

* FC(768 ‚Üí 4)

### This is what ConvNeXt is designed for.

This is what timm does internally.

### Why it‚Äôs preferred:

1. **Reduces parameters dramatically**
   FC with flat features would require:

   $$768 \times 7 \times 7 = 37632$$

   So FC would be
   **37,632 ‚Üí 4**
   versus
   **768 ‚Üí 4**
   Massive difference in number of learnable weights.

2. **Prevents overfitting**
   You have **6k images**. A head with ~150k parameters will overfit quickly.

3. **GAP is translation-invariant**
   For medical images (like X-rays), location can vary slightly due to patient misalignment.

4. **Every modern CNN uses GAP:**

   * ResNet
   * EfficientNet
   * MobileNet
   * ConvNeXt + ConvNeXtV2
   * RegNet
   * DenseNet

It is the standard because it turns feature maps into semantic vectors without exploding parameter count.

---

# OPTION B ‚Äî **Flatten (7√ó7√ó768) and feed through a large FC**

### Output:

* Flatten: [B, 37,632]

### Head:

* FC(37632 ‚Üí 4)

### Problems:

1. Much larger parameter count
2. Overfitting risk increases significantly
3. Slower training
4. Ignores spatial pooling benefits
5. Not how ConvNeXt was designed to be used

### When is flattening used?

Rarely in modern CNNs.
Only architectures like AlexNet (very old) used it.

---

# ‚úîÔ∏è Conclusion: Use Global Pooling

### **Use this:**

```python
model = timm.create_model(model_name, pretrained=True, num_classes=4)
```

Timm will automatically:

* apply global average pooling
* replace the FC head with a 768 ‚Üí 4 classifier

If you want manual control:

```python
model = timm.create_model(model_name, pretrained=True)
in_features = model.head.fc.in_features
model.head.fc = nn.Linear(in_features, 4)
```

Or if ConvNeXtV2 uses `model.head = nn.Linear(768, 4)` directly:

```python
model.head = nn.Linear(768, 4)
```

---

# üîç Summary (which one and why)

| Method                 | Input to classifier | Params    | Overfitting risk | Standard? | Recommended? |
| ---------------------- | ------------------- | --------- | ---------------- | --------- | ------------ |
| **Global Avg Pooling** | 768                 | Very low  | Low              | ‚úîÔ∏è Yes    | **‚úîÔ∏è Yes**   |
| **Flatten 7√ó7√ó768**    | 37,632              | Very high | High             | ‚ùå No      | **‚ùå No**     |

---


# ‚úîÔ∏è **Custom Head ‚Äî Structure Review**



```python
custom_head_v1_b = nn.Sequential(
    nn.AdaptiveAvgPool2d(1),     # [B, 768, 7, 7] -> [B, 768, 1, 1]
    nn.Flatten(1),               # -> [B, 768]
    nn.Dropout(0.3),
    nn.Linear(num_features, 128),
    nn.ReLU(),
    nn.Dropout(0.25),
    nn.Linear(128, num_classes)
)
```

This is a **valid, well-structured classifier head**.

---

# ‚úîÔ∏è Why this head is correct

### 1. **Correct handling of ConvNeXtV2 output**

ConvNeXtV2-Tiny outputs:

* features: `[B, 768, 7, 7]`

You apply:

* global pooling ‚Üí `[B, 768, 1, 1]`
* flatten ‚Üí `[B, 768]`

This is **exactly** how the official ConvNeXt head works.

---

# ‚úîÔ∏è Why your head design is GOOD

### 1. **Small MLP avoids overfitting**

Your dataset is ~6k images ‚Üí very small.
A simple 1‚Äì2 layer MLP is ideal.

### 2. **Dropout is useful**

Chest X-ray datasets often have:

* class imbalance
* noisy labels
* small dataset size

Dropout 0.3 ‚Üí 0.25 is **healthy** without hurting accuracy.

### 3. **128 hidden units is a good bottleneck**

A larger head (512 or 1024) would overfit.
A smaller head (<64) might underfit.

128 works extremely well in practice.

### 4. **This head adds nonlinearity**

Which improves decision boundaries (important for subtle findings).

---

# ‚úîÔ∏è Recommended improvement (not required)

### Add LayerNorm before the MLP

ConvNeXt models benefit from normalization before an FC layer:

```python
nn.LayerNorm(768)
```

### Improved version:

```python
custom_head_v1_b = nn.Sequential(
    nn.AdaptiveAvgPool2d(1),
    nn.Flatten(1),
    nn.LayerNorm(768),
    nn.Dropout(0.3),
    nn.Linear(768, 128),
    nn.ReLU(),
    nn.Dropout(0.25),
    nn.Linear(128, num_classes)
)
```

LayerNorm stabilizes fine-tuning and is used extensively in ConvNeXt blocks.

---


