

# 1. Motivation

ConvNeXt (CVPR 2022, Facebook/Meta) aims to answer a question:

**If we modernize classical CNNs with all the training tricks used in Vision Transformers (ViT), could a pure CNN match or surpass ViT?**

ConvNeXt = **ResNet-50/ResNet-200 rewritten with modern design choices inspired by ViT**, without using attention.

Key ideas:

* Replace all **3×3 convolutions** in bottleneck blocks with **depthwise 7×7 convolutions**.
* Use **LayerNorm** instead of BatchNorm.
* Use very large **conv kernels** (7×7 DW).
* Use **ConvNeXt blocks** that structurally resemble **MLP blocks in ViT**.
* Use **inverted bottlenecks** with very large expansion (just like MLP ratios in ViT).

ConvNeXt is therefore a **CNN that behaves like a Vision Transformer**, but runs faster and often performs better.

---

# 2. Overall Architecture

ConvNeXt uses a **4-stage hierarchy** similar to ResNet / Swin Transformer:

| Stage   | Resolution      | Channels |
| ------- | --------------- | -------- |
| Stage 1 | 224×224 → 56×56 | C        |
| Stage 2 | 56×56 → 28×28   | 2C       |
| Stage 3 | 28×28 → 14×14   | 4C       |
| Stage 4 | 14×14 → 7×7     | 8C       |

A Tiny model uses C=96.
A Base model uses C=128.

---

# 3. ConvNeXt Block

The ConvNeXt block has **four main parts**:

1. **Depthwise convolution**
   Kernel size **7×7** (very large), per-channel:

   $$
   y_{c} = x_{c} * k^{(c)}
   $$

2. **LayerNorm**
   Applied channel-wise:

   $$
   \hat{x} = \frac{x - \mu}{\sigma}
   $$

3. **Pointwise MLP (two linear layers using 1×1 convs)**
   Expansion ratio = **4×**:

   $$
   u = W_1 \hat{x}
   $$
   $$
   z = \text{GELU}(u)
   $$
   $$
   y = W_2 z
   $$

4. **Residual connection**:

   $$
   \text{Block}(x) = x + y
   $$

This is **almost identical** to a ViT MLP block, except the attention is replaced by a **depthwise convolution**.

---

# 4. Comparison with a ResNet Bottleneck

ResNet bottleneck block:

* 1×1 → 3×3 → 1×1
* BatchNorm everywhere
* Activation after each conv
* Small expansion ratio (4× internally but shrinks back)

ConvNeXt block:

* No BatchNorm → LayerNorm only
* **7×7 depthwise conv**
* **Inverted bottleneck** (expand → shrink), exactly like a ViT MLP block
* Very few activations (GELU only once)
* No ReLU after every conv

This makes ConvNeXt much closer to ViT.

---

# 5. Stage Downsampling Design

Between stages, ConvNeXt uses a **simple downsampling layer**:

* LayerNorm
* 2×2 stride-2 convolution

This mimics ViT patch embeddings and Swin Transformer patch merging.

---

# 6. Full Block Diagram (Text)

```
Input
 │
 ▼
Depthwise Conv (7×7)
 │
 ▼
LayerNorm
 │
 ▼
Pointwise 1×1 Conv (4× expansion)
 │
 ▼
GELU
 │
 ▼
Pointwise 1×1 Conv (projection)
 │
 ▼
Add residual
 │
Output
```

---

# 7. Mathematical Form of a Single Block

Let the input be a tensor
$$ x \in \mathbb{R}^{H \times W \times C} $$

Step 1: Depthwise convolution
$$
u_{h,w,c} = (x_{:, :, c} * k_c)(h,w)
$$

Step 2: LayerNorm
$$
\hat{u}*{h,w,c} = \frac{u*{h,w,c} - \mu_c}{\sigma_c}
$$

Step 3: Expansion
$$
v_{h,w,c'} = \sum_{c} W^{(1)}*{c,c'},\hat{u}*{h,w,c}
$$
with expansion ratio 4, so $c' = 4C$.

Step 4: GELU
$$
g_{h,w,c'} = \text{GELU}(v_{h,w,c'})
$$

Step 5: Projection
$$
p_{h,w,c} = \sum_{c'} W^{(2)}*{c',c}, g*{h,w,c'}
$$

Step 6: Residual
$$
y = x + p
$$

---

# 8. Why ConvNeXt Works So Well

ConvNeXt integrates the best ideas from both CNNs and Transformers:

### From Transformers:

* LayerNorm instead of BatchNorm
* GELU activation
* Large MLP expansion
* Fewer nonlinearities
* Simple stage transitions

### From CNNs:

* Efficient convolutions
* Local receptive fields
* No attention (faster, cheaper)

The result is a **ViT-level accuracy with CNN-level speed**.

---

# 9. Model Variants

ConvNeXt models follow Swin-style naming:

* **ConvNeXt-T** (Tiny)
* **ConvNeXt-S** (Small)
* **ConvNeXt-B** (Base)
* **ConvNeXt-L** (Large)
* **ConvNeXt-XL** (Extra Large)
* **ConvNeXt-v2** (2023, improved training + LayerScale)

---

# 10. How it Performs

ConvNeXt-Base (pure CNN) ≈ ViT-Base
ConvNeXt-Large ≈ ViT-Large
ConvNeXt-XL beats Swin-B and ViT-L in some tasks

Used widely for:

* Classification
* Detection (with FPN/Mask-RCNN)
* Segmentation (with UperNet / FPN)

---



---

## 1. PyTorch implementation of a ConvNeXt block

This is a **standalone, minimal but faithful** ConvNeXt block:

* Depthwise 7×7 convolution
* LayerNorm (channels-last)
* 1×1 conv MLP with 4× expansion
* GELU
* Optional LayerScale
* Optional stochastic depth (DropPath)

```python
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import Tensor
import math
```

### 1.1 Utility: DropPath (stochastic depth)

```python
class DropPath(nn.Module):
    """
    Per-sample stochastic depth.
    """

    def __init__(self, drop_prob: float = 0.):
        super().__init__()
        self.drop_prob = drop_prob

    def forward(self, x: Tensor) -> Tensor:
        if self.drop_prob == 0. or not self.training:
            return x
        keep_prob = 1 - self.drop_prob
        # shape: (batch, 1, 1, 1) so it is broadcast across H, W, C
        shape = (x.shape[0],) + (1,) * (x.ndim - 1)
        random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
        random_tensor = torch.floor(random_tensor)  # 0 or 1
        return x / keep_prob * random_tensor
```

### 1.2 Utility: LayerNorm for channels-last

ConvNeXt uses channels-last (`N, H, W, C`) internally for LayerNorm efficiency.

```python
class LayerNormChannelsLast(nn.LayerNorm):
    """
    LayerNorm expecting input in (B, H, W, C) format.
    Inherits nn.LayerNorm but just documents expected layout.
    """
    def __init__(self, normalized_shape, eps=1e-6):
        super().__init__(normalized_shape, eps=eps)
```

### 1.3 ConvNeXt block (single stage block)

```python
class ConvNeXtBlock(nn.Module):
    def __init__(
        self,
        dim: int,
        mlp_ratio: float = 4.0,
        drop_path: float = 0.0,
        layer_scale_init_value: float = 1e-6,
    ):
        """
        dim: number of channels (C)
        mlp_ratio: expansion ratio in the 1x1 conv MLP
        drop_path: stochastic depth rate
        layer_scale_init_value: if > 0, uses a learnable gamma vector
        """
        super().__init__()

        self.dwconv = nn.Conv2d(
            dim, dim,
            kernel_size=7,
            padding=3,
            groups=dim  # depthwise
        )

        self.norm = LayerNormChannelsLast(dim, eps=1e-6)

        hidden_dim = int(dim * mlp_ratio)
        self.pwconv1 = nn.Linear(dim, hidden_dim)  # channels-last, so use Linear
        self.act = nn.GELU()
        self.pwconv2 = nn.Linear(hidden_dim, dim)

        if layer_scale_init_value > 0:
            self.gamma = nn.Parameter(
                layer_scale_init_value * torch.ones(dim),
                requires_grad=True
            )
        else:
            self.gamma = None

        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()

    def forward(self, x: Tensor) -> Tensor:
        """
        x: (B, C, H, W)
        """
        shortcut = x

        # 1) depthwise conv in NCHW
        x = self.dwconv(x)  # (B, C, H, W)

        # 2) convert to NHWC for LayerNorm + MLP
        x = x.permute(0, 2, 3, 1)  # (B, H, W, C)

        # 3) LayerNorm
        x = self.norm(x)

        # 4) MLP: Linear -> GELU -> Linear
        x = self.pwconv1(x)       # (B, H, W, hidden_dim)
        x = self.act(x)
        x = self.pwconv2(x)       # (B, H, W, C)

        # 5) LayerScale (optional)
        if self.gamma is not None:
            x = self.gamma * x

        # 6) back to NCHW
        x = x.permute(0, 3, 1, 2)  # (B, C, H, W)

        # 7) residual + drop_path
        x = shortcut + self.drop_path(x)
        return x
```

You can drop this into a stage like:

```python
class ConvNeXtStage(nn.Module):
    def __init__(self, dim, depth, mlp_ratio=4.0, drop_path_rate=0.0):
        super().__init__()
        dpr = torch.linspace(0, drop_path_rate, depth).tolist()  # different drop_rates
        self.blocks = nn.ModuleList(
            [
                ConvNeXtBlock(
                    dim=dim,
                    mlp_ratio=mlp_ratio,
                    drop_path=dpr[i],
                )
                for i in range(depth)
            ]
        )

    def forward(self, x: Tensor) -> Tensor:
        for blk in self.blocks:
            x = blk(x)
        return x
```

---

## 2. Step-by-step flow diagram of the ConvNeXt block

Assume input tensor
$$
x \in \mathbb{R}^{B \times C \times H \times W}.
$$

High-level diagram:

```text
        ┌─────────────────────────────┐
        │         Input x            │
        │     (B, C, H, W)           │
        └────────────┬───────────────┘
                     │
                     ▼
          Depthwise Conv 7×7 (groups=C)
          (B, C, H, W)
                     │
                     ▼
          Permute → (B, H, W, C)
                     │
                     ▼
                LayerNorm
               (per-channel)
                     │
                     ▼
              Linear (C → 4C)
                     │
                     ▼
                    GELU
                     │
                     ▼
              Linear (4C → C)
                     │
                     ▼
              (optional) γ ⊙ x
                LayerScale
                     │
                     ▼
          Permute → (B, C, H, W)
                     │
                     ▼
            DropPath (stochastic)
                     │
                     ▼
          Residual add with input
          y = x_input + Δx
                     │
                     ▼
                Output y
            (B, C, H, W)
```

Step-by-step narrative:

1. **Input**
   Take input feature map
   $$x \in \mathbb{R}^{B \times C \times H \times W}.$$

2. **Depthwise 7×7 convolution**
   Apply depthwise conv, one kernel per channel:
   $$u = \text{DWConv}_{7\times 7}(x) \in \mathbb{R}^{B \times C \times H \times W}.$$

3. **Change layout**
   Permute to channels-last:
   $$u' = \text{permute}(u) \in \mathbb{R}^{B \times H \times W \times C}.$$

4. **LayerNorm**
   Normalize each channel:
   $$\hat{u} = \text{LayerNorm}(u').$$

5. **First linear (expansion)**
   $$v = \hat{u} W_1 + b_1,$$
   where
   $$W_1 \in \mathbb{R}^{C \times 4C}, \quad v \in \mathbb{R}^{B \times H \times W \times 4C}.$$

6. **Nonlinearity**
   $$g = \text{GELU}(v).$$

7. **Second linear (projection)**
   $$p = g W_2 + b_2,$$
   where
   $$W_2 \in \mathbb{R}^{4C \times C}, \quad p \in \mathbb{R}^{B \times H \times W \times C}.$$

8. **LayerScale (optional)**
   If using gamma:
   $$p' = \gamma \odot p,$$
   where
   $$\gamma \in \mathbb{R}^{C}.$$

9. **Back to NCHW**
   $$p'' = \text{permute}(p') \in \mathbb{R}^{B \times C \times H \times W}.$$

10. **DropPath**
    $$\Delta x = \text{DropPath}(p'').$$

11. **Residual add**
    $$y = x + \Delta x.$$

---

## 3. ConvNeXt training recipe (classification)

This is a standard, ImageNet-style training recipe adapted from common ConvNeXt usage.

### 3.1 Data preprocessing

**Input size**: 224×224 (for standard models)

**Training transforms**:

* RandomResizedCrop(224, interpolation=bilinear)
* RandomHorizontalFlip(0.5)
* Color jitter (optional, light)
* AutoAugment or RandAugment (recommended)
* Mixup + CutMix
* Random Erasing
* Normalize with ImageNet mean/std:

  * mean = [0.485, 0.456, 0.406]
  * std = [0.229, 0.224, 0.225]

**Validation transforms**:

* Resize shorter side to 256
* CenterCrop(224×224)
* Normalize with same mean/std

### 3.2 Optimizer and schedule

* Optimizer: **AdamW**
* Base learning rate for large batch (e.g. 4096 global):
  $$\text{lr}_{\text{base}} = 4 \times 10^{-3}.$$
* If your global batch size is smaller, scale linearly:
$$\text{lr} = \text{lr}_{\text{base}} \times \frac{\text{batch\_size}}{4096}.$$

* Weight decay:
  $$\text{wd} = 0.05.$$
* Betas: (0.9, 0.999)

**Scheduler**:

* Warmup: 20 epochs of linear warmup to peak lr
* Then cosine decay down to a small value
  $$\text{lr}*{\text{final}} \approx \text{lr}*{\text{max}} \times 10^{-2}.$$

**Epochs**:

* 300 epochs for ImageNet from scratch (common setting)
* You can do 100–150 epochs for smaller experiments.

**Regularization details**:

* Label smoothing:
  $$\epsilon = 0.1.$$
* Mixup: alpha = 0.8
* CutMix: alpha = 1.0
* DropPath (stochastic depth): linearly increased with depth, e.g. max 0.1–0.3 for deeper models.

### 3.3 Simple PyTorch training skeleton

Below is a **minimal** training loop skeleton for classification using a ConvNeXt backbone (you can plug in your own model):

```python
import torch
import torch.nn as nn
from torch.utils.data import DataLoader

def train_one_epoch(model, loader, optimizer, epoch, device, scaler=None, criterion=None):
    model.train()
    if criterion is None:
        criterion = nn.CrossEntropyLoss(label_smoothing=0.1).to(device)

    for step, (images, targets) in enumerate(loader):
        images = images.to(device, non_blocking=True)
        targets = targets.to(device, non_blocking=True)

        optimizer.zero_grad(set_to_none=True)

        # optional mixed precision
        if scaler is not None:
            with torch.cuda.amp.autocast():
                outputs = model(images)
                loss = criterion(outputs, targets)
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
        else:
            outputs = model(images)
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()

        if step % 100 == 0:
            print(f"Epoch {epoch} | Step {step}/{len(loader)} | Loss: {loss.item():.4f}")

def validate(model, loader, device):
    model.eval()
    correct1, total = 0, 0
    loss_sum = 0.0
    criterion = nn.CrossEntropyLoss().to(device)

    with torch.no_grad():
        for images, targets in loader:
            images = images.to(device, non_blocking=True)
            targets = targets.to(device, non_blocking=True)

            outputs = model(images)
            loss = criterion(outputs, targets)
            loss_sum += loss.item() * images.size(0)

            _, pred = outputs.topk(1, dim=1)
            correct1 += (pred.squeeze(1) == targets).sum().item()
            total += targets.size(0)

    top1 = 100.0 * correct1 / total
    avg_loss = loss_sum / total
    print(f"Validation: Loss={avg_loss:.4f}, Top-1 Acc={top1:.2f}%")
    return avg_loss, top1
```

And a high-level setup:

```python
from torch.optim import AdamW
from torch.cuda.amp import GradScaler

model = YourConvNeXtModel(num_classes=1000).to(device)

optimizer = AdamW(
    model.parameters(),
    lr=4e-3,      # adjust for your batch size
    weight_decay=0.05,
    betas=(0.9, 0.999),
)

scaler = GradScaler()

for epoch in range(num_epochs):
    # update lr via scheduler here if you use cosine
    train_one_epoch(model, train_loader, optimizer, epoch, device, scaler=scaler)
    validate(model, val_loader, device)
```

---

