# **DeiT (Data-Efficient Image Transformer)**

**Paper:** *Training data-efficient image transformers & distillation through attention* (ICML 2021, Facebook Research, 2020)

---

## Motivation

**Vision Transformer (ViT)** achieved strong results but required **huge datasets** (ImageNet-21k, JFT-300M) and **massive compute**.
**DeiT** made ViT trainable **on ImageNet-1k (1.3 M images)** — *no extra data* — through **data-efficient augmentation** and **distillation through attention**.

---

## ViT Recap

1. **Patch embedding**

   Split an image of size $H\times W\times C$ into patches $P\times P$; number of patches:
   $$
   N=\frac{H\times W}{P^2}
   $$
   Each patch → vector → $\mathbf X_p\in\mathbb R^{N\times D}$.

2. **Add tokens and positional embeddings**
   $$
   Z_0=[x_{\text{cls}},x_p^1,\dots,x_p^N]+E_{\text{pos}}
   $$

3. **Transformer encoder**

   $L$ blocks of MSA + FFN + residuals + layer norm.

4. **Classification head** uses final `[CLS]`.

---

## DeiT Architecture

Adds one more token — **[DIST]**:

$$
Z_0=[x_{\text{cls}},x_{\text{dist}},x_p^1,\dots,x_p^N]+E_{\text{pos}}
$$

Both `[CLS]` and `[DIST]` participate equally in attention through all layers.

### Output heads

After the final block:
$$
Z_L=[z_{\text{cls}},z_{\text{dist}},z_p^1,\dots,z_p^N]
$$



<img src="images/deit.png" height="30%" width="30%" />


## Training loss

$$
\mathcal{L}_{\text{DeiT}} = (1-\lambda)\mathcal{L}_{\text{CE}}\big(\sigma(Z_{\text{cls}}), y\big) + \lambda\tau^2\mathrm{KL}\left( \sigma\left(\frac{Z_{\text{dist}}}{\tau}\right), \sigma\left(\frac{Z_t}{\tau}\right) \right)
$$

---


| Symbol               | Meaning                                                         |
| :------------------- | :-------------------------------------------------------------- |
| $ Z_{\text{cls}} $   | logits from the **student’s `[CLS]` head**.                     |
| $ Z_{\text{dist}} $  | logits from the **student’s `[DIST]` head**.                    |
| $ Z_t $              | Teacher logits (from pretrained CNN like RegNetY)               |
| $ y $                | Ground-truth class label                                        |
| $ \sigma(\cdot) $    | Softmax function                                                |
| $ \mathcal{L}_{CE} $ | Cross-entropy loss with ground truth                            |
| $ KL(\cdot,\cdot) $  | Kullback–Leibler divergence (between probability distributions) |
| $ \tau $             | **Temperature** to soften the logits                            |
| $ \lambda $          | Balancing factor between supervised and distillation losses     |


So overall DeiT loss becomes:

$$
\mathcal{L}_{\text{DeiT}} =
\mathcal{L}_{CE}^{[\text{CLS}]} +
\mathcal{L}_{\text{student}}^{[\text{DIST}]}
$$

---
#### Intuitive Meaning

The student (DeiT) is trained to satisfy **two goals simultaneously**:

1. **Match the true labels**
   → via the standard cross-entropy term

   $$
   (1-\lambda)\mathcal{L}_{\text{CE}}\big(\sigma(Z_{\text{cls}}), y\big)
   $$

3. **Mimic the teacher’s “dark knowledge”** (soft class probabilities)
   → via the KL divergence term
   $$
   \lambda\tau^2\mathrm{KL}\left( \sigma\left(\frac{Z_{\text{dist}}}{\tau}\right), \sigma\left(\frac{Z_t}{\tau}\right) \right)
   $$

The **teacher’s predictions** (even for incorrect classes) contain valuable information about class similarity — e.g., a cat image might get 0.7 cat, 0.2 dog, 0.1 fox.
These *soft targets* help the student generalize better than one-hot labels.

---

#### The Role of **Temperature $ \tau $**

* When $ \tau > 1 $, the softmax becomes **softer** — it spreads probability mass across classes.
* This reveals **relative similarities** between classes.

$$
\sigma_i(Z / \tau) = \frac{e^{Z_i / \tau}}{\sum_j e^{Z_j / \tau}}
$$

Typical values: $ \tau \in [2, 5] $

During training:

* Compute both soft teacher and soft student distributions at temperature $ \tau $.
* Multiply the KL term by $ \tau^2 $ (to keep gradient magnitudes consistent).

---

#### Role of **Balancing factor $ \lambda $**

Controls the tradeoff between:

* Fitting to **ground truth** (hard labels)
* Mimicking the **teacher** (soft labels)

Typical choice: $ \lambda = 0.5 $

---

In [2]:
# fmt: off
# isort: skip_file
# DO NOT reorganize imports - warnings filter must be FIRST!

import torch.nn.functional as F
import torch
import warnings
import os

warnings.filterwarnings('ignore')
os.environ['PYTHONWARNINGS'] = 'ignore'

import timm
# fmt: on


all_deit = timm.list_models("*deit*")
for m in all_deit:
    print(m)


deit3_base_patch16_224
deit3_base_patch16_384
deit3_huge_patch14_224
deit3_large_patch16_224
deit3_large_patch16_384
deit3_medium_patch16_224
deit3_small_patch16_224
deit3_small_patch16_384
deit_base_distilled_patch16_224
deit_base_distilled_patch16_384
deit_base_patch16_224
deit_base_patch16_384
deit_small_distilled_patch16_224
deit_small_patch16_224
deit_tiny_distilled_patch16_224
deit_tiny_patch16_224


## **Student/Teacher Architectures For Every DeiT and DeiT-III** 

---

# 1. Original DeiT (2020)

Paper: **Training data-efficient image transformers & distillation through attention**

DeiT introduced two things:

1. A **distillation token**
2. A **CNN teacher network** (RegNetY-16GF)

Only **distilled variants** have a teacher.
Non-distilled variants are trained normally.

---

# Complete mapping for original DeiT

| Model name                           | Student architecture           | Teacher used? | Teacher architecture |
| ------------------------------------ | ------------------------------ | ------------- | -------------------- |
| **deit_tiny_patch16_224**            | ViT-Tiny                       | No            | None                 |
| **deit_tiny_distilled_patch16_224**  | ViT-Tiny + distillation token  | Yes           | **RegNetY-16GF**     |
| **deit_small_patch16_224**           | ViT-Small                      | No            | None                 |
| **deit_small_distilled_patch16_224** | ViT-Small + distillation token | Yes           | **RegNetY-16GF**     |
| **deit_base_patch16_224**            | ViT-Base                       | No            | None                 |
| **deit_base_patch16_384**            | ViT-Base                       | No            | None                 |
| **deit_base_distilled_patch16_224**  | ViT-Base + distillation token  | Yes           | **RegNetY-16GF**     |
| **deit_base_distilled_patch16_384**  | ViT-Base + distillation token  | Yes           | **RegNetY-16GF**     |

**Important:** In DeiT, all *distilled* models use the **same teacher**:
$$
\text{Teacher} = \text{RegNetY-16GF}
$$

---

# 2. DeiT-III (2022)

Paper: **DeiT III: Revenge of the ViT**

DeiT-III completely removes:

* teacher network
* distillation token

They only use:

* stronger data augmentation
* regularization
* training for more epochs
* better optimization settings

Therefore:

### **All DeiT-III models have NO teacher.**

---

# Complete mapping for DeiT-III

| Model name                   | Student architecture         | Teacher? | Teacher architecture |
| ---------------------------- | ---------------------------- | -------- | -------------------- |
| **deit3_tiny_patch16_224**   | ViT-Tiny (improved training) | No       | None                 |
| **deit3_small_patch16_224**  | ViT-Small                    | No       | None                 |
| **deit3_medium_patch16_224** | ViT-Medium                   | No       | None                 |
| **deit3_base_patch16_224**   | ViT-Base                     | No       | None                 |
| **deit3_base_patch16_384**   | ViT-Base                     | No       | None                 |
| **deit3_large_patch16_224**  | ViT-Large                    | No       | None                 |
| **deit3_large_patch16_384**  | ViT-Large                    | No       | None                 |
| **deit3_huge_patch14_224**   | ViT-Huge (patch 14)          | No       | None                 |

---

# Summary (short answer)

### Original DeiT (2020)

* Distilled variants use the teacher **RegNetY-16GF**.
* Non-distilled variants have **no teacher**.

### DeiT-III (2022)

* **No teacher at all.**
* Entire family is trained without distillation.

---

# Widely used DeiT models (ranked)

The community usage depends on the **task**, **dataset size**, and **training budget**, but overall there is a clear pattern.

Below is the **real-world ranking** of DeiT and DeiT-III models, from **most used → least used**.

---

# 1. Most commonly used (top tier)

### **1. deit_base_patch16_224**

This is by far the most widely used DeiT variant.

Reason:

* It is the **original DeiT** baseline used in papers and experiments.
* Works well on medium-sized datasets.
* Does not need a teacher (no distillation token).
* Good speed and accuracy.

Used in:

* Classification (ImageNet fine-tuning)
* Transfer learning on medical images, plant disease, remote sensing
* Research baselines for Vision Transformers

---

### **2. deit_base_distilled_patch16_224**

Also extremely popular.

Reason:

* This is the **teacher-distilled version** from the original DeiT paper.
* Best accuracy for the base model.
* Still efficient.

Used in:

* Classification where you want a stronger pretrained backbone
* As a backbone for segmentation/detection (sometimes)

---

### **3. deit_small_patch16_224**

Very widely used in datasets < 50k images.

Reason:

* Lower compute
* Still significantly better than CNNs of similar size
* Ideal for small dataset fine-tuning (medical, industrial)

---

### **4. deit_tiny_patch16_224**

Used when:

* Dataset is very small
* Training resources are very limited
* Need a fast model on CPU or edge device

---

# 2. Moderately used (middle tier)

### **5. deit3_base_patch16_224**

DeiT-III is newer, stronger, and used increasingly often.

Reasons:

* No teacher required
* More stable training
* Better regularization and data augmentation strategy

But:

* Still less common than the original DeiT-base because many repos did not update their baselines.

---

### **6. deit3_small_patch16_224**

Common in smaller training setups, good for:

* Kaggle
* Prototyping
* Edge usage

---

# 3. Rarely used (low tier)

### **7. deit3_large_patch16_224**

### **8. deit3_large_patch16_384**

### **9. deit3_huge_patch14_224**

These are rarely used because:

* Very expensive to train
* Need heavy regularization
* Require advanced data pipelines
* Compete with better alternatives like ViT-Large, Swin-Large, or PVT-Large

Only seen in:

* Large research labs
* Benchmark papers
* OpenAI-style pretraining setups

---

# 4. Very rare or rarely needed

### **10. 384-resolution versions**

* deit_base_patch16_384
* deit_base_distilled_patch16_384
* deit3_base_patch16_384
* deit3_large_patch16_384
* etc.

These are used only for:

* High-accuracy ImageNet validation
* Vision benchmarks

But almost never in real projects because:

* Quadrupled compute and memory
* Minor accuracy gain

---

# Summary Table

| Model                               | Community usage   | Why                                    |
| ----------------------------------- | ----------------- | -------------------------------------- |
| **deit_base_patch16_224**           | Widely used (top) | Standard DeiT baseline, stable, strong |
| **deit_base_distilled_patch16_224** | Widely used (top) | Highest accuracy among base DeiT       |
| **deit_small_patch16_224**          | Widely used       | Good for small datasets                |
| **deit_tiny_patch16_224**           | Common            | Fast, efficient                        |
| **deit3_base_patch16_224**          | Moderately used   | Improved training, newer               |
| **deit3_small_patch16_224**         | Moderately used   | Efficient                              |
| deit3_large/huges                   | Rare              | Very expensive                         |
| 384-resolution versions             | Rare              | High compute                           |

---

## Name decoding of `deit_base_patch16_224` and Parameters 

Roughly speaking, each part of `deit_base_patch16_224` is a compact code for the architecture and input resolution.

Let’s decode it piece by piece:

---

### `deit`

Short for **DeiT = Data-efficient image Transformers**
This is the family name of the model, from the paper *“Training data-efficient image transformers & distillation through attention”*.

It means:

* Vision Transformer-style backbone (ViT)
* Trained with the DeiT recipe (strong augmentation, regularization, possibly distillation for some variants)

---

### `base`

This is the **model size** (like small/medium/large in ResNet):

* `tiny`  → ViT-Tiny (very small)
* `small` → ViT-Small
* `base`  → **ViT-Base**

For `base`, the typical ViT-Base config is:

* Number of transformer layers:
  $$ L = 12 $$
* Embedding dimension:
  $$ D = 768 $$
* Number of heads:
  $$ H = 12 $$
* MLP hidden dim:
  $$ 4D = 3072 $$ (roughly)

So `base` ≈ medium-sized ViT.

---

### `patch16`

This says how the image is split into patches:

* Patch size =
  $$ 16 \times 16 \text{ pixels} $$

For a (224 \times 224) image, that gives:

* Number of patches per side:
  $$ 224 / 16 = 14 $$
* Total number of patches (tokens before adding [CLS]):
  $$ 14 \times 14 = 196 $$

So the sequence length (tokens) is 196 image tokens + 1 class token.

---

### `224`

This is the **default input resolution**:

* Image size:
  $$ 224 \times 224 \text{ pixels} $$

So:

* Input: $224 \times 224$,
* Split into $16 \times 16$ patches → $14 \times 14 = 196$ tokens,
* Processed by a ViT-Base transformer, trained with the DeiT recipe.

---

### Put together

`deit_base_patch16_224` means:

> “A DeiT model with a ViT-Base backbone, using 16×16 image patches, trained on 224×224 input images.”


---

# 1. Input and Patch Embeddings

### Image

Resolution:
$$224 \times 224 \times 3$$

### Patch Size

$$16 \times 16$$

This gives:
$$\frac{224}{16}=14$$
So the number of patches:
$$14 \times 14 = 196$$

Each patch is flattened:
$$16 \times 16 \times 3 = 768\text{ dimensions}$$

---

# 2. Tokenization

### Patch tokens

Shape:
$$196 \times 768$$

### CLS token

A learnable vector:
$$1 \times 768$$

Total sequence length:
$$196 + 1 = 197\text{ tokens}$$

---

# 3. Positional Embedding

A learnable embedding:
$$197 \times 768$$

Added to the token embeddings:
$$X_0 = E_{\text{patches}} + E_{\text{pos}}$$

---

# 4. Transformer Encoder Structure

There are **12 transformer encoder blocks**.

Each block has:

### (1) LayerNorm

### (2) Multi-Head Self-Attention (MSA)

Parameters:

* Embedding dimension:
  $$D = 768$$

* Number of heads:
  $$H = 12$$

So dimension per head:
$$d_k = \frac{768}{12} = 64$$

Attention uses:

* Query:
  $$Q = X W_Q$$
* Key:
  $$K = X W_K$$
* Value:
  $$V = X W_V$$

Where each projection matrix has shape:
$$W_Q, W_K, W_V \in \mathbb{R}^{768 \times 768}$$

Attention output uses:
$$W_O \in \mathbb{R}^{768 \times 768}$$

### (3) Skip connection

### (4) LayerNorm

### (5) MLP block

MLP has:

Hidden size:
$$3072 = 4 \times 768$$

Layers:

* Linear:
  $$768 \rightarrow 3072$$
* GELU
* Linear:
  $$3072 \rightarrow 768$$

### (6) Skip connection

This repeats for **12 layers**.

---

# 5. Output

We take the final representation of the **CLS token**:

$$z_{\text{CLS}} \in \mathbb{R}^{768}$$

Pass it to a linear head:
$$y = z_{\text{CLS}} W_{\text{head}}$$

For ImageNet:
$$W_{\text{head}} \in \mathbb{R}^{768 \times 1000}$$

---

# 6. Summary Table

| Component            | Configuration     |
| -------------------- | ----------------- |
| Image size           | 224×224           |
| Patch size           | 16×16             |
| Number of patches    | 196               |
| Patch embedding dim  | 768               |
| Sequence length      | 197 (196 + CLS)   |
| Positional embedding | 197×768 (learned) |
| Encoder blocks       | 12                |
| MSA heads            | 12                |
| Dim per head         | 64                |
| MLP hidden dim       | 3072              |
| Final head           | 768 → 1000        |

---

# 7. Important note

**Architecture = ViT-B/16.
What makes it DEiT is not architecture but the training recipe.**

DEiT training includes:

* Mixup
* CutMix
* RandAugment
* Stochastic depth
* EMA
* Label smoothing
* Optional distillation (for distilled variants)

But the architecture stays **pure ViT Base Patch16**.

---

