

# **DINO — Distillation with NO Labels**

**DINO (Distillation with NO Labels)** is a **self-supervised learning method** introduced by Facebook AI Research (FAIR) in 2021 in the paper:

> **“Emerging Properties in Self-Supervised Vision Transformers”**
> *Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin, 2021*

DINO demonstrated that **Vision Transformers (ViTs)** can learn **powerful visual representations without labeled data** — and more importantly, **without collapsing** (i.e., without all representations becoming identical).

---

## **1. Main Idea**

DINO trains a model to produce **consistent representations for different augmented views** of the same image — without labels.

It does this via **knowledge distillation** between two networks:

* **Student network** — learns by gradient descent.
* **Teacher network** — provides *soft targets* but is **not trained via gradients**.

The teacher’s weights are an **Exponential Moving Average (EMA)** of the student’s:

$$
\theta_{\text{teacher}} \leftarrow \tau \theta_{\text{teacher}} + (1 - \tau) \theta_{\text{student}}
$$

where  $\tau \in [0,1) $ is a momentum parameter (typically 0.996–0.9995).
Hence the name: **Distillation with NO labels (DINO).**

---

## **2. Training Setup**

Each input image is augmented into **multiple crops**:

| Crop Type    | Input Size | Used By           | Number per Image |
| ------------ | ---------- | ----------------- | ---------------- |
| Global crops | 224×224    | Teacher + Student | 2                |
| Local crops  | 96×96      | Student only      | 8                |

So, for each image:

1. The **teacher** sees only the 2 global crops.
2. The **student** sees both 2 global + 8 local crops.
3. The student’s predictions for *all* crops are trained to match the teacher’s outputs on the global crops.

This **multi-crop strategy** enforces **scale- and viewpoint-invariance** — forcing the model to focus on *semantic content* rather than low-level pixels.

---

## **3. Outputs and Objective**

Both teacher and student end with an **MLP projection head** that produces a **probability distribution** via softmax:

$$
p_t = \text{softmax}\left(\frac{z_t - c}{T_t}\right), \qquad p_s = \text{softmax}\left(\frac{z_s}{T_s}\right)
$$

where:

* $ z_t, z_s $ = projection head outputs
* $ T_t, T_s $ = temperature scalars
* $ c $ = center vector (running mean of teacher outputs)

The loss is the **cross-entropy** between teacher and student distributions:

$$
\mathcal{L}_{\text{DINO}} = -\sum_i p_t^{(i)} \log p_s^{(i)}
$$

Each **student crop** is matched with **each teacher global crop**, encouraging all crops of the same image to yield similar embeddings:

$$
\mathcal{L} = \sum_{i=1}^{N_t} \sum_{\substack{j=1 \ j \neq i}}^{N_s} H(p_t^{(i)}, p_s^{(j)})
$$

with ( N_t=2 ) (teacher global crops) and ( N_s=10 ) (student crops).

---

## **4. Avoiding Collapse**

To prevent trivial solutions (identical embeddings for all inputs), DINO stabilizes training via:

1. **Temperature scaling:** controls sharpness of teacher/student distributions.
2. **Centering:** subtracts a running mean ( c ) from teacher outputs:
   $$
   p_t = \text{softmax}\left(\frac{z_t - c}{T_t}\right)
   $$
3. **EMA teacher:** ensures stable targets through slow updates.

---

## **5. EMA Update Explained**

The EMA equation:

$$
\theta_t \leftarrow \tau \theta_t + (1 - \tau)\theta_s
$$

means the teacher is a **momentum-averaged version** of past student weights.
If $ \tau = 0.996 $, the teacher changes slowly, producing stable targets.
Thus, the teacher acts as a **temporal ensemble** — a “memory” of recent student states.

**PyTorch-style implementation:**

```python
with torch.no_grad():
    for t, s in zip(teacher.parameters(), student.parameters()):
        t.data = tau * t.data + (1 - tau) * s.data
```

This ensures:

* The **student** learns fast via gradients.
* The **teacher** evolves slowly and stably.


---

```
Iteration 1
teacher = student

Iteration 2
teacher = 0.996 * teacher + 0.004 * new_student

Iteration 3
teacher = 0.996 * teacher + 0.004 * new_student
...
```

So by iteration 1000, the teacher is effectively a **smoothed ensemble** of many previous student versions — acting like a “memory bank” of good representations.



---



## **6. Architecture**

Both teacher and student have **identical architectures**, often **Vision Transformers (ViT)** such as ViT-S/16 or ViT-B/16.

| Role    | Architecture         | Gradient Updates | Purpose                      |
| ------- | -------------------- | ---------------- | ---------------------------- |
| Student | ViT-S/16 (or ResNet) | Yes (**trained with gradient descent** )    | Learns from teacher          |
| Teacher | ViT-S/16 (same)      | No (**updated via EMA (no gradients)**)    | Provides stable soft targets |

At initialization:
$$
\theta_{\text{teacher}} = \theta_{\text{student}}
$$

---

#### 6.1. The original DINO paper (2021)

In *“Emerging Properties in Self-Supervised Vision Transformers”* (Caron et al., 2021),
the authors tested DINO with **two main families** of architectures:

| Family                           | Examples              | Notes                                                           |
| -------------------------------- | --------------------- | --------------------------------------------------------------- |
| **Vision Transformers (ViT)**    | ViT-S/16, ViT-B/16    | DINO works *extremely well* — emergent segmentation, clustering |
| **Convolutional Networks (CNN)** | ResNet-50, ResNet-101 | Works well, but ViTs show stronger semantic representations     |

---

#### 6.2. Why Vision Transformers (ViT) are usually preferred

DINO’s biggest discovery was that **self-supervised ViTs** (trained *without labels*)
can automatically learn **object-level representations** and **attention maps**.

Here’s why ViTs work so well in this setup:

1. **Patch tokens** naturally represent local regions — ideal for multi-crop learning.
2. **Self-attention** allows modeling global context easily.
3. EMA teacher stabilization + multi-crop augmentations yield strong object localization signals.

Result: even without labels, attention heads focus on *objects* in the image.

---

#### 6.3. Commonly used networks

| Network        | Params | Patch size | Resolution | Typical use                                   |
| -------------- | ------ | ---------- | ---------- | --------------------------------------------- |
| **ViT-S/16**   | ~22M   | 16×16      | 224×224    | ✅ Default in DINO (good balance of size/perf) |
| **ViT-B/16**   | ~86M   | 16×16      | 224×224    | High accuracy, heavier                        |
| **ViT-B/8**    | ~86M   | 8×8        | 224×224    | Better fine-grained detail, more compute      |
| **ResNet-50**  | ~25M   | –          | 224×224    | Works fine but weaker object-level features   |
| **ResNet-101** | ~44M   | –          | 224×224    | Higher performance CNN baseline               |

Later versions (like **DINOv2**) scale this up to huge models like **ViT-L/14**, **ViT-G/14**,
trained on hundreds of millions of images.

---

#### 6.4. Architecture summary in code (simplified)

```python
# Teacher and student both ViT-S/16
teacher = VisionTransformer(patch_size=16, embed_dim=384)
student = VisionTransformer(patch_size=16, embed_dim=384)

# Add projection heads
teacher_head = DINOHead(in_dim=384, out_dim=65536)
student_head = DINOHead(in_dim=384, out_dim=65536)
```

The **DINOHead** is an MLP with normalization and a softmax output used for the loss.

---

#### 6.5. Which to use (practical guide)

| GPU Memory         | Recommended Backbone                        | Notes                       |
| ------------------ | ------------------------------------------- | --------------------------- |
| ≤ 4 GB             | ViT-Ti/16 or ViT-S/16 with small batch size | Possible on laptop GPU      |
| 8–16 GB            | ViT-S/16 or ViT-B/16                        | Standard DINO               |
| 24 GB+             | ViT-B/16 or ViT-L/16                        | High-performance            |
| CPU or low compute | ResNet-50                                   | Simpler, slower convergence |

---

#### 6.6. For DINOv2 (2023, Meta)

The updated version **DINOv2** uses much larger models:

| Model    | Layers | Hidden dim | Patch size | Params |
| -------- | ------ | ---------- | ---------- | ------ |
| ViT-S/14 | 12     | 384        | 14         | ~22M   |
| ViT-B/14 | 12     | 768        | 14         | ~86M   |
| ViT-L/14 | 24     | 1024       | 14         | ~304M  |
| ViT-G/14 | 40     | 1536       | 14         | ~1B    |

All trained on **142 million curated images** (without labels).

---

#### 6.7. TL;DR Summary

| Aspect          | DINO (2021)           | DINOv2 (2023)                         |
| --------------- | --------------------- | ------------------------------------- |
| Typical network | ViT-S/16 or ViT-B/16  | ViT-L/14 or ViT-G/14                  |
| Input size      | 224×224               | 518×518                               |
| Patch size      | 16                    | 14                                    |
| Teacher/Student | same architecture     | same architecture                     |
| Supervision     | none (self-distilled) | none (self-distilled, larger dataset) |

---

✅ **In short:**

* You can use **any network**, but **Vision Transformers (ViT-S/16)** are the **standard** in DINO.
* The **teacher and student** share the *same architecture*.
* Only the **student** is trained by gradient descent; the **teacher** updates via EMA.

---






#### 6.1. Why This Works

This EMA update makes the teacher a **temporal ensemble** — it slowly tracks the student’s improvements, producing **stable targets**.

If both teacher and student were trained by gradients, the targets would change too fast → **training collapse**.

By keeping the teacher **slow and stable**, DINO ensures:

* The student always learns from a consistent signal.
* The teacher gradually improves as the student improves.

---

#### 6.2 Initialization

At the start:

* The **teacher** is initialized as a **copy** of the student:
  $$
  \theta_{\text{teacher}} = \theta_{\text{student}}
  $$

Then training begins — student learns, teacher slowly follows via EMA.

---

#### 6.3 Visual Summary

```
             ┌───────────────────────┐
             │   Same ViT backbone   │
             └───────────────────────┘
                    ▲           ▲
                    │           │
         EMA update │           │ Gradient descent
                    │           │
              ┌───────────┐     │
              │  Teacher  │     │
              └───────────┘     │
                  │             │
    Global crops  │             │ Global + Local crops
                  ▼             ▼
              ┌───────────┐
              │  Student  │
              └───────────┘
```

---



## **7. Cropping Images**
For each image:

1. Create 2 global and 8 local crops via random cropping and resizing.
2. Teacher processes **only** the 2 global crops.
3. Student processes **all 10 crops**.

```python
teacher_outputs = [teacher(c) for c in global_crops]
student_outputs = [student(c) for c in (global_crops + local_crops)]
```

Each output is a probability vector (e.g., 65,536 dimensions).
Loss is computed between every teacher and student pair (excluding same-view pairs).

---




#### 7.1. What “crop” means here

When we say *“crop”*, we mean we **take a sub-region** of the original image —
so yes, we *remove part of the image* and then **resize that sub-region** to the network’s input size.

This is **data augmentation** — not truncation of data, but creation of *different views* of the same image.

---

#### 7.2. Step-by-step example

Suppose your **original image** is
$$
\text{Original size: } 480 \times 480
$$

DINO will create **multiple random crops** from it:

**(a) Global crops**

* Two random sub-regions (large portions of the image).
* Each crop might cover, for example, 50–100% of the area.
* These are then **resized** to
  $$224 \times 224$$
  before being fed to the network.

**(b) Local crops**

* Smaller sub-regions (e.g., 15–50% of the area).
* These are resized to
  $$96 \times 96$$
  before being fed to the network.

---

#### 7.3. Who sees what

| Network     | Crop type            | Input size        | Number per image              | Purpose               |
| ----------- | -------------------- | ----------------- | ----------------------------- | --------------------- |
| **Teacher** | Global crops only    | 224×224           | 2                             | Stable supervision    |
| **Student** | Global + Local crops | 224×224 and 96×96 | 2 global + 8 local = 10 total | Learns from all views |

So:

* The **teacher** only sees the two *large* views.
* The **student** sees all views — two global and eight local.

The student’s outputs for *each local/global crop* are matched (via cross-entropy) to the corresponding teacher outputs of the *global crops*.

---

#### 7.4. Concrete example

Let’s visualize one sample numerically.

**Original image: 480×480**

```
+------------------------------------------------+
|                                                |
|          [ Original image 480×480 ]            |
|                                                |
+------------------------------------------------+
```

**Cropping (random locations & scales)**

1. Global crop 1: region (x=20,y=10,width=420,height=420)
   → resize → (224×224)
2. Global crop 2: region (x=80,y=50,width=350,height=350)
   → resize → (224×224)
3. Local crop 1: region (x=200,y=100,width=150,height=150)
   → resize → (96×96)
4. Local crop 2: region (x=50,y=300,width=120,height=120)
   → resize → (96×96)
5. … and so on up to 8 local crops.

Now we have:

* 2 crops of 224×224 (global)
* 8 crops of 96×96 (local)

**Feeding to networks**

```python
# Example (pseudo-code)
global_crops = [crop1_224, crop2_224]
local_crops  = [crop3_96, crop4_96, ..., crop10_96]

teacher_outputs = [teacher(c) for c in global_crops]   # 2 × 224×224
student_outputs = [student(c) for c in (global_crops + local_crops)]  # 10 total
```

---

#### 7.5. Why DINO Does This

The **multi-crop strategy** helps the model learn **scale-invariant** and **location-invariant** representations:

* The same object seen in large or small view → should produce similar embeddings.
* Forces the model to focus on **semantics**, not exact pixels.

So for example:

* A cat’s face in a global crop and a zoomed-in cat’s ear (local crop)
  should still map to similar feature space representations.

---


#### 7.7. Intuitive Analogy

Imagine you’re showing two people (teacher and student) pictures of the same scene:

* The **teacher** sees two big photos of the whole scene.
* The **student** sees those same big photos plus several **zoomed-in patches** of details.

Then you ask the student to make their representations consistent with the teacher’s interpretation — even when looking at smaller parts.

---



## **8. Feeding Crops to the Network**

1. **Architecture setup (teacher vs student)**
2. **What the inputs are**
3. **How the inputs are fed (serial vs batch)**
4. **How the loss is computed when the number of crops differ**

---

#### 8.1. Architecture setup

**Both teacher and student have the same architecture** (e.g. ViT-S/16 or ViT-B/16).

They both can process any image size, but in DINO:

| Network     | Inputs                   | Input Size(s)     |
| ----------- | ------------------------ | ----------------- |
| **Teacher** | 2 global crops           | 224×224           |
| **Student** | 2 global + 8 local crops | 224×224 and 96×96 |

They share the same patch embedding structure (e.g., ViT divides image into 16×16 patches),
so smaller crops just yield fewer tokens — but the same network can process them.

---

#### 8.2. What the inputs are

For each original image (say 480×480):

* **Global crops (x2)** → resized to **224×224**
* **Local crops (x8)** → resized to **96×96**

So total 10 crops (views).

Let’s denote them:

$$
\text{Global views: } {g_1, g_2}, \quad \text{Local views: } {l_1, l_2, \dots, l_8}
$$

---

#### 8.3. How the inputs are fed

You can’t directly feed all of them into a single forward pass because:

* The teacher gets **only global crops (2)**.
* The student gets **both global (2) + local (8) crops = 10**.

So we usually **loop or batch** them separately:

**Example (pseudo-code)**

```python
# Teacher: only global crops
teacher_outputs = [teacher(crop) for crop in global_crops]  # 2 outputs

# Student: global + local crops
student_outputs = [student(crop) for crop in (global_crops + local_crops)]  # 10 outputs
```

Each call produces a feature vector or a probability distribution, e.g. `[batch_size, dim]`.

In practice, DINO implements this efficiently by **concatenating** the crops along the batch dimension,
so they still go through one or two vectorized forward passes.

---

#### 8.4. How the loss is computed

The teacher provides **reference distributions** for the global crops:
$$
p_t^{(1)}, ; p_t^{(2)}
$$

The student provides **predictions** for all 10 crops:
$$
p_s^{(1)}, p_s^{(2)}, \dots, p_s^{(10)}
$$

**Matching rule:**

For **each student crop**, you compute a loss with **each teacher global crop**.
That is, every student view tries to predict what the teacher predicts for the same image (under different augmentations).

Formally, for an image (x):

$$
\mathcal{L} = \sum_{i=1}^{N_t} \sum_{\substack{j=1 \ j \neq i}}^{N_s} H(p_t^{(i)}, p_s^{(j)})
$$

where:

* $N_t = 2$: number of teacher global crops,
* $N_s = 10$: number of student crops (2 global + 8 local),
* $H(p_t, p_s)$ is the cross-entropy between teacher and student distributions.

The “$j \neq i$” part simply means we don’t match a student’s global view with the *same* global view used for that teacher prediction — only with *different* ones.

So, the loss encourages **all crops (local and global)** of the same image to yield similar semantic representations.

---

#### 8.5. Example dimensions

Assume:

* Student ViT produces a feature vector of size 384 (for ViT-S).
* Projection head → 65,536-dim softmax output (e.g. via MLP).

Then:

| Crop      | Network | Input size | Output shape |
| --------- | ------- | ---------- | ------------ |
| 2× global | Teacher | 224×224    | [2, 65536]   |
| 2× global | Student | 224×224    | [2, 65536]   |
| 8× local  | Student | 96×96      | [8, 65536]   |

All teacher and student outputs are **probability vectors** over the same dimension,
so you can compute cross-entropy pairwise.

---



#### 8.6. Intuitive picture

```
Original image
   │
   ├── 2 global crops (224×224) → teacher & student
   └── 8 local crops (96×96)   → student only

teacher_outputs = [p_t1, p_t2]
student_outputs = [p_s1 ... p_s10]

Loss = Σ H(p_ti, p_sj) for i∈{1,2}, j∈{1,...,10}, i≠j
```

---

#### 8.7. Why multi-crop matters

This design enforces **consistency across scales and viewpoints**:

* A small local crop should have the same representation as the global view of the same image.
* This drives the model to learn **semantic**, not pixel-level, similarity.

---

So, in short:

✅ **Both teacher and student** have the same architecture.
✅ **Teacher input:** 2× 224×224 (global crops).
✅ **Student input:** 2× 224×224 (global) + 8× 96×96 (local).
✅ **Feeds:** Usually batched or iterated, outputs concatenated.
✅ **Loss:** Cross-entropy between each student crop and each teacher global crop.
✅ **Gradients:** Student only; teacher updated via EMA.

---



## **8. ViT and Variable Input Sizes**

DINO can feed **224×224** (global) and **96×96** (local) crops into the same ViT.

### Patch calculation

For ViT-S/16:

* 224×224 → ( 14×14 = 196 ) patches → 197 tokens (with [CLS])
* 96×96 → ( 6×6 = 36 ) patches → 37 tokens (with [CLS])

### Positional embedding interpolation

Since ViT uses **learned positional embeddings** for 14×14 patches, they must be **interpolated** for 6×6 crops:

```python
def interpolate_pos_encoding(model, x):
    n_patches = x.shape[1] - 1
    N = model.pos_embed.shape[1] - 1
    if n_patches == N:
        return model.pos_embed
    dim = model.pos_embed.shape[-1]
    h = w = int(N ** 0.5)
    pos = model.pos_embed[0,1:].reshape(1, h, w, dim).permute(0,3,1,2)
    new_hw = int(n_patches ** 0.5)
    pos = F.interpolate(pos, size=(new_hw, new_hw), mode='bicubic', align_corners=False)
    pos = torch.cat([model.pos_embed[:,:1], pos.flatten(2).transpose(1,2)], dim=1)
    return pos
```

This enables one ViT model to handle **mixed resolutions** (96×96 and 224×224) in the same batch.

| Crop   | Input   | # Patches | Tokens | Positional Embedding | Used by           |
| ------ | ------- | --------- | ------ | -------------------- | ----------------- |
| Global | 224×224 | 14×14=196 | 197    | Learned              | Teacher + Student |
| Local  | 96×96   | 6×6=36    | 37     | Interpolated         | Student only      |

---



####  ** How Vision Transformers handle variable input sizes**
Excellent question — and this shows you’ve understood the subtle part of **DINO with ViT backbones**.

Let’s go deep into how **Vision Transformers handle variable input sizes**, and how DINO feeds both **224×224** and **96×96** crops into the *same ViT-S/16* model.

---

#### 8.1. Recall what ViT does

In **ViT-S/16**, the input image (say 224×224) is split into **non-overlapping 16×16 patches**.

So for a 224×224 input:

$$
\text{Number of patches} = \frac{224}{16} \times \frac{224}{16} = 14 \times 14 = 196
$$

Each patch → flattened → linearly projected → **token** (dimension = 384 for ViT-S).
Then a **[CLS] token** is prepended, giving total **197 tokens**.

So the input to the transformer is:
$$
X \in \mathbb{R}^{197 \times 384}
$$

---

#### 8.2. What happens with a 96×96 crop?

Now if we input 96×96 into the *same* ViT-S/16:

$$
\frac{96}{16} = 6 \quad \text{patches per dimension}
$$

So total patches:
$$
6 \times 6 = 36 \text{ patches}
$$

Add one CLS token:
$$
\Rightarrow 37 \text{ tokens in total.}
$$

So now the transformer input is:
$$
X \in \mathbb{R}^{37 \times 384}
$$

✅ This is perfectly fine — the ViT can handle *different numbers of tokens*, as long as patch size and embedding dimension stay consistent.

---

#### 8.3. The positional embedding challenge

Here’s the **only tricky part**:
ViTs use **learned positional embeddings** that depend on the number of patches (e.g. 14×14 for 224×224).

So, for 224×224 input:
$$
\text{pos_embed} \in \mathbb{R}^{1 \times (196 + 1) \times 384}
$$

For 96×96 input:
$$
\text{pos_embed} \in \mathbb{R}^{1 \times (36 + 1) \times 384}
$$

We don’t have positional embeddings pre-trained for the 6×6 grid — so we must **resize/interpolate** the positional embeddings.

That’s exactly what DINO does internally.

---

#### 8.4. How DINO handles this (positional embedding interpolation)

DINO resizes the **2D grid** of positional embeddings via **bicubic interpolation** to match the number of patches for each input size.

```python
def interpolate_pos_encoding(model, x):
    # Get number of patches in current input
    n_patches = x.shape[1] - 1  # minus CLS token
    N = model.pos_embed.shape[1] - 1
    if n_patches == N:
        return model.pos_embed
    
    # Extract 2D positional embedding grid
    dim = model.pos_embed.shape[-1]
    h = w = int(N ** 0.5)
    pos_embed_2d = model.pos_embed[0, 1:].reshape(1, h, w, dim).permute(0, 3, 1, 2)
    
    # Interpolate to new resolution
    new_h = new_w = int(n_patches ** 0.5)
    pos_embed_2d = F.interpolate(pos_embed_2d, size=(new_h, new_w), mode='bicubic', align_corners=False)
    
    # Flatten back and concat CLS token
    pos_embed = torch.cat([model.pos_embed[:, :1], pos_embed_2d.flatten(2).transpose(1, 2)], dim=1)
    return pos_embed
```

This allows a **single ViT** to process both 224×224 and 96×96 crops by **adapting** positional embeddings on the fly.

---

#### 8.5. Feeding different-sized inputs

DINO processes all crops (global + local) in a single forward pass by **concatenating them** along the batch dimension:

```python
# global_crops = [B, 3, 224, 224]
# local_crops  = [B, 3, 96, 96]

all_crops = global_crops + local_crops
student_output = student(all_crops)  # handles mixed resolutions internally
```

During the forward pass, the model interpolates the positional embeddings for each crop depending on its spatial size.

So even though **teacher** always sees 224×224 inputs,
the **student** can flexibly handle both 224×224 and 96×96 crops — same model weights.

---

#### 8.6. Why this works

* Patch size = 16 is fixed → feature dimension fixed (384).
* Only the **number of tokens** changes (196 vs 36).
* Positional embedding is resized → keeps spatial structure aligned.
* Transformer layers (self-attention, MLPs) can process any sequence length.

---


#### 8.7 Intuition

So even though the ViT expects *“224×224”* in its original pretraining,
the **patch embedding + interpolation trick** lets it generalize to any crop size that’s a multiple of the patch size.

That’s how DINO’s **student network** can ingest both large (224×224) and small (96×96) views
— and match them semantically to the teacher’s global representations.

---

✅ **In short:**

* Both teacher and student = same ViT-S/16 architecture.
* Teacher always gets 224×224 crops.
* Student gets both 224×224 and 96×96 crops.
* Different crop sizes work fine because:

  * Patch embedding is fixed.
  * Positional embeddings are interpolated for smaller grids.

---



#### **When and Why Interpolation is needed in ViT** 


In the ViT forward pass, you typically see something like:

```python
x = x + self.pos_embed[:, :x.size(1), :]
```

Here:

* `x` → token sequence (shape `[B, N_tokens, D]`)
* `self.pos_embed` → learned positional embeddings of shape `[1, N_max, D]`, where
  $$ N_\text{max} = \text{num_patches} + 1 = 197 $$ for 224×224 input (14×14 grid + CLS token)

So this code assumes:
$$
N_\text{tokens} = x.size(1) \leq N_\text{max}
$$

That is: the number of tokens (patches + CLS) in your **current input**
must be ≤ the number of positional embeddings stored in the model.

If smaller, it just slices a subset of embeddings (which is fine).
If larger — e.g., 384×384 → 24×24 patches → 577 tokens — it would fail,
and **interpolation is needed** to *expand* the position grid.

---

Why this simple slicing **does not work for smaller images**

It’s true that slicing (`[:, :x.size(1)]`) *works syntactically* for smaller images —
but **it gives the wrong spatial correspondence**.

Let’s see why.



| Input   | Grid          | Tokens  | Shape |
| ------- | ------------- | ------- | ----- |
| 224×224 | 14×14 patches | 196 + 1 | 197   |
| 96×96   | 6×6 patches   | 36 + 1  | 37    |

If you simply slice `pos_embed[:, :37]`, you’re **taking the first 37 embeddings**
from a **1D flattening** of the 14×14 positional grid.

That means you’re taking something like:

```
first row patches (14 tokens)
second row patches (14 tokens)
third row patches (9 tokens)
```

→ It doesn’t form a 6×6 *spatial grid* anymore!

So the positional meaning (e.g., top-left → bottom-right) is **completely misaligned**.
The network would think the patches are still 14×14 arranged, not 6×6.

Hence, **simple slicing breaks spatial consistency** when the grid size changes.

---

**Correct fix: interpolate the 2D positional grid**

To preserve spatial meaning, you must:

1. Reshape the original 1D positional embeddings (except CLS token)
   into a 2D grid → `[H_orig, W_orig, D]`
   (for 224×224, that’s 14×14×384).

2. **Interpolate** this 2D grid to the new grid size (e.g. 6×6 for 96×96 input).

3. Flatten it back and concatenate with the CLS token.

This keeps the “positional meaning” consistent for arbitrary input sizes.

---

**Visual intuition**

Imagine the learned positional embeddings as a **14×14 colored grid** —
each color corresponds to a patch location.

When the crop is **smaller (6×6)**:

* If you just slice the first 36 colors → you take the *wrong patches*.
* If you interpolate → you *resample* that 14×14 map into a 6×6 version.

That’s what interpolation ensures.

---

**When interpolation is needed**

| Case                       | # Patches            | Example   | Why interpolate?         |
| -------------------------- | -------------------- | --------- | ------------------------ |
| **Smaller input (96×96)**  | fewer patches (6×6)  | ✅ **Yes** | preserve spatial mapping |
| **Same input (224×224)**   | same patches (14×14) | ❌ No      | direct match             |
| **Larger input (384×384)** | more patches (24×24) | ✅ **Yes** | extend grid spatially    |

So **both smaller and larger** inputs require interpolation —
because the **2D spatial structure** changes, not just the count.

---

## 6. Example with numbers

* Original: 224×224 → 14×14 grid → 14×14 positional map
* Smaller: 96×96 → 6×6 grid

Interpolation rescales:
$$
14 \times 14 ; \text{→} ; 6 \times 6
$$

so each of the 36 positions in the smaller image corresponds to
a proper spatially-aligned position in the original grid.

---

## 7. Code snippet recap

That’s why DINO includes this in its ViT forward:

```python
# During forward:
pos_embed = interpolate_pos_encoding(self, x)

x = x + pos_embed
```

and `interpolate_pos_encoding()` handles the grid resizing automatically.

---

## 8. Summary

| Situation                             | x.size(1)    | Problem                        | Solution             |
| ------------------------------------- | ------------ | ------------------------------ | -------------------- |
| Input smaller than pretrained (96×96) | fewer tokens | slicing breaks spatial mapping | **interpolate down** |
| Input equal (224×224)                 | same tokens  | works fine                     | use as is            |
| Input larger (384×384)                | more tokens  | not enough embeddings          | **interpolate up**   |

So yes — slicing only “works” numerically for smaller inputs,
but it **destroys the geometry**, so **interpolation is required** to maintain correct positional relationships.

---

✅ **In short:**

* Slicing keeps correct *count* but wrong *spatial layout*.
* Interpolation preserves *2D positional meaning* for any resolution (smaller or larger).
* That’s why DINO uses interpolation, even for smaller 96×96 crops.

---



## **numerical and visual example** 
Perfect. Let’s build a **numerical and visual example** to make the positional-embedding interpolation crystal clear.

We’ll take a **ViT-S/16** model (patch size = 16) and compare two inputs:

* A **224 × 224** image (standard size)
* A **96 × 96** image (local crop, used by the student in DINO)

---

## 1. Number of patches and tokens

### For input 224×224

$$
N_{h} = N_{w} = \frac{224}{16} = 14
$$
$$
N_\text{patches} = 14 \times 14 = 196
$$
Adding the `[CLS]` token:
$$
N_\text{tokens} = 196 + 1 = 197
$$

### For input 96×96

$$
N_{h} = N_{w} = \frac{96}{16} = 6
$$
$$
N_\text{patches} = 6 \times 6 = 36
$$
Adding `[CLS]`:
$$
N_\text{tokens} = 37
$$

So:

| Input size | Grid  | # patches | # tokens (with CLS) |
| ---------- | ----- | --------- | ------------------- |
| 224 × 224  | 14×14 | 196       | 197                 |
| 96 × 96    | 6×6   | 36        | 37                  |

---

## 2. What the positional embeddings look like

The ViT has a **learned 2D grid** of positional embeddings
(ignoring the `[CLS]` token):

$$
\text{pos_embed_2D} \in \mathbb{R}^{14\times14\times384}
$$

Flattened into 196×384 when used.

For a smaller image (6×6 = 36 patches), we must **resize** this 14×14 grid → 6×6.

---

## 3. Visualizing the idea (grids)

### Original positional grid (14×14)

```
14×14 positions
┌───────────────────────────────┐
│▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
│▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
│▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
│............. 14×14 ...........│
└───────────────────────────────┘
```

### After interpolation to 6×6

```
6×6 positions
┌──────────────┐
│▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
│▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
│......6×6......│
└──────────────┘
```

This interpolation keeps the spatial meaning (top-left → bottom-right mapping) rather than just slicing off the first 36 embeddings.

---

## 4. Numerical example in code

Below is a minimal **PyTorch snippet** to simulate this interpolation.

```python
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt

# Simulate learned 2D positional embeddings for 14×14 grid
H_orig, W_orig = 14, 14
dim = 1  # for visualization (normally 384)
pos_embed = torch.arange(H_orig * W_orig).float().reshape(1, H_orig, W_orig, dim)
pos_embed = pos_embed.permute(0, 3, 1, 2)  # [1, dim, H, W]

# Interpolate down to 6×6 grid
H_new, W_new = 6, 6
pos_embed_resized = F.interpolate(pos_embed, size=(H_new, W_new),
                                  mode='bicubic', align_corners=False)

# Visualize
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.imshow(pos_embed[0, 0].numpy(), cmap='viridis')
plt.title("Original 14×14 positional grid")
plt.subplot(1, 2, 2)
plt.imshow(pos_embed_resized[0, 0].numpy(), cmap='viridis')
plt.title("Interpolated 6×6 positional grid")
plt.show()

print(f"Original grid shape: {pos_embed.shape} -> Interpolated: {pos_embed_resized.shape}")
```

**Output**

```
Original grid shape: torch.Size([1, 1, 14, 14])
Interpolated: torch.Size([1, 1, 6, 6])
```

…and you’ll see the coarse 6×6 grid maintaining the same smooth spatial pattern.

---

## 5. How the ViT uses this at runtime

During a forward pass, DINO’s code calls something like:

```python
pos_embed = interpolate_pos_encoding(self, x)
x = x + pos_embed
```

So each input crop (96×96 or 224×224) gets a **properly resized positional embedding**,
keeping the network’s spatial reasoning consistent.

---

✅ **In short:**

* 224×224 → 14×14 patches → 197 tokens
* 96×96 → 6×6 patches → 37 tokens
* The positional embedding grid (14×14) is **interpolated** to 6×6
  so that patch positions correspond spatially.
* This is why DINO can use one ViT for both large and small crops.


## **9. Emergent Properties**

After self-supervised training:

* ViT attention heads automatically focus on **objects** without labels.
* Embedding space forms **semantic clusters** (e.g., all birds together).
* Works effectively as a **feature extractor** for downstream tasks (classification, detection, segmentation).

---

## **10. Comparison with Other SSL Methods**

| Method  | Teacher Update | Contrastive? | Negatives? | Backbone | Key Feature                  |
| ------- | -------------- | ------------ | ---------- | -------- | ---------------------------- |
| SimCLR  | N/A            | Yes          | Yes        | CNN      | Simple contrastive loss      |
| BYOL    | EMA            | No           | No         | CNN      | Momentum teacher + predictor |
| DINO    | EMA            | No           | No         | ViT/CNN  | Multi-crop self-distillation |
| MoCo v3 | EMA            | Yes          | Yes        | ViT      | Queue of negatives           |
| MAE     | N/A            | No           | No         | ViT      | Masked autoencoding          |

---

## **11. DINOv2 (2023)**

Meta’s **DINOv2** scales up DINO to **foundation-model level**:

* Trained on **142M curated images**.
* Models up to **ViT-G/14 (~1B parameters)**.
* Achieves **zero-shot and transfer learning** performance comparable to supervised models.
* Serves as a **vision backbone** for systems like CLIP and SAM.

| Model    | Layers | Hidden dim | Patch | Params |
| -------- | ------ | ---------- | ----- | ------ |
| ViT-S/14 | 12     | 384        | 14    | ~22M   |
| ViT-B/14 | 12     | 768        | 14    | ~86M   |
| ViT-L/14 | 24     | 1024       | 14    | ~304M  |
| ViT-G/14 | 40     | 1536       | 14    | ~1B    |

---

## **12. Key Equations Summary**

1. **Teacher update:**
   $$
   \theta_t \leftarrow \tau \theta_t + (1 - \tau) \theta_s
   $$
2. **Centered teacher output:**
   $$
   p_t = \text{softmax}\left(\frac{z_t - c}{T_t}\right)
   $$
3. **Student output:**
   $$
   p_s = \text{softmax}\left(\frac{z_s}{T_s}\right)
   $$
4. **Distillation loss:**
   $$
   \mathcal{L}_{\text{DINO}} = -\sum_i p_t^{(i)} \log p_s^{(i)}
   $$

---

## **13. Simplified PyTorch Pseudocode**

```python
# Multi-crop augmentations
global_crops, local_crops = augment_multi_crop(batch)

# Forward
with torch.no_grad():
    teacher_out = teacher(global_crops)
student_out = student(global_crops + local_crops)

# Softmax with temperature
p_t = softmax((teacher_out - center) / T_teacher)
p_s = softmax(student_out / T_student)

# Cross-entropy loss
loss = cross_entropy(p_s, p_t)

# Train student
loss.backward()
optimizer.step()

# Update teacher via EMA
with torch.no_grad():
    for t, s in zip(teacher.parameters(), student.parameters()):
        t.data = tau * t.data + (1 - tau) * s.data
```

---

✅ **In summary:**

* Both teacher and student share the same ViT architecture.
* The teacher is a slow EMA of the student (no gradients).
* Multi-crop augmentation and temperature scaling prevent collapse.
* ViT-S/16 achieves strong object-level representation without labels.
* DINOv2 scales this into a **foundation-level visual model** rivaling supervised ones.

---