# **Knowledge Distillation (KD)**

**Knowledge Distillation (KD)** is a training technique in deep learning where a **smaller (student) model** learns to imitate a **larger (teacher) model**.
It was introduced by **Geoffrey Hinton et al. (2015)** in *“Distilling the Knowledge in a Neural Network”*, with the goal of transferring the **“dark knowledge”** (soft, informative distributions) from a large model or ensemble into a compact, efficient model — preserving performance while reducing size and computation.

---

## 1. Motivation

Large models (teachers) have high capacity and accuracy but are computationally expensive to deploy.
Small models (students) are efficient but typically less accurate.

**Knowledge Distillation** bridges this gap by **compressing** the teacher’s knowledge into the student so that:

* The student learns the teacher’s **generalization behavior**.
* The student achieves accuracy close to the teacher while being much smaller.

---

## 2. Core Idea

Instead of learning only from **hard one-hot labels**, the student learns from the **soft output distribution** of the teacher.

### Teacher and Student Outputs

$$
p_t = \text{softmax}\left(\frac{z_t}{T}\right), \qquad
p_s = \text{softmax}\left(\frac{z_s}{T}\right)
$$

where:

* $ z_t, z_s $: logits from teacher and student
* $ T $: **temperature** — controls how smooth the probabilities are

Higher $ T $ (e.g. 2–4) produces **softer distributions**, revealing **inter-class similarities** — the so-called *dark knowledge*.

---

## 3. Loss Function

The student minimizes a **weighted combination** of two losses:

1. **Distillation Loss** — match soft outputs of teacher and student:
   $$
   \mathcal{L}_{KD} = T^2 \cdot \text{KL}(p_t || p_s)
   $$

2. **Cross-Entropy Loss** — match student to hard labels:
   $$
   \mathcal{L}*{CE} = -\sum_i y_i \log p*{s,i}(T=1)
   $$

Final combined objective:

$$
\mathcal{L}*{\text{total}} = \alpha \mathcal{L}*{CE} + (1-\alpha) \mathcal{L}_{KD}
$$

where $ \alpha \in [0,1] $ balances between dataset labels and teacher supervision.

---

## 4. Why Soft Targets Help

Hard labels only tell the model which class is correct.
Soft targets reveal **how confident** the teacher is about all classes.

| Class | Hard Label | Teacher (T=3) |
| ----- | ---------- | ------------- |
| Cat   | 1          | 0.85          |
| Dog   | 0          | 0.10          |
| Fox   | 0          | 0.05          |

This gives the student richer gradient information, such as:

* “Dog” is more similar to “Cat” than “Fox.”
* Leads to **better generalization**, **faster convergence**, and **less overfitting**.

---

## 5. Classical Knowledge Distillation Setup

### 1. **Teacher is pretrained and frozen**

The teacher is first trained on the dataset and then kept fixed:

$$
z_t = f_{\text{teacher}}(x), \qquad
p_t = \text{softmax}\left(\frac{z_t}{T}\right)
$$

No gradients are computed for the teacher:

$$
\frac{\partial \theta_{\text{teacher}}}{\partial t} = 0
$$

### 2. **Student learns to mimic the teacher**

The student is trained from scratch or pretrained weights.
For each input sample:

1. Teacher produces soft probabilities (no gradients).
2. Student predicts its own outputs.
3. KD and CE losses are combined.
4. Only the student’s parameters are updated:

$$
\theta_{\text{student}} \leftarrow \theta_{\text{student}} - \eta \frac{\partial \mathcal{L}}{\partial \theta_{\text{student}}}
$$

### 3. **Simplified PyTorch Loop**

```python
teacher.eval()
for p in teacher.parameters(): p.requires_grad = False

student.train()
opt = torch.optim.Adam(student.parameters(), lr=1e-4)

for imgs, labels in dataloader:
    with torch.no_grad():
        teacher_logits = teacher(imgs)
    student_logits = student(imgs)
    loss = distillation_loss(student_logits, teacher_logits, labels, T=3.0, alpha=0.5)
    opt.zero_grad()
    loss.backward()
    opt.step()
```

---

## 6. Variants of Distillation

| Type                          | Description                                             | Example                  |
| ----------------------------- | ------------------------------------------------------- | ------------------------ |
| **Response-based**            | Student mimics teacher’s output probabilities           | Hinton et al., 2015      |
| **Feature-based**             | Student matches intermediate activations                | FitNets                  |
| **Relation-based**            | Student matches relations between samples/features      | Attention Transfer, CRD  |
| **Online / EMA Distillation** | Teacher and student co-train (teacher = EMA of student) | DINO, BYOL, Mean Teacher |

---

## 7. EMA Teacher (Used in DINO/BYOL)

In self-distillation (no labels), the teacher is updated as an **Exponential Moving Average (EMA)** of the student:

$$
\theta_t \leftarrow \tau \theta_t + (1-\tau)\theta_s
$$

* $ \tau  ≈ 0.996–0.9995$ controls update smoothness
* The teacher evolves slowly, providing **stable targets**
* Prevents collapse in label-free training

PyTorch implementation:

```python
with torch.no_grad():
    for t_p, s_p in zip(teacher.parameters(), student.parameters()):
        t_p.data = tau * t_p.data + (1 - tau) * s_p.data
```

---

## 8. Numerical Example — Hard vs. Soft Targets

### Setup

3-class problem, true label = “dog”
Hard label: $ y = [0, 1, 0] $

### Case A — Hard labels

$$
p_s = [0.1, 0.7, 0.2], \quad
\mathcal{L}*{CE} = -\log(0.7)=0.357
$$
Gradient:
$$
\frac{\partial \mathcal{L}*{CE}}{\partial z_s} = [0.1, -0.3, 0.2]
$$

### Case B — Teacher soft labels

Teacher: $ p_t = [0.2, 0.6, 0.2] $, $ T=2 $

Softened:
$$
p_t(T=2)=[0.29,0.43,0.29], \quad p_s(T=2)=[0.26,0.38,0.36]
$$
Distillation gradient:
$$
\frac{\partial \mathcal{L}_{KD}}{\partial z_s} = p_s(T)-p_t(T)=[-0.03,-0.05,0.07]
$$

| Class | Hard Gradient | Soft Gradient |
| ----- | ------------- | ------------- |
| Cat   | +0.10         | –0.03         |
| Dog   | –0.30         | –0.05         |
| Wolf  | +0.20         | +0.07         |

Soft targets yield **smoother gradients** and capture class similarity, improving stability and generalization.

---

## 9. Soft Labels Without a Teacher

You can use **soft labels manually**, known as **label smoothing**:

$$
y_i^{\text{smooth}} =
\begin{cases}
1-\varepsilon & \text{if } i=y_{\text{true}} \
\frac{\varepsilon}{C-1} & \text{otherwise}
\end{cases}
$$

Example ($ \varepsilon=0.1 $):

$$
y^{\text{smooth}} = [0.05, 0.90, 0.05]
$$

### Difference from KD

| Aspect                | Label Smoothing | Knowledge Distillation              |
| --------------------- | --------------- | ----------------------------------- |
| Source                | Manual          | Teacher-generated                   |
| Same for all samples? | Yes             | No                                  |
| Semantic info?        | No              | Yes                                 |
| Data-dependent?       | No              | Yes                                 |
| Effect                | Regularization  | Knowledge transfer + regularization |

Thus:

* **Label smoothing** says: “be a little less certain about all classes.”
* **Distillation** says: “be uncertain in the *same way the teacher is uncertain*.”

---

## 10. Applications

| Task             | Teacher      | Student       | Benefit                            |
| ---------------- | ------------ | ------------- | ---------------------------------- |
| Classification   | ResNet-152   | ResNet-18     | 4× smaller, near-equal accuracy    |
| Object Detection | Faster R-CNN | MobileNet-SSD | Real-time on edge devices          |
| NLP              | BERT-large   | DistilBERT    | 40% smaller, 97% accuracy          |
| Self-supervised  | EMA teacher  | DINO, BYOL    | Label-free representation learning |

---

## 11. PyTorch Example (Simplified)

```python
import torch, torch.nn.functional as F

def distillation_loss(student_logits, teacher_logits, labels, T=3.0, alpha=0.5):
    ce = F.cross_entropy(student_logits, labels)
    p_t = F.softmax(teacher_logits / T, dim=1)
    p_s = F.log_softmax(student_logits / T, dim=1)
    kd = F.kl_div(p_s, p_t, reduction='batchmean') * (T * T)
    return alpha * ce + (1 - alpha) * kd
```

---

## 12. Summary Intuition

| Concept            | Analogy                           |
| ------------------ | --------------------------------- |
| Teacher            | Expert explaining reasoning       |
| Student            | Learner imitating reasoning       |
| Temperature        | Softness of teacher’s explanation |
| Distillation Loss  | Match reasoning                   |
| Cross-Entropy Loss | Match answers                     |

---

**In short:**
Knowledge Distillation teaches the student **how the teacher thinks**, not just **what it predicts** — transferring structured, data-dependent knowledge that improves efficiency and generalization.

---

