1. **Define teacher and student models**  
   Typically, the teacher is a large pretrained model, and the student is a smaller model.

2. **Softened output with temperature**  
   Use a temperature $T > 1 $to soften the logits before softmax for both teacher and student:
   $$
   q_i = \frac{\exp(z_i/T)}{\sum_j \exp(z_j/T)}
   $$
   where $z_i $are logits.

3. **Loss function**  
   Combine:
   - Cross-entropy loss with true labels for the student
   - KL divergence loss between teacher and student softened outputs

   Loss could be:
   $$
   L = \alpha \cdot CE(y, \hat{y}_s) + (1 - \alpha) \cdot T^2 \cdot KL(q_t, q_s)
   $$
   where $\alpha $balances the two losses, $y $is true label, $\hat{y}_s $student output, $q_t $teacher softened output, $q_s $student softened output.

4. **Training loop**  
   For each batch:
   - Get teacher outputs (no gradient needed)
   - Get student outputs
   - Compute distillation loss
   - Backpropagate and optimize student model

### Step-0 :


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.transforms as transforms
import torchvision.datasets as datasets


In [None]:
# let check and device to gpu if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

### Step-Last :
0. [Knowledge Distillation: Principles, Algorithms, Applications](https://neptune.ai/blog/knowledge-distillation)
1. [Knowledge Distillation for Beginners using PyTorch](https://docs.pytorch.org/tutorials/beginner/knowledge_distillation_tutorial.html)
2. [KN-Pytorch part-1](https://www.kaggle.com/code/shivangitomar/knowledge-distillation-part-1-pytorch)
3. [KN-PyTorch part-2](https://www.kaggle.com/code/shivangitomar/knowledge-distillation-part-2-pytorch/)
4. [KD-PyTorch github benchmarks](https://github.com/haitongli/knowledge-distillation-pytorch)
5[Youtube - KD](https://youtu.be/l44uC7jfnvY?feature=shared)