
---

# 📘 What is Knowledge Distillation (KD)?

Knowledge Distillation is a technique to **compress a large, powerful model (teacher)** into a **smaller, faster model (student)** without losing much accuracy.

* **Teacher Model**: A big neural network (e.g., ResNet50, BERT-large) trained on a dataset. It gives strong predictions but is slow and heavy.
* **Student Model**: A smaller neural network (e.g., ResNet18, DistilBERT) that we want to train. It learns not only from ground truth labels but also from the **soft predictions (logits) of the teacher**.

The key is that the **teacher’s soft labels contain "dark knowledge"** (like class probabilities distribution, not just 0/1 labels) which helps the student generalize better.

---

# ⚙️ Loss Function in KD

We combine **two losses**:

1. **Hard Loss (CrossEntropy with true labels)**
   Student learns from actual dataset labels.

   $$
   L_{CE} = CE(y_{true}, y_{student})
   $$

2. **Soft Loss (KL Divergence with teacher’s soft predictions)**
   Student mimics teacher’s distribution using a temperature-scaled softmax.

   $$
   L_{KD} = KLDiv(\sigma(z_{student}/T), \sigma(z_{teacher}/T))
   $$

   where $T$ = temperature (usually 2–5). Higher $T$ makes the probabilities softer.

3. **Final Loss**:

   $$
   L = \alpha L_{CE} + (1-\alpha) L_{KD}
   $$

   where α balances between teacher guidance and true labels.




### :--**Step 1** :--

 **🚀 Implementation in PyTorch (Fine-tuning with Knowledge Distillation)

Here’s a Jupyter Notebook style code:**

In [None]:
# 📌 Step 1: Imports
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torchvision import datasets, transforms, models
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt

# Check GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using:", device)


### ***🔹 Step 2: Data Preparation (CIFAR-10 Example)***

In [None]:
# Data transforms
transform = transforms.Compose([
    transforms.Resize((224,224)),   # ResNet input size
    transforms.ToTensor(),
    transforms.Normalize((0.5,0.5,0.5), (0.5,0.5,0.5))
])

# CIFAR-10 dataset
train_dataset = datasets.CIFAR10(root='./data', train=True, transform=transform, download=True)
test_dataset  = datasets.CIFAR10(root='./data', train=False, transform=transform, download=True)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader  = DataLoader(test_dataset, batch_size=64, shuffle=False)


### **🔹 Step 3: Define Teacher & Student Models
### ***Teacher = ResNet50(pretrained on ImageNet)***
### ***Student = ResNet18 (smaller)***

In [None]:
# Teacher Model (large)
teacher_model = models.resnet50(pretrained=True)
teacher_model.fc = nn.Linear(teacher_model.fc.in_features, 10)  # CIFAR-10 has 10 classes
teacher_model = teacher_model.to(device)

# Student Model (smaller)
student_model = models.resnet18(pretrained=True)
student_model.fc = nn.Linear(student_model.fc.in_features, 10)
student_model = student_model.to(device)

# Freeze teacher
for param in teacher_model.parameters():
    param.requires_grad = False


### **🔹 Step 4: Define Knowledge Distillation Loss**

In [None]:
class DistillationLoss(nn.Module):
    def __init__(self, temperature=4.0, alpha=0.5):
        super(DistillationLoss, self).__init__()
        self.temperature = temperature
        self.alpha = alpha
        self.ce_loss = nn.CrossEntropyLoss()
        self.kl_loss = nn.KLDivLoss(reduction="batchmean")

    def forward(self, student_logits, teacher_logits, labels):
        # Hard loss (true labels)
        ce = self.ce_loss(student_logits, labels)

        # Soft loss (teacher guidance with temperature scaling)
        T = self.temperature
        student_soft = F.log_softmax(student_logits / T, dim=1)
        teacher_soft = F.softmax(teacher_logits / T, dim=1)
        kd = self.kl_loss(student_soft, teacher_soft) * (T*T)

        # Combine losses
        return self.alpha * ce + (1. - self.alpha) * kd


### **🔹 Step 5: Training Loop with KD**

In [None]:
def train_kd(student, teacher, train_loader, optimizer, criterion, epoch):
    student.train()
    teacher.eval()
    total_loss = 0

    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)

        # Forward pass
        student_logits = student(images)
        with torch.no_grad():
            teacher_logits = teacher(images)

        # Loss
        loss = criterion(student_logits, teacher_logits, labels)

        # Backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    print(f"Epoch {epoch}, Loss: {total_loss/len(train_loader):.4f}")


###**🔹 Step 6: Evaluation Function**

In [None]:
def evaluate(model, test_loader):
    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for images, labels in test_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, preds = torch.max(outputs, 1)
            correct += (preds == labels).sum().item()
            total += labels.size(0)
    return 100 * correct / total


### **🔹 Step 7: Run Training**

In [None]:
# Optimizer
optimizer = optim.Adam(student_model.parameters(), lr=1e-4)

# Criterion
criterion = DistillationLoss(temperature=4.0, alpha=0.7)

# Train Student with KD
for epoch in range(5):  # (for demo, increase epochs)
    train_kd(student_model, teacher_model, train_loader, optimizer, criterion, epoch)
    acc = evaluate(student_model, test_loader)
    print(f"Test Accuracy: {acc:.2f}%")


# 🎯 Key Takeaways

* **KD reduces model size while keeping performance high.**
* **Temperature (T)** softens probabilities, making KD effective.
* **α** balances between true labels and teacher guidance.
* KD is useful in **model compression**, **edge deployment**, and **efficient inference**.

---

### ***Best Notebook to Learn Knowledge Distillation + Fine-tuning in one-short:--***
---

# 📓 Knowledge Distillation + Fine-tuning — Full Teaching Notebook

```markdown
# 🧠 Knowledge Distillation + Fine-tuning

This notebook explains **Knowledge Distillation (KD)** with **fine-tuning** in a simple and practical way.

---

## 📘 What is Knowledge Distillation?

- **Teacher Model (big)** → accurate but slow & heavy  
- **Student Model (small)** → fast & lightweight but less accurate  
- **Goal** → Train student to mimic teacher’s "dark knowledge" (soft predictions).

---

### 📌 KD Loss Function

We combine two parts:

1. **Hard Loss (CE)** → Ground truth labels  
2. **Soft Loss (KL)** → Teacher soft predictions (with temperature T)  
3. **Final Loss**:  
```

Loss = α \* CE(student, labels) + (1-α) \* KL(student/T, teacher/T)

```

- `T` (temperature) → softens probability distribution  
- `α` → balance between ground truth & teacher guidance
```

---

## 🔹 Step 1: Imports

```python
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torchvision import datasets, transforms, models
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using:", device)
```

---

## 🔹 Step 2: Dataset (CIFAR-10)

```python
transform = transforms.Compose([
    transforms.Resize((224,224)),
    transforms.ToTensor(),
    transforms.Normalize((0.5,0.5,0.5), (0.5,0.5,0.5))
])

train_dataset = datasets.CIFAR10(root="./data", train=True, transform=transform, download=True)
test_dataset  = datasets.CIFAR10(root="./data", train=False, transform=transform, download=True)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader  = DataLoader(test_dataset, batch_size=64, shuffle=False)
```

---

## 🔹 Step 3: Teacher & Student Models

```python
# Teacher (large, ResNet50)
teacher_model = models.resnet50(pretrained=True)
teacher_model.fc = nn.Linear(teacher_model.fc.in_features, 10)
teacher_model = teacher_model.to(device)

# Student (small, ResNet18)
student_model_ce = models.resnet18(pretrained=True)   # normal fine-tuning
student_model_ce.fc = nn.Linear(student_model_ce.fc.in_features, 10)
student_model_ce = student_model_ce.to(device)

student_model_kd = models.resnet18(pretrained=True)   # KD fine-tuning
student_model_kd.fc = nn.Linear(student_model_kd.fc.in_features, 10)
student_model_kd = student_model_kd.to(device)

# Freeze teacher
for p in teacher_model.parameters():
    p.requires_grad = False
```

---

## 🔹 Step 4: Define Loss Functions

```python
ce_loss = nn.CrossEntropyLoss()

class DistillationLoss(nn.Module):
    def __init__(self, temperature=4.0, alpha=0.5):
        super().__init__()
        self.temperature = temperature
        self.alpha = alpha
        self.ce = nn.CrossEntropyLoss()
        self.kl = nn.KLDivLoss(reduction="batchmean")
    
    def forward(self, student_logits, teacher_logits, labels):
        ce = self.ce(student_logits, labels)
        T = self.temperature
        student_soft = F.log_softmax(student_logits / T, dim=1)
        teacher_soft = F.softmax(teacher_logits / T, dim=1)
        kd = self.kl(student_soft, teacher_soft) * (T*T)
        return self.alpha * ce + (1-self.alpha) * kd
```

---

## 🔹 Step 5: Training Functions

```python
def train_student_ce(model, loader, optimizer, epochs=3):
    model.train()
    losses = []
    for epoch in range(epochs):
        total_loss = 0
        for x,y in loader:
            x,y = x.to(device), y.to(device)
            out = model(x)
            loss = ce_loss(out, y)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        losses.append(total_loss/len(loader))
        print(f"Epoch {epoch}: CE Loss {losses[-1]:.4f}")
    return losses

def train_student_kd(student, teacher, loader, optimizer, criterion, epochs=3):
    student.train(); teacher.eval()
    losses = []
    for epoch in range(epochs):
        total_loss = 0
        for x,y in loader:
            x,y = x.to(device), y.to(device)
            s_out = student(x)
            with torch.no_grad():
                t_out = teacher(x)
            loss = criterion(s_out, t_out, y)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        losses.append(total_loss/len(loader))
        print(f"Epoch {epoch}: KD Loss {losses[-1]:.4f}")
    return losses
```

---

## 🔹 Step 6: Evaluation Function

```python
def evaluate(model, loader):
    model.eval()
    correct, total = 0,0
    with torch.no_grad():
        for x,y in loader:
            x,y = x.to(device), y.to(device)
            out = model(x)
            _,pred = torch.max(out,1)
            correct += (pred==y).sum().item()
            total += y.size(0)
    return 100*correct/total
```

---

## 🔹 Step 7: Run Training (CE vs KD)

```python
# Optimizers
opt_ce = optim.Adam(student_model_ce.parameters(), lr=1e-4)
opt_kd = optim.Adam(student_model_kd.parameters(), lr=1e-4)

# KD loss
kd_loss = DistillationLoss(temperature=4.0, alpha=0.7)

# Train
losses_ce = train_student_ce(student_model_ce, train_loader, opt_ce, epochs=3)
losses_kd = train_student_kd(student_model_kd, teacher_model, train_loader, opt_kd, kd_loss, epochs=3)

# Evaluate
acc_ce = evaluate(student_model_ce, test_loader)
acc_kd = evaluate(student_model_kd, test_loader)

print(f"Student (Normal Fine-tuning): {acc_ce:.2f}%")
print(f"Student (KD Fine-tuning): {acc_kd:.2f}%")
```

---

## 🔹 Step 8: Visualization of Loss Curves

```python
plt.plot(losses_ce, label="CE Student")
plt.plot(losses_kd, label="KD Student")
plt.xlabel("Epochs"); plt.ylabel("Loss"); plt.legend(); plt.show()
```

---

## 🔹 Step 9: Prediction Comparison (Teacher vs Student)

```python
import numpy as np

def show_predictions(images, labels, teacher, student_ce, student_kd):
    teacher.eval(); student_ce.eval(); student_kd.eval()
    with torch.no_grad():
        t_pred = torch.argmax(teacher(images.to(device)), dim=1)
        s_ce_pred = torch.argmax(student_ce(images.to(device)), dim=1)
        s_kd_pred = torch.argmax(student_kd(images.to(device)), dim=1)
    for i in range(len(images)):
        plt.imshow(np.transpose(images[i], (1,2,0)))
        plt.title(f"Label:{labels[i]} | Teacher:{t_pred[i]} | CE:{s_ce_pred[i]} | KD:{s_kd_pred[i]}")
        plt.show()

data_iter = iter(test_loader)
images, labels = next(data_iter)
show_predictions(images[:5], labels[:5], teacher_model, student_model_ce, student_model_kd)
```

---

## 🔹 Step 10: Hyperparameter Experiment (T and α)

```python
temperatures = [1,2,4,10]
alphas = [0.3,0.5,0.7,0.9]

# (Optional: Loop over T & alpha, train small runs, record accuracy)
# Plot: Accuracy vs Temperature & Accuracy vs Alpha
```

---

```markdown
# 🎯 Summary

- Knowledge Distillation transfers "dark knowledge" from Teacher → Student.
- Student with KD performs **much better** than Student with only CE fine-tuning.
- **Temperature (T)** and **Alpha (α)** are key hyperparameters.
- KD is useful for **model compression** and **deploying lightweight models**.

✅ Now you fully understand **KD + Fine-tuning**.
```

---

⚡ Is notebook ko run karoge to ek **student-friendly, unique, and complete KD tutorial** ban jayega.

👉 Kya aap chahte ho mai isko **ready-made `.ipynb` file** bana ke export kar du, jise aap directly Colab/Jupyter me open karke run kar sako?
