# Knowledge Distillation on MNIST
Knowledge distillation is the process of transferring the higher performance of a more expensive model to a smaller one.  In this notebook, we will explore performing this process on MNIST.  To begin with, I have provided access to pre-trained model that is large, but performant.  The exact architecture is not relevant (although you can inspect this easily if you wish).  It is straightforward to load in pytorch with

In [1]:
import torch
device = 'cuda'

class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.l1 = torch.nn.Linear(28**2,800)
        self.l2 = torch.nn.Linear(800,800)
        self.l3 = torch.nn.Linear(800,10)
        self.dropout2 = torch.nn.Dropout(0.5)
        self.dropout3 = torch.nn.Dropout(0.5)

    def forward(self, x):
        x = self.l1(x)
        x = torch.relu(x)
        x = self.dropout2(x)
        x = self.l2(x)
        x = torch.relu(x)
        x = self.dropout3(x)
        x = self.l3(x)
        return x
    
big_model = torch.load('pretrained_model.pt').to(device)

First, let's establish the baseline performance of the big model on the MNIST test set.  Of course we'll need acces to the MNIST test set to do this.  At the same time, let's also get our transfer set, which in this case will be a $n=10$k subset of the full MNIST training set (using a subset is helpful for speeding up training of distilled models, and also helps showcase some of the improved performance due to model distillation).   

In [3]:
from torchvision import transforms, datasets
transform=transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,)),
    ])

dataset_train = datasets.MNIST('./data', train=True, download=True, transform=transform)

dataset_test = datasets.MNIST('../data', train=False, download=True, transform=transform)

# This is a useful function that I didn't know about before
first_10k = list(range(0, 10000))
dataset_transfer = torch.utils.data.Subset(dataset_train, first_10k)

batch_size = 32
num_workers = 4
transfer_loader = torch.utils.data.DataLoader(dataset_transfer,batch_size=batch_size,num_workers=num_workers)
test_loader = torch.utils.data.DataLoader(dataset_test,batch_size=batch_size,num_workers=num_workers)

Here's a function that runs the big model in test mode and provides the number of correct examples

In [4]:
def test(model,test_loader):
    correct = 0
    counter = 0
    model.eval()
    with torch.no_grad():
        for data,target in test_loader:
            data, target = data.to(device), target.to(device)
            data = data.reshape(data.shape[0],-1)
            logits = model(data)
            pred = logits.argmax(dim=1,keepdim=True)
            correct += pred.eq(target.view_as(pred)).sum().item()
            counter += batch_size
    return correct

num_correct = test(big_model,test_loader)
print(f"Large Network Accuracy: {10000-num_correct} wrong {num_correct}/10,000 = {(num_correct/10000)*100:0.4}%")

Large Network Accuracy: 167 wrong 9833/10,000 = 98.33%


We find that the big model gets 167 examples wrong (not quite as good as the Hinton paper, but who cares). 

Now we would like to perform knowledge distillation by training a smaller model to approximate the larger model's performance on the transfer set.  First, let's build a smaller model.  You may use whatever architecture you choose, but I found that using two hidden layers, each with 200 units along with ReLU activations (and no regularization at all) worked fine.

In [5]:
class SmallNet(torch.nn.Module):
    def __init__(self):
        super(SmallNet, self).__init__()
        # Build a SmallNet
        self.l1 = torch.nn.Linear(28**2,400)
        self.l2 = torch.nn.Linear(400,10)
        self.dropout1 = torch.nn.Dropout(0.2)

    def forward(self, x):
        # Don't forget to put the right operations here too!
        x = self.l1(x)
        x = torch.relu(x)
        x = self.dropout1(x)
        x = self.l2(x)
        
        return x
    
small_model = SmallNet()
small_model.to(device)

SmallNet(
  (l1): Linear(in_features=784, out_features=400, bias=True)
  (l2): Linear(in_features=400, out_features=10, bias=True)
  (dropout1): Dropout(p=0.2, inplace=False)
)

**To establish a baseline performance level, train the small model on the transfer set**  

In [6]:
# I'm giving you this training function: you'll need to modify it below to do knowledge distillation
def train(model,train_loader,n_epochs):
    optimizer = torch.optim.Adam(model.parameters(),1e-3)
    loss_fn = torch.nn.CrossEntropyLoss()
    model.train()
    for epoch in range(n_epochs):
        avg_l = 0.0
        counter = 0
        for batch_idx, (data, target) in enumerate(train_loader):
            data, target = data.to(device), target.to(device)
            data = data.reshape(data.shape[0],-1)
            optimizer.zero_grad()
            logits = model(data)
            L = loss_fn(logits,target)
            L.backward()
            optimizer.step()
            with torch.no_grad():
                avg_l += L
                counter += 1
        print(epoch,avg_l/counter)

train(small_model,transfer_loader,50)

0 tensor(0.4276, device='cuda:0')
1 tensor(0.1962, device='cuda:0')
2 tensor(0.1329, device='cuda:0')
3 tensor(0.0956, device='cuda:0')
4 tensor(0.0723, device='cuda:0')
5 tensor(0.0566, device='cuda:0')
6 tensor(0.0453, device='cuda:0')
7 tensor(0.0382, device='cuda:0')
8 tensor(0.0315, device='cuda:0')
9 tensor(0.0275, device='cuda:0')
10 tensor(0.0340, device='cuda:0')
11 tensor(0.0252, device='cuda:0')
12 tensor(0.0237, device='cuda:0')
13 tensor(0.0296, device='cuda:0')
14 tensor(0.0337, device='cuda:0')
15 tensor(0.0253, device='cuda:0')
16 tensor(0.0155, device='cuda:0')
17 tensor(0.0182, device='cuda:0')
18 tensor(0.0112, device='cuda:0')
19 tensor(0.0194, device='cuda:0')
20 tensor(0.0277, device='cuda:0')
21 tensor(0.0252, device='cuda:0')
22 tensor(0.0181, device='cuda:0')
23 tensor(0.0137, device='cuda:0')
24 tensor(0.0162, device='cuda:0')
25 tensor(0.0143, device='cuda:0')
26 tensor(0.0279, device='cuda:0')
27 tensor(0.0217, device='cuda:0')
28 tensor(0.0139, device='cuda

**Evaluate the small model on the test set, and comment on its accuracy relative to the big model.**  As you might expect, the performance is relatively worse.  

In [7]:
small_acc = test(small_model,test_loader)
big_acc = test(big_model,test_loader)

print(f"Large Network Accuracy: {10000-big_acc} wrong {big_acc}/10,000 = {(big_acc/10000)*100:0.4}%")
print(f"Small Network Accuracy: {10000-small_acc} wrong {small_acc}/10,000 = {(small_acc/10000)*100:0.4}%")


Large Network Accuracy: 167 wrong 9833/10,000 = 98.33%
Small Network Accuracy: 425 wrong 9575/10,000 = 95.75%


The small networks accuracy is worse by a factor of approximately 2. Which is considerably worse, but not terrible.This is mainly because the original big network is not that complex when compared to our small model.

**The primary task of this notebook is now as follows: create a new training function similar to "train" above, but instead called "distill".**  "distill" should perform knowledge distillation as outlined in this week's paper.  It should accept a few additional arguments compared to train, namely the big model, the temperature hyperparameter, and a hyperparameter $\alpha$ that weights the relative magnitude of the soft target loss and the hard target loss.

In [8]:
distilled_model = SmallNet()
distilled_model.to(device)

# The body of this method is currently copied verbatim from the train method above: 
# you will need to modify it to utilize the big_model, temperature, and alpha values 
# to perform knowledge distillation
def distill(small_model,big_model,T,alpha,transfer_loader,n_epochs):
    # produce soft targets --> involves T
    # use categorical cross entropy with temp adjusted logits of soft targets + big model
    softMax = torch.nn.Softmax(dim=1)
    
    optimizer = torch.optim.Adam(small_model.parameters(),1e-3)
    loss_fn = torch.nn.CrossEntropyLoss()
    small_model.train()
    big_model.eval()
    
    for epoch in range(n_epochs):
        avg_l = 0.0
        counter = 0
        for batch_idx, (data, target) in enumerate(transfer_loader):
            data, target = data.to(device), target.to(device)
            data = data.reshape(data.shape[0],-1)
            optimizer.zero_grad()
            
            # evaluate batch of data with big model 
            big_logits = big_model(data)
            small_logits = small_model(data)
            
            # pass through softmax + adjust with temperature factor
            soft_targets = softMax(big_logits/T)
            
            # loss of distillation model to soft targets
            L_soft = loss_fn(small_logits/T,soft_targets)
            
            # regular loss of distillation model
            L_hard = loss_fn(small_logits,target)

            L = alpha*L_hard + (1-alpha)*L_soft

            L.backward()
            optimizer.step()
            
            with torch.no_grad():
                avg_l += L
                counter += 1
                
        print(epoch,avg_l/counter)
        
T = 20
alpha = 1e-1
distill(distilled_model,big_model,T,alpha,transfer_loader,50)

0 tensor(1.7695, device='cuda:0')
1 tensor(1.6145, device='cuda:0')
2 tensor(1.5627, device='cuda:0')
3 tensor(1.5324, device='cuda:0')
4 tensor(1.5143, device='cuda:0')
5 tensor(1.5016, device='cuda:0')
6 tensor(1.4920, device='cuda:0')
7 tensor(1.4861, device='cuda:0')
8 tensor(1.4812, device='cuda:0')
9 tensor(1.4774, device='cuda:0')
10 tensor(1.4738, device='cuda:0')
11 tensor(1.4718, device='cuda:0')
12 tensor(1.4703, device='cuda:0')
13 tensor(1.4683, device='cuda:0')
14 tensor(1.4678, device='cuda:0')
15 tensor(1.4655, device='cuda:0')
16 tensor(1.4648, device='cuda:0')
17 tensor(1.4639, device='cuda:0')
18 tensor(1.4631, device='cuda:0')
19 tensor(1.4622, device='cuda:0')
20 tensor(1.4621, device='cuda:0')
21 tensor(1.4607, device='cuda:0')
22 tensor(1.4600, device='cuda:0')
23 tensor(1.4591, device='cuda:0')
24 tensor(1.4586, device='cuda:0')
25 tensor(1.4583, device='cuda:0')
26 tensor(1.4586, device='cuda:0')
27 tensor(1.4593, device='cuda:0')
28 tensor(1.4587, device='cuda

**Finally, test your distilled model (on the test set) and describe how it performs relative to both big and small models.**

In [9]:
dist_acc = test(distilled_model,test_loader)

print(f"Large Network Accuracy: {10000-big_acc} wrong {big_acc}/10,000 = {(big_acc/10000)*100:0.4}%")
print(f"Small Network Accuracy: {10000-small_acc} wrong {small_acc}/10,000 = {(small_acc/10000)*100:0.4}%")
print(f"Distilled Network Accuracy: {10000-dist_acc} wrong {dist_acc}/10,000 = {(dist_acc/10000)*100:0.4}%")

Large Network Accuracy: 167 wrong 9833/10,000 = 98.33%
Small Network Accuracy: 425 wrong 9575/10,000 = 95.75%
Distilled Network Accuracy: 324 wrong 9676/10,000 = 96.76%


The distilled model performs better than the small model, but not significantly so. The large network has an 98.33% accuracy, and the small model has an accuracy of 95.8%. The distilled model has an accuracy of 96.8%, so it performs better, and gets about 100 less wrong. The distilled model still does a better job than the small model so you can see how powerful this architecture can be, even on a simple example.