# Knowledge Distillation on MNIST
Knowledge distillation is the process of transferring the higher performance of a more expensive model to a smaller one.  In this notebook, we will explore performing this process on MNIST.  To begin with, I have provided access to pre-trained model that is large, but performant.  The exact architecture is not relevant (although you can inspect this easily if you wish).  It is straightforward to load in pytorch with

In [6]:
# Gina Mazza


import torch
import torch.nn as nn
device = 'cpu'

class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.l1 = torch.nn.Linear(28**2,800)
        self.l2 = torch.nn.Linear(800,800)
        self.l3 = torch.nn.Linear(800,10)
        self.dropout2 = torch.nn.Dropout(0.5)
        self.dropout3 = torch.nn.Dropout(0.5)

    def forward(self, x):
        x = self.l1(x)
        x = torch.relu(x)
        x = self.dropout2(x)
        x = self.l2(x)
        x = torch.relu(x)
        x = self.dropout3(x)
        x = self.l3(x)
        return x
    
big_model = torch.load('pretrained_model.pt').to(device)

First, let's establish the baseline performance of the big model on the MNIST test set.  Of course we'll need acces to the MNIST test set to do this.  At the same time, let's also get our transfer set, which in this case will be a $n=10$k subset of the full MNIST training set (using a subset is helpful for speeding up training of distilled models, and also helps showcase some of the improved performance due to model distillation).   

In [2]:
from torchvision import transforms, datasets
transform=transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,)),
    ])

dataset_train = datasets.MNIST('./data', train=True, download=True, transform=transform)

dataset_test = datasets.MNIST('../data', train=False, download=True, transform=transform)

# This is a useful function that I didn't know about before
first_10k = list(range(0, 10000))
dataset_transfer = torch.utils.data.Subset(dataset_train, first_10k)

batch_size = 32
num_workers = 4
transfer_loader = torch.utils.data.DataLoader(dataset_transfer,batch_size=batch_size,num_workers=num_workers)
test_loader = torch.utils.data.DataLoader(dataset_test,batch_size=batch_size,num_workers=num_workers)

Here's a function that runs the big model in test mode and provides the number of correct examples

In [3]:
def test(model,test_loader):
    correct = 0
    counter = 0
    model.eval()
    with torch.no_grad():
        for data,target in test_loader:
            data, target = data.to(device), target.to(device)
            data = data.reshape(data.shape[0],-1)
            logits = model(data)
            pred = logits.argmax(dim=1,keepdim=True)
            correct += pred.eq(target.view_as(pred)).sum().item()
            counter += batch_size
    return correct

test(big_model,test_loader)

9833

We find that the big model gets 167 examples wrong (not quite as good as the Hinton paper, but who cares). 

Now we would like to perform knowledge distillation by training a smaller model to approximate the larger model's performance on the transfer set.  First, let's build a smaller model.  You may use whatever architecture you choose, but I found that using two hidden layers, each with 200 units along with ReLU activations (and no regularization at all) worked fine.

In [4]:
class SmallNet(torch.nn.Module):
    def __init__(self):
        super(SmallNet, self).__init__()
        # Build a SmallNet
        self.l1 = nn.Linear(784, 128)
        self.l2= nn.Linear(128, 10)

    def forward(self, x):
        # Don't forget to put the right operations here too!
        a1 = self.l1(torch.flatten(x,start_dim=1))
        z1 = torch.relu(a1)
        
        a2 = self.l2(z1)
        
        return a2 

 
    
small_model = SmallNet()
small_model.to(device)

SmallNet(
  (l1): Linear(in_features=784, out_features=128, bias=True)
  (l2): Linear(in_features=128, out_features=10, bias=True)
)

**To establish a baseline performance level, train the small model on the transfer set**  

In [5]:
# I'm giving you this training function: you'll need to modify it below to do knowledge distillation
def train(model,train_loader,n_epochs):
    optimizer = torch.optim.Adam(model.parameters(),1e-3)
    loss_fn = torch.nn.CrossEntropyLoss()
    model.train()
    for epoch in range(n_epochs):
        avg_l = 0.0
        counter = 0
        for batch_idx, (data, target) in enumerate(train_loader):
            data, target = data.to(device), target.to(device)
            data = data.reshape(data.shape[0],-1)
            optimizer.zero_grad()
            logits = model(data)
            L = loss_fn(logits,target)
            L.backward()
            optimizer.step()
            with torch.no_grad():
                avg_l += L
                counter += 1
        print(epoch,avg_l/counter)

train(small_model,transfer_loader,50)

0 tensor(0.4609)
1 tensor(0.2205)
2 tensor(0.1522)
3 tensor(0.1085)
4 tensor(0.0781)
5 tensor(0.0549)
6 tensor(0.0388)
7 tensor(0.0286)
8 tensor(0.0204)
9 tensor(0.0158)
10 tensor(0.0140)
11 tensor(0.0114)
12 tensor(0.0142)
13 tensor(0.0240)
14 tensor(0.0226)
15 tensor(0.0110)
16 tensor(0.0084)
17 tensor(0.0073)
18 tensor(0.0066)
19 tensor(0.0087)
20 tensor(0.0130)
21 tensor(0.0088)
22 tensor(0.0118)
23 tensor(0.0032)
24 tensor(0.0007)
25 tensor(0.0007)
26 tensor(0.0004)
27 tensor(0.0002)
28 tensor(0.0001)
29 tensor(0.0001)
30 tensor(0.0001)
31 tensor(9.9492e-05)
32 tensor(8.8442e-05)
33 tensor(7.9089e-05)
34 tensor(7.0739e-05)
35 tensor(6.3290e-05)
36 tensor(5.6534e-05)
37 tensor(5.0532e-05)
38 tensor(4.4993e-05)
39 tensor(4.0092e-05)
40 tensor(3.5377e-05)
41 tensor(3.1382e-05)
42 tensor(2.7865e-05)
43 tensor(2.4640e-05)
44 tensor(2.1876e-05)
45 tensor(1.9115e-05)
46 tensor(1.6773e-05)
47 tensor(1.4735e-05)
48 tensor(1.3037e-05)
49 tensor(1.1255e-05)


**Evaluate the small model on the test set, and comment on its accuracy relative to the big model.**  As you might expect, the performance is relatively worse.  

Yes, the small model misses 400 compared to 167 with the big model

In [7]:
test(small_model,test_loader)

9600

**The primary task of this notebook is now as follows: create a new training function similar to "train" above, but instead called "distill".**  "distill" should perform knowledge distillation as outlined in this week's paper.  It should accept a few additional arguments compared to train, namely the big model, the temperature hyperparameter, and a hyperparameter $\alpha$ that weights the relative magnitude of the soft target loss and the hard target loss.

In [16]:
distilled_model = SmallNet()
distilled_model.to(device)

# The body of this method is currently copied verbatim from the train method above: 
# you will need to modify it to utilize the big_model, temperature, and alpha values 
# to perform knowledge distillation


def distill(small_model,big_model,T,alpha,transfer_loader,n_epochs):
    optimizer = torch.optim.Adam(small_model.parameters(),1e-3)
    loss_fn = torch.nn.CrossEntropyLoss()
    small_model.train()

    
    for epoch in range(n_epochs):
        avg_l = 0.0
        counter = 0
        for batch_idx, (data, target) in enumerate(transfer_loader):
            
            data, target = data.to(device), target.to(device)
            data = data.reshape(data.shape[0],-1)
            optimizer.zero_grad()
            
            
            logitsB = big_model(data)
            logitsS = small_model(data)
            
            softTarget = torch.nn.functional.softmax(logitsB/T, dim=1)
            L0 = loss_fn((logitsS/T), softTarget)     # don't softmax logS/T                   
            L1 = loss_fn(logitsS, target)
            
            L = ((1-alpha)*L0) + (alpha * L1)
            
            L.backward()
            optimizer.step()
            with torch.no_grad():
                avg_l += L
                counter += 1
        print(epoch,avg_l/counter)
T = 8
alpha = 10
distill(distilled_model,big_model,T,alpha,transfer_loader,50)

0 tensor(-12.1571)
1 tensor(-14.0227)
2 tensor(-14.5835)
3 tensor(-14.9200)
4 tensor(-15.1595)
5 tensor(-15.3314)
6 tensor(-15.4675)
7 tensor(-15.5780)
8 tensor(-15.6656)
9 tensor(-15.7316)
10 tensor(-15.7877)
11 tensor(-15.8348)
12 tensor(-15.8723)
13 tensor(-15.9047)
14 tensor(-15.9307)
15 tensor(-15.9538)
16 tensor(-15.9750)
17 tensor(-15.9972)
18 tensor(-16.0122)
19 tensor(-16.0239)
20 tensor(-16.0385)
21 tensor(-16.0496)
22 tensor(-16.0621)
23 tensor(-16.0732)
24 tensor(-16.0816)
25 tensor(-16.0892)
26 tensor(-16.0962)
27 tensor(-16.1041)
28 tensor(-16.1080)
29 tensor(-16.1098)
30 tensor(-16.1052)
31 tensor(-16.1077)
32 tensor(-16.0728)
33 tensor(-16.0859)
34 tensor(-16.1195)
35 tensor(-16.1829)
36 tensor(-16.2655)
37 tensor(-16.4236)
38 tensor(-16.5240)
39 tensor(-16.6827)
40 tensor(-16.8394)
41 tensor(-17.0197)
42 tensor(-17.3307)
43 tensor(-17.5763)
44 tensor(-17.9785)
45 tensor(-18.3520)
46 tensor(-18.7798)
47 tensor(-19.3766)
48 tensor(-19.9839)
49 tensor(-20.5636)


**Finally, test your distilled model (on the test set) and describe how it performs relative to both big and small models.**


The big model out performed both other models, the small model out performed the distilled model. I would have expected the dtsilled model to land in the middle (better than the small, but not quite as high performance as the big). I would also expect the distilled model to eventually be closer to the big model so perhaps I did something wrong in there. I expected the distilled model to outperform the small model because the small model is a stand alone model and not learning anything from the bigger model like the distilled model is. 

In [20]:
distilled = test(distilled_model, test_loader)
bigModel = test(big_model,test_loader)
smallModel = test(small_model,test_loader)

print(f'Big Model = {bigModel}, Small Model = {smallModel}, Distilled Model = {distilled}')

Big Model = 9833, Small Model = 9600, Distilled Model = 9285
