# Knowledge Distillation on MNIST
Knowledge distillation is the process of transferring the higher performance of a more expensive model to a smaller one.  In this notebook, we will explore performing this process on MNIST.  To begin with, I have provided access to pre-trained model that is large, but performant.  The exact architecture is not relevant (although you can inspect this easily if you wish).  It is straightforward to load in pytorch with

In [29]:
import torch
device = 'cpu'

class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.l1 = torch.nn.Linear(28**2,800)
        self.l2 = torch.nn.Linear(800,800)
        self.l3 = torch.nn.Linear(800,10)
        self.dropout2 = torch.nn.Dropout(0.5)
        self.dropout3 = torch.nn.Dropout(0.5)

    def forward(self, x):
        x = self.l1(x)
        x = torch.relu(x)
        x = self.dropout2(x)
        x = self.l2(x)
        x = torch.relu(x)
        x = self.dropout3(x)
        x = self.l3(x)
        return x
    
big_model = torch.load('pretrained_model.pt').to(device)

First, let's establish the baseline performance of the big model on the MNIST test set.  Of course we'll need acces to the MNIST test set to do this.  At the same time, let's also get our transfer set, which in this case will be a $n=10$k subset of the full MNIST training set (using a subset is helpful for speeding up training of distilled models, and also helps showcase some of the improved performance due to model distillation).   

In [30]:
from torchvision import transforms, datasets
transform=transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,)),
    ])

dataset_train = datasets.MNIST('./data', train=True, download=True, transform=transform)

dataset_test = datasets.MNIST('../data', train=False, download=True, transform=transform)

# This is a useful function that I didn't know about before
first_10k = list(range(0, 10000))
dataset_transfer = torch.utils.data.Subset(dataset_train, first_10k)

batch_size = 32
num_workers = 4
transfer_loader = torch.utils.data.DataLoader(dataset_transfer,batch_size=batch_size,num_workers=num_workers)
test_loader = torch.utils.data.DataLoader(dataset_test,batch_size=batch_size,num_workers=num_workers)

  cpuset_checked))


Here's a function that runs the big model in test mode and provides the number of correct examples

In [31]:
def test(model,test_loader):
    correct = 0
    counter = 0
    model.eval()
    with torch.no_grad():
        for data,target in test_loader:
            data, target = data.to(device), target.to(device)
            data = data.reshape(data.shape[0],-1)
            logits = model(data)
            pred = logits.argmax(dim=1,keepdim=True)
            correct += pred.eq(target.view_as(pred)).sum().item()
            counter += batch_size
    return correct

test(big_model,test_loader)

  cpuset_checked))


9833

We find that the big model gets 167 examples wrong (not quite as good as the Hinton paper, but who cares). 

Now we would like to perform knowledge distillation by training a smaller model to approximate the larger model's performance on the transfer set.  First, let's build a smaller model.  You may use whatever architecture you choose, but I found that using two hidden layers, each with 200 units along with ReLU activations (and no regularization at all) worked fine.

In [48]:
class SmallNet(torch.nn.Module):
    def __init__(self, width):
        super(SmallNet, self).__init__()
        self.l1 = torch.nn.Linear(28**2,width)
        self.l2 = torch.nn.Linear(width,width)
        self.l3 = torch.nn.Linear(width,10)
        self.dropout2 = torch.nn.Dropout(0.5)
        self.dropout3 = torch.nn.Dropout(0.5)

    def forward(self, x):
        x = self.l1(x)
        x = torch.relu(x) 
        x = self.dropout3(x)
        x = self.l3(x)
        return x
    
small_model = SmallNet(200)
small_model.to(device)

SmallNet(
  (l1): Linear(in_features=784, out_features=200, bias=True)
  (l2): Linear(in_features=200, out_features=200, bias=True)
  (l3): Linear(in_features=200, out_features=10, bias=True)
  (dropout2): Dropout(p=0.5, inplace=False)
  (dropout3): Dropout(p=0.5, inplace=False)
)

**To establish a baseline performance level, train the small model on the transfer set**  

In [49]:
# I'm giving you this training function: you'll need to modify it below to do knowledge distillation
def train(model,train_loader,n_epochs):
    optimizer = torch.optim.Adam(model.parameters(),1e-3)
    loss_fn = torch.nn.CrossEntropyLoss()
    model.train()
    
    for epoch in range(n_epochs):
        avg_l = 0.0
        counter = 0
        for batch_idx, (data, target) in enumerate(train_loader):
            data, target = data.to(device), target.to(device)
            data = data.reshape(data.shape[0],-1)
            optimizer.zero_grad()
            logits = model(data)
            L = loss_fn(logits,target)
            L.backward()
            optimizer.step()
            with torch.no_grad():
                avg_l += L
                counter += 1
        print(epoch,avg_l/counter)

train(small_model,transfer_loader,50)

  cpuset_checked))


0 tensor(0.5658)
1 tensor(0.3119)
2 tensor(0.2413)
3 tensor(0.2031)
4 tensor(0.1780)
5 tensor(0.1646)
6 tensor(0.1477)
8 tensor(0.1179)
9 tensor(0.1190)
10 tensor(0.1058)
11 tensor(0.1015)
12 tensor(0.0983)
13 tensor(0.0901)
14 tensor(0.0910)
15 tensor(0.0830)
16 tensor(0.0835)
17 tensor(0.0703)
19 tensor(0.0711)
20 tensor(0.0825)
21 tensor(0.0706)
22 tensor(0.0665)
23 tensor(0.0679)
24 tensor(0.0635)
25 tensor(0.0621)
26 tensor(0.0610)
27 tensor(0.0507)
28 tensor(0.0638)
29 tensor(0.0620)
30 tensor(0.0530)
31 tensor(0.0564)
32 tensor(0.0521)
33 tensor(0.0572)
34 tensor(0.0539)
35 tensor(0.0496)
36 tensor(0.0536)
37 tensor(0.0529)
38 tensor(0.0449)
39 tensor(0.0471)
40 tensor(0.0503)
41 tensor(0.0553)
42 tensor(0.0561)
43 tensor(0.0459)
44 tensor(0.0390)
45 tensor(0.0424)
46 tensor(0.0422)
47 tensor(0.0422)
48 tensor(0.0445)
49 tensor(0.0434)


**Evaluate the small model on the test set, and comment on its accuracy relative to the big model.**  As you might expect, the performance is relatively worse.  

In [50]:
test(small_model,test_loader)

  cpuset_checked))


9600

For the small model, I reduced the hidden layer width to 200 parameters and removed the second layer entirely. Not surprisingly, the performance saw a significant drop, correctly identifying only 9600 compared to the large model's 9833 (a 2.3% decrease in performance). 




**The primary task of this notebook is now as follows: create a new training function similar to "train" above, but instead called "distill".**  "distill" should perform knowledge distillation as outlined in this week's paper.  It should accept a few additional arguments compared to train, namely the big model, the temperature hyperparameter, and a hyperparameter $\alpha$ that weights the relative magnitude of the soft target loss and the hard target loss.

In [58]:
distilled_model = SmallNet(150)
distilled_model.to(device)

# The body of this method is currently copied verbatim from the train method above: 
# you will need to modify it to utilize the big_model, temperature, and alpha values 
# to perform knowledge distillation
def distill(small_model,big_model,T,alpha,transfer_loader,n_epochs):
    optimizer = torch.optim.Adam(small_model.parameters(),1e-3)
    loss_fn = torch.nn.CrossEntropyLoss()
    softmax = torch.nn.Softmax(dim=1)

    small_model.train()

    for epoch in range(n_epochs):
        avg_l = 0.0
        counter = 0

        for batch_idx, (data, target) in enumerate(transfer_loader):
            data, target = data.to(device), target.to(device)             
            data = data.reshape(data.shape[0],-1)
            soft_target = softmax(big_model(data))

            optimizer.zero_grad()
            logits = small_model(data)

            L = loss_fn(logits/T, target)
            soft_L = loss_fn(logits/T, soft_target)
            L = (1-alpha)*L + alpha*soft_L

            L.backward()
            optimizer.step()

            with torch.no_grad():
                avg_l += L
                counter += 1

        print(epoch,avg_l/counter)

T = 6
alpha = 0.8

distill(distilled_model,big_model,T,alpha,transfer_loader,50)

  cpuset_checked))


0 tensor(0.8310)
1 tensor(0.4011)
2 tensor(0.3265)
3 tensor(0.2843)
4 tensor(0.2529)
5 tensor(0.2220)
6 tensor(0.2035)
7 tensor(0.1903)
8 tensor(0.1741)
9 tensor(0.1698)
10 tensor(0.1605)
11 tensor(0.1450)
12 tensor(0.1337)
13 tensor(0.1307)
14 tensor(0.1228)
15 tensor(0.1194)
16 tensor(0.1108)
17 tensor(0.1088)
18 tensor(0.1026)
19 tensor(0.0968)
20 tensor(0.0931)
21 tensor(0.0946)
22 tensor(0.0900)
23 tensor(0.0878)
24 tensor(0.0844)
25 tensor(0.0799)
26 tensor(0.0797)
27 tensor(0.0777)
28 tensor(0.0738)
29 tensor(0.0756)
30 tensor(0.0725)
31 tensor(0.0684)
32 tensor(0.0665)
33 tensor(0.0659)
34 tensor(0.0649)
35 tensor(0.0659)
36 tensor(0.0612)
37 tensor(0.0641)
38 tensor(0.0647)
39 tensor(0.0574)
40 tensor(0.0558)
41 tensor(0.0587)
42 tensor(0.0573)
43 tensor(0.0553)
44 tensor(0.0565)
45 tensor(0.0580)
46 tensor(0.0559)
48 tensor(0.0542)
49 tensor(0.0543)


**Finally, test your distilled model (on the test set) and describe how it performs relative to both big and small models.**

In [59]:
test(distilled_model,test_loader)

  cpuset_checked))


9613

Initially, I used an alpha value of 0.9, and a temperature of 3, resulting in 9603 correct guesses. Hardly an improvement. Increasing the temperature to 6, and lowering the alpha value to 0.8, resulted in 9613 correct answers. While this is still not competetive with the large model, the benefits of knowledge distillation are apparent.