# Distillation
This notebook shows how the tool can be used to perform knowledge distillation.

## Set Up
* Import dependencies
* Import data loaders
* Import models

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
import torch
import importlib
import inspect
import torchvision.datasets as datasets
import torchvision.transforms as transforms
import torch.optim as optim
import torch.nn.functional as F

# Add thesis package to path
sys.path.append("../")

import src.general as general
import src.compression.distillation as distill
import src.metrics as metrics
import src.evaluation as eval
from models.mnist import *

In [3]:
# Load MNIST dataset
batch_size = 8
test_batch_size = 1000
use_cuda = False

kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
mnist_transform = transforms.ToTensor()
train_loader = torch.utils.data.DataLoader(
    datasets.MNIST('../data', train=True, download=True, transform=mnist_transform,),
    batch_size=batch_size, shuffle=True, **kwargs)
test_loader = torch.utils.data.DataLoader(
    datasets.MNIST('../data', train=True, download=True, transform=mnist_transform,),
    batch_size=test_batch_size, shuffle=True, **kwargs)

In [4]:
model_state = "../models/mnist.pt"

device = general.get_device()
teacher_model = torch.load(model_state, map_location=torch.device(device))

Using cuda: False


## Distillation
The original model acts as the teacher model. 

For the student model the user can either give a model architecture of their own, presented in a `.py` file, or use the the tool to intelligently design a student model. 

In [5]:
# Load the student model
student_model = MnistSmallLinear()

In [6]:
# Test performance of student model before training
general.test(student_model, device,  test_loader, criterion=F.nll_loss, epoch=1, metric = metrics.accuracy)

Test: 100%|██████████| 60/60 [00:01<00:00, 33.16it/s]

Average loss = 0.0080
Accuracy = 0.1214
Elapsed time = 1819.75 milliseconds (30.33 per batch)





In [7]:

epochs = 3
lr = 0.01

optimizer = optim.Adam(student_model.parameters(), lr=lr) # Important: use the student model parameters
distil_criterion = F.mse_loss
eval_criterion = F.cross_entropy


distill.distillation_train_loop(teacher_model, student_model, train_loader, test_loader, distil_criterion, eval_criterion, optimizer, epochs)

Distillation Training: 100%|██████████| 7500/7500 [00:14<00:00, 525.12it/s]
Distillation Validation: 100%|██████████| 60/60 [00:01<00:00, 34.64it/s]


Epoch: 0
Distillation loss: 5.476218223571777
Test loss: 0.24435090273618698, Test accuracy: 0.92655


Distillation Training: 100%|██████████| 7500/7500 [00:13<00:00, 536.90it/s]
Distillation Validation: 100%|██████████| 60/60 [00:01<00:00, 34.97it/s]


Epoch: 1
Distillation loss: 4.238290786743164
Test loss: 0.23007701511184375, Test accuracy: 0.92995


Distillation Training: 100%|██████████| 7500/7500 [00:13<00:00, 545.32it/s]
Distillation Validation: 100%|██████████| 60/60 [00:01<00:00, 34.39it/s]

Epoch: 2
Distillation loss: 5.2310590744018555
Test loss: 0.19692060003678005, Test accuracy: 0.94





## Evaluation
Analayze the metrics of the new student model

In [8]:
# Test model performance after distillation
print("Teacher model performance:")
general.test(teacher_model, device,  test_loader, criterion=F.nll_loss, epoch=1, metric = metrics.accuracy)
print("Student model performance:")
general.test(student_model, device,  test_loader, criterion=F.nll_loss, epoch=1, metric = metrics.accuracy)
print('\n\n')


# Compare the number of parameters of the teacher and student model
teacher_params = eval.get_model_parameters(teacher_model)
student_params = eval.get_model_parameters(student_model)
print('Number of parameters: {} (Teacher) -> {} (Student)'.format(teacher_params, student_params))

# Compare the model size of the teacher and student model
teacher_size = eval.get_model_size(teacher_model)
student_size = eval.get_model_size(student_model)
print('Model Size: {} MB (Teacher) -> {} MB (Student)'.format(teacher_size, student_size))




Teacher model performance:


Test: 100%|██████████| 60/60 [00:04<00:00, 13.49it/s]


Average loss = 0.0363
Accuracy = 0.9891
Elapsed time = 4448.19 milliseconds (74.14 per batch)
Student model performance:


Test: 100%|██████████| 60/60 [00:01<00:00, 33.77it/s]

Average loss = 1.3392
Accuracy = 0.9400
Elapsed time = 1777.77 milliseconds (29.63 per batch)



Number of parameters: 431080 (Teacher) -> 39760 (Student)
Model Size: 1.65 MB (Teacher) -> 0.15 MB (Student)



