# Distilling the knowledge?

Deploying machine learning algorithm has several limitations such as latency issues and high computational cost. This is usually a fundamental drawback that prevents the use of cumbersome models with good results.

Our goal here is to transfer the properties of cumbersome models (i.e. great metrics) to simpler and lighter models. 

NOTE: This notebook is an implementation (with additional ...) of the paper:

In [1]:
import numpy as np
import pandas as pd
import keras

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


# Data

We will use the popular MNSIT data set for this experiment. 

In [2]:
from keras.datasets import mnist
# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz


In [3]:
from __future__ import print_function
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K

In [4]:
# Parameters of the data set
batch_size = 128
num_classes = 10
epochs = 12

# input image dimensions
img_rows, img_cols = 28, 28

In [5]:
# Reshaping
if K.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
    x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
else:
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)

In [6]:
# Numerical data
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')

# Normalizing
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# Convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples


# Model configuration

As suggested in the paper, several steps help our cumbersome model to make less errors:

- Using a high dropout (0.5)
- Weight constraints: One particular form of regularization was found to be especially useful for dropout— constraining the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant c. In other words, if w represents the vector of weights incident on any hidden unit, the neural network was optimized under the constraint ||w||_2 ≤ c. This constraint was imposed during optimization by projecting w onto the surface of a ball of radius c, whenever w went out of it. This is also called max-norm regularization since it implies that the maximum value that the norm of any weight can take is c. The constant c is a tunable hyperparameter, which is determined using a validation set. Max-norm regularization has been previously used in the context of collaborative filtering (Srebro and Shraibman, 2005). It typically improves the performance of stochastic gradient descent training of deep neural nets, even when no dropout is used.
- 


# Distillation

<b>TL;DR:</b> Approximating a complex and hard to train model by a simpler one that uses the soft target of the former. 

## CNN basic architecture

Prior to delving into the details, let's recall the basic architecture of an image classifier.

![Convolutional Neural Network basic architecture](cnn_notebook.png)

The inputs (the handwritten digits) pass through a Convolutional Neural Network and the following operations are performed:
1. The convolutional and max-pooling layers creates a rich and compressed representation of the output
2. This representation then goes into a fully-connected layer, which produce logits
3. The softmax function converts the logits $z_{i}$ into probabilities $q_{i}$, by performing:

$$ q_{i} = \frac{exp(z_{i}/T)}{\sum_{j}exp(z_{j}/T)} $$

4. The argmax function then returns the class with the highest probability (our digits)

The key takeaway is that more informations are contained in the class probabilities rather than in {0, 1}. The idea is to leverage this by constructing a simpler model that predicts the probability distribution.

# Experiments

Two cumbersome models will be developed:
- The one described in the paper
- A more fine-tuned one

We will then analyze and compare the transferatbility to lighter models on these two cumbersome models.

In [7]:
# Model
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),
              metrics=['accuracy'])

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          verbose=1,
          validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Train on 60000 samples, validate on 10000 samples
Epoch 1/12
Epoch 2/12
Epoch 3/12
Epoch 4/12
Epoch 5/12
Epoch 6/12
Epoch 7/12

KeyboardInterrupt: 

# Sources:

https://arxiv.org/pdf/1503.02531.pdf
http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf
https://arxiv.org/pdf/1207.0580.pdf
https://cambridgespark.com/content/tutorials/neural-networks-tuning-techniques/index.html
https://en.wikipedia.org/wiki/Softmax_function
https://www.quora.com/What-does-Dr-Hinton-mean-by-hard-vs-soft-targets