# Distilling the knowledge?

Deploying machine learning algorithm has several limitations such as latency issues and high computational cost. This is usually a fundamental drawback that prevents the use of cumbersome models with good results.

Our goal here is to transfer the properties of cumbersome models (i.e. great metrics) to simpler and lighter models. 

NOTE: This notebook is an implementation (with additional ...) of the paper:

In [1]:
import numpy as np
import pandas as pd
import keras

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


# Data

We will use the popular MNSIT data set for this experiment. 

In [2]:
from keras.datasets import mnist
# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

In [16]:
from __future__ import print_function
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Lambda, concatenate, Activation, Conv2D, MaxPooling2D
from keras.models import Model
from keras.losses import categorical_crossentropy as logloss
from keras.metrics import categorical_accuracy, top_k_categorical_accuracy
from keras import backend as K

In [4]:
# Parameters of the data set
batch_size = 128
num_classes = 10
epochs = 12

# Input image dimensions
img_rows, img_cols = 28, 28

In [5]:
# Reshaping
if K.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
    x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
else:
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)

In [6]:
# Numerical data
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')

# Normalizing
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# Convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

x_train shape: (60000, 1, 28, 28)
60000 train samples
10000 test samples


# Model configuration

As suggested in the paper, several steps help our cumbersome model to make less errors:

- Using a high dropout (0.5)
- Weight constraints: One particular form of regularization was found to be especially useful for dropout— constraining the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant c. In other words, if w represents the vector of weights incident on any hidden unit, the neural network was optimized under the constraint ||w||_2 ≤ c. This constraint was imposed during optimization by projecting w onto the surface of a ball of radius c, whenever w went out of it. This is also called max-norm regularization since it implies that the maximum value that the norm of any weight can take is c. The constant c is a tunable hyperparameter, which is determined using a validation set. Max-norm regularization has been previously used in the context of collaborative filtering (Srebro and Shraibman, 2005). It typically improves the performance of stochastic gradient descent training of deep neural nets, even when no dropout is used.
- 


# Distillation

<b>TL;DR:</b> Approximating a complex and hard to train model by a simpler one that uses the soft target of the former. 

## CNN basic architecture

Prior to delving into the details, let's recall the basic architecture of an image classifier.

![Convolutional Neural Network basic architecture](cnn_notebook.png)

The inputs (the handwritten digits) pass through a Convolutional Neural Network and the following operations are performed:
1. The convolutional and max-pooling layers creates a rich and compressed representation of the output
2. This representation then goes into a fully-connected layer, which produce logits
3. The softmax function converts the logits $z_{i}$ into probabilities $q_{i}$, by performing:

$$ q_{i} = \frac{exp(z_{i}/T)}{\sum_{j}exp(z_{j}/T)} $$

4. The argmax function then returns the class with the highest probability (our digits)

The key takeaway before we continue is that more informations are contained in the class probabilities rather than in  the ouput digit. The idea is therefore to leverage this by constructing a simpler model that predicts the probability distribution.

## Simplest form of distillation

The target of the distill model are the probabilities of the cumbersome model.

Train the cumbersome model with high temperature $T_{high}$ in the softmax and do the same with the distilled one. During training of the distilled model, the temperature $T_{high}$ will also be used. Once the training is done though, we set T = 1.

# Experiments

Two cumbersome models will be developed:
- The one described in the paper
- A more fine-tuned one

We will then analyze and compare the transferatbility to lighter models on these two cumbersome models.

In [7]:
x_train.shape

(60000, 1, 28, 28)

In [13]:
# Model
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),
              metrics=['accuracy'])

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=1,
          verbose=1,
          validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Train on 60000 samples, validate on 10000 samples
Epoch 1/1
Test loss: 0.07937954030409455
Test accuracy: 0.9752


In [66]:
y_pred = model.predict(x_test)

In [14]:
model_test = model

In [83]:
# Converting to channels_last
#x_train_ = np.moveaxis(x_train, -1, 0)
#x_test_ = np.moveaxis(x_test, -1, 0)

In [17]:


temperature = 5.0

# remove softmax
model_test.layers.pop()

# usual probabilities
logits = model_test.layers[-1].output
probabilities = Activation('softmax')(logits)

# softed probabilities
logits_T = Lambda(lambda x: x/temperature)(logits)
probabilities_T = Activation('softmax')(logits_T)

output = concatenate([probabilities, probabilities_T])
model_test = Model(model_test.input, output)
# now model outputs 512 dimensional vectors

In [73]:
y_test[:, :10]

array([[0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [75]:
y_test[:, 10:]

array([], shape=(10000, 0), dtype=float64)

In [67]:
# Custom loss
def knowledge_distillation_loss(y_true, y_pred, lambda_const):    
    
    # split in 
    #    onehot hard true targets
    #    logits from xception
    y_true, logits = y_true[:, :10], y_true[:, 10:]
    
    # convert logits to soft targets
    y_soft = K.softmax(logits/temperature)
    
    # split in 
    #    usual output probabilities
    #    probabilities made softer with temperature
    y_pred, y_pred_soft = y_pred[:, :10], y_pred[:, 10:]    
    
    return lambda_const*logloss(y_true, y_pred) + logloss(y_soft, y_pred_soft)

In [63]:
def accuracy(y_true, y_pred):
    return categorical_accuracy(y_true, y_pred)

In [40]:
def top_5_accuracy(y_true, y_pred):
    y_true = y_true[:, :10]
    y_pred = y_pred[:, :10]
    return top_k_categorical_accuracy(y_true, y_pred)

In [41]:
def categorical_crossentropy(y_true, y_pred):
    y_true = y_true[:, 10:]
    y_pred = y_pred[:, 10:]
    return logloss(y_true, y_pred)

In [59]:
# logloss with only soft probabilities and targets
def soft_logloss(y_true, y_pred):     
    logits = y_true[:, 10:]
    y_soft = K.softmax(logits/temperature)
    y_pred_soft = y_pred[:, 10:]    
    return logloss(y_soft, y_pred_soft)

In [62]:
soft_logloss(y_test, y_pred)

AttributeError: 'numpy.ndarray' object has no attribute 'get_shape'

In [51]:
x_train_ = x_train.reshape()

(60000, 1, 28, 28)

In [52]:
y_train.shape

(60000, 10)

In [69]:
x_train.shape

(60000, 1, 28, 28)

In [72]:
lambda_const = 0.07

model_test.compile(loss=lambda y_true, y_pred: knowledge_distillation_loss(y_true, y_pred, lambda_const),
              optimizer=keras.optimizers.SGD(lr=1e-1, momentum=0.9, nesterov=True), 
              metrics=['accuracy'])

model_test.fit(x_train, y_train,
              batch_size=batch_size,
              epochs=1,
              verbose=1,
              validation_data=(x_test, y_test))


Train on 60000 samples, validate on 10000 samples
Epoch 1/1


InvalidArgumentError: Incompatible shapes: [128,0] vs. [128,246]
	 [[Node: training_12/SGD/gradients/loss_12/concatenate_1_loss/mul_2_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _class=["loc:@loss_12/concatenate_1_loss/mul_2"], _device="/job:localhost/replica:0/task:0/cpu:0"](training_12/SGD/gradients/loss_12/concatenate_1_loss/mul_2_grad/Shape, training_12/SGD/gradients/loss_12/concatenate_1_loss/mul_2_grad/Shape_1)]]

Caused by op u'training_12/SGD/gradients/loss_12/concatenate_1_loss/mul_2_grad/BroadcastGradientArgs', defined at:
  File "/Users/louisdevitry/anaconda/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/Users/louisdevitry/anaconda/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/ipykernel/kernelapp.py", line 389, in start
    ioloop.IOLoop.instance().start()
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/zmq/eventloop/ioloop.py", line 151, in start
    super(ZMQIOLoop, self).start()
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/tornado/ioloop.py", line 866, in start
    handler_func(fd_obj, events)
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 433, in _handle_events
    self._handle_recv()
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 465, in _handle_recv
    self._run_callback(callback, msg)
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 407, in _run_callback
    callback(*args, **kwargs)
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/ipykernel/kernelbase.py", line 252, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/ipykernel/kernelbase.py", line 213, in dispatch_shell
    handler(stream, idents, msg)
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/ipykernel/kernelbase.py", line 362, in execute_request
    user_expressions, allow_stdin)
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/ipykernel/ipkernel.py", line 175, in do_execute
    shell.run_cell(code, store_history=store_history, silent=silent)
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2902, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 3012, in run_ast_nodes
    if self.run_code(code, result):
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 3066, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-72-b2601425dd21>", line 11, in <module>
    validation_data=(x_test, y_test))
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/keras/engine/training.py", line 1608, in fit
    self._make_train_function()
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/keras/engine/training.py", line 990, in _make_train_function
    loss=self.total_loss)
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 87, in wrapper
    return func(*args, **kwargs)
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/keras/optimizers.py", line 156, in get_updates
    grads = self.get_gradients(loss, params)
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/keras/optimizers.py", line 73, in get_gradients
    grads = K.gradients(loss, params)
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 2369, in gradients
    return tf.gradients(loss, variables, colocate_gradients_with_ops=True)
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 482, in gradients
    in_grads = grad_fn(op, *out_grads)
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/tensorflow/python/ops/math_grad.py", line 610, in _MulGrad
    rx, ry = gen_array_ops._broadcast_gradient_args(sx, sy)
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 411, in _broadcast_gradient_args
    name=name)
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
    op_def=op_def)
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2395, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1264, in __init__
    self._traceback = _extract_stack()

...which was originally created as op u'loss_12/concatenate_1_loss/mul_2', defined at:
  File "/Users/louisdevitry/anaconda/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
[elided 15 identical lines from previous traceback]
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2902, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 3006, in run_ast_nodes
    if self.run_code(code, result):
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 3066, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-72-b2601425dd21>", line 5, in <module>
    metrics=['accuracy'])
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/keras/engine/training.py", line 860, in compile
    sample_weight, mask)
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/keras/engine/training.py", line 460, in weighted
    score_array = fn(y_true, y_pred)
  File "<ipython-input-72-b2601425dd21>", line 3, in <lambda>
    model_test.compile(loss=lambda y_true, y_pred: knowledge_distillation_loss(y_true, y_pred, lambda_const),
  File "<ipython-input-67-4c2759a7b93e>", line 17, in knowledge_distillation_loss
    return lambda_const*logloss(y_true, y_pred) + logloss(y_soft, y_pred_soft)
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/keras/losses.py", line 50, in categorical_crossentropy
    return K.categorical_crossentropy(y_true, y_pred)
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 2861, in categorical_crossentropy
    return - tf.reduce_sum(target * tf.log(output),
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 884, in binary_op_wrapper
    return func(x, y, name=name)
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 1105, in _mul_dispatch
    return gen_math_ops._mul(x, y, name=name)
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1625, in _mul
    result = _op_def_lib.apply_op("Mul", x=x, y=y, name=name)
  File "/Users/louisdevitry/anaconda/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
    op_def=op_def)

InvalidArgumentError (see above for traceback): Incompatible shapes: [128,0] vs. [128,246]
	 [[Node: training_12/SGD/gradients/loss_12/concatenate_1_loss/mul_2_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _class=["loc:@loss_12/concatenate_1_loss/mul_2"], _device="/job:localhost/replica:0/task:0/cpu:0"](training_12/SGD/gradients/loss_12/concatenate_1_loss/mul_2_grad/Shape, training_12/SGD/gradients/loss_12/concatenate_1_loss/mul_2_grad/Shape_1)]]


In [None]:
probas = model.predict(x_train)

In [17]:
from keras.losses import categorical_crossentropy as logloss

def knowledge_distillation_loss(y_true, y_pred, lambda_const = 6):    
    
    # split in 
    #    onehot hard true targets
    #    logits from xception
    y_true, logits = y_true[:, :256], y_true[:, 256:]
    
    # convert logits to soft targets
    y_soft = K.softmax(logits/temperature)
    
    # split in 
    #    usual output probabilities
    #    probabilities made softer with temperature
    y_pred, y_pred_soft = y_pred[:, :256], y_pred[:, 256:]    
    
    return lambda_const*logloss(y_true, y_pred) + logloss(y_soft, y_pred_soft)

# Model
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss=loss,#keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),
              metrics=['accuracy'])

model.fit(x_train, probas,
          batch_size=batch_size,
          epochs=1,
          verbose=1,
          validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Train on 60000 samples, validate on 10000 samples
Epoch 1/1
Test loss: 0.0876760777818039
Test accuracy: 0.9688


In [13]:
probas_test = model.predict(x_test)


In [14]:
probas_test

array([[0.        , 0.        , 0.        , ..., 1.1292136 , 0.        ,
        0.        ],
       [0.        , 0.        , 0.8081301 , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.9808517 , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.25781712,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]], dtype=float32)

# Sources:

https://arxiv.org/pdf/1503.02531.pdf
http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf
https://arxiv.org/pdf/1207.0580.pdf
https://cambridgespark.com/content/tutorials/neural-networks-tuning-techniques/index.html
https://en.wikipedia.org/wiki/Softmax_function
https://www.quora.com/What-does-Dr-Hinton-mean-by-hard-vs-soft-targets