This repository contains my work of replicating the Geoffrey Hinton's work on distilling the knowledge from a big ensemble network to a smaller neural network.
- To replicate the work, I created a custom optimizer that you can find here. This optimizer make sure that the norm of weights going to each individual neuron does not exceed a certain threshold.
- The main results of this work are as follwoing:
- The missclassification error of the big ensemble network is 101.
- The missclassification error of the smaller network is 196.
- The missclassification error of the smaller network trained on the probabilities of ensemble network is 134.
- Using the code, you can also learn how to use multiple tensorflow graphs within one python file. I created separate tensorflow graph for the ensemble model and the distill model. I then feed the probabilities generated by ensemble model to the distill model during the training.
We first need to train an ensemble model and save the model. We can achieve this by running the following code:
python model.py -n ensemble
We can train a distill model using the probabilities of this ensemble model by the following code:
python distill_knowledge.py
The model trained on probabilities is certainly giving better results than the model trained on labels. However, I am not able to get the direct replications of the work in the paper.