In [10]:
import numpy as np

# Terminology

Ensemble: a collection of models

soft targets: the class probabilities, such as the softmax probabilities, produced by a model

transfer set: a subset of the training set that is used as the new training set to transfer an ensemble to the finalist model

# Objectives aiming to achieve in the finalist model

To encourage the finalist model to predict the true targets as well as matching the soft targets provided by the ensemble, without too much computation.

# Distillation

The concept of distillation is actually closely related to statistical mechanics. The logit $z_i$ refers to the energy at state $i$ from an ensemble.  

By the Gibbs'distribution law, $p_i =\frac{\exp(\beta{z_i})}{\sum_{j} \exp(\beta{z_j})}$ is the probability of the energy state at $i$,  where $\beta = -\frac{1}{\kappa T}$ and $\kappa$ is the Boltzmann constant, $T$ is the temperature. 

In the paper, $\kappa = -1$ is set. If $T = 1$, this is the typical softmax probability $p_i =\frac{\exp(z_i)}{\sum_{j} \exp(z_j)} $ we see.

The energy state $z_i$ here refers to the $i-$th output from the neural network.

With the higher $T$, the distribution is softer in the sense that $p(z)$ have less fluctuation. When $T \rightarrow \infty$, $p(z)$ tends to be a constant.

Distilled model is trained on a transfer set and use a soft target distribution produced by using higher $T$ in all $p_i$. After training, we set $T=1$.

# Training ensembles of very big datasets

Problem arised: excessive computation with (1) huge neural network for each model, and (2) large dataset.


Solution: learn specialist models that each concentrates on a different confusable subset of the classes, but should avoid overfitting.

(1) Find the confused classes (different kinds of mushroom) that are specifically trained on a model. 

(2) combine all of the classes it does not care (e.g. remaining classes) about into a single dustbin class on the model.

(3) The proportion of the training examples from the confused classes shall be higher in the model. 

(Specified in the paper: half examples from confused classes and another half from the single dustbin class)

(4) the model is initialized with the weights of the generalist model.

(5) After training, we can make correction on the biased training
set by incrementing the logit of the dustbin class by the log of the proportion by which the
specialist class is oversampled.

In [38]:
'''
Assume there are 5 training examples.
Let's say, the 1st and 2nd rows of the logit z come from the confused classes; 
the 3rd row comes from the single dustbin class.
'''
z = np.random.randn(3,5)
z

array([[ 0.43632898, -0.4978327 ,  1.67371774,  0.29129533,  0.57995694],
       [-0.78937289, -0.85106126, -0.82978901,  0.894575  ,  1.0703239 ],
       [-0.49457959, -0.42699805,  0.93182444,  1.43433881,  1.51625282]])

In [39]:
'''
Assume p is the proportion array by which the specialist class is oversampled.
pc1 refers to the proportion from the confused class 1.
'''
pc1 = np.random.uniform(low = 0, high = 0.5)
p = [pc1, 0.5-pc1, 0.5]
p

[0.3919959093788017, 0.10800409062119831, 0.5]

In [40]:
z[-1,:] += np.log(p[-1])
z

array([[ 0.43632898, -0.4978327 ,  1.67371774,  0.29129533,  0.57995694],
       [-0.78937289, -0.85106126, -0.82978901,  0.894575  ,  1.0703239 ],
       [-1.18772677, -1.12014523,  0.23867726,  0.74119163,  0.82310564]])

# Details in Training ensembles of very big datasets

(1) Find the confused classes 

simpler approach that does not require the true labels to construct the clusters is better. Thus, we avoid using confusion matrix.

Instead, apply a clustering algorithm to the covariance matrix of the predictions of our
generalist model.

# Performing inference with ensembles of specialists

## KL divergence

Given probability distributions $\mathbb{P},\mathbb{Q}$, we have the probability densities $p = \frac{d\mathbb{P}}{dx}, q = \frac{d\mathbb{Q}}{dx}$ defined under certain assumption (absolute continuity of distribution).

$$KL(p || q) = \int \log(\frac{d\mathbb{P}}{d\mathbb{Q}}) \mathrm{d}\mathbb{P} = \int\log(\frac{p}{q})p \mathrm{d} x $$

This is a method measuring the distance of two distribution.

Provided the density $q$, and sample $x$ over $p$, $KL(p || q)$ is the expectation of log-difference between $p$ and $q$.

$$KL(p, q1) + KL(p, q2) + KL(p, q3) = \int \big(\log(\frac{p}{q1}) + \log(\frac{p}{q2}) + \log(\frac{p}{q3})  \big) p \mathrm{d} x$$

## Steps

Find $n$ most classes $C$ that are probable according to the generalist model. In the paper, $n=1$.

Get the specialist models, $m$, with the confusable classes $S^m$. Put $A_k = C\cap S^m$. 

Note: $q, p^m, p^g$ are the probability density of full classes, classes from specialist model $m$ and classes $C$ of generalist model respectively such that

$$KL(p^g, q) + \sum_{m \in A_{k}}KL(p^m, q)$$

In detail, $q = softmax(z)$