<a href="https://colab.research.google.com/github/jiahfong/incoherent-thoughts/blob/develop/Pseudo_label_simple_and_efficient_SSL_method_for_DNNs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pseudo-Label: Simple and Efficient Semi-supervised learning method for deep NNs.

## Executive summary

1. Pseudo-labelling is in effect equivalent to Entropy Regularization. It favors a low-density separation between classes, a commonly assumed prior for semi-supervised learning. 
2. **Cluster assumption**: By minimizing the entropy for unlabeled data, the overlap of class probability distribution can be reduced (think: max-margin principle in SMVs). The *custer assumtion* states that the decision boundary should lie in low-density regions to improve generalization performance (Chapelle et al., 2005).
3. The more general assumtion, **continuity assumtion** states that points nearby each other are likely to have the same class. A small change in input space is unlikely to change the target class.


The algorithm goes as follows:

for every epoch $e = 1 \dots$ EPOCHS:
1. calculate regular supervised loss (e.g. cross entropy)
2. get pseudo-labels from unlabelled dataset (argmax softmax)
3. total loss = supervised loss (CE) + $\alpha(e)\ *$ unlabelled loss (CE; using pseudo labels as "true" labels)

$$
\alpha(e) = 
\begin{cases} 
      0 & e\le T_1 \\
      \frac{e - T_1}{T_2 - T_1}\alpha_f & T_1 \leq e \lt T_2\\
      \alpha_f & T_2 \leq e
\end{cases}
$$

$\alpha_f = 3, T_1 = 100, T_2 = 600$ without pre-training and
$T_1 = 200, T_2 = 800$ if used with denoising autoencoder as pre-training.

4. Intuitively, what the unlabelled loss is doing is encouraging the model generalise. The unlabelled loss is large when the CE is large, which implies that the softmax distribution is evenly distributed amongst the classes. The model is forced to be decisive, that entails placing the decision boundary in areas where the class probability density is low (re. point 1, 2).


# Closing remarks

1. Perhaps trivially we might be able to deduce that the initial pre-training/training with labelled samples is extrememly important in this framework. If the initial labelled samples are too small for the model to predict useable pseudo-labels, then pseudo-labelling will probably make the model worse!