<a href="https://colab.research.google.com/github/jiahfong/incoherent-thoughts/blob/develop/Noisy_Student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Self-training with Noisy student

# Abstract
Bootstrap self-learning by (soft/hard)pseudo-labelling unlabelled images with a teacher network initially trained on ImageNet and train a noisy student (with equal or greater capacity than the teacher) on the combination of labelled and unlabelled datasets. This student is expected to be better than the teacher network due to noise and higher representational capacity. This student network is then the next teacher network and the process repeats. The resulting network yields higher accuracy and also better robustness to perturbations in the images (noise, occlusion, etc.).

# Introduction

An important element of this method is that the student must be noised (to be as robust as possible) and the teacher should not (to be as accurate as possible)
Noising implemented with RandAugment, dropout, and stochastic depth

> Idea: consider other kinds of stochasticity for noise generation?


# Self-training with Noisy Student

1. Teacher model minimises cross-entropy loss on labelled dataset
2. Un-noised teacher model generates soft/hard pseudo labels (hard = one-hot; soft = softmax/continuous distribution of class labels)
3. _Noised_ Student model minimises the CE loss on both labelled and pseudo labelled images
4. Student becomes teacher and the same process repeats (step 2 onwards). This step is referred to iterative training and the authors showed that the accuracy improves by iteratively making the new teacher more accurate (see Ablation study).

Additional techniques:

* __Data filtering__ - filter images where the teacher has low confidence in; these images are usually out-of-domain images. (Would be interesting to see how confidence calculated using _bayesian NNs_ differ.)
* __Class balancing__ - balance the number of unlabelled images for each class by duplicating images in classes where there aren’t enough and taking the most confident predictions when the class is in abundance. (Benefits of class balancing more pronounced in smaller models)
Soft and hard pseudo labels work well, soft is slightly better for out-of-domain unlabelled data.


Note, data filtering and class balancing is performed _once_ and is henceforth treated as the _unlabelled pool_. The (initial) trained teacher model is used to calculate confidence in data filtering.

__IDEA__: Consider doing data filtering and class balancing in each iterative training step? Would that help since confidence can change as the teach model is (presumably) improving?

# Experiments

## Labeled, unlabelled datasets & model architecture

The models used are variants of [EfficientNets](https://arxiv.org/pdf/1905.11946.pdf). The teacher model is pretrained on ImageNet nad the unlabelled images are obtained from the JFT dataset (it's labelled but labels are ignored).

The authors performed data filtering and balancing on this corpus by selecting images with confidence > 0.3 and select at most 130K images with the highest confidence. For classes less than 130K images, the images are duplicated at random.

> They claim that these hyperparameters are not tuned extensively as their method is highly robust to them.

In most experiments, the authors performed iterative training unless the experiments take too long. In iterative training, the authors deliberately increased the unlabelled batch size in the final iteration to boost the performance (accuracy; see section 4.2). The ratio of unlabelled batch size to labelled batch size is important as the dataset sizes are probably significantly different. (See ablation study 7)

## Summary of experiments

1. Noisy student outperforms SOTA on ImageNet in both top-1 and top-5 accuracy.
2. The authors found out that making the model larger only improved accuracy marginally (+0.5%) whilst using noisy student increased more (+2.9%).
3. Noisy student improves all model architectures (of EffecientNet) by ~0.8% even without iterative training.
4. Noisy student improved robustness performance on ImageNet-[A,C,P], despite the method not optimising for robustness directly.
5. Noisy student improved in accuracy even on adversarial examples although the experiments conducted wasn't very comprehensive.

# Ablation study

## Importance of noise in self-training
How could a student outperform the teacher if the CE loss is 0? The training signal would vanish! The authors claim that by introducing noise, the student is forced to perform better than the teacher. Presumably, aggresive noise causes the loss to be > 0 and that noise makes the student model much more robust to perturbations.

> Adding noise to the teacher model as an adverse effect

## Importance of iterative training
The authors demonstrated that accuracy increases if iterative training is employed. It's especially interesting to note that they make use of a _larger ratio between unlabelled batch size and labeled batch size in the final iteration_ to boost the final accuracy measure. (see ablation study #7)

## Additional study

### Ablation study 1: Powerful (initial) teacher models are better
Using a large teacher model with better performance leads to better results.

### Ablation study 2: The larger the unlabelled data size, the better
Title says it all. The authors performed an experiment and found that larger models might benefit from more data while small models with limited capacity might saturate.

### __Ablation study 3: Soft vs. Hard pseudo-labels depends on your situation__
1. soft and hard labels work well with _in-domain_ unlabelled images.
2. soft labels is more robust with _out-of-domain_ unlabelled images and hard labelling can hurt performance.
3. using hard labels sometimes outperforms soft-labelling when a _large teacher model is employed_.
> Upshot: it depends.

### Ablation study 4: Unsurprisingly, large student models are better

See experiments in appendix A.2 if interested.

### Ablation study 5: Data balancing helps noisy student work well with all model sizes
Just use data balancing, it helps.

### __Ablation study 6: Joint training outperforms pretraining on unlabelled and finetuning on labeled.__

An alternative approach by Yalniz et al. (2019) suggests pretraining and finetuning. The authors noted that joint training is not only simpler, but also outperforms the alternative.

### __Ablation study 7: Scale unlabelled batch size accordingly__
Since the unlabelled pool is likely to be much larger than the labelled pool, make sure the batch sizes are scaled appropriately. For example, if the unlabelled pool size is 10x the labelled pool size, then if the batch sizes are identical, every one epoch of training on the unlabelled is 10x the training on the labelled pool.

As the authors put it: "... we would also like the model to be trained on unlabeled data for more epochs by using a larger unlabeled batch size so that it can fit the unlabeled data better", and "[u]sing a larger ratio between unlabeled batch size and labeled batch size, leads to substantially better performance for a large model.


### __Ablation study 8: Restarting the training for student models in each iteration might be better__
The authors postulated that sometimes the student might be stuck in a local optima if the student inherits the model weights from the teacher. The authors recommend training from scratch the ensure the best performance.





# Conclusion

Noisy student self-training emphasises on the noise and importance of making the student perform better than the teacher. All of which improves the overall accuracy whilst making the model slightly more robust.

# Closing remarks (relatability to AL project)

1. Would performance improve if the teacher's confidence is measured using a more robust method (i.e. bayesian NN)? Data filtration is shown to improve the results, presumably using a more "accurate" confidence might make it even better?
2. In ablation study #8, the authors showed that warm starting the student still requires a large amount of epochs (and can sometimes be stuck in a local optima); can the same be said for active learning? What if the models were not re-trained after acquiring samples?
3. Consider doing data filtering and class balancing in each iterative training step? Would that help since confidence can change as the teacher model is (presumably) improving?
4. Training on __highly confident__ pseudo labels is essentially complementary to querying for most informative labels (using standard techniques in AL, that means points where the model is __least confident__ in) from the unlabelled pool using active learning. How can we combine the two nicely (esp. in an iterative fashion)? 
5. Possible extension: consider other types of noise for the student model.
6. Use (weighted) ensemble of (all; subset) past students when training future students.