<a href="https://colab.research.google.com/github/YueWangpl/DATA2040/blob/main/Cassava_Leaf_Disease_Classification_Part_3_Yue%2C_Tianqi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

Hello, we are one of the contesting teams for Kaggle's Cassava Leaf Disease Classification challenge. The team members are Yue Wang and Tianqi Tang, both from Brown University's Data Science Initiatives.

Cassava roots are a good source of carbohydrates, vitamin C, thiamine, riboflavin, and niacin. Cassava leaves, if prepared properly, can contain up to 25 percent protein. As a resilient crop, cassava is resistant to heat and does not require much fertilizer. However, it is vulnerable to bacterial and viral diseases. One way to detect the diseases is to examine the look of cassava leaves. Therefore, it is important to identify different diseases affecting cassava leaves based on the images, which, with the utilization of deep learning, is exactly what this project tries to accomplish. We hope that our modest contribution will be useful for the development of cassava disease treatments.

We deployed our deep learning model based on ResNet-50. Our model achieved above 0.72 accuracy on the test sets. In the following sections, we will discuss about the architectures of our model, the hyperparameters, the optimization and possible future improvements. Various links relevant to our work are shown at the end of this notebook.

## Model Architecture

We've employed an ensembled model consisting of three independent sub-models with the same structures, yet different hyperparameters and `randome_state` in the `train_test_split` function from the *sklearn* library.  
For each sub-model, it employed a ResNet-50 as the base model with pretrained weights on *ImageNet*.  
<img src="https://miro.medium.com/max/700/1*_nmPcwwnsHE-AC69ASkj9w.jpeg" alt="ResNet Architecture">
<center><b>ResNet Architecture <a href="https://towardsdatascience.com/architecture-comparison-of-alexnet-vggnet-resnet-inception-densenet-beb8b116866d">[1]</a></b></center>

We usually want to go deeper in networks so that the network can learn the subtle features we may want it to notice. However, deeper networks may cause overfit easily and the accuracy starts to decrease after a certain depth, because all of the layers are fighting with each other and newer layers are trying to study what was outputed from the last layer. Imagine a scenario that a CNN is trained to tell the difference between a basketball from a room door. The network have learned that they have different shapes, but the later layers goes into details such as colors, textures and if they have a door knob which the CNN classifier might make mistakes on, because a drop of water on the basketball might be similar to the presence of door knob, and there might be doors wrapped with rubber, just like the cover of a basketball.  
Obviously, for this simple task, the most important factor this classifier needs is just the shape of these two objects. It will be so easy for the model, or for human to distinguish a sphere from a cuboid, so we would not like this feature to be ruled out by over training the network.

ResNet is a model with more than 100 layers, yet acieves better results comparing with other such deep convolutional networks because of the concatenation of the input on the output for each residual block. So that features learned by previous layers will be kept after every new set of convolutional layers as an offset to overly deep neural networks.

The model does not include top layers of ResNet-50. Instead, after the convolution layers of ResNet-50, we added a global average pooling layer, then flattened, and added two dense layers with ReLu activations. The dense layers have output sizes of 512 and 256, and each of them is followed by a batch normalization layer and a dropout layer.  
The final layer is anoter dense layer with 5 nodes, which matches the number of classes. It uses softmax as the activation function to offer the probabilities of each class.

<center><img src="https://lh4.googleusercontent.com/7VUZlhPScFEZ9_yq_sm3Khmb9E_2d8Xwhc6wx126vEdJxdaXaQUtxNdXLjpL2xzpjv_nL8hU7bnjpT_33A_qBG-lT9eFJee14Jg0z4AuNGRJan3q4_gOyzuRrisVv8NX3gCRcyI_" alt="ResNet Structure"><center>
<center><b>Performances of different learning rates</b></center>



We use Adam as the optimizer. The loss and the metrics are sparse_categorical_crossentropy and sparse_categorical_accuracy, respectively.


For fitting, we adopt learning rate scheduling and early stopping.

We fit the model for 500 epochs but the training terminates well before that limit due to early stopping.

Finally, we take the 3 trained sub-models and average out their prediction probabilities as the prediction of our final ensemble model.

<center><img src="https://lh4.googleusercontent.com/cDmcygmxCz5Mfi5alvkkl8VoY847DmCJbCBo-FjN46MKHeiqd5aUzllZEePN1ZfGIKx_VMCod7YHxJ2RQXEhWpt2m0mpth4Kcz64f9VLi--6JQL6nl_NjSGOd4iOwZwrn495vrxI" alt="Ensemble Model"><center>
<center><b>Ensemble Model</b></center>

## Hyperparameter Selection

### Unfreeze base model layers?

We did not unfreeze the original ResNet model layers after comparing metrics of 'unfreeze first 30 layers1, 'unfreeze last 30 layers', and 'unfreeze all'. Small learning rate as $1e-8$ were used, yet the model with unfreezed layers still cannot beat an accuracy of 64%.  
A [StackOverFlow about unfreezing layers](https://stackoverflow.com/questions/64227483/what-is-the-right-way-to-gradually-unfreeze-layers-in-neural-network-while-learn) was viewed and one of the response says that there are cases where $1e-8$ can be too much. I'm not sure if it's the case of our situation as well.

<center><img src="https://lh4.googleusercontent.com/B5uohwVkWx2W9nf08PchqoIkxL02kAz2Nwr_pcebs1eRWSGjHXIg_rR2KX7xlYaOj4IpQ80n0hTg1gjQaFMut4QOb-uHaAw8_oPG81loK2x5OMf96Vw6cgh45RlY8ZEAFvLQEYL8"><center>
<center><b>Unfreeze All</b></center>
Training accuracy exceeds the validation accuracy too much.

<center><img src="https://lh4.googleusercontent.com/AgVYAR9bDEsEFxgR3bZG8iy7HRQqnashjLJq0QzGcgvyCENHh5RkiQmzxIxVagJqygV9_MLkqB2fzOXjnORb_zXUZu1F_A5o-4SVwhqm1l-e_TLqUZUQiSdb86f_Ysg7lsAFjiWp"><center>
<center><b>Unfreeze First 30 Layers</b></center>
Validation accuracy stayed to be 0.1~0.2, yet training loss as in a decrease.


And when unfreeze the last 30 layers, the validation accuracy stayed around 0.6 with the highest to be 0.6462 after 100 epochs.

### Optimizer?
We've tested four different optimizers, namely adagrad, adam, adadelta, and rmsprop, while every other hyperparameters to remain the same to compare their performances and determind on the one that can find the minimum in the most stable way with 15 epochs each, and here is the plot of their behavior:
<center><img src="https://lh6.googleusercontent.com/xogtVtiT4zDFqZWmwgzDmfw6Ae7j6qLnAqtcdICK5ENRBM-CkyC-59Zzpk6A9cnGeulWhhDiKkQ9ftWSKFuT6bV3v1AX5-CSqgPljK45Eb1BmWEVPiFoAwY8I94geTyvBYDiruh1" alt="Optimizer performances of adagrad, adam, adadelta, rmsprop"><center>
<center><b>Optimizer performances of adagrad, adam, adadelta, rmsprop</b></center>


Adadelta seems to perform better than others, which makes it a winner here.

### Learning Rate?

With a similar approach, learning rate is compared with in 15 epochs with the plot showing their behavior:

<center><img src="https://lh5.googleusercontent.com/PrVEMVP3-VLYMQ9XqF7pQtmv2diT3doFtR2SYZS1Ntj5slMxDp7WexX8MaVU80WywsmkxI_ZWs7EIfjVMZqtfz7gK9mTz8WBB-kyKtO8A2cVxVf71-LRT_tMeDUHXmun6cgKuTqU" alt="Performances of different learning rates"><center>
<center><b>Performances of different learning rates</b></center>



$1e-4$ was considered to be more stable in performance. However, it’s observed that all learning rates gives a bumpy behavior, so before we make a final decision, we’ve use a callback function of LearningRateScheduler to graph the training and validation loss.
 
```
tf.keras.callbacks.LearningRateScheduler(lambda epoch: 1e-8*10**(epoch/20))
```
 
The resulted plot is shown below:
 
<center><img src="https://lh6.googleusercontent.com/7YzCb5OmpWrdpaqwN-j121Pnw5of55QPdE9fv55OiKzgkbS31ZuSP1G29KwUZ9Mss8y0kDl35vL5pYQdlp3Gb-89sI3j3M1kfDTVJf2EJIJLVUSXt8kiCXoRXHpboimG_5ACkBLm" alt="Callback to compare lr"><center>
<center><b>Callback to compare lr</b></center>

All learning rates below $1e-4$ and above $1e-8$ were leading us to lower losses smoothly. Hence, $1e-4$ was decided to be used as our learning rate.

### Decay Rate?

After that, different decay rates in `tf.keras.optimizers.schedules.ExponentialDecay` were also compared. However, since epoch of 15 is just a small amount

So all of the above are hyperparameters chosen.  
Here is the behavior of our final model with an early stopping callback:
<center><img src="https://lh3.googleusercontent.com/PN43RckvJb5DUqbKwiJfensjGpRualYYdg284KEDoValrSIurlC__WJ4_xZTflcbzTWxuhZCzfS0XAyTQIGdYGygsLBoATs1lrkCnjni-FrvQuE9V02UQB9G3Zd1oHHZp4wTV6nD" alt="3 models and the ensembled model"><center>
<center><b>3 models and the ensembled mode</b></center>


And as we said, we ensembled 3 models trained under different ramdom states during train/ test split.  
And here is the accuracy of half of our dataset:

<center><img src="https://lh5.googleusercontent.com/GJT9j6JdF2ixBwqTf00tEPEZQfvVt_9AOCsVIwKg3yroODgvTnvwrZFABS82blaJZ3vID67lH1A5fY5Y9kXSOeLtzjj4-9cMvrC_Vb985hxQdpU6gkgRlzm0X_eCfAHrb5GZuSbV" alt="3 models and the ensembled model"><center>
<center><b>Accuracy of local dataset</b></center>


Here is the difference of accuracy after submission, with the one on the bottom to be the ensembled model.
<center><img src="https://lh4.googleusercontent.com/wcrdllFYKsRzH3vC_FN0wzNzjCpv_8fCIK44iGKyLpvub4aBcdKtnDH16McXOodMBnR5dfl5kDgJVcxx-B_hUpS0BYWDFLerWTl1wwxBz6-uxWrsr9HqcocFErv8n6QqZ2lxvSxQ" alt="3 models and the ensembled model"><center>
<center><b>3 models and the ensembled mode</b></center>

## Next Steps



Due to the lack of time, we were unable to implement some techniques that may boost the accuracy of our model further.

1. Create a more balanced dataset. Our dataset is imbalanced as with more than 60% of the data belong to one class and there are 5 classes in total. This may cause a bias to our model as predicting any data to be the majority class will yield a decent accuracy, but then the model will performs poorly on predicting minority classes. By constructing a balanced dataset, this concern can be avoided and our model may not be biased towards the majority class.

2. More fine tuning. We did not have enough time to fine tune various hyperparameters, and we wish to keep fine tuning if we were given more time.

3. Exploiting more models. We have tried various models including VGG, MobileNet, EfficientNet, etc. Among them, ResNet yields the best results. But this may not be conclusive as we did not fully explore many of the models.

4. Collaborating with other teams. We wish to collaborate with other Kaggle teams to produce a better model.

## Kaggle Submission

<img src="https://lh4.googleusercontent.com/dhTKyU9om0zrPXQ22QD1R5NEywoWvXmBsSvf9MsIrhV2yKJTanr_r3tJdJ0deji_LX16Vd8QrtL9dkT1m84hhVw-pqd3P8hkeJQ44VvaaEsjuq2D1ixrhOnZDcVTXTxwUkIdp4pd">

## Links

[Github Repo](https://github.com/YueWangpl/DATA2040)

[Kaggle Notebook Saved in Git Repo](https://github.com/YueWangpl/DATA2040/blob/main/tune-resnet-cassava-leaf-disease.ipynb)


## References 


[Sample Submission from Dan](https://www.kaggle.com/danpotter/blind-monkey-submission-example-data2040-sp21)  
[Sample model from Kaggle](https://www.kaggle.com/jessemostipak/getting-started-tpus-cassava-leaf-disease)  
[Ensemble keras models](https://medium.com/randomai/ensemble-and-store-models-in-keras-2-x-b881a6d7693f)  
[Ensemble models with scikit-learn](https://sailajakarra.medium.com/ensemble-scikit-learn-and-keras-be93206c54c4)  
[Save and load sklearn models](https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn//) 
[StackOverFlow about unfreezing layers](https://stackoverflow.com/questions/64227483/what-is-the-right-way-to-gradually-unfreeze-layers-in-neural-network-while-learn)  
