Exercise – Learning rate scheduling and optimisers
--------------------------------------------------

In this exercise, you will practice using learning rate scheduling and different optimisers. You will experiment with different parameters, compare results and identify an appropriate choice for the handwritten digit classification task. You will build on these skills in Assessment 2 and Assessment 3. 

We will use the MNIST dataset again, which is a good training dataset consisting of handwritten digits (0 to 9). Here is the code base from Module 3, which you already know and can use to complete the tasks. It is a good idea to read through it again now. You can keep the code where it is in your Jupyter notebook and build on top of it when coding the tasks.

In [None]:
# Add appropriate imports here
%matplotlib inline
import numpy as np 
import matplotlib.pyplot as plt
plt.rc('figure', dpi=120) # set good resolution

import tensorflow as tf
from tensorflow import keras 
from sklearn.model_selection import train_test_split

In [None]:
# Load MNIST data
mnist = keras.datasets.mnist
(X_train_full, y_train_full), (X_test, y_test) = mnist.load_data()
print(X_train_full.shape)
XSIZE, YSIZE = X_train_full.shape[1:3]

In [None]:
# Use the train_test_split function to split the data into training and validation, given that a 
# separate hold-out testset has already been provided. We will use an 80/20 split for training/validation.
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, test_size =0.2, stratify = y_train_full)

In [None]:
# Create a Keras sequential model with the following setup (use code examples from main module)
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape = [XSIZE , YSIZE]))
model.add(keras.layers.Dense(300, activation = "relu"))
model.add(keras.layers.Dense(100, activation = "relu"))
model.add(keras.layers.Dense(10, activation = "softmax"))

# Print a summary of the model
model.summary()

In [None]:
# Compile the model
model.compile(loss='sparse_categorical_crossentropy', 
              optimizer=keras.optimizers.Adam(lr=0.01), metrics=["accuracy"])

In [None]:
# Run fit
history = model.fit(X_train, y_train, epochs=10, batch_size=1000, 
                    validation_data=(X_valid, y_valid),verbose=0)

---
---
### Task: Write a function that builds and returns a dense neural network model, just like the one above.

In [None]:
# code here

---
---
### Task: Train a dense network using the optimisers RMSprop, Adam, Nadam and AdaMax. Use the function that you created just before to build the models. Save the models and their training histories for analysis in the next task. Use 30 epochs and the accuracy metric.


In [None]:
# code here

---
---
### Task: Display the learning curves of training and validation accuracy of all optimiser runs in one single diagram. Practice unambiguous and organised visualisation, by labelling the axis, using a legend, title, axis limits and suitable marker and line styles. Make sure to choose the axis limits, such that the important parts of the data are clearly visible.


In [None]:
# code here

---
---
### Question: Interpret the accuracy results across the range of optimisers. Which one convergences fastest? Which one has the best result? Which one looks robust?

Answer: 

---
---
### Task: So RMSprop gives the best result but is slow. Let's add a learning rate scheduling to improve this. Create a learning rate scheduler that implements the exponential decay from Equation 4.14. Implement the decay factor s and the initial learning rate eta0 as attributes of the function that implements the formula. Then, you can set these two parameters to different values before training the model.

In [None]:
# code here

---
---
### Task: In order to save epochs, you should create an early stopping callback. If the training converges early, the training will then stop early. Create an early stopping callback with patience 5 that monitors the validation accuracy and restores the best weights at the end.

In [None]:
# code here

---
---
### Task: Run the training again with the RMSprop optimiser and the learning rate callback and early stopping callback that you just created. Use and initial learning rate eta0 of 0.01, which is higher than the default constant learning rate of 0.001. This will speed up the early parts. Repeat the training with decay factors or 1, 3, 5, 7, 9, and 11 and save the model and results.

In [None]:
# code here

---
---
### Task: Display the learning curves of training and validation accuracy of all runs in one single diagram. Practice unambiguous and organised visualisation, by labelling the axis, using a legend, title, axis limits and suitable marker and line styles. Make sure to choose the axis limits, such that the important parts of the data are clearly visible.

In [None]:
# code here

---
---
### Question: Interpret the accuracy results across the range of decay rates. Which one convergences fastest? Which one has the best result? Which decay rate may be the best trade-off in your results?

Answer: 