# Comp0188 Lab 3

Welcome to Lab 3A! 

In this lab we will introduce you to the Recurrent Neural Network (RNN) model by training it on the MNIST dataset. After visualizing the training history, you will experiment with setting the learning rate using a **learning rate scheduler**. 

## Setup

As usual, let's begin by installing the necessary libraries. 

In [None]:
!pip install numpy
!pip install tensorflow
!pip install scikit_learn
!pip install pydot
!pip install graphviz



The basic components of machine learning are:

1. The **data**. 
2. The **loss function**, which serves as the optimization objective. 
3. The model **parameters**, which will be updated during the training process. 
4. The **optimizer**, which minimizes the loss function, e.g. stochastic gradient descent or RMSProp.

### Data

In this lab, we will consider the [MNIST](https://keras.io/api/datasets/mnist/) dataset. It is a dataset consisting of a training set of 60,000 images of handwritten digits, along with a label identifying them from 0-9. We will aim to predict the digit from the image. This problem can be cast as a **classification** problem over 10 distinct classes (the 10 digits).

To start off, let's visualize the data.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.datasets import mnist


# load mnist dataset
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# 1 - Print the number of distinct labels
num_labels = len(np.unique(y_train))
print(f"Number of distinct labels in the train set: {num_labels}")

# 2 - Print the first 9 labels
print(f"First 9 labels: {y_train[:9]}")

# 3 - Visualize the first 9 images
for i in range(9):
	# define subplot
	plt.subplot(330 + 1 + i)
	# plot raw pixel data
	plt.imshow(x_train[i], cmap=plt.get_cmap('gray'))
# show the figure
print("First 9 images: ")
plt.show()

In [None]:
from tensorflow.keras.utils import to_categorical

# convert to one-hot vector
print("Labels before converting to one-hot: ", y_train[:3])
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
print("Labels after converting to one-hot: ")
print(y_train[:3])

We'll also **normalize** the raw pixel values. Originally, these are in the range [0,255], but we'll divide by 255 to convert to the range [0,1]

Note: be careful to only run the cell below **once**, or you will end up performing the division multiple times!

In [None]:

x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255

Now, we'll define our model architecture. The model will accept input of size $(B, H, W)$ where $H = W = 28$ is our image size and $B$ is the batch size. It will output labels of size $(B, K)$ where $K=10$ is our number of classes.The outputs of the model are interpreted as a **score** for each class - the class with the highest score is taken to be the model's prediction. 

We will construct the model as an RNN with a dense classification layer on top. At a high level, the RNN processes each image sequentially - in this case, it processes the image one row (28 pixels) at a time, storing past information in its **hidden state**, until it reaches the end, at which point it has seen the full image. 

In [None]:
import numpy as np
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, Activation, SimpleRNN
from tensorflow.keras.utils import plot_model

# network parameters
input_shape = (28, 28)
batch_size = 128
units = 256
dropout = 0.2

# model is RNN with 256 units, input is 28-dim vector 28 timesteps
model = Sequential()
model.add(SimpleRNN(units=units,
                    activation='tanh',
                    dropout=dropout,
                    input_shape=input_shape))
model.add(Dense(num_labels))
model.summary()

### Loss Function:                                                     
A loss function provides a smooth optimization objective that can be used to improve the model. Put simply, a lower loss function means a better model, and differentiating the loss function gives a gradient over the model parameters which can be used by standard optimizers 

For the task of classification, we commonly use **cross-entropy**,  which quantifies how well our predicted class labels agree with our ground-truth labels. The higher level of agreement between these two sets of labels, the lower our loss (and higher our classification accuracy, at least on the training set). 

As the model outputs unnormalized scores or **logits**, we specify `from_logits=True` in the loss function constructor. 

In [None]:
from tensorflow.keras.losses import CategoricalCrossentropy

model.compile(loss=CategoricalCrossentropy(from_logits=True),
              optimizer='RMSprop',
              metrics=['accuracy'])
# train the network
model.fit(x_train, y_train, epochs=20, batch_size=batch_size)

_, acc = model.evaluate(x_test,
                        y_test,
                        batch_size=batch_size,
                        verbose=0)

print("\nTest accuracy: %.1f%%" % (100.0 * acc))

### RMSProp test acc is 97.4 ###


## Exercise 1 - Learning Rates

The default learning rate used by RMSProp is $0.001$. Now re-train the model with different learning rates of $0.01$ and $0.0001$ respectively. What do you notice? Which learning rate performs best? 

Hints:
1. Look up the documentation for `RMSProp` optimizer in Tensorflow to see how to set the learning rate. 

In [None]:
### Your code here ###

## Exercise 2 - Learning Rate Schedulers:

The concept of learning rate schedule is sometimes called learning rate **annealing** or **adaptive learning rates**. Intuitively, the model at the start does not know much, so a higher learning rate helps it ramp up its performance quickly. Near the end, when the model is close to an optimal solution, a lower learning rate prevents it from leaving the optimal region and 'forgetting' what it already knew. By adjusting our learning rate on an epoch-to-epoch basis, we can reduce loss, increase accuracy, and even in certain situations reduce the total amount of time it takes to train a network.

Train the RNN model with RMSprop with a learning rate decay factor of `0.25`, and report the final test accuracy and validation accuracy. How does this compare with your results in Exercise 1?

Hint: 
1. Look up the `LearningRateScheduler` callback function in Tensorflow. 
2. Look at the documentation of `model.fit` in Tensorflow to see how to introduce the learning rate callback in the training loop.
3. You may find the following function, which defines a learning rate at each epoch, to be useful in configuring the scheduler. As a bonus, to understand how the `step_decay` function works, you can write some code (using `matplotlib.pyplot`) to visualize the learning rate provided at each epoch. 

In [None]:
def step_decay(epoch):
    initial_alpha = 0.2
    factor = 0.25
    decayE = 5

    #
    alpha = initial_alpha*(factor**np.ceil((1+epoch)/decayE))
    print(epoch, alpha)
    return np.float(alpha)

In [None]:
### Your code here ###