# Deep Learning
## Summative assessment
### Coursework 1: MLPs and Backpropagation

#### Instructions

This coursework is released on **Wednesday 31st January 9.00** and is due by **Wednesday 7th February 23.59**. It is worth **10%** of your overall mark. There are 3 questions in this assessment, worth a total of 90 marks. A further 10 marks are awarded for good code quality, clarity and presentation. **You should attempt to answer all questions.** 

This assessment mainly assesses your understanding of the multilayer perceptron model and the backpropagation algorithm, as well as your ability to use the high-level Keras API.

You can make imports as and when you need them throughout the notebook, and add code cells where necessary. Make sure your notebook executes correctly in sequence before submitting.

#### Submission instructions

Ensure your notebook executes correctly in order. Save your notebook .ipynb file **after you have executed it** (so that outputs are all showing). It is recommended to also export a PDF file of your executed notebook. Upload a zip file containing your notebook (and separate PDF file) to Coursera by the deadline above.

In [2]:
# You will need the following imports for this assessment. You can make additional imports when you need them

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

# For some reason I have only been able to get it to work using just keras, not tensorflow.keras:
from keras.models import Sequential
from keras.layers import Dense, Flatten, Activation
from keras.callbacks import EarlyStopping

from sklearn.model_selection import train_test_split

#from tensorflow.keras.models import Sequential
#from tensorflow.keras.layers import Dense, Flatten, Activation
#from tensorflow.keras.callbacks import EarlyStopping




### Question 1 (Total 30 marks)

a) Load the Boston housing dataset using the Keras API, with a 75/25 train/validation split. 

Standardise the input features by subtracting the mean and dividing by the standard deviation, where the per-feature statistics are computed from the training dataset. You can use numpy or sklearn for this part if you wish.

Load the data into `tf.data.Dataset` objects, shuffle and batch the datasets with a batch size of 32. Print out the `element_spec` of one of the Datasets. 

**(5 marks)**

In [10]:
data = tf.keras.datasets.boston_housing.load_data()
inputs, features = data
print(inputs.shape)
train, test = train_test_split(data,test_size=0.25)

AttributeError: 'tuple' object has no attribute 'shape'

b) Create a TensorFlow `Sequential` model object according to the following spec:

* The model should have 2 hidden layers, with 32 and 16 neurons respectively
* Each hidden layer should use a 'swish' activation

The model should be an multilayer perceptron (MLP) model suitable for regression on the Boston housing dataset.

Train the model for 300 epochs using the training Dataset object, but terminate the training if the validation mean absolute error (MAE) doesn't improve after 30 epochs. Use the stochastic gradient descent (SGD) optimizer with Nesterov momentum, with the momentum hyperparameter set to 0.9, and a learning rate of $10^{-3}$. You should use the high-level Keras API (using `compile`, `fit` methods) for this. The model should be trained with a mean squared error (MSE) loss function. The mean absolute error should also be computed and recorded on the training and validation sets.

Plot the MSE and MAE learning curves for training and validation sets, and compute the MSE loss and MAE on the validation set for the best set of model parameters (according to the validation set MAE).

**(15 marks)**

c) What do you expect would be the effect of training the same model architecture on the Boston housing dataset where the input features have not been standardised? Briefly justify your answer. 

**(5 marks)**

d) In terms of the computations carried out, describe in a few sentences what the differences would be (if any) between standardising the input features as above, and inserting a batch normalisation layer before the first dense layer of the model. 

**(5 marks)**

### Question 2 (Total 30 marks)

In this question you will empirically study the post-activation statistics in the hidden layers of an MLP model under different initialisation strategies. 

Consider an MLP model with 5 hidden layers with 8192, 8192, 8192, 4096 and 4096 neurons respectively. Each hidden layer uses a tanh activation function. Let $\mathbf{W}^{(k)}\in\mathbb{R}^{n_{k+1}\times n_k}$ and $\mathbf{b}^{(k)}\in\mathbb{R}^{n_{k+1}}$ denote the weight matrix and bias vector that map from hidden layer $k$ to hidden layer $k+1$ according to the following:

$$
\begin{align}
\mathbf{h}^{(k)} &= \tanh\left( \mathbf{W}^{(k-1)}\mathbf{h}^{(k-1)} + \mathbf{b}^{(k-1)} \right),\qquad k=1,\ldots, 5,
\end{align}
$$

where $\mathbf{h}^{(0)} \in \mathbb{R}^{1024}$ denotes the input layer, $n_k$ is the number of neurons in hidden layer $k$, and the tanh function is applied elementwise. Suppose the input features ${h}_i^{(0)}$ are each independently sampled from $N(0, \frac{1}{2})$.

a) Compute the (post-)activations of each hidden layer after passing a single input example through the network (where the input example is sampled as described above), and save them in a variable called `layer_activations`. The following initialisation strategy should be used for the model parameters:

1. Each element in each weight matrix $\mathbf{W}^{(k)}\in\mathbb{R}^{n_{k+1}\times n_k}$ is sampled from a standard normal distribution
2. Each bias vector $\mathbf{b}^{(k)}\in\mathbb{R}^{n_{k+1}}$ is initialised to zero
    
Your answer for this part should use only TensorFlow objects and functions, and not use numpy or scipy at all. You can make use of the Keras module if you wish. Weight and bias parameters should be implemented with TF Variable objects.

**(10 marks)**

b) Create a plot for the normalised (density) histograms for the activation statistics in each of the hidden layers. Briefly comment on the result.

**(5 marks)**

c) Re-compute the activation statistics for the MLP under two different initialisation strategies:

1. Glorot normal distribution initialisation for the weights $\mathbf{W}^{(k)}\in\mathbb{R}^{n_{k+1}\times n_k}$ and zero initialisation for the bias $\mathbf{b}^{(k)}\in\mathbb{R}^{n_{k+1}}$ for $k=0,\ldots,4$
1. Glorot uniform distribution initialisation for the weights $\mathbf{W}^{(k)}\in\mathbb{R}^{n_{k+1}\times n_k}$ and zero initialisation for the bias $\mathbf{b}^{(k)}\in\mathbb{R}^{n_{k+1}}$ for $k=0,\ldots,4$

For each initialisation strategy above, plot normalised histograms for the activation statistics in each hidden layer.

**(10 marks)**

d) Comment on your interpretation of the results in the previous parts of this question, what implications there might be for the successful training of the MLP model, and any limitations of the empirical study carried out.

**(5 marks)**

### Question 3 (Total 30 marks)

Consider the following MLP model, designed as an image classifier for the MNIST dataset:

$$
\begin{align}
\mathbf{h}^{(0)} &:= \mathbf{x}\\
\mathbf{h}^{(k)} &= \sigma\left( \mathbf{W}^{(k-1)}\mathbf{h}^{(k-1)} + \mathbf{b}^{(k-1)} \right),\qquad k=1,2\\
\hat{\mathbf{y}} &= \textrm{softmax}\left( \mathbf{W}^{(2)}\mathbf{h}^{(2)} + \mathbf{b}^{(2)} \right)
\end{align}
$$

where $\mathbf{x}\in\mathbb{R}^{784}$ is the flattened image input, $\mathbf{W}^{(k)}\in\mathbb{R}^{n_{k+1}\times n_k}$ and $\mathbf{b}^{(k)}\in\mathbb{R}^{n_{k+1}}$ ($k=0,1,2$) are the model weights and biases, and $n_k$ is the number of neurons in the $k$-th layer.

a) Construct this MLP model using the Sequential API. The model will have two hidden layers with 64 neurons each, using a sigmoid activation function, and take an input of shape `(28, 28)`. The output should be a 10-way softmax.

Load the MNIST dataset from the Keras API. Normalise the input pixel values to the interval $[0,1]$ and convert the labels to one-hot vectors. Do not shuffle the dataset. Save the training inputs and targets as Tensors `x_train` and `y_train` respectively. The validation/test partition can be discarded.

_Hint: you may find it helpful in later parts of this question to use separate Keras layers for the activation functions inside your model object._

**(5 marks)**

b) Suppose that the loss function used to train the MLP is the categorical cross entropy loss function, and $L_i$ denotes the (scalar-valued) loss with respect to the $i$-th input example. Consider the pre-activations in the final layer, given by $\mathbf{a}^{(3)} =  \mathbf{W}^{(2)}\mathbf{h}^{(2)} + \mathbf{b}^{(2)}$. 

Show that the error $\delta^{(k)} := \frac{\partial L_i}{\partial \mathbf{a}^{(3)}}$ given by the derivative of the loss with respect to the final layer pre-activation values is given by $\hat{\mathbf{y}} - \mathbf{y}$, where $\hat{\mathbf{y}}$ is the output from the model, and $\mathbf{y}$ is the ground truth label, represented as a one-hot vector.

Write your answer below in Markdown. You do not need to write any code for this part.

**(7 marks)**

c) Write a function called `grads` that implements the backpropagation equations for this model to return the gradients of the categorical cross entropy loss function with respect to the parameters $\mathbf{W}^{(2)}$ and $\mathbf{b}^{(2)}$. This function should return these gradients as a list `[grads_W2, grads_b2]`, and it should take the following input arguments:

* `model`: your model object, defined using the Sequential API
* `inputs`: a Tensor of shape `(batch_size, 28, 28)`
* `y_true`: a Tensor of shape `(batch_size, 10)` containing the true labels as one-hot vectors

The function `grads` (and any other function it uses) should only use TensorFlow ops. In particular, it should not use automatic differentiation or other libraries (e.g. numpy).

You should make sure that your code is clearly written, and variables sensibly named. You might find it helpful to write separate helper functions to be used in the `grads` function.

 **(15 marks)**

d) Compute the gradients on the first 16 examples in the training set using your `grads` function and model. Print out the gradients that are computed. 

**(3 marks)**