# ELEC 400M / EECE 571M Assignment 2: Neural networks
(This assignment is a modified version of an assignment used in ECE 421 at the University of Toronto and kindly made available to us by the instructor.)

In this assignment, you will implement a neural network model for multi-class classification. The purpose is to demonstrate an understanding of the basic elements including training of neural network models. Hence, your implementation will be from scratch only using functions from the NumPy library.

The neural network you will be implementing has the following structure:
* 3 layers: 1 input layer, 1 hidden layer with ReLU activation and 1 output layer with Softmax function 􏴣 
* The loss function is the Cross Entropy Loss.
* Training will be done using Gradient Descent with Momentum. 

## Data Set
We again consider the dataset of images of letters in different fonts contained in file notMNIST.npz (which btw is from http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html). This time we consider 10 letters ("A" to "J"), which are all the letters contained in this data set, and we want to classfiy the images according to the letter they display. The figure below shows 30 randomly selected image samples for the letters.

![](sample_images_2.eps)


You will apply the function `loadData` given below to load the data set, which includes 18720 images and their labels, which we also refer to as targets. This script organizes the data set into training, validation and test sets. 

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

In [None]:
def loadData():
    with np.load('notMNIST.npz') as data:
        Data, Target = data['images'], data['labels']
        np.random.seed(521)
        randIndx = np.arange(len(Data))
        np.random.shuffle(randIndx)
        Data = Data[randIndx]/255.0
        Target = Target[randIndx]
        trainData, trainTarget = Data[:15000], Target[:15000]
        validData, validTarget = Data[15000:16000], Target[15000:16000]
        testData, testTarget = Data[16000:], Target[16000:]
       
    return trainData, validData, testData, trainTarget, validTarget, testTarget

## Data preprocessing [5 points]

Input data: The classification should be based on the $d=28\times 28=784$ intensity values in an image (as for Assignment 1).

Output data: Since you will be performing multi-class classification, the labels will be converted into a one-hot encoding format. 

Please first briefly explain the meaning of one-hot encoding and why it is used (instead of keeping the numerical label values provided by the data set). State an example for a one-hot encoded label for the data set considered in this assignment.

YOUR ANSWER HERE

Now implement a function that one-hot encodes the labels (or targets) for the training, validation and test sets. 

In [None]:
def convertOneHot(trainTarget, validTarget, testTarget):
    trainTargetOneHot = np.zeros([trainTarget.shape[0],10])
    trainTargetOneHot[np.arange(trainTarget.size),trainTarget]=1
    
    validTargetOneHot = np.zeros([validTarget.shape[0],10])
    validTargetOneHot[np.arange(validTarget.size),validTarget] = 1
    
    testTargetOneHot = np.zeros([testTarget.shape[0],10])
    testTargetOneHot[np.arange(testTarget.size),testTarget]=1
    
    
    return trainTargetOneHot, validTargetOneHot, testTargetOneHot

## Structure of the network [2 points]

Sketch the structure of the network to classify the letters from the data set. Identify the dimensions of the network layers, include the activation functions, and do not forget the bias nodes. (You may sketch this by hand and upload a photo of your sketch.)

YOUR ANSWER HERE

## Helper functions [6 points]
To give the implementation of the network some structure, you will first implement five helper functions. 

Use Numpy arrays for your implementations, and organize data in vectors and matrices as appropriate for compact programming.

1. `relu`: This function will accept one argument and return the ReLU activation: 
    $$\mathrm{ReLU}(x)=\max(0,x).$$
    
2. `softmax`: This function will accept one argument and return the softmax activations:
    $$ [\sigma(\mathbf{z})]_j = \frac{\mathrm{e}^{z_j}}{\sum\limits_{k=1}^K\mathrm{e}^{z_k}},$$ $j=1,2,\ldots, K$,  for $K$ classes.

3. `computeLayer`: This function will accept two arguments, the input vector $\mathbf{x}$ for a layer and the weight matrix $\mathbf{W}$, and return a vector $\mathbf{s}=\mathbf{W}^T\mathbf{x}$, i.e., the input to the activation function of the layer (the notation for variables from the textbook is used). Don't forget to account for the bias term (which can be included in an augmented vector $\mathbf{x}$ as in the textbook).

4. `CE`: This function will accept two arguments, the one-hot encoded labels $\mathbf{y}_n$ and the inputs $\mathbf{s}_n$ to the softmax function, $n=1,2,\ldots N$. It will return the cross entropy loss
$$\mathrm{E}_{\mathrm{in}}=-\frac{1}{N}\sum\limits_{n=1}^N\sum\limits_{k=1}^Ky_{n,k}\log([\sigma(\mathbf{s}_n)]_k)$$

5. `gradCE`: This function will accept two arguments, the labels and the inputs to the softmax function. It will return the gradient of the cross entropy loss with respect to the inputs (i.e., it returns the sensivity vector for the output layer as introduced in the textbook). 

First state the analytical expression for the gradient used in `gradCE` and then implement the five helper functions.

**Solution:**

The analytical expression for the gradient used in gradCE is given by,

$$\delta^{\left(L\right)}_{n}=\frac{\partial e_{n}}{\partial S^{L}_{n}}=\sigma\left(S^{L}_{n}\right)-y_{n}$$



In [None]:
def relu(x):
    return np.maximum(0,x)

In [None]:
def softmax(x):    
    op = np.exp(x)/np.sum(np.exp(x))
    return op

In [None]:
def computeLayer(x,W):
    # Assuming x includes bias
    return np.matmul(x,W)

In [None]:
def CE(target, prediction):
    # My Implementation is n x k
    return (-1.0/target.shape[0])*np.sum(np.multiply(target,np.log(softmax(prediction))))

In [None]:
def gradCE(target, prediction):
    # Returns Sensitivity Vector for Layer L
    return softmax(prediction)-target

## Backpropagation [2 points]

The training of the network will be done via backpropagation. First derive the following gradients:
1. $\frac{\partial E_{\mathrm{in}}}{\partial \mathbf{W}^{\mathrm{o}}}$, where $\mathbf{W}^{\mathrm{o}}$ is the weight matrix of the output layer.

2. $\frac{\partial E_{\mathrm{in}}}{\partial \mathbf{W}^{\mathrm{h}}}$, where $\mathbf{W}^{\mathrm{h}}$ is the weight matrix of the hidden layer.

Write the results using the steps and notation used in the textbook.

**Solution:**

1. $\frac{\partial E_{in}}{\partial W^{0}}$ where $W^{0}$ is the weight matrix of the output layer.



## Network training [8 points]

Implement a function to train the network. The function uses the helper functions from above. The optimization technique for backpropagation will be Gradient Descent with Momentum:
$$\mathbf{V}(t)=\alpha \mathbf{V}(t-1)-\eta\frac{\partial E_{\mathrm{in}}}{\partial \mathbf{W}(t)}$$
and 
$$\mathbf{W}(t+1)=\mathbf{W}(t)+\mathbf{V}(t),$$
where $\eta$ is the learning rate and $\alpha$ is the momentum hyperparameter.

The training function accepts the following inputs:  training data (features), training labels, weight matrix of the hidden layer, weight matrix of the output layer, number of iterations, parameters $\eta$ and $\alpha$, validation data, validation labels, test data, test labels. The validation and test inputs are initialized to "None" and need not be passed on. You will also need to initialize the velocity matrices $\mathbf{V}$ for both hidden layer and output layer weights to small values, e.g. $10^{-5}%$.

The function outputs the updated weight matrices, the losses and classification accuracies for the training data, and if validation and test inputs were provided, then it also outputs the classification accuracies for the validation and test data.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Network test [4 points]

Write a script that constructs the neural network.

Initialize your weight matrices by drawing the elements i.i.d. at random from a zero-mean Gaussian distribution with variance equal to $$\sigma_w^2=\frac{2}{\mbox{# of input nodes + # of output nodes}}$$ (Xavier normalization http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf) 

Build a network with 1000 hidden units and train it for 200 epochs using $\alpha=0.9$ and $\eta=10^{-5}$. Plot the training, validation and testing accuracy curves. State the training, validation and testing accuracies after training. Show the plot and the accuracies in the next markdown cell.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

## Hyperparameter investigation [3 points]

Continue to use $\alpha=0.9$ and $\eta=10^{-5}$.

Test your network with 500, 1500, 2500 hidden nodes and train for 200 epochs. Comment based on the validation accuracy after how many epochs training could be terminated early. 

Plot the training and validation accuracy curves for all three network sizes and 200 training epochs, and report the test accuracy for your selected network size and training length. Show the plot and the accuracies in the next markdown cell.

(Training of the large network for 200 epochs should take about 30-60 mins.)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE