# Neural Networks

<img src="MLP_diagram.png">
<br><br>
Using this neural network to classify MNIST database of handwritten digits (0-9). The architecture of the multi-layer perceptron (MLP, just another term for fully connected feedforward networks) for a K-class classification problem. 

Let $(x\in\mathbb{R}^D, y\in\{1,2,\cdots,K\})$ be a labeled instance, such an MLP performs the following computations.
<br><br><br><br>
$$
\begin{align}
 \textbf{input features}: \hspace{15pt} & x \in \mathbb{R}^D \\
 \textbf{linear}^{(1)}: \hspace{15pt} & u = W^{(1)}x + b^{(1)} \hspace{2em}, W^{(1)} \in \mathbb{R}^{M\times D} \text{ and } b^{(1)} \in \mathbb{R}^{M}  \label{linear_forward}\\
 \textbf{tanh}:\hspace{15pt} & h =\cfrac{2}{1+e^{-2u}}-1 \label{tanh_forward}\\
 \textbf{relu}: \hspace{15pt} & h = max\{0, u\} =
\begin{bmatrix}
\max\{0, u_1\}\\
\vdots \\
\max\{0, u_M\}\\
\end{bmatrix} \label{relu_forward}\\
 \textbf{linear}^{(2)}: \hspace{15pt} & a = W^{(2)}h + b^{(2)} \hspace{2em}, W^{(2)} \in \mathbb{R}^{K\times M} \text{ and } b^{(2)} \in \mathbb{R}^{K} \label{linear2_forward}\\
 \textbf{softmax}: \hspace{15pt} & z = \begin{bmatrix}
\cfrac{e^{a_1}}{\sum_{k} e^{a_{k}}}\\
\vdots \\
\cfrac{e^{a_K}}{\sum_{k} e^{a_{k}}} \\
\end{bmatrix}\\
 \textbf{predicted label}: \hspace{15pt} & \hat{y} = argmax_k z_k.
%& l = -\sum_{k} y_{k}\log{\hat{y_{k}}} \hspace{2em}, \vy \in \mathbb{R}^{k} \text{ and } y_k=1 \text{ if } \vx \text{ belongs to the } k' \text{-th class}.
\end{align}
$$


For a $K$-class classification problem, one popular loss function for training (i.e., to learn $W^{(1)}$, $W^{(2)}$, $b^{(1)}$, $b^{(2)}$) is the cross-entropy loss.
Specifically we denote the cross-entropy loss with respect to the training example $(x, y)$ by $l$:
<br><br>
$$
\begin{align}
  l = -\log (z_y) = \log \left( 1 + \sum_{k\neq y} e^{a_k - a_y} \right)
\end{align}
$$
<br><br>
Note that one should look at $l$ as a function of the parameters of the network, that is, $W^{(1)}, b^{(1)}, W^{(2)}$ and $b^{(2)}$.
For ease of notation, let us define the one-hot (i.e., 1-of-$K$) encoding of a class $y$ as

\begin{align}
y \in \mathbb{R}^K \text{ and }
y_k =
\begin{cases}
1, \text{ if }y = k,\\
0, \text{ otherwise}.
\end{cases} 
\end{align}
so that
\begin{align} 
l = -\sum_{k} y_{k}\log{z_k} = 
-y^T
\begin{bmatrix}
\log z_1\\
\vdots \\
\log z_K\\
\end{bmatrix}
= -y^T\log{z}.
\end{align}

Then perform error-backpropagation, a way to compute partial derivatives (or gradients) w.r.t the parameters of a neural network, and use gradient-based optimization to learn the parameters.  

### 1. Mini batch Gradient Descent 
Mini-batch gradient descent which is a gradient-based optimization to learn the parameters of the neural network. 
<br>
$$
\begin{align}
\upsilon = \alpha \upsilon - \eta \delta_t\\
w_t = w_{t-1} + \upsilon
\end{align}
$$
<br>
Use the formula above to update the weights using momentum. <br>
Here,
$\alpha$ is the discount factor such that $\alpha \in (0, 1)$ <br>
$\upsilon$ is the velocity update<br>
$\eta$ is the learning rate<br>
$\delta_t$ is the gradient<br>

### 2. Linear Layer
The linear layer of MLP. Initialize W with random values using np.random.normal such that the mean is 0 and standard deviation is 0.1. Or initialize gradients to zeroes in the same function. Compute gradients of W and b in backward pass. 

$$
\begin{align}
\text{forward pass:}\hspace{2em} &
u = \text{linear}^{(1)}\text{.forward}(x) = W^{(1)}x + b^{(1)},\\
&\text{where } W^{(1)} \text{ and } b^{(1)} \text{ are its parameters.}\nonumber\\ 
\nonumber\\
\text{backward pass:}\hspace{2em} &[\frac{\partial l}{\partial x}, \frac{\partial l}{\partial W^{(1)}}, \frac{\partial l}{\partial b^{(1)}}] = \text{linear}^{(1)}\text{.backward}(x, \frac{\partial l}{\partial u}).
\end{align}
$$

### 3. Activation function - tanh
The activation function tanh. 
$$
\begin{align}
\textbf{tanh}:\hspace{15pt} & h =\cfrac{2}{1+e^{-2u}}-1\\
\end{align}
$$

### 4. Activation function - relu
Another activation function called relu. 

$$
\begin{align}
\textbf{relu}: \hspace{15pt} & h = max\{0, u\} =
\begin{bmatrix}
\max\{0, u_1\}\\
\vdots \\
\max\{0, u_M\}\\
\end{bmatrix}
\end{align}
$$

### 5. Dropout
To prevent overfitting, we usually add regularization. Dropout is another way of handling overfitting. We define the forward and the backward passes as follows.

\begin{align}
\text{forward pass:}\hspace{2em} &
{s} = \text{dropout}\text{.forward}({q}\in\mathbb{R}^J) = \frac{1}{1-r}\times
\begin{bmatrix}
\textbf{1}[p_1 >= r] \times q_1\\
\vdots \\
\textbf{1}[p_J >= r] \times q_J\\
\end{bmatrix},
\\
\nonumber\\
&\text{where } p_j \text{ is sampled uniformly from }[0, 1), \forall j\in\{1,\cdots,J\}, \nonumber\\
&\text{and } r\in [0, 1) \text{ is a pre-defined scalar named dropout rate}.
\end{align}
\begin{align}
\text{backward pass:}\hspace{2em} &\frac{\partial l}{\partial {q}} = \text{dropout}\text{.backward}({q}, \frac{\partial l}{\partial {s}})=
\frac{1}{1-r}\times
\begin{bmatrix}
\textbf{1}[p_1 >= r] \times \cfrac{\partial l}{\partial s_1}\\
\vdots \\
\textbf{1}[p_J >= r] \times \cfrac{\partial l}{\partial s_J}\\
\end{bmatrix}.
\end{align}

Note that $p_j, j\in\{1,\cdots,J\}$ and $r$ are not be learned so we do not need to compute the derivatives w.r.t. to them. Moreover, $p_j, j\in\{1,\cdots,J\}$ are re-sampled every forward pass, and are kept for the following backward pass. The dropout rate $r$ is set to 0 during testing.

In [20]:
from utils import softmax_cross_entropy, add_momentum, data_loader_mnist, predict_label, DataSplit
from neural_networks import main, miniBatchGradientDescent, dropout, tanh, relu, linear_layer
import sys
import os
import argparse
import numpy as np
import json
import sys; sys.argv=['']; del sys
parser = argparse.ArgumentParser()
parser.add_argument('--random_seed', default=42)
parser.add_argument('--learning_rate', default=0.01)
parser.add_argument('--alpha', default=0.0)
parser.add_argument('--lambda', default=0.0)
parser.add_argument('--dropout_rate', default=0.0)
parser.add_argument('--num_epoch', default=10)
parser.add_argument('--minibatch_size', default=5)
parser.add_argument('--activation', default='relu')
parser.add_argument('--input_file', default='mnist_subset.json')
args = parser.parse_args()
main_params = vars(args)
main(main_params)

At epoch 1
Training loss at epoch 1 is 330.1989480383847
Training accuracy at epoch 1 is 0.8992
Validation accuracy at epoch 1 is 0.903
At epoch 2
Training loss at epoch 2 is 214.72871312475223
Training accuracy at epoch 2 is 0.9426
Validation accuracy at epoch 2 is 0.928
At epoch 3
Training loss at epoch 3 is 158.3293031904497
Training accuracy at epoch 3 is 0.956
Validation accuracy at epoch 3 is 0.932
At epoch 4
Training loss at epoch 4 is 122.53548073976626
Training accuracy at epoch 4 is 0.9732
Validation accuracy at epoch 4 is 0.936
At epoch 5
Training loss at epoch 5 is 101.20980823990539
Training accuracy at epoch 5 is 0.9802
Validation accuracy at epoch 5 is 0.941
At epoch 6
Training loss at epoch 6 is 78.78150386882093
Training accuracy at epoch 6 is 0.9874
Validation accuracy at epoch 6 is 0.942
At epoch 7
Training loss at epoch 7 is 68.65701316349937
Training accuracy at epoch 7 is 0.9898
Validation accuracy at epoch 7 is 0.947
At epoch 8
Training loss at epoch 8 is 59.5483

([330.1989480383847,
  214.72871312475223,
  158.3293031904497,
  122.53548073976626,
  101.20980823990539,
  78.78150386882093,
  68.65701316349937,
  59.54833394999111,
  46.89296297058256,
  43.517096785973166],
 [65.50061441820793,
  50.84967263201994,
  45.51357273448952,
  43.09664852512866,
  41.93889203204472,
  39.006270866201,
  37.51718816295636,
  35.69871360326422,
  35.70158095571034,
  35.10064515601373])

In [21]:
parser = argparse.ArgumentParser()
parser.add_argument('--random_seed', default=42)
parser.add_argument('--learning_rate', default=0.01)
parser.add_argument('--alpha', default=0.9)
parser.add_argument('--lambda', default=0.0)
parser.add_argument('--dropout_rate', default=0.25)
parser.add_argument('--num_epoch', default=10)
parser.add_argument('--minibatch_size', default=5)
parser.add_argument('--activation', default='tanh')
parser.add_argument('--input_file', default='mnist_subset.json')
args = parser.parse_args()
main_params = vars(args)
main(main_params)

At epoch 1
Training loss at epoch 1 is 1485.9423066914571
Training accuracy at epoch 1 is 0.815
Validation accuracy at epoch 1 is 0.83
At epoch 2
Training loss at epoch 2 is 948.4358639759693
Training accuracy at epoch 2 is 0.8904
Validation accuracy at epoch 2 is 0.88
At epoch 3
Training loss at epoch 3 is 861.7931225157394
Training accuracy at epoch 3 is 0.9076
Validation accuracy at epoch 3 is 0.891
At epoch 4
Training loss at epoch 4 is 465.93812790020087
Training accuracy at epoch 4 is 0.945
Validation accuracy at epoch 4 is 0.925
At epoch 5
Training loss at epoch 5 is 865.0176855759477
Training accuracy at epoch 5 is 0.9118
Validation accuracy at epoch 5 is 0.884
At epoch 6
Training loss at epoch 6 is 396.8032749306144
Training accuracy at epoch 6 is 0.958
Validation accuracy at epoch 6 is 0.928
At epoch 7
Training loss at epoch 7 is 354.224859912969
Training accuracy at epoch 7 is 0.9592
Validation accuracy at epoch 7 is 0.942
At epoch 8
Training loss at epoch 8 is 892.104517797

([1485.9423066914571,
  948.4358639759693,
  861.7931225157394,
  465.93812790020087,
  865.0176855759477,
  396.8032749306144,
  354.224859912969,
  892.1045177979262,
  320.97949501683354,
  157.1995671705035],
 [282.82032361844574,
  231.6451654596812,
  244.9950386003027,
  186.53752665336742,
  259.0499962866089,
  191.25386829647206,
  145.43853640120957,
  299.6363552804344,
  224.11053640911814,
  198.27128976289993])