# Feed-Forward Neural Network

Note: This notebook already assumes a basic knowledge of neural nets. Things like layers and layer sizes, activation functions, batching, softmax, and so on.

By the end of the notebook we are going to create a simple feed-forward neural net that learns to recognize handwritten digits using the [MNIST-dataset](http://yann.lecun.com/exdb/mnist/).

We'll first start by training a simple neural network to learn to classify XOR:

<table>
    <thead><tr><td>a</td><td>b</td><td>a XOR b</td></tr></thead>
    <tbody>
        <tr><td>0</td><td>0</td><td>0</td></tr>
        <tr><td>0</td><td>1</td><td>1</td></tr>
        <tr><td>1</td><td>0</td><td>1</td></tr>
        <tr><td>1</td><td>1</td><td>0</td></tr>
    </tbody>
</table>

---

We'll start by defining the structure of our network:
<img src="XOR-nn.png" width="60%">

- The first layer (aka the input layer) has two inputs corresponding to $a$ and $b$.
- The middle / hidden layer is composed of three neurons.
- The final layer (aka the output layer) has two outputs. 

The output of the neural network is a vector of length 2 where the first entry is the probability of the result being 0 and the second entry is the probability of the result being 1.

## Feed-Forward

It's called a **Feed-Forward Neural Net** because we **feed the input forward** through the network starting at the input layer until the output.

Here's how we implement the feed forward algorithm.

$$
Z_1 = X_1 \cdot W_1 \\
X_2 = \text{ReLU}(Z_1) \\
Z_2 = X_2 \cdot W_2 \\
\hat{Y} = \text{Softmax}(Z_2)
$$

Note: $\large \cdot$ represents matrix multiplication.

To start we'll get some notation out of the way:
1. **X1** is the input. 
    - It can either be a single instance i.e. \[0, 0\] (1 x 2) or a batch of instances \[[0,0],[0,1],[1,0]] (3 x 2)
1. **W1** is the first weight matrix with a shape of (2 x 3)
2. **W2** is the second weight matrix with a shape of (3 x 2)

- We forward our input through the first layer and get out $Z_1$. 
- We then apply a ReLU activation function on $Z_1$ and get $X_2$. 
- We then forward $X_2$ through the second layer and get $Z_2$.
- Finally we apply softmax on $Z_2$ to get a vector of probabilities, $\hat{Y}$, for each class (1 or 0).

In [7]:
%load_ext autoreload
%autoreload 2

import numpy as np

import nn

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [24]:
X = np.array([[0, 0],
              [0, 1],
              [1, 0],
              [1, 1]])

# y is "one-hotted": Since the first row
# has a 1 in the zeroth position, it means
# that the output is a 0.
# Since in the second row the 1 is in the 1st
# position, it means the output is a 1.
y = np.array([[1,0],[0,1],[0,1],[1,0]])

print('X:')
print(X, '\n')
print('y:')
print(y)

print('''
y is "one-hotted". 

Each row in X corresponds to an output row in y. 

Since the first row of y has a 1 in 
the zeroth position, it means that 
the output is a 0.

Since in the second row the 1 is in the 1st
position, it means the output is a 1.''')

X:
[[0 0]
 [0 1]
 [1 0]
 [1 1]] 

y:
[[1 0]
 [0 1]
 [0 1]
 [1 0]]

y is "one-hotted". 

Each row in X corresponds to an output row in y. 

Since the first row of y has a 1 in 
the zeroth position, it means that 
the output is a 0.

Since in the second row the 1 is in the 1st
position, it means the output is a 1.


## Initializing the layers

It's common practice to initialize the weights of each layer by drawing from a uniform distribution ranging from 
$$
-\sqrt{\frac{6}{n_{inputs} + n_{outputs}}} \to \sqrt{\frac{6}{n_{inputs} + n_{outputs}}}
$$

also known as **Glorot uniform**.

In [26]:
np.random.seed(0)
layer1 = nn.Layer(2, 3)
layer2 = nn.Layer(3, 2)

print('W1')
print(layer1.weights, '\n')
print('W2')
print(layer2.weights, '\n')

print('Preliminary predictions:')
forward(Ws, X)[-1].argmax(axis=1)

W1
[[ 0.10694503  0.47145628  0.22514328]
 [ 0.09833413 -0.16726395  0.31963799]] 

W2
[[-0.13673957  0.85833164]
 [ 1.01583421 -0.25536684]
 [ 0.63913754  0.0633056 ]] 

Preliminary predictions:


NameError: name 'Ws' is not defined

Right now the network is predicting everything to be a 0. 

So we need to learn right weights to give the correct output. This is where backpropagation comes in.

## Backpropagation

In backpropagation we learn what the right set of weights are in order to give the desired output.

We assume we have a cost function (denoted $J$, in this case cross-entropy loss). We find the partial derivatives of the cost function with respect to each weight matrix. 

$$
\frac{\partial J}{\partial W_2} = \frac{\partial J}{\partial \hat{Y}} \cdot \frac{\partial \hat{Y}}{\partial Z_2} \cdot \frac{\partial Z_2}{\partial W_2} \\
\frac{\partial J}{\partial W_1} = \frac{\partial J}{\partial \hat{Y}} \cdot \frac{\partial \hat{Y}}{\partial Z_2} \cdot \frac{\partial Z_2}{\partial X_2} \cdot \frac{\partial X_2}{\partial Z_1} \cdot \frac{\partial Z_1}{\partial W_1}
$$