In [7]:
%load_ext autoreload
%autoreload 2

import numpy as np

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Feed-Forward Neural Network

By the end of the notebook we are going to create a simple feed-forward neural net that learns to recognize handwritten digits using the [MNIST-dataset](http://yann.lecun.com/exdb/mnist/).

We'll first start by training a simple neural network to learn to classify XOR ($\oplus$):

<table>
    <thead><tr><td>a</td><td>b</td><td>$a \oplus b$</td></tr></thead>
    <tbody>
        <tr><td>0</td><td>0</td><td>0</td></tr>
        <tr><td>0</td><td>1</td><td>1</td></tr>
        <tr><td>1</td><td>0</td><td>1</td></tr>
        <tr><td>1</td><td>1</td><td>0</td></tr>
    </tbody>
</table>

---

We'll start by defining the structure of our network:
<div style="text-align: center;">
    <img src="static/2-3-1-nn.png" width="60%">
</div>

- The first layer (aka the input layer) has two inputs corresponding to $a$ and $b$.
- The middle / hidden layer is composed of three neurons.
- The final layer (aka the output layer) has a single output. 

The output of the neural network is a prediction that is the probability of the instance to belong to the positive class.
So anything above 0.5 we could classify as 1 and everthing below 0.5 we could classify as 0.

We start by setting up our data according to the table above.

In [8]:
X = np.array([[0, 0],
              [0, 1],
              [1, 0],
              [1, 1]])

y = np.array([0,1,1,0])

print('X:')
print(X, '\n')
print('y:')
print(y)

X:
[[0 0]
 [0 1]
 [1 0]
 [1 1]] 

y:
[0 1 1 0]


## Initializing the layers

It's common practice to initialize the weights of each layer by drawing from a uniform distribution ranging from 
$$
-\sqrt{\frac{6}{n_{inputs} + n_{outputs}}} \to \sqrt{\frac{6}{n_{inputs} + n_{outputs}}}
$$

also known as the **Glorot uniform**.

We'll creat the network above meaning we have two weight matrices with shapes 2x3 and 3x1. The output of the first layer will go through a relu activation while the output of the second layer will go through a sigmoid activation. The result will be the output of our model.

In [11]:
np.random.seed(0)

n_inputs, hidden_size, n_outputs = 2, 3, 1

bounds = np.sqrt(6 / (n_inputs + n_outputs))

weights1 = np.random.uniform(-bounds, bounds, (n_inputs, hidden_size))
weights2 = np.random.uniform(-bounds, bounds, (hidden_size, n_outputs))

print(weights1)
print()
print(weights2)

[[ 0.13806544  0.60864744  0.29065872]
 [ 0.12694881 -0.21593684  0.41265087]]

[[-0.17653002]
 [ 1.10810138]
 [ 1.31143633]]


In [29]:
X @ weights1

array([[ 0.        ,  0.        ,  0.        ],
       [ 0.12694881, -0.21593684,  0.41265087],
       [ 0.13806544,  0.60864744,  0.29065872],
       [ 0.26501425,  0.3927106 ,  0.70330959]])

In [46]:
# forwarding through the network.
z1 = X @ weights1
x2 = np.maximum(z1, 0)
z2 = x2 @ weights2
y_hat = 1 / (1 + np.exp(-z2))

print('Z1')
print(z1)
print('\nX2')
print(x2)
print('\nZ2')
print(z2)
print('\nYhat')
print(y_hat)

Z1
[[ 0.          0.          0.        ]
 [ 0.12694881 -0.21593684  0.41265087]
 [ 0.13806544  0.60864744  0.29065872]
 [ 0.26501425  0.3927106   0.70330959]]

X2
[[0.         0.         0.        ]
 [0.12694881 0.         0.41265087]
 [0.13806544 0.60864744 0.29065872]
 [0.26501425 0.3927106  0.70330959]]

Z2
[[0.        ]
 [0.51875506]
 [1.03125078]
 [1.31072593]]

Yhat
[[0.5       ]
 [0.62685661]
 [0.73715831]
 [0.78763461]]


Here's how to interpret $Z_1$. Our input was a batch of 4 instances with 2 "features" each. Our hidden layer has 3 neurons corresponding to the 3 columns in $Z_1$. Each input instance corresponds to each row in $Z_1$. 

Meaning the number in the 3rd column and second row of $Z_1$ is the output of the 3rd neuron in the hidden layer for the second input instance (in this case \[0, 1\]).

The same logic follows for $X_2$, $Z_2$, and $\hat Y$.

In [6]:

model = nn.NN([
    nn.Layer(2, 3, activation=nn.ReLU),
    nn.Layer(3, 1, activation=nn)
])

print(model)
print('Forwarding X:\n')
print(model.forward(X))

print('\nThe models predictions:')
model.predict(X)

1) Fully connected layer: (2, 3)
2) Fully connected layer: (3, 1)

Forwarding X:

[[1.]
 [1.]
 [1.]
 [1.]]

The models predictions:


array([0, 0, 0, 0])

Right now the network is predicting everything to be a 0. 

So we need to learn right weights to give the correct output. This is where backpropagation comes in.

## Backpropagation

In backpropagation we learn what the right set of weights are in order to give the desired output.

In this example assume our network is a general 3 layer network like below (instead of 2). This will help illustrate the pattern that arises for backpropagation. 

<div style="text-align: center;">
    <img src="static/3-layer-nn.png" width="60%" />
</div>

We assume we have a cost function (denoted $J$, in this case cross-entropy loss). We find the partial derivatives of the cost function with respect to each weight matrix: 


$$
\frac{\partial \mathbf{J}}{\partial \mathbf{W}_3} =  \frac{\partial \mathbf{J}}{\partial \mathbf{\hat{Y}}} \cdot \frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{Z}_3} \cdot \frac{\partial \mathbf{Z}_3}{\partial \mathbf{W}_3}
$$

$$
\frac{\partial \mathbf{J}}{\partial \mathbf{W}_2} = \frac{\partial \mathbf{J}}{\partial \mathbf{\hat{Y}}} \cdot \frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{Z}_3} \cdot \frac{\partial \mathbf{Z}_3}{\partial \mathbf{X}_3} \cdot \frac{\partial \mathbf{X}_3}{\partial \mathbf{Z}_2} \cdot \frac{\partial \mathbf{Z}_2}{\partial \mathbf{W}_2}
$$

$$
\frac{\partial \mathbf{J}}{\partial \mathbf{W}_1} = \frac{\partial \mathbf{J}}{\partial \mathbf{\hat{Y}}} \cdot \frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{Z}_3} \cdot \frac{\partial \mathbf{Z}_3}{\partial \mathbf{X}_3} \cdot \frac{\partial \mathbf{X}_3}{\partial \mathbf{Z}_2} \cdot \frac{\partial \mathbf{Z}_2}{\partial \mathbf{X}_2} \cdot \frac{\partial \mathbf{X}_2}{\partial \mathbf{Z}_1} \cdot \frac{\partial \mathbf{Z}_1}{\partial \mathbf{W}_1}
$$

**Let's break down the formulas:**

$\large \frac{\partial \mathbf{J}}{\partial \mathbf{W}_3}$:

$\text{We'll set }\delta_3 = \frac{\partial \mathbf{J}}{\partial \mathbf{\hat{Y}}} \cdot \frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{Z}_3} = \mathbf{\hat{Y}} - \mathbf{Y}$

Since $\mathbf{Z}_3 = \mathbf{X}_3 \cdot \mathbf{W}_3 \to \frac{\partial \mathbf{Z}_3}{{\partial \mathbf{W}_3}} = \mathbf{X}_3$ So in total:

$$
\frac{\partial \mathbf{J}}{\partial \mathbf{W}_3} = \frac{\partial \mathbf{J}}{\partial \mathbf{\hat{Y}}} \cdot \frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{Z}_3} \cdot \frac{\partial \mathbf{Z}_3}{\partial \mathbf{W}_3} = \delta_3 \cdot \frac{\partial \mathbf{Z}_3}{\partial \mathbf{W}_3} = {\mathbf{X}_3}^T \cdot \delta_3
$$

$\large \frac{\partial \mathbf{J}}{\partial \mathbf{W}_2}$:


$\text{Now set }\delta_2 = \delta_3 \cdot \frac{\partial \mathbf{Z}_3}{\partial \mathbf{X}_3} \cdot \frac{\partial \mathbf{X}_3}{\partial \mathbf{Z}_2}$

Since $\mathbf{Z}_3 = \mathbf{X}_3 \cdot \mathbf{W}_3 \to \frac{\partial \mathbf{Z}_3}{{\partial \mathbf{X}_3}} = \mathbf{W}_3$ and $\mathbf{X}_3 = \text{ReLU}(\mathbf{Z}_2) \text{ so } \frac{\partial \mathbf{X}_3}{\partial \mathbf{Z}_2} = \text{ReLU}'(\mathbf{Z}_2)$ therefore:

$$
\delta_2 = \delta_3 \cdot \frac{\partial \mathbf{Z}_3}{\partial \mathbf{X}_3} \cdot \frac{\partial \mathbf{X}_3}{\partial \mathbf{Z}_2} = \delta_3 \cdot {\mathbf{W}_3}^T * \text{ReLU}'(\mathbf{Z}_2)
$$
*Note: * indicates element-wise multiplication.*

Now notice:
$$
\frac{\partial \mathbf{J}}{\partial \mathbf{W}_2} = \delta_2 \cdot \frac{\partial \mathbf{Z}_2}{{\partial \mathbf{W}_2}} = {\mathbf{X}_2}^T \cdot \delta_2
$$

$\large \frac{\partial \mathbf{J}}{\partial \mathbf{W}_1}$:


$\text{Now set }\delta_1 = \delta_2 \cdot \frac{\partial \mathbf{Z}_2}{\partial \mathbf{X}_2} \cdot \frac{\partial \mathbf{X}_2}{\partial \mathbf{Z}_1} \to \delta_1 = \delta_2 \cdot {\mathbf{W}_2}^T * \text{ReLU}'(\mathbf{Z}_1)$

and notice:
$$
\frac{\partial \mathbf{J}}{\partial \mathbf{W}_1} = \delta_1 \cdot \frac{\partial \mathbf{Z}_1}{\partial \mathbf{W}_1} = {\mathbf{X}_1}^T \cdot \delta_1
$$

$$
(\mathbf{Z}_3) = 1 + e^{- \mathbf{Z}_3}' = \frac{0 \cdot (1 + e^{-\mathbf{Z}_3}) - (1 + e^{-\mathbf{Z}_3})(-\mathbf{Z}_3)}{(1 + e^{-\mathbf{Z}_3})^2}
$$