In [10]:
%load_ext autoreload
%autoreload 2

import numpy as np

import nn

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [11]:
X = np.array([[0, 0],
              [0, 1],
              [1, 0],
              [1, 1]])

y = np.array([[1,0],[0,1],[0,1],[1,0]])

print('X:')
print(X, '\n')
print('y:')
print(y)

print('''
y is "one-hotted". 

Each row in X corresponds to an output row in y. 

Since the first row of y has a 1 in 
the zeroth position, it means that 
the output is a 0.

Since in the second row the 1 is in the 1st
position, it means the output is a 1.''')

X:
[[0 0]
 [0 1]
 [1 0]
 [1 1]] 

y:
[[1 0]
 [0 1]
 [0 1]
 [1 0]]

y is "one-hotted". 

Each row in X corresponds to an output row in y. 

Since the first row of y has a 1 in 
the zeroth position, it means that 
the output is a 0.

Since in the second row the 1 is in the 1st
position, it means the output is a 1.


## Initializing the layers

It's common practice to initialize the weights of each layer by drawing from a uniform distribution ranging from 
$$
-\sqrt{\frac{6}{n_{inputs} + n_{outputs}}} \to \sqrt{\frac{6}{n_{inputs} + n_{outputs}}}
$$

also known as the **Glorot uniform**.

In [12]:
np.random.seed(0)
model = nn.NN([
    nn.Layer(2, 3),
    nn.Layer(3, 2)
])

print(model)
print('Forwarding X:\n')
print(model.forward(X))

print('\nThe models predictions:')
model.predict(X)

1) Fully connected layer: (2, 3)
2) Fully connected layer: (3, 2)

Forwarding X:

[[0.5        0.5       ]
 [0.52153874 0.47846126]
 [0.65079658 0.34920342]
 [0.62154818 0.37845182]]

The models predictions:


array([0, 0, 0, 0])

Right now the network is predicting everything to be a 0. 

So we need to learn right weights to give the correct output. This is where backpropagation comes in.

## Backpropagation

In backpropagation we learn what the right set of weights are in order to give the desired output.

In this example assume our network is a general 3 layer network like below (instead of 2). This will help illustrate the pattern that arises for backpropagation. 

<div style="text-align: center;">
    <img src="static/3-layer-nn.png" width="60%" />
</div>

We assume we have a cost function (denoted $J$, in this case cross-entropy loss). We find the partial derivatives of the cost function with respect to each weight matrix: 


$$
\frac{\partial \mathbf{J}}{\partial \mathbf{W}_3} =  \frac{\partial \mathbf{J}}{\partial \mathbf{\hat{Y}}} \cdot \frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{Z}_3} \cdot \frac{\partial \mathbf{Z}_3}{\partial \mathbf{W}_3}
$$

$$
\frac{\partial \mathbf{J}}{\partial \mathbf{W}_2} = \frac{\partial \mathbf{J}}{\partial \mathbf{\hat{Y}}} \cdot \frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{Z}_3} \cdot \frac{\partial \mathbf{Z}_3}{\partial \mathbf{X}_3} \cdot \frac{\partial \mathbf{X}_3}{\partial \mathbf{Z}_2} \cdot \frac{\partial \mathbf{Z}_2}{\partial \mathbf{W}_2}
$$

$$
\frac{\partial \mathbf{J}}{\partial \mathbf{W}_1} = \frac{\partial \mathbf{J}}{\partial \mathbf{\hat{Y}}} \cdot \frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{Z}_3} \cdot \frac{\partial \mathbf{Z}_3}{\partial \mathbf{X}_3} \cdot \frac{\partial \mathbf{X}_3}{\partial \mathbf{Z}_2} \cdot \frac{\partial \mathbf{Z}_2}{\partial \mathbf{X}_2} \cdot \frac{\partial \mathbf{X}_2}{\partial \mathbf{Z}_1} \cdot \frac{\partial \mathbf{Z}_1}{\partial \mathbf{W}_1}
$$

**Let's break down the formulas:**

$\large \frac{\partial \mathbf{J}}{\partial \mathbf{W}_3}$:

$\text{We'll set }\delta_3 = \frac{\partial \mathbf{J}}{\partial \mathbf{\hat{Y}}} \cdot \frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{Z}_3} = \mathbf{\hat{Y}} - \mathbf{Y}$

Since $\mathbf{Z}_3 = \mathbf{X}_3 \cdot \mathbf{W}_3 \to \frac{\partial \mathbf{Z}_3}{{\partial \mathbf{W}_3}} = \mathbf{X}_3$ So in total:

$$
\frac{\partial \mathbf{J}}{\partial \mathbf{W}_3} = \frac{\partial \mathbf{J}}{\partial \mathbf{\hat{Y}}} \cdot \frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{Z}_3} \cdot \frac{\partial \mathbf{Z}_3}{\partial \mathbf{W}_3} = \delta_3 \cdot \frac{\partial \mathbf{Z}_3}{\partial \mathbf{W}_3} = {\mathbf{X}_3}^T \cdot \delta_3
$$

$\large \frac{\partial \mathbf{J}}{\partial \mathbf{W}_2}$:


$\text{Now set }\delta_2 = \delta_3 \cdot \frac{\partial \mathbf{Z}_3}{\partial \mathbf{X}_3} \cdot \frac{\partial \mathbf{X}_3}{\partial \mathbf{Z}_2}$

Since $\mathbf{Z}_3 = \mathbf{X}_3 \cdot \mathbf{W}_3 \to \frac{\partial \mathbf{Z}_3}{{\partial \mathbf{X}_3}} = \mathbf{W}_3$ and $\mathbf{X}_3 = \text{ReLU}(\mathbf{Z}_2) \text{ so } \frac{\partial \mathbf{X}_3}{\partial \mathbf{Z}_2} = \text{ReLU}'(\mathbf{Z}_2)$ therefore:

$$
\delta_2 = \delta_3 \cdot \frac{\partial \mathbf{Z}_3}{\partial \mathbf{X}_3} \cdot \frac{\partial \mathbf{X}_3}{\partial \mathbf{Z}_2} = \delta_3 \cdot {\mathbf{W}_3}^T * \text{ReLU}'(\mathbf{Z}_2)
$$
*Note: * indicates element-wise multiplication.*

Now notice:
$$
\frac{\partial \mathbf{J}}{\partial \mathbf{W}_2} = \delta_2 \cdot \frac{\partial \mathbf{Z}_2}{{\partial \mathbf{W}_2}} = {\mathbf{X}_2}^T \cdot \delta_2
$$

$\large \frac{\partial \mathbf{J}}{\partial \mathbf{W}_1}$:


$\text{Now set }\delta_1 = \delta_2 \cdot \frac{\partial \mathbf{Z}_2}{\partial \mathbf{X}_2} \cdot \frac{\partial \mathbf{X}_2}{\partial \mathbf{Z}_1} \to \delta_1 = \delta_2 \cdot {\mathbf{W}_2}^T * \text{ReLU}'(\mathbf{Z}_1)$

and notice:
$$
\frac{\partial \mathbf{J}}{\partial \mathbf{W}_1} = \delta_1 \cdot \frac{\partial \mathbf{Z}_1}{\partial \mathbf{W}_1} = {\mathbf{X}_1}^T \cdot \delta_1
$$