# Neural Network Training Process: Feedforward & Backpropagation Step-By-Step 

### Notation

Let's start by introducing some notation for how we will describe layers, nodes, and weights.
 - $a\implies$ Neuron/Activation
 - $a_2\implies$ bias units
 - $L\implies$ Layer
 - $w\implies$ Weight
 - $jk\implies$ Weight from node j to node k (right to left)
 - $y_0\implies$ First output
 - $y_1\implies$ Second output

### Neural Network Topology

Below is a 3-layer network topology that we'll be using as a example for the walkthrough.  The top image shows how the notation is defined, and the computational graph is shown below the horizontal line.

<figure style="text-align:center;">
  <img src="Images/simple_neural_network_3.png" style="width:80%;" />
</figure> 


### Analyze Dataset
The first thing to do is import a dataset.  We'll be using a test dataset for this walkthrough with the following properties

- **Legs**
- **Language**




In [4]:
import pandas as pd
data = pd.read_csv("Datasets/test.csv")
data.head()

Unnamed: 0,Legs,Language,Target
0,2,1,Human
1,4,0,Animal


### Initialization
Let's initialize the neural network by setting the weights randomly.  In the image below you can see the random weights between the range 0 and 1 (exclusive), the bias nodes are set to $+1$. Read [Neural Network Initialization](NeuralNetworkInitialization.ipynb) to learn more about different methods to initialize a neural network. The output of the neural network will be defined as follows.  

$\begin{bmatrix}a^{(3)}_{0} \\ a^{(3)}_{1} \end{bmatrix} = \begin{bmatrix}0 \\ 1 \end{bmatrix} = \rm{Human}$  
$\begin{bmatrix}a^{(3)}_{0} \\ a^{(3)}_{1} \end{bmatrix} = \begin{bmatrix}1 \\ 0 \end{bmatrix} = \rm{Animal}$  

<figure style="text-align:center;">
  <img src="Images/simple_neural_network_4.png" style="width:60%" />
</figure> 

### Feed Forward

##### Set the inputs, $a^{(0)}_{0}$ and $a^{(0)}_{1}$
$a^{(0)}_{0} = 2$  
$a^{(0)}_{1} = 1$  
Layers $a^{(1)}$ and $a^{(2)}$ will use the sigmoid activation function $= g(x) = \dfrac{1}{1+e^{-x}}$  
Layer $a^{(3)}$ will use the softmax activation function $= \sigma(z)_j = \dfrac{e^{z_j}}{\sum_{k=1}^{K}e^{(z_k)}}\quad \textrm{for } j=1,..., K$  

##### Compute $a^{(1)}_{0}$
$z^{(1)}_0 = w^{(1)}_{0,0} \cdot a^{(0)}_{0} + w^{(1)}_{0,1} \cdot a^{(0)}_{1} + w^{(1)}_{0,2} \cdot a^{(0)}_{2} = 0.63 \cdot 2 + 0.39 \cdot 1 + 0.34 \cdot 1 = 1.99$  
$a^{(1)}_{0} = g(z^{(1)}_0) = \dfrac{1}{1+e^{-z^{(1)}_0}} = \dfrac{1}{1+e^{-1.99}} \approx 0.88$  

##### Compute $a^{(1)}_{1}$
$z^{(1)}_1 = w^{(1)}_{1,0} \cdot a^{(0)}_{0} + w^{(1)}_{1,1} \cdot a^{(0)}_{1} + w^{(1)}_{1,2} \cdot a^{(0)}_{2} = 0.50 \cdot 2 + 0.50 \cdot 1 + 0.71 \cdot 1 = 2.21$  
$a^{(1)}_{1} = g(z^{(1)}_1) = \dfrac{1}{1+e^{-z^{(1)}_1}} = \dfrac{1}{1+e^{-2.21}} \approx 0.90$  

##### Compute $a^{(2)}_{0}$
$z^{(2)}_0 = w^{(2)}_{0,0} \cdot a^{(1)}_{0} + w^{(2)}_{0,1} \cdot a^{(1)}_{1} + w^{(2)}_{0,2} \cdot a^{(1)}_{2} = 0.28 \cdot 0.88 + 0.42 \cdot 0.90 + 0.27 \cdot 1 = 0.89$   
$a^{(2)}_{0} = g(z^{(2)}_0) = \dfrac{1}{1+e^{-z^{(2)}_0}} = \dfrac{1}{1+e^{-0.89}} \approx 0.71$  

##### Compute $a^{(2)}_{1}$
$z^{(2)}_1 = w^{(2)}_{1,0} \cdot a^{(1)}_{0} + w^{(2)}_{1,1} \cdot a^{(1)}_{1} + w^{(2)}_{1,2} \cdot a^{(1)}_{2} = 0.82 \cdot 0.88 + 0.49 \cdot 0.90 + 0.45 \cdot 1 = 1.61$  
$a^{(2)}_{1} = g(z^{(2)}_1) = \dfrac{1}{1+e^{-z^{(2)}_1}} = \dfrac{1}{1+e^{-1.61}} \approx 0.83$   

##### Compute $a^{(3)}_{0}$ and $a^{(3)}_{1}$ using Softmax activation funciton
$z^{(3)}_0 = w^{(3)}_{0,0} \cdot a^{(2)}_{0} + w^{(3)}_{0,1} \cdot a^{(2)}_{1} + w^{(3)}_{0,2} \cdot a^{(2)}_{2} = 0.32 \cdot 0.71 + 0.83 \cdot 0.83 + 0.70 \cdot 1 = 1.62$  

$z^{(3)}_1 = w^{(3)}_{1,0} \cdot a^{(2)}_{0} + w^{(3)}_{1,1} \cdot a^{(2)}_{1} + w^{(3)}_{1,2} \cdot a^{(2)}_{2} = 0.70 \cdot 0.71 + 0.72 \cdot 0.83 + 0.59 \cdot 1 = 1.68$  

$\hat{y}_0 = a^{(3)}_{0} = \sigma(z^{(3)}_0) = \dfrac{e^{1.62}}{e^{1.62} + e^{1.68}} = 0.49$  
$\hat{y}_1 = a^{(3)}_{1} = \sigma(z^{(3)}_1) = \dfrac{e^{1.68}}{e^{1.62} + e^{1.68}} = 0.51$  

### Calculate Error
To calculate the error we will use the multi-class cross-entropy loss function instead of the binary cross-entropy loss function.  The multi-class cross-entropy loss function is a more generalized form and calculates exactly the same thing.  Instead of dealing with only two classes, the multi-class function can handle more 2 or more classes.  Note that we will never use the cost function in the training, but we have to compute the gradient of the cost function w.r.t weights and biases for their update.  So computing the loss is just to monitor how good we are doing.

$C = -\sum\limits^{K}_{k=1}y_{k}\log_2\left(\hat{y}_{k}\right)$

This function can easily be adjusted to take in account of more than one training example.

$C = -\dfrac{1}{m}\sum\limits^{m}_{i=1}\sum\limits^{K}_{k=1}y_{k}\log_2\left(\hat{y}_{k}\right)$



Now let's calculate the error with our simplified loss function

$C=  $  

### Backpropagation

$C = -\dfrac{1}{m}\sum\limits^{m}_{i=1}\sum\limits^{K}_{k=1}y_{k}\log_2\left(\hat{y}_{k}\right)$

Derivate of the cross-entropy error function.  
$\dfrac{\part-\dfrac{y_k}{\hat{y}_k} + \dfrac{1 - y_k}{1 - \hat{y}_k}$  

Calculate the weights, $w^{(3)}$  

$\dfrac{\partial C}{\partial w^{(3)}_{0,0}}$



### Update Weights