# **Lecture: Neural Network, Full Breakdown**

In this notebook we will look at a full breakdown and computation of the back propagation process. Below you will find all the individual derivatives calculated and identified. While it would be a good exercise to follow through the computations, the main objective of this notebook is to identify the the pieces that we need to preform back-propagation are (mostly) already computed in the forward pass. This means to update the weights we need to make sure that we save info on the forward pass, and utilize it well as we move back through.

In [1]:
import numpy as np
import pandas as pd

## **The Data**

In [2]:
# Original data
DATA = pd.read_csv("~/Files/Data/iris_KNN.csv", 
                   names=["index",'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm',
                          'PetalWidthCm',"Species"],
                  header=None).iloc[1:]
DATA

Unnamed: 0,index,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
1,0.0,4.9,3.0,1.4,0.2,0
2,1.0,4.7,3.2,1.3,0.2,0
3,2.0,4.6,3.1,1.5,0.2,0
4,3.0,5.0,3.6,1.4,0.2,0
5,4.0,5.4,3.9,1.7,0.4,0
...,...,...,...,...,...,...
145,144.0,6.7,3.0,5.2,2.3,2
146,145.0,6.3,2.5,5.0,1.9,2
147,146.0,6.5,3.0,5.2,2.0,2
148,147.0,6.2,3.4,5.4,2.3,2


In [3]:
DATA = DATA.sample(frac=1)

X = DATA[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']].astype(np.float32)
Y = DATA[['Species']].astype(np.float32)

X

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
117,7.7,3.8,6.7,2.2
141,6.9,3.1,5.1,2.3
37,4.9,3.1,1.5,0.1
91,6.1,3.0,4.6,1.4
100,6.3,3.3,6.0,2.5
...,...,...,...,...
36,5.5,3.5,1.3,0.2
124,6.7,3.3,5.7,2.1
79,5.7,2.6,3.5,1.0
1,4.9,3.0,1.4,0.2


## **Network Setup**

Next we are going to setup the network. For this we are going to need staring weight matrices, as well as any activation functions that we are going to use. For this we will only have sigmoid as an activation function, but in practice we will have many options. 

The weight matrices will be filled with random entries and will need to we shaped so that they are compatible for matrix multiplication. 

In [4]:
# Establish random weights.
W1 = 2 * np.random.rand(4,10) - 1
W2 = 2 * np.random.rand(10,3) - 1
W3 = 2 * np.random.rand(3,1)

In [5]:
print(W1)
print(X)

[[-0.68346046  0.76055157 -0.81336474 -0.41852269 -0.68317457 -0.60233962
  -0.03404217  0.75877667  0.93325056 -0.7246356 ]
 [-0.49362079 -0.80908372  0.78791125  0.38723266 -0.18515252 -0.67131661
   0.92347466 -0.31854911 -0.96147297  0.43050549]
 [ 0.83321737  0.15161674  0.47708171 -0.40889739 -0.93244687 -0.336736
   0.43425408 -0.17456453 -0.1158559   0.85668546]
 [ 0.32474823  0.45304415 -0.10233126  0.33532477  0.39412694  0.68121698
   0.80384655 -0.58492206 -0.31597696  0.10847151]]
     SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
117            7.7           3.8            6.7           2.2
141            6.9           3.1            5.1           2.3
37             4.9           3.1            1.5           0.1
91             6.1           3.0            4.6           1.4
100            6.3           3.3            6.0           2.5
..             ...           ...            ...           ...
36             5.5           3.5            1.3           0.2
124  

In [6]:
# Activation function.
def sigmoid(z):
    a = 1 / (1 + np.e ** (-z))
    return a

## **Forward Pass**

For the forward pass we are going to apply the composite function to the data. This will mean that we multiply by each matrix and apply activation between the layers.

For use since we have one layer type (linear) and one activation type (sigmoid) or network is a standard fully-connected network. These networks are a good place to start, and are often where deep neural network end. Later we will explore other types of layers and how we can use them in different application. 

In [7]:
# The foward pass. 
def forward_pass(X, W1, W2, W3):
    # Layer 1
    z1 = np.dot(X, W1) # Left hand side of node. (Pre-activation)
    a1 = sigmoid(z1) # Right hand side of the node. (Post-activation)
    
    # Layer 2
    z2 = np.dot(a1, W2) # Left hand side of node. (Pre-activation)
    a2 = sigmoid(z2) # Right hand side of the node. (Post-activation)
    
    # Layer 3
    z3 = np.dot(a2, W3) # Left hand side of node. (Pre-activation)
    a3 = sigmoid(z3) # Right hand side of the node. (Post-activation)
    
    # Returns the guess(s) for a give datum's features or for a batch of data. 
    return [z1, z2, z3, a3] 

Net we will check that we can move forward through the network.

In [8]:
Forward_Pass_Output = forward_pass(X, W1, W2, W3)

## **Evaluation**

Now we need to evaluate the prediction in the network. We can do this with different evaluation tools, but here we will use mean-squared error. 

$$ \textrm{Cost} = \frac{1}{N} \sum_{i=1}^N (a_{1}^{(3)} - y_i) ^ 2 $$

In [9]:
# Cost Equation (MSE)
Cost = np.sum((Forward_Pass_Output[3] - Y) ** 2) / np.shape(Y)[1]

In [10]:
Cost

Species    99.290848
dtype: float64

## **Back Propagation (update the weights)**

First we will start with the cost equation,
$$ \textrm{Cost} = \frac{1}{N} \sum_{i=1}^N (a_{1}^{(3)} - y_i) ^ 2 $$
the activation function in the final node,

$$ a_{1}^{(3)} = \sigma(z_{1}^{(3)}) $$ 

and the linear combination that allows us to progress forward, 

$$ z_{1}^{(3)} = w_{11}^{(3)} a_{1}^{(2)} + w_{21}^{(3)} a_{2}^{(2)} + w_{31}^{(3)} a_{3}^{(2)}$$

The composition would look like,

$$ \textrm{Cost} = \frac{1}{N} \sum_{i=1}^N ( \sigma(\color{Red}{w_{11}^{(3)}} a_{1}^{(2)} + \color{Red}{w_{21}^{(3)}} a_{2}^{(2)} + \color{Red}{w_{31}^{(3)}} a_{3}^{(2)}) - y_i) ^ 2 $$

This last line shows us where the first weights that we will be updating will be on the cost equation. We can compute the gradient with respect to these weights to determine the "way to head" on the cost surface to reduce the overall error for the data or a batch of the data. 

Let's take the gradient computation on step at a time.

$$ \nabla \textrm{Cost} = \nabla \frac{1}{N} \sum_{i=1}^N ( \sigma(\color{Red}{w_{11}^{(3)}} a_{1}^{(2)} + \color{Red}{w_{21}^{(3)}} a_{2}^{(2)} + \color{Red}{w_{31}^{(3)}} a_{3}^{(2)}) - y_i) ^ 2 $$

$$= \frac{1}{N} \sum_{i=1}^N \nabla ( \sigma(\color{Red}{w_{11}^{(3)}} a_{1}^{(2)} + \color{Red}{w_{21}^{(3)}} a_{2}^{(2)} + \color{Red}{w_{31}^{(3)}} a_{3}^{(2)}) - y_i) ^ 2 $$

$= \frac{1}{N} \sum_{i=1}^N 2(\sigma(\color{Red}{w_{11}^{(3)}} a_{1}^{(2)} + \color{Red}{w_{21}^{(3)}} a_{2}^{(2)} + \color{Red}{w_{31}^{(3)}} a_{3}^{(2)}) - y_i) \\ \left \langle \sigma'(\color{Red}{w_{11}^{(3)}} a_{1}^{(2)} + \color{Red}{w_{21}^{(3)}} a_{2}^{(2)} + \color{Red}{w_{31}^{(3)}} a_{3}^{(2)}) a_{1}^{(2)},
\sigma'(\color{Red}{w_{11}^{(3)}} a_{1}^{(2)} + \color{Red}{w_{21}^{(3)}} a_{2}^{(2)} + \color{Red}{w_{31}^{(3)}} a_{3}^{(2)}) a_{2}^{(2)},
\sigma'(\color{Red}{w_{11}^{(3)}} a_{1}^{(2)} + \color{Red}{w_{21}^{(3)}} a_{2}^{(2)} + \color{Red}{w_{31}^{(3)}} a_{3}^{(2)})
a_{3}^{(2)} \right \rangle\\ $

$ = \frac{1}{N} \sum_{i=1}^N 2(\sigma(\color{Red}{w_{11}^{(3)}} a_{1}^{(2)} + \color{Red}{w_{21}^{(3)}} a_{2}^{(2)} + \color{Red}{w_{31}^{(3)}} a_{3}^{(2)}) - y_i) \\ \left \langle \sigma'(\color{Red}{w_{11}^{(3)}} a_{1}^{(2)} + \color{Red}{w_{21}^{(3)}} a_{2}^{(2)} + \color{Red}{w_{31}^{(3)}} a_{3}^{(2)}),
\sigma'(\color{Red}{w_{11}^{(3)}} a_{1}^{(2)} + \color{Red}{w_{21}^{(3)}} a_{2}^{(2)} + \color{Red}{w_{31}^{(3)}} a_{3}^{(2)}),
\sigma'(\color{Red}{w_{11}^{(3)}} a_{1}^{(2)} + \color{Red}{w_{21}^{(3)}} a_{2}^{(2)} + \color{Red}{w_{31}^{(3)}} a_{3}^{(2)})
 \right \rangle \\ \langle a_{1}^{(2)}, a_{2}^{(2)}, a_{3}^{(2)} \rangle $

 
Now let's simplify some of the notation by replacing, $\color{Red}{w_{11}^{(3)}} a_{1}^{(2)} + \color{Red}{w_{21}^{(3)}} a_{2}^{(2)} + \color{Red}{w_{31}^{(3)}} a_{3}^{(2)} $ with $z_{1}^{(3)} $.

$$ \frac{1}{N} \sum_{i=1}^N 2(\sigma(z_{1}^{(3)}) - y_i) \left \langle \sigma'(z_{1}^{(3)} ), \sigma'(z_{1}^{(3)} ), \sigma'(z_{1}^{(3)} ) \right \rangle \langle a_{1}^{(2)}, a_{2}^{(2)}, a_{3}^{(2)} \rangle $$

 
Now so that we can see the next step, let $N = 4$. We are going to use another subscript to indicate the datum that we are on. So the final notation will be, $z_{1_i}^{(3)}$ where $i$ is the datum that we are on. So

$$ \frac{1}{4} \sum_{i=1}^4 2(\sigma(z_{1_i}^{(3)}) - y_i) \color{green}{\left \langle \sigma'(z_{1_i}^{(3)}), \sigma'(z_{1_i}^{(3)}), \sigma'(z_{1_i}^{(3)}) \right \rangle} \color{blue}{\langle a_{1_i}^{(2)}, a_{2_i}^{(2)}, a_{3_i}^{(2)} \rangle} =$$

Applying the sum to the vectors can be done term by term,

$$ \frac{1}{4} 2(\sigma(z_{1_1}^{(3)}) - y_1) 
\color{green}{\left \langle \sigma'(z_{1_1}^{(3)}), \sigma'(z_{1_1}^{(3)}), \sigma'(z_{1_1}^{(3)}) \right \rangle}  
\color{blue}{\langle a_{1_1}^{(2)}, a_{2_1}^{(2)}, a_{3_1}^{(2)} \rangle} + \\
\frac{1}{4} 2(\sigma(z_{1_2}^{(3)}) - y_2) 
\color{green}{\left \langle \sigma'(z_{1_2}^{(3)}), \sigma'(z_{1_2}^{(3)}), \sigma'(z_{1_2}^{(3)}) \right \rangle}
\color{blue}{\langle a_{1_2}^{(2)}, a_{2_2}^{(2)}, a_{3_2}^{(2)} \rangle} + \\
\frac{1}{4} 2(\sigma(z_{1_3}^{(3)}) - y_3) 
\color{green}{\left \langle \sigma'(z_{1_3}^{(3)}), \sigma'(z_{1_3}^{(3)}), \sigma'(z_{1_3}^{(3)}) \right \rangle}
\color{blue}{\langle a_{1_3}^{(2)}, a_{2_3}^{(2)}, a_{3_3}^{(2)} \rangle} + \\
\frac{1}{4} 2(\sigma(z_{1_4}^{(3)}) - y_4) 
\color{green}{\left \langle \sigma'(z_{1_4}^{(3)}), \sigma'(z_{1_4}^{(3)}), \sigma'(z_{1_4}^{(3)}) \right \rangle}
\color{blue}{\langle a_{1_4}^{(2)}, a_{2_4}^{(2)}, a_{3_4}^{(2)} \rangle}$$

Then we can break the sum in each of the vector positions into a dot product,

$$\left \langle 
\left( \frac{2}{4} (\sigma(z_{1_1}^{(3)}) - y_1)\color{green}{\sigma'(z_{1_1}^{(3)})}\color{blue}{a_{1_1}^{(2)}} + \frac{2}{4} (\sigma(z_{1_2}^{(3)}) - y_2)\color{green}{\sigma'(z_{1_2}^{(3)})}\color{blue}{a_{1_2}^{(2)}} + \frac{2}{4}(\sigma(z_{1_3}^{(3)}) - y_3)\color{green}{\sigma'(z_{1_3}^{(3)})}\color{blue}{a_{1_3}^{(2)}} + \frac{2}{4}(\sigma(z_{1_4}^{(3)}) - y_4)\color{green}{\sigma'(z_{1_4}^{(3)})}\color{blue}{a_{1_4}^{(2)}} \right), \\
\left( \frac{2}{4} (\sigma(z_{1_1}^{(3)}) - y_1))\color{green}{\sigma'(z_{1_1}^{(3)})}\color{blue}{a_{2_1}^{(2)}} + \frac{2}{4} (\sigma(z_{1_2}^{(3)}) - y_2))\color{green}{\sigma'(z_{1_2}^{(3)})}\color{blue}{a_{2_2}^{(2)}} + \frac{2}{4}(\sigma(z_{1_3}^{(3)}) - y_3))\color{green}{\sigma'(z_{1_3}^{(3)})} \color{blue}{a_{2_3}^{(2)}} + \frac{2}{4}(\sigma(z_{1_4}^{(3)}) - y_4))\color{green}{\sigma'(z_{1_4}^{(3)})}\color{blue}{a_{2_4}^{(2)}}  \right), \\
\left( \frac{2}{4} (\sigma(z_{1_1}^{(3)}) - y_1))\color{green}{\sigma'(z_{1_1}^{(3)})}\color{blue}{a_{3_1}^{(2)}} + \frac{2}{4} (\sigma(z_{1_2}^{(3)}) - y_2))\color{green}{\sigma'(z_{1_2}^{(3)})} \color{blue}{a_{3_2}^{(2)}} + \frac{2}{4}(\sigma(z_{1_3}^{(3)}) - y_3))\color{green}{\sigma'(z_{1_3}^{(3)})} \color{blue}{a_{3_3}^{(2)}} + \frac{2}{4}(\sigma(z_{1_4}^{(3)}) - y_4))\color{green}{\sigma'(z_{1_4}^{(3)})}\color{blue}{a_{3_4}^{(2)}}\right)
\right \rangle 
$$

Then we can break the sum in each of the vector positions into a dot product,

$$\left \langle \frac{\delta}{\delta \color{Red}{w_{11}^{(3)}}}, \frac{\delta}{\delta \color{Red}{w_{21}^{(3)}}}, \frac{\delta}{\delta \color{Red}{w_{31}^{(3)}}} \right \rangle =
\color{blue}{\sigma\left(\begin{pmatrix} 
z_{1_1}^{(2)} & z_{1_2}^{(2)} & z_{1_3}^{(2)} & z_{1_4}^{(2)} \\
z_{2_1}^{(2)} & z_{2_2}^{(2)} & z_{2_3}^{(2)} & z_{2_4}^{(2)} \\
z_{3_1}^{(2)} & z_{3_2}^{(2)} & z_{3_3}^{(2)} & z_{3_4}^{(2)} 
\end{pmatrix}\right)}
\cdot
\begin{bmatrix} 
\frac{1}{4} 2(\sigma(z_{1_1}^{(3)}) - y_1) \\
\frac{1}{4} 2(\sigma(z_{1_2}^{(3)}) - y_1) \\
\frac{1}{4} 2(\sigma(z_{1_3}^{(3)}) - y_1) \\
\frac{1}{4} 2(\sigma(z_{1_4}^{(3)}) - y_1) 
\end{bmatrix}
\color{green}
{\sigma'
\begin{pmatrix} 
z_{1_1}^{(3)} \\
z_{1_2}^{(3)} \\ 
z_{1_3}^{(3)} \\
z_{1_4}^{(3)}
\end{pmatrix}} =
\color{blue}{\sigma\left(\begin{pmatrix} 
z_{1_1}^{(2)} & z_{2_1}^{(2)} & z_{3_1}^{(2)} \\
z_{2_2}^{(2)} & z_{2_2}^{(2)} & z_{3_2}^{(2)} \\
z_{2_3}^{(2)} & z_{2_3}^{(2)} & z_{3_3}^{(2)} \\
z_{2_4}^{(2)} & z_{2_4}^{(2)} & z_{3_4}^{(2)} 
\end{pmatrix}\right)^ {T}} 
\cdot
\begin{bmatrix} 
\frac{1}{4} 2(\sigma(z_{1_1}^{(3)}) - y_1) \\
\frac{1}{4} 2(\sigma(z_{1_2}^{(3)}) - y_1) \\
\frac{1}{4} 2(\sigma(z_{1_3}^{(3)}) - y_1) \\
\frac{1}{4} 2(\sigma(z_{1_4}^{(3)}) - y_1) 
\end{bmatrix}
\color{green}
{\sigma'
\begin{pmatrix} 
z_{1_1}^{(3)} \\
z_{1_2}^{(3)} \\ 
z_{1_3}^{(3)} \\
z_{1_4}^{(3)}
\end{pmatrix}}
$$

Where,
$$
\color{blue}{\textrm{Output of the previous layer activated} = 
\sigma\left(\begin{pmatrix} 
z_{1_1}^{(2)} & z_{2_1}^{(2)} & z_{3_1}^{(2)} \\
z_{2_2}^{(2)} & z_{2_2}^{(2)} & z_{3_2}^{(2)} \\
z_{2_3}^{(2)} & z_{2_3}^{(2)} & z_{3_3}^{(2)} \\
z_{2_4}^{(2)} & z_{2_4}^{(2)} & z_{3_4}^{(2)} 
\end{pmatrix}\right)} \\ 
\textrm{Difference between guess and target} = 
\begin{bmatrix} 
\frac{1}{4} 2(\sigma(z_{1_1}^{(3)}) - y_1) \\
\frac{1}{4} 2(\sigma(z_{1_2}^{(3)}) - y_1) \\
\frac{1}{4} 2(\sigma(z_{1_3}^{(3)}) - y_1) \\
\frac{1}{4} 2(\sigma(z_{1_4}^{(3)}) - y_1) 
\end{bmatrix}\\
\color{green}{\textrm{Derivative applied to the pre-activation final value} = \sigma'
\begin{pmatrix} 
z_{1_1}^{(3)} \\
z_{1_2}^{(3)} \\ 
z_{1_3}^{(3)} \\
z_{1_4}^{(3)}
\end{pmatrix}}
$$

While there are many details here, the main point of note is that we have all of these pieces from the forward pass. 

Now let's take a quick look at the next set of wights as we move our way backwards through the network. Consider again the cost surface,
$$ \textrm{Cost} = \frac{1}{N} \sum_{i=1}^N (a_{1}^{(3)} - y_i) ^ 2 $$

Now let's start building the derivatives. We are going to look at this in a bit different way then we did above. In the next few lines we will find each of the composite functions that make up the Cost function. The arrow points to the derivative for each, and we can use the chain rule to build the derivative.

$$\frac{1}{N} \sum_{i=1}^N (a_{1}^{(3)} - y_i) ^ 2 \ \ \xrightarrow{\frac{\partial}{\partial a_{1}^{(3)} }} \ \  \frac{2}{N} \sum_{i=1}^N (a_{1}^{(3)} - y_i)$$

$$ a_{1}^{(3)} = \sigma(z_{1}^{(3)}) \ \ \xrightarrow{\frac{\partial}{\partial z_{1}^{(3)} }} \ \ \sigma'(z_{1}^{(3)}) $$ 

which is the activation in the final node. And, 

$$ z_{1}^{(3)} = w_{11}^{(3)} a_{1}^{(2)} + w_{21}^{(3)} a_{2}^{(2)} + w_{31}^{(3)} a_{3}^{(2)} 
\ \ \xrightarrow{\frac{\partial}{\partial a_{1}^{(2)} }} \ \ w_{11}^{(3)} \\ 
z_{1}^{(3)} = w_{11}^{(3)} a_{1}^{(2)} + w_{21}^{(3)} a_{2}^{(2)} + w_{31}^{(3)} a_{3}^{(2)} 
\ \ \xrightarrow{\frac{\partial}{\partial a_{2}^{(2)} }} \ \ w_{21}^{(3)} \\
z_{1}^{(3)} = w_{11}^{(3)} a_{1}^{(2)} + w_{21}^{(3)} a_{2}^{(2)} + w_{31}^{(3)} a_{3}^{(2)} 
\ \ \xrightarrow{\frac{\partial}{\partial a_{3}^{(2)} }} \ \ w_{31}^{(3)}  $$

But to move back to the weights in the layer before we need to keep building. 

So,
$$
a_{1}^{(2)} = \sigma(z_{1}^{(2)}) \ \ \xrightarrow{\frac{\partial}{\partial z_{1}^{(3)} }} \ \ \sigma'(z_{1}^{(2)})  \\
a_{2}^{(2)} = \sigma(z_{2}^{(2)}) \ \ \xrightarrow{\frac{\partial}{\partial z_{2}^{(3)} }} \ \ \sigma'(z_{2}^{(2)}) \\
a_{3}^{(2)} = \sigma(z_{3}^{(2)}) \ \ \xrightarrow{\frac{\partial}{\partial z_{3}^{(3)} }} \ \ \sigma'(z_{3}^{(2)})
$$

and,
$$
z_{1}^{(2)} = w_{11}^{(2)} a_{1}^{(1)} + w_{21}^{(2)} a_{2}^{(1)} + w_{31}^{(2)} a_{3}^{(1)} + w_{41}^{(2)} a_{4}^{(1)}
 \ \ \xrightarrow{\frac{\partial}{\partial w_{11}^{(2)} }} \ \ a_{1}^{(1)} \\
z_{1}^{(2)} = w_{11}^{(2)} a_{1}^{(1)} + w_{21}^{(2)} a_{2}^{(1)} + w_{31}^{(2)} a_{3}^{(1)} + w_{41}^{(2)} a_{4}^{(1)}
 \ \ \xrightarrow{\frac{\partial}{\partial w_{21}^{(2)} }} \ \ a_{2}^{(1)} \\
z_{1}^{(2)} = w_{11}^{(2)} a_{2}^{(1)} + w_{21}^{(2)} a_{2}^{(1)} + w_{31}^{(2)} a_{3}^{(1)} + w_{41}^{(2)} a_{4}^{(1)}
 \ \ \xrightarrow{\frac{\partial}{\partial w_{31}^{(2)} }} \ \ a_{3}^{(1)} \\
z_{1}^{(2)} = w_{11}^{(2)} a_{3}^{(1)} + w_{21}^{(2)} a_{2}^{(1)} + w_{31}^{(2)} a_{3}^{(1)} + w_{41}^{(2)} a_{4}^{(1)}
 \ \ \xrightarrow{\frac{\partial}{\partial w_{41}^{(2)} }} \ \ a_{4}^{(1)} \\
$$

$$
z_{2}^{(2)} = w_{12}^{(2)} a_{1}^{(1)} + w_{22}^{(2)} a_{2}^{(1)} + w_{32}^{(2)} a_{3}^{(1)} + w_{42}^{(2)} a_{4}^{(1)}
 \ \ \xrightarrow{\frac{\partial}{\partial w_{12}^{(2)} }} \ \ a_{1}^{(1)} \\
z_{2}^{(2)} = w_{12}^{(2)} a_{1}^{(1)} + w_{22}^{(2)} a_{2}^{(1)} + w_{32}^{(2)} a_{3}^{(1)} + w_{42}^{(2)} a_{4}^{(1)}
 \ \ \xrightarrow{\frac{\partial}{\partial w_{22}^{(2)} }} \ \ a_{2}^{(1)} \\
z_{2}^{(2)} = w_{12}^{(2)} a_{1}^{(1)} + w_{22}^{(2)} a_{2}^{(1)} + w_{32}^{(2)} a_{3}^{(1)} + w_{42}^{(2)} a_{4}^{(1)}
 \ \ \xrightarrow{\frac{\partial}{\partial w_{32}^{(2)} }} \ \ a_{3}^{(1)} \\
z_{2}^{(2)} = w_{12}^{(2)} a_{1}^{(1)} + w_{22}^{(2)} a_{2}^{(1)} + w_{32}^{(2)} a_{3}^{(1)} + w_{42}^{(2)} a_{4}^{(1)} 
\ \ \xrightarrow{\frac{\partial}{\partial w_{42}^{(2)} }} \ \ a_{4}^{(1)} \\
$$

$$
z_{3}^{(2)} = w_{13}^{(2)} a_{1}^{(1)} + w_{23}^{(2)} a_{2}^{(1)} + w_{33}^{(2)} a_{3}^{(1)} + w_{43}^{(2)} a_{4}^{(1)}
\ \ \xrightarrow{\frac{\partial}{\partial w_{13}^{(2)} }} \ \ a_{1}^{(1)} \\
z_{3}^{(2)} = w_{13}^{(2)} a_{1}^{(1)} + w_{23}^{(2)} a_{2}^{(1)} + w_{33}^{(2)} a_{3}^{(1)} + w_{43}^{(2)} a_{4}^{(1)}
\ \ \xrightarrow{\frac{\partial}{\partial w_{23}^{(2)} }} \ \ a_{2}^{(1)} \\
z_{3}^{(2)} = w_{13}^{(2)} a_{1}^{(1)} + w_{23}^{(2)} a_{2}^{(1)} + w_{33}^{(2)} a_{3}^{(1)} + w_{43}^{(2)} a_{4}^{(1)}
\ \ \xrightarrow{\frac{\partial}{\partial w_{33}^{(2)} }} \ \ a_{3}^{(1)} \\
z_{3}^{(2)} = w_{13}^{(2)} a_{1}^{(1)} + w_{23}^{(2)} a_{2}^{(1)} + w_{33}^{(2)} a_{3}^{(1)} + w_{43}^{(2)} a_{4}^{(1)}
\ \ \xrightarrow{\frac{\partial}{\partial w_{43}^{(2)} }} \ \ a_{4}^{(1)} \\
$$

**NOTE: The products between the scalars, vectors and matrices below are not clear. The pattern above would be needed to turn the sum to a matrix product. The following math is to showcase the pattern and the idea of what will happen in the layers that follow.** 

So the pieces that make up the gradient are, 
$$
\frac{2}{N} \sum_{i=1}^N (a_{1}^{(3)} - y_i)
\sigma'(z_{1}^{(3)})
\langle w_{11}^{(3)}, w_{21}^{(3)}, w_{31}^{(3)} \rangle
\langle \sigma'(z_{1}^{(2)}), \sigma'(z_{2}^{(2)}), \sigma'(z_{3}^{(2)}) \rangle
\langle a_{1}^{(1)}, a_{2}^{(1)}, a_{3}^{(1)}, a_{4}^{(1)} \rangle
$$

And the pattern continues,
$$
\frac{2}{N} \sum_{i=1}^N (a_{1}^{(3)} - y_i)
\sigma'(z_{1}^{(3)})
\langle w_{11}^{(3)}, w_{21}^{(3)}, w_{31}^{(3)} \rangle
\langle \sigma'(z_{1}^{(2)}), \sigma'(z_{2}^{(2)}), \sigma'(z_{3}^{(2)}) \rangle
\begin{bmatrix}
w_{11}^{(2)} & w_{12}^{(2)} & w_{13}^{(2)} \\
w_{21}^{(2)} & w_{22}^{(2)} & w_{23}^{(2)} \\
w_{31}^{(2)} & w_{32}^{(2)} & w_{33}^{(2)} \\
w_{41}^{(2)} & w_{42}^{(2)} & w_{43}^{(2)} 
\end{bmatrix}
\langle \sigma'(z_{1}^{(1)}), \sigma'(z_{2}^{(1)}), \sigma'(z_{3}^{(1)}),  \sigma'(z_{4}^{(1)})  \rangle
X
$$

In [11]:
Error = Forward_Pass_Output[3] - Y

### **Gradient Structure Reference:**
$$ \textrm{learning rate} \ \color{green}{(a_{1}^{(3)} - y_i)} 
\color{blue}{\sigma'(z_{1}^{(3)})} \langle a_{1}^{(2)}, a_{2}^{(2)}, a_{3}^{(2)} \rangle
$$

$$
\textrm{learning rate} \ \color{green}{(a_{1}^{(3)} - y_i) \sigma'(z_{1}^{(3)}) \textrm{W3}}
\color{blue}{ \langle \sigma'(z_{1}^{(2)}), \sigma'(z_{2}^{(2)}), \sigma'(z_{3}^{(2)}) \rangle}
\langle a_{1}^{(1)}, a_{2}^{(1)}, a_{3}^{(1)}, a_{4}^{(1)} \rangle
$$

$$
 \textrm{learning rate} \ \color{green}{(a_{1}^{(3)} - y_i) \sigma'(z_{1}^{(3)}) \textrm{W3}
 \langle \sigma'(z_{1}^{(2)}), \sigma'(z_{2}^{(2)}), \sigma'(z_{3}^{(2)}) \rangle \textrm{W2}}
\color{blue} {\langle \sigma'(z_{1}^{(1)}), \sigma'(z_{2}^{(1)}), \sigma'(z_{3}^{(1)}),  \sigma'(z_{4}^{(1)}) \rangle}
X
$$

 <p style="text-align: center;"> <img src= nn_an.png width=500 alt='[img: SVM]'/>  </p>

In [12]:
def back_prop(learning_rate, layers, weights, error):
    z1 = layers[0]
    z2 = layers[1]
    z3 = layers[2]
    a3 = layers[3]
    
    W1 = weights[0]
    W2 = weights[1]
    W3 = weights[2]
    
    # Back through the node and update the weights.
    l3_delta = error * sigmoid(z3) * (1 - sigmoid(z3))
    W3_update = np.dot(sigmoid(z2).T, l3_delta)
    
    l2_error = np.dot(l3_delta, W3.T)
    l2_delta = l2_error * sigmoid(z2) * (1 - sigmoid(z2))
    W2_update = np.dot(sigmoid(z1).T, l2_delta)

    l1_error = np.dot(l2_delta, W2.T)
    l1_delta = l1_error * sigmoid(z1) * (1 - sigmoid(z1))
    W1_update = np.dot(X.T, l1_delta)

    W3 -= learning_rate * W3_update
    W2 -= learning_rate * W2_update
    W1 -= learning_rate * W1_update
    
    return [W1,W2,W3]
    

In [13]:
weights = [W1, W2, W3]
learning_rate = .1
back_prop(learning_rate, Forward_Pass_Output, weights, Error)

[array([[-0.46043   ,  0.78325456, -0.91052451, -0.39694355, -0.68295567,
         -0.60317385, -0.03734536,  0.76249926,  0.91878531, -0.69979317],
        [-0.39642542, -0.79194496,  0.77224922,  0.40463135, -0.18500572,
         -0.6724614 ,  0.92111597, -0.31865146, -0.97160508,  0.4060936 ],
        [ 1.02886178,  0.15454739,  0.32862695, -0.40883481, -0.9323759 ,
         -0.33550771,  0.43363429, -0.16707399, -0.11967727,  0.9567594 ],
        [ 0.39657398,  0.45239932, -0.16330949,  0.33340474,  0.39414106,
          0.6818863 ,  0.80385136, -0.58166735, -0.31649416,  0.15290468]]),
 array([[-0.05135891,  0.66011932,  0.64564383],
        [-0.6156931 ,  0.76471771, -0.29345988],
        [-0.14075517,  0.0919627 , -0.72450641],
        [-0.0060898 , -0.20976357, -0.14309217],
        [ 0.23222962, -0.72658663,  0.38378495],
        [-0.13890312,  0.59622625,  0.91708085],
        [ 0.15046392, -0.15810372,  0.33281517],
        [-0.19052994,  0.27285169,  0.0935852 ],
        [ 

## **Training**

For the training we need a bit of each of the above. The steps are:
- Loop over all training data (Each pass is an epoch)
    - Break the data into batches.
    - Send a batch forward through the network.
    - Update the weights with the gradients computed above.
    
Things that we will need to send into the training loop:
- the training data.
- the target data.
- number of epoch.
- learning rate.
- weights.
- layers that we will need to make the gradients.

In [53]:
# How many times to consider the entire training set.
epoch = 100

# Features to use in training.
training_features = X

# Loop over all the epochs.
for e in range(epoch + 1):
    
    layers = forward_pass(training_features, W1, W2, W3) # Forward pass.
    error = layers[3] - Y # Evaluation.
    weights = back_prop(learning_rate, layers, weights, error) # Back Propogation.

# These are the final weights after all the training. 
weights

[array([[-10.06241126,  -0.6076612 ,   3.34616238,   0.91196121,
           1.44865172,  -0.78545495,  -0.2525013 ,  -1.11429943,
           0.97463769,  -3.10609286],
        [ -1.90222476,  -2.55879662,  -0.11574561,   0.85121727,
          -0.18013704,  -0.07682424,  -1.72246612,   0.04395962,
           2.0243891 ,   0.27222629],
        [-11.92739448,   3.00294179,   5.28547422,  -0.3224051 ,
           0.76689075,  -0.60099398,   1.33517528,  -0.48844716,
          -3.53007104,  -8.80437355],
        [ -4.95079083,   1.79452415,   2.18770847,  -0.74872256,
          -0.4897849 ,  -0.41560967,   0.84695439,   0.87447157,
          -0.76801281,  -2.75561465]]),
 array([[-3.49876992, -2.29256909, -2.62067903],
        [ 3.52323414,  2.41287272,  2.72311001],
        [-0.54265251,  0.60435037,  0.45279704],
        [-0.81447975, -2.4117693 , -1.31586052],
        [-2.02630675, -1.23218241, -1.94299212],
        [-0.53057861, -0.10108294, -0.09705777],
        [ 0.97350753,  1.7182664

In [54]:
A = forward_pass(training_features, weights[0], weights[1], weights[2])