# 4. Back propagation and gradient descent - updating weights and biases
_Author: Maurice Snoeren_<br>
This notebook explains back propagation and how the weights and biases are updated using gradient descent algorithm with the update rule. We will use the example that is used in an earlier notebook, which is given by the figure below.

<img src="./images/ann1.png" width="400px" />

Note that the nodes that are given have inputs, weights, biases and an activation function. This is shown by the figure below. It is important to understand this, because we will require all the forward propagation equations and functions.

<img src="./images/perceptron2.png" width="200px" />

## Training
Training of a neural network is finding the correct weights and biases, that results in the desired output based on a given input. We could try random number for the weights and biases. However, it will take too much time because of the many variables. For our simple example, the neural network already contains 20 weights and 6 biases. That are already 26 different variables to fit. We need to find a way to calculate which weights and biases need to be changed and result to the final solution.

That is were gradient descent comes in. With the gradient descent algorithm we need to define the error of the neural network based on a known input and desired output. Another term for error is cost function, which is normally used within the literature. The cost function gives us an idea how bad (or how good) the solution is. It is based on a test sample $T$ that contains a predefined input $\hat{x}$ and the belonging desired output $\hat{y}$. We define this test set as 

$T = [\hat{x}, \hat{y}]$.

## Cost function
All the neural networks will start by randomly fill the weights and biases, to have some starting point. From this moment, we are able to train the network by using sample from the test set. In this example, we will use one sample to train the network. Generally this is a bad idea. For now we are only focussing on how a neural network can be trained. When we hava a sample $T = [\hat{x}, \hat{y}]$, we start by calculate the output of the neural network $y$, using the forward propagation equations, based on the input vector given by $\hat{x}$. As summary, we will give these equations below:

$z_h = \hat{x} * W_{hx} + b_h$<br>
$h   = f(z_h)$

$z_y = h * W_{yh} + b_y$<br>
$y   = f(z_y)$

We already filled in the sample input $\hat{x}$ to the network and find the output $y$. This output is based on the current weights and biases of the network. When the network is not trained yet, the output should not show the correct solution. The error or cost can be seen as the distance between the calculated output $y$ and the desired output $\hat{y}$. (In the literature, the calculated output is the hypothesis). We are able to calculate these by a well-known quadratic cost function (there are many types of cost functions, we will stick to this one):

$J = \frac{1}{2}(y - \hat{y})^2$

Ẃhen the calculated output and the desired output is far away, the costs will be high. Indicating that the neural network is not (yet) delivering the correct solution. If the cost is low, the output of the neural network is almost the same as desired. We could stop the training process when we reach a certain low cost function.

## Which weights should be changed?
At this point, we have a good insight how well our neural network is trained using the cost function. From now, we would like to know which weights needs to be adapted. Generally, we would like to tune the weights, that has the most influence on the cost. As example, we could change a certain weight and see how the costs are changed. When the costs are changed a lot, we know that this particular weight is able to get closer to the desired output. In the case, the costs do not change a lot, we know that this particular weight does not get us closer to the desired output. 

Manually changing each weight and see the effect of the cost function is very cumbersome. We can solve this by applying mathematics. In principe we would like to know how much the cost function changes when we change the weight. Within math this is done by applying the derivative of the cost function to the weight $\frac{\partial{J}}{\partial{w}}$. We use partial derivative to focus on one variable only. The derivate of the cost function gives us how much the cost function will change when we change the weight. We calculate this for every weight in the network. Based on this outcome, we know which weight influence the cost function the most. That is exactly what we need.

If you need to information of the derivative, you could check the following links https://www.wiskunde.net/differentieren and https://wiskundeacademie.nl/onderwerpen/de-afgeleide. Google and Youtube will have tons of information on this subject as well.

## Gradient descent
When we have found the gradients of the weights $\frac{\partial{J}}{\partial{w}}$ and the gradients of the biases $\frac{\partial{J}}{\partial{b}}$, we are able to update our weights to improve our neural network. Updating this variables is done using the gradient descent algorithm. Within this algorithm we have the following update rule for the weights and biases respectively:

$w = w - \alpha * \frac{\partial{J}}{\partial{w}}$<br>
$b = b - \alpha * \frac{\partial{J}}{\partial{b}}$

The minus sign is to move the weights and biases in the direction that the cost function become smaller. The symbol $\alpha$ is defined as the learning rate. When $\alpha$ is high the learning will go fast and when $\alpha$ is low the learning is slow. Note that when the $\alpha$ is large, it is possible the the solution is not found, due to the large steps. Change this $\alpha$ value when you see that the neural network is not learning correctly. You can see this by plotting the cost function after each iteration. Therefore, it is a good practice to always plot the cost function versus the iterations.

## Calculate the weight gradient matrix
The update rule of the gradient descent algorithm is used to update the weights and biases. This update rule contains the derivative of the cost to the weights and biases. We can calculate these derivatives by back propagating the network: back propagation. First we will show how to calculate the weight gradient matrices of the network. In our example, we start with the last weight matrix. Therefore, we would like to fi