<table align="center">
   <td align="center"><a target="_blank" href="https://colab.research.google.com/github/umbcdata602/fall2020/blob/master/lab_backpropagation.ipynb">
<img src="http://introtodeeplearning.com/images/colab/colab.png?v2.0"  style="padding-bottom:5px;" />Run in Google Colab</a></td>
</table>

# Lab -- Backpropagation

The goal here is to develop an intuition for backpropagation, with minimal math.



# Supersimplified case

The supersimplified model is

$$
\hat{y} = wx
$$

* $x$ the model input (feature)
* $\hat{y}$ is the model prediction of the target variable $y$
* $w$ is the trainable weight
* let $\epsilon = \hat{y} - y$ represent the residual of the model prediction. 

The cost function $J$
$$
J = \epsilon^2 = (y - wx)^2
$$
 
If we had multiple inputs and samples, then $J$ would involve multiple sums, which would make the mathematical manipulations more complicated. We'll hold off.

## The solution

$J$ is a minimum for 

$$
\frac{\partial J}{\partial w} = 0
$$

The gradient of $J$ with respect to the weight $w$ is

$$
\frac{\partial J}{\partial w} = 2(y-wx)(-x) = -2x(y-wx) = -2x\epsilon
$$

That is,

$$\epsilon = y - wx = 0$$

In this supersimplified case, the solution is obvious by inspection

$$
w = \frac{y}{x}
$$

The model input is $x$, and $y$ is the target output.

# Slightly more complex case

We'll add some complexity, and perform the calculations in a sequence of "Layers."

Layer 1 multiplies the weight $w$ by the input $x$. The output of Layer 1 is the resulting product $z$,

$$
z = wx
$$

Layer 2 takes the output of Layer 1 and computes a nonlinear function $f(z)$. The output of Layer 2 is a prediction $\hat{y}$ of $y$:

$$
\hat{y} = f(z)
$$

The cost function $J$ can be computed with the result of Layer 2.

$$
J = (y - f(z))^2 = \epsilon^2
$$

The derivative of $J$ wrt the weight $w$ is

$$
\frac{\partial J}{\partial w} = 
2(y-f(z))\left( - \frac{\partial f}{\partial z} \frac{\partial z}{\partial w} \right) = -2 (y-f(z))\frac{\partial f}{\partial z} \frac{\partial z}{\partial w}
$$


## Solving the more complex case

To evaluate $\frac{\partial J}{\partial w}$, we need the outputs from each layer, $z$ and $\hat{y}$.
We also need $\frac{\partial f}{\partial z}$ and $\frac{\partial z}{\partial w}$, which we can compute in each layer during the forward pass. That's what happens in practice. The algorithm keeps track of these derivatives during the forward pass in the network. In Tensorflow, that's the purpose of GradientTape.

* If $f(z)$ is a linear function of $x$, then $J$ is a parabola as before, and we can solve the problem right away.
* If $f(z)$ is a nonlinear function of $x$, then $J$ will have more complex structure. As long as $f(z)$ is a "well-behaved" and monotonic function of $z$, then $J$ is convex and the solution occurs for $\frac{\partial J}{\partial w} = 0$, as before. In that case, we need to solve the equation $y = f(z)$.
* In general, however, $J$ may not be convex (see Figure below from Raschka's [ch12.ipynb](https://github.com/rasbt/python-machine-learning-book-3rd-edition/blob/master/ch12/ch12.ipynb)). In this case we perform gradient descent.


<img src="https://github.com/rasbt/python-machine-learning-book-3rd-edition/raw/master/ch12/images/12_13.png" width="600"/>

* If we use batch gradient descent, then we risk getting caught in a local minimum. That's where the "noisy" behavior of stochastic gradient descent can be helpful. As the algorithm progresses down the gradient in search of a global minimum in $J$, SGD can knock the algorithm out of local minima.