# Gradient Descent Math
## Simple Neural Network

<img src="./simple_neural_network.png" width=500 height=400/>

## Error

$ E = (y - \hat{y})^2$

$y$ is our actual value, $\hat{y}$ is the prediction. Why take the square and not the absolute value? It makes all values positive. It also penalizes outliers more.

That is just for one point, however. For the entire data set, the error looks like this for each point $\mu$ in the data set:
$$
\frac{1}{2} \sum_{\mu} (y^{\mu} - \hat{y}^{\mu})^2
$$

It is known as the **sum of the squared errors**.

Recall that $\hat{y}$ is $f(\sum_{i} w_ix_i^{\mu})$ where $f(x)$ is the sigmoid function.

$$
\frac{1}{2} \sum_{\mu} (y^{\mu} - f(\sum_{i} w_ix_i^{\mu}))^2
$$

We update the weights on each pass like so $w_i = w_i + \Delta w_i$ where $\Delta$ is the negative of the gradient.

Weights are proportional to the partial derivative of the error with respect to the weight.
$$
\Delta w_i \propto - \frac{\delta E}{\delta w_i}
$$

We can also throw in the learning rate, $\eta$
$$
\Delta w_i \propto - \eta \frac{\delta E}{\delta w_i}
$$

Which equals:
$$
\begin{align}
\frac{\delta E}{\delta w_i}& = \frac{\delta}{\delta w_i} \frac{1}{2}(y - \hat{y})^2 \\
& =\frac{\delta}{\delta w_i} \frac{1}{2}(y - \hat{y}(w_i))^2
\end{align}
$$


This requires the **chain rule** to solve. Quick refresher on the chain rule: 

$$
\frac{\delta}{\delta z} p(q(z)) = \frac{\delta p}{\delta q}\frac{\delta q}{\delta z}
$$

So in our problem, $p = \frac{1}{2}q(w_i)^2$ and $q = y - \hat{y}(w_i)$

Then using the chain rule again:

$$
\begin{align}
\frac{\delta E}{\delta w_i} & = (y - \hat{y}) \frac{\delta}{\delta w_i} (y - \hat{y}) \\
& = (y - \hat{y}) \frac{\delta}{\delta w_i} (y - \hat{y}) \\
& = -(y - \hat{y}) \frac{\delta \hat{y}}{\delta w_i} 
\end{align} 
$$

Recall that $\hat{y} = f(h)$, where $f$ is the activation function (such as sigmoid) of h. And also recall that $h$ is the linear combination of the input values and the weights: 
$$
h = \sum_{i} w_i x_i
$$

Using the chain rule again: 
$$
\frac{\delta E}{\delta w_i} = (y - \hat{y}) f'(h) \frac{\delta}{\delta w_i} \sum_{i} w_i x_i
$$

In $\sum_{i} w_i x_i$, only one term depends on each weight.

So $$
\begin{align}
\frac{\delta}{\delta w_i} \sum_{i} w_i x_i & = \frac{\delta}{\delta w_1} [w_1x_1 + w_2x_2 + \dots + w_nx_n]\\
& = x_1 + 0 + 0 + \dots
\end{align}
$$

Putting it all together:
$$
\frac{\delta E}{\delta w_i} = (y - \hat{y})f'(h)x_i
$$

To update the weights:
$$
\Delta w_i = \eta (y - \hat{y})f'(h)x_i
$$

To make things easier, we can define an error term $\delta$ as $(y - \hat{y})f'(h)$ so updating the weights is just:
$$
\Delta w_i = w_i + \eta \delta x_i
$$

If there are multiple output units:

$$
\delta_j = \eta(y_j - \hat{y_j})f'(h_j)
$$

And:

$$
\Delta w_{ij} = \eta \delta_j x_i
$$

