# Gradient Descent and Backpropagation - Introduction  (Primer)
#### Author: Jay Mody
---
## Introduction

Gradient descent and backpropagation is often presented as a blackbox process. Often you're presented with the depiction of descending down a mountain, or propagating the error from the end of the start to the beggining. But how does it all actually work? How would you implement it in code? To understand and implement backpropagation gradient descent, we are going to need to understand the math behind it.

This notebook is meant designed for an audience that is already familiar with neural networks, so concepts like feed forward, hidden layers, and activation functions are assumed. Here's a summary of what you should already be familiar with:

**Required Knowledge:**
- Basic Python Skills
- NumPy
- Functions
- Calculus (derivatives, gradients, chain rule)
- Linear Algebra (matrices, matrix multiplication)
- Basic Understanding of a fully connected neural network (feed forward, activations, error, weights, bias)

Make sure you understanding everything in this notebook before moving on. Let's start by refreshing our memory on what the feed forward process of a neural network looks like.

---

## The Neural Network
Let's look at a simple example of a neural newtork with a single hidden layer. The feed forward process would look something like this:

\begin{align}
\large
X \xrightarrow[W^1X + B^1]{\text{linear}} h_i \xrightarrow[f(h_i)]{\text{activation}} h_o \xrightarrow[W^2h_o + B^2]{\text{linear}} y_i \xrightarrow[f(y_i)]{\text{activation}} \hat{y} \xrightarrow[E(y, \hat{y})]{\text{error}} E
\end{align}

**Inputs and Outputs:**
- $X$ is the inputs ($x_1, x_2$)
- $h_i$ is the hidden layer inputs
- $h_o$ is the hidden layer outputs
- $y_i$ is the output layer inputs
- $\hat{y}$ is the predictions (output layer outputs)
- $E$ error of the network


**Labels:**
- $y$ correct labels


**Parameters:**
- $W^1$ is the weight matrix connecting the input layer to the hidden layer
- $B^1$ is the bias vector for the input layer to the hidden layer
- $W^2$ is the weight matrix connecting the hidden layer to the output layer
- $B^2$ is the bias vector for the hidden layer to the output layer


**Functions:**
- $f(x)$ is an activation function
- $E(y, \hat{y})$ is the error function


The final result is your predictions $\hat{y}$ which is the whole purpose of a neural network, to make accurate predictions. However, when training and testing the network, there also is the **error** step, which uses an **error function** that takes in the predictions $\hat{y}$ and the labels $y$ to "measure" how wrong the network is.

---
## Error
The **error** of a network measures how "wrong" the predictions were. The lower the error, the better the prediction, the higher the error, the worse the prediction. If we find an intelligent way to figure out how to change the parameters of the network to reduce the error, we could train the network. Before we get to how that is achieved, lets look at ways of defining error.


### Error Function
There are many ways we can define the error. The simplest approach is to simply subract the prediction ($\hat{y}$) with the label ($y$). The, further away the prediction is from the loss, the greater the error
For our loss function (also called error function or cost function), we will use squared error.

\begin{align}
\large
E(y, \hat{y}) = y - \hat{y}
\end{align}

One problem with this approach is that we can get negative errors. $0.9 - 0.7$ and $0.7 - 0.9$ are both seperated by $0.2$, however one produces a negative error, which makes no sense, since the best error achievable is 0 error. To avoid this, we could take the absolute value, or we can square it. Squaring the difference allows us to exaggerate the error for "wronger" predictions and it also has a more useful derivative, which will come in handy later on.

\begin{align}
\large
E(y, \hat{y}) = (y - \hat{y})^2
\end{align}

We are also going to add a $1/2$ constant to make the derivative of the function cleaner, since that will be required later on.

\begin{align}
\large
E(y, \hat{y}) = \frac{1}{2}(y - \hat{y})^2
\end{align}

Of course, the above would give you the error of a single class. If you want the error for all the predictions in the vector $\hat{y}$, then simply take the mean over all the classes:

\begin{align}
\large
E(y, \hat{y}) = \frac{1}{M} \sum_{j=1}^M \frac{1}{2}(y_j - \hat{y_j})^2
\end{align}

where $M$ is the number of classes (in the case of our data, 3). If you wanted to measure the error of a whole data  set, again, you would simply take the mean of the error of all the samples in the dataset:

\begin{align}
\large
E(y, \hat{y}) = \frac{1}{NM} \sum_{i=1}^N  \sum_{j=1}^M \frac{1}{2}(y_{ij} - \hat{y_{ij}})^2
\end{align}

where $N$ is the total number of samples in the dataset. This error function is known as **Mean Squared Error** or MSE. Often, the  equation for [MSE](https://en.wikipedia.org/wiki/Mean_squared_error) is written without the inner summation (as it is implied that $y_i - \hat{y_i}$ is the mean error over all the classes), and without the $\frac{1}{2}$ constant that we added:

\begin{align}
\large
E(y, \hat{y}) = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y_i})^2
\end{align}

One last important concept before we can finally get to finding a way to change the parameters such that the error reduces, is the idea that neural networks are functions.

## Neural Networks are Functions
This concept that a neural network is a function, specifically a composite function, is crucial to understanding neural networks. Each layer as you progress through the network is a more and more layered composite function. For example, the error $E$ of the neural network is composite as such:
- $E$ is a funnction of the labels $y$ and $\hat{y}$
- $\hat{y}$ is a function of $y_i$
- $y_i$ is a function of $h_o$
- $h_o$ is a function of $h_i$
- $h_i$ is a function of the inputs $x$. 

\begin{align}
\large
X \xrightarrow[W^1X + B^1]{\text{linear}} h_i \xrightarrow[f(h_i)]{\text{activation}} h_o \xrightarrow[W^2h_o + B^2]{\text{linear}} y_i \xrightarrow[f(y_i)]{\text{activation}} \hat{y} \xrightarrow[E(y, \hat{y})]{\text{error}} E
\end{align}


There are 3 categories of variables that go into a neural network:
1. **Parameters:** the wieghts and biases
2. **The inputs:** $x$
3. **The labels:** $y$

A neural newtork is just combining all these variables with fancy functions and linear combiniations to get a prediction and then an error. You could technically express $\hat{y}$ in terms of the parameters and the inputs, it would be a very long ugly function, but a function nonetheless.

Think of it like massive machine with hundred and hundreds of knobs and dials that we can turn and twist (the knobs and dials representing the parameters). The inputs go into this machine, the knobs and dials do something with it, and an answer is spit out. We are simply trying to find what configuration of the knobs and dials will give us accurate answers.

What's important is not only is a neural network a function, but it is a **differentiable function** (since the activations and error functions are differentiable, and so is a linear combination).

We know reducing error, should train the network. We also know that a **derivative** models how one variable changes with respect to another. A **gradient** of a function indicates which direction of the change in variables would cause the greatest increase to the function, and is expressed in the language of derivatives.

## Gradients
If we could find the gradient of the **error with respect to the weights/biases**, the negative of that gradient would tell us how to adjust the weights/biases to decrease error. 

As we said, $E$ is a function of the inputs, labels, and **all** the parameters. If we isolated how a single weight affects the network by taking the **partial derivative** of the error with respect to that single weight, we would be able to see how that weight can be changed to decrease the error. Do that for all possible weights and biases, and that would yield the **gradient vector** of the error.

The gradient vector of the error with respect to the weights and biases respectively would look like this. 

\begin{align}
\large
\nabla E_w = (\frac{\partial E}{\partial w_1}, \frac{\partial E}{\partial w_2}, \frac{\partial E}{\partial w_n})
\end{align}

\begin{align}
\large
\nabla E_w = (\frac{\partial E}{\partial b_1}, \frac{\partial E}{\partial b_2}, \frac{\partial E}{\partial b_n})
\end{align}



Our inputs and labels are defined by our **data**, for the neural network, they are static. The inputs will always be the inputs, the labels will always be the labels. The whole goal is to have the neural network find a relationship between the inputs and labels and predict from it.

The parameters, are variable. The goal is t
