# Neural Networks: 

Neural Networks (NN) are a very broad category of algorithms and models whose designs were inspired by the patterns and complexities of the human mind. The design philosophy behind NN is fairly simple: iteratively stack relatively simple components to build a significantly more complex model.

Mathematically, NN are considered functions from one space to another. Hence, their operation greatly depends on the concepts of Jacobians, Chain Rule, Computational Graphs, all of which are collectively applied in a technique known as backpropagation. 

To begin with, we'll define and discuss the first and simplest component of Neural Networks: a perceptron.

![Perceptron](images/perceptron.png)

Let's break down what we see above. First, we have our inputs $x_1,x_2,\dots,x_n$. These are whatever variables or features you will be working with, alongside an extra constant $w_0$ known as a *bias*. Generally the bias is optional, but favored. Each input, including the bias, is  multiplied against its own personal *weight* $w_1,w_2,\dots,w_n$. These weights are also known as parameters, and we'll be using them interchangeably. The products of inputs and weights is then summed together to get $\sum_{i=1}^nx_iw_i$, and finally that total is *activated*, which means it has an *activation function* applied to it. In the case of a vanilla perceptron, the activation function is a basic step function: 

$$
\sigma:\mathbb{R}\to\mathbb{R}\\ 
\sigma(x)=0\text{ if }x\leq 0\text{ otherwise }\sigma(x)=1
$$

Generally though this step function doesn't see actual use. Instead, the sigmoid function is favored: 
$$
\sigma:\mathbb{R}\to\mathbb{R}\\ 
\sigma(x)=\frac{1}{1+e^{-x}}
$$ 

It serves a very similar purpose, but unlike a step function, its smooth everywhere. This means that it's differentiable, which will be key for the model.


So let's formalize what our perceptron does mathematically. Let's start without a bias. We can consider our inputs as a vector $\vec x\in\mathbb{R}^n$ where $\vec x=[x_1,x_2,\dots,x_n]^T$, and similarly our weights as a vector: $\vec w\in\mathbb{R}^n$ with $\vec w=[w_1,w_2,\dots,w_n]^T$ then we get that our weighted sum is $\sum_{i=1}^nx_iw_i=\vec x\cdot\vec w= \vec w^T\vec x$. If we include our bias, the only difference we make to this equation is that we define $\vec x=[1,x_1,x_2,\dots,x_n]^T$ and our weights as $\vec w=[w_0,w_1,w_2,\dots,w_n]^T$. This definition becomes especially convenient later when building up large Neural Networks. 

We can consider our perceptron as a function 
$$
F: \mathbb{R}^n\to\mathbb{R}\\
F(\vec x,\vec w)=\sigma(\vec w^T\vec x)
$$

The real question is now: how do we figure out our weights? Well we start by defining some loss function. For simplicity, we'll be using an $L^2$ loss function. For every input $\vec x_i$ there is an accompanying value known as a *label* or *target* which is what we hope our model will output, written as $y_i\in\mathbb{R}$. Then the loss for a particular input is 
$$
L:\mathbb{R}^n\to\mathbb{R}\\
L(\vec x_i, \vec w)=(F(\vec x_i,\vec w)-y_i)^2
$$

This is essentially the *distance* between the model output $\hat x_i=F(\vec x_i,\vec w)$ and the target $y_i$. Now we want to minimize the distance between those two values, so we can apply gradient descent! In order to do that, first we must figure out what the gradient is. As a reminder, what we want is the Jacobian of $L$ with respect to $\vec w$, written as $(\mathcal{D}L)$. Note we'll be omitting the arguments to the functions, since those won't change until the last couple steps. Let's worth that out.

$$
\mathcal{D}L=\mathcal{D}[(F-y_i)^2]
\\=2(F-y_i)*\mathcal{D}[F-y_i]
\\=2(F-y_i)*(\mathcal{D}[F]-\mathcal{D}[y_i])
\\=2(F-y_i)*\mathcal{D}[F]
$$

So now in particular, we'll focus on the $\mathcal{D}[F]$ term. 

**Note**: $\vec w^T\vec x=\vec x^T\vec w$ 
$$
\mathcal{D}[F]=\mathcal{D}[\sigma(\vec x^T\vec w)]
=(\mathcal{D}[\sigma])(\vec x^T\vec w)\mathcal{D}[\vec x^T\vec w]
=(\sigma(\vec x^T\vec w)(1-\sigma(\vec x^T\vec w))\vec x^T
$$

Plugging this back into our equation for 
$$
\mathcal{D}[L]=2(F-y_i)*\mathcal{D}[F]=2(F-y_i)*(\sigma(\vec x^T\vec w)(1-\sigma(\vec x^T\vec w))\vec x^T
\\=2(\sigma(\vec x^T\vec w)-y_i)*(\sigma(\vec x^T\vec w)(1-\sigma(\vec x^T\vec w))\vec x^T
$$

Let's take a step back. Remember, we wante to figure out *how much to change* $\vec w$ in order to minimize $L$. Now that we know $\mathcal{D}[L]$, we know the direction of *greatest increase*, so to minimize it, we only need to step in the opposite direction, $-\mathcal{D}[L]$. So we know the direction, but how large should our step be? The smaller the step, the more carefuly the algorithm will hone in on its closest local minima. This leads to slower, less noisy training. Setting the step size to be too low will lead to the model spending way too much time training, and likely getting stuck in the first local minima it finds. In practice, this can lead to very poor models. A larger step size leads to faster, noisier training. This noise can *knock* the model out of a local minima, causing it to optimize in other directions. Setting the learning rate to be too large can lead to unstable, non-reproducible training that either diverges, or fails to converge to any minima at all.

Unfortunately, there's no golden rule to setting learning rate. Repeated experiments are really the only way to tell what work. There are some heuristics however. The larger the batch size, the greater you want your learning rate to be. If you don't intend to train the model for a long time, then longer steps are encouraged. It generally depends on the exact problem and ciscumstances of your model.

For now, let's say that our learning rate $\epsilon=10^{-7}$. Then in order to update our weights, we will set $w_{new}=w_{old}-\epsilon\mathcal{D}[L]$. This seems very small, but generally this process will be repeated millions-billions of times

So, why even bother with perceptrons? Well, they're simple. Really, its only 3 operations: multiplication, addition, and a simple thresholding using $\sigma$. If you think about it, 

## Backpropagation

Backpropagation (backprop for short (bkprp fr shrt (bp f s))) is the heart of deep learning.

### Jacobians




### Chain Rule


### Computational Graph
