# What is deep learning?

Everything in the universe is a function. Some functions are well known (how long it takes the earth to orbit the sun) while others are impossible for humans to wrap their heads around (the english languange). This is where deep learning comes in. By feeding a model inputs and the desired outputs it can learn to approximate an unknown function. Let's see the mechanics of how this is done.

***
# The main deep learning topics we're covering

- Models
- Loss Functions
- Activation Functions
- Backpropagation
- Optimizers

***
# Models (Neural Network)

### Perceptron
A perceptron is the simplest part of a neural network. It takes a weighed sum of m inputs, adds a bias, and then multiplies them with a nonlinear activation function. 

$$
y = f\left( \sum_{i=1}^{n} w_i x_i + b \right) = f(\mathbf{w} \cdot \mathbf{x} + b)
$$

The perceptron can only classify between two linearaly separable classes. This is because an activation function for a single perceptron can return either 1 or 0.

<img src="../images/perceptron.png" 
        alt="Picture" 
        style="display: block; margin: 0 auto" />

### Multi Layer Perceptron (MLP)
What if we stack multiple of these perceptrons together? We get something called a feedforward network. An MLP can model much more complicated functions. Each perceptron's outputs are fed to the next layer's inputs. In an MLP, perceptrons are called neurons. The layers between the input and output are called hidden layers.

<img src="../images/mlp.png" 
        alt="Picture" 
        style="display: block; margin: 0 auto" />

***
# Activation Functions

Activation functions are nonlinear functions that allow linear equations from an MLP to model nonlinear functions. They also control the output range of a neuron. When training a model, activation functions must be differentiable so gradients can be computed during backpropagation.

### Sigmoid

The Sigmoid activation function is a function that ranges from (0, 1). It is used mainly for probabilities and commonly used as an output layer for binary classification. It suffers from a vanishing gradient issue.

$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$

<img src="../images/sigmoid.png" 
        alt="Picture" 
        style="display: block; margin: 0 auto" />

### Tanh

The Tanh activation function is similar to sigmoid but the output range is (-1, 1). Generally, it is better than sigmoid for hidden layers but still suffers from vanishing gradients.

$$
\tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}
$$

<img src="../images/tanh.png" 
        alt="Picture" 
        style="display: block; margin: 0 auto" />

### ReLU

ReLU is a special activation function. Its output range is [0, infinity). If you remember calculus, you may notice that this function is not differentiable at x = 0. To combat this we just define the derivative as either 1 or 0. ReLU is commonly used for hidden layers since it is simple to calculate. It suffers from the the dead neuron problem.

$$
\text{ReLU}(x) = \max(0, x)
$$

<img src="../images/relu.png" 
        alt="Picture" 
        style="display: block; margin: 0 auto" />

### Leaky ReLU

Leaky ReLU is a modification of ReLU to fix the dying neuron problem. The output range is (-infinity, infinity). Rather than having all values <0 be 0, it defines a line with a very slight negative slope.

$$
\text{LeakyReLU}(x) =
\begin{cases}
x, & x \ge 0 \\
\alpha x, & x < 0
\end{cases}
$$

<img src="../images/leaky_relu.png" 
        alt="Picture" 
        style="display: block; margin: 0 auto" />

### Softmax

Unlike the other activation functions, this one cannot be used in hidden layers. Softmax creates a probability distribution from a vector of inputs, making it perfect for multi-class classification. The output range is (0, 1).

$$
\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}
$$ 

*** 
# Loss Functions

Loss functions tell a model how far its predictions are from the expected values. During training, the optimizer updates a model's weights to minimize the loss. Different tasks use different loss functions.

Loss functions should be:
- Differentiable
- Suited for the training task

### Mean Squared Error (MSE): Regression

The most common loss for regression tasks. It penalizes larger errors more heavily than smaller errors due to squaring the error.

$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^n (\hat{y}_i - y_i)^2
$$

### Mean Absolute Error (MAE): Regression

Another common loss for regression tasks. It penalizes all errors equally.

$$
\text{MAE} = \frac{1}{n} \sum_{i=1}^n \left| \hat{y}_i - y_i \right|
$$

### Binary Cross-Entropy (BCE): Binary Classification

A loss function used for classification between two classes.

$$
\text{BCE} = -\frac{1}{n} \sum_{i=1}^n \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]
$$

### Categorical Cross-Entropy: Multi-Class Classification

A loss function for several classes. 

$$
\text{CCE} = -\frac{1}{n} \sum_{i=1}^n \sum_{c=1}^K y_{i,c} \log(\hat{y}_{i,c})
$$

***
# Backpropagation

This is the algorithm used to train neural networks. It computes the gradient of the loss function with respect to each weight in the model. These gradients are then used by an optimizer to update the weights and minimize the loss.
Basically, backpropagation computes all partial derivatives of the weights in a model.

$$
\text{partial derivative} = \frac{\partial L}{\partial w_{ij}}
$$

<img src="../images/backpropagation.png" 
        alt="Picture" 
        style="display: block; margin: 0 auto" />

***
# Optimization

Optimization is the process of adjusting a neural network's weights and biases to minimize the loss function. We are trying to find the global minimum of the loss function which can be thought of as a high dimensional surface. Using the gradients found from backpropagation, we decide how to tune the parameters.

<img src="../images/gradient_descent.png" 
        alt="Picture" 
        height=400
        width=600
        style="display: block; margin: 0 auto" />

Adam is the most common optimization function and the only one we will talk about here, it utilizes momentum and adaptive learning rates. 