# Formulas

### Notebooks
[01_data_visualization](#01_data_visualization.ipynb)  
[02_formulas](#02_formulas.ipynb)  
[03_demo](03_demo.ipynb)  

### This Notebook's Contents
[Layers](#Layers)  
[Scoring Function](#Scoring-Function)  
[Convolution Output Size](#Convolution-Output-Size)  
[Padding](#Padding)  
[ReLU](#ReLU)  
[Softmax Classifier](#Softmax-Classifier)  
[Categorical Cross-Entropy Loss](#Categorical-Cross-Entropy-Loss)  
[Backpropagation](#Backpropagation)  
[Stochastic Gradient Descent](#Stochastic-Gradient-Descent)  

## Layers

In its simplest form, a CNN classifier for CIFAR-10 might look like this:
- Convolutionlayer: computes output of neurons connected to local regions in the input, changes output size.
    - Has parameters: weights and biases
    - Has hyperparameters: num_filters (K), filter_size (F), stride (S), zero-padding(P)
- ReluLayer: applies element-wise activation function max(0, x), no change in output size.
- PoolingLayer: downsamples along width and height, changes output size.
    - Has hyperparameters: filter_size (F), stride (S)
- DenseLayer: computes class scores, where numbers correspond to CIFAR-10 class score.
    - Has parameters: weights and biases
    - Has hyperparameter: units
- SoftMaxLayer: applies exponentiation and normalization to the the DenseLayer output scores.

The formulas below are not exhaustive. They cover the core calculations used in the library. Note that when vectorized, a variable $w$ is written as $W$.

## Scoring Function

$$Z = WX + B$$

Dot product layers in a neural network, such as convolution and dense layers, take some input $X$, multiply it by weights $W$, add biases $B$ to output scores $Z$.

Deeper in the network, the input to a layer are the activation outputs of the previous layer. Output values $Z$ of a given layer $L$ can be expressed as:

$$Z^{[L]} = W^{[L]}A^{[L-1]} + B^{[L]}$$

## Convolution Output Size
Given an $n$ x $n$ image, $f$ x $f$ filter, padding $p$, and stride $s$, the output volume of a convolution can be expressed as:

$$\left[  \frac{n+2p-f}{s} + 1 \right] \text{x} \left[  \frac{n+2p-f}{s} + 1 \right] \text{x filters}$$

The depth of the output volume is equivalent to the number of filters used in the convolution.

## Padding
Formula for padding so that output is same as input size:

$$p = \frac{f-1}{2}$$

Common to use this to preserve size spatially during convolutions.

## ReLU

Rectified Linear Units or ReLU are used as standard activation functions in most neural networks, including CNNs: 

$$f(x) = max(0,x)$$

ReLU does not saturate in the positive region and is computationally efficient. It converges much faster than sigmoid/tanh in practice (e.g. 6X), but doesn't have 0-centered output.

## Softmax Classifier

Once you get scores $Z$ out of the last Dense layer, it's up to you how to interpret them. The industry standard for interpreting classification problems with more than two classes is Softmax.

Softmax interprets $Z$ scores as the unnormalized log probabilities of the classes. To convert to probabilities, exponentiate and normalize the scores. The probability for a class $k$ with score $s$ can be expressed as:

$$P(Y = k|X = x_i) = \frac{e^s_k}{\sum_{j} e^s_j}$$

You exponentiate the scores for one class, and divide by the sum of exponentiated scores for all classes.

## Categorical Cross-Entropy Loss

To optimize the network, we need a loss or cost function to minimize. If we want to maximize the likelihood of the correct class, then we want to minimize the negative log likelihood of the correct class:

$$L_i = -logP(Y = y_i|X = x_i)$$

Log is used because it works better mathematically. Substituting in the formula above, loss for a single example can be expressed as:

$$L_i = -log\frac{e^s_{y_i}}{\sum_{j} e^s_j}$$

Loss over the entire training set can be expressed as:

$$L = \frac{1}{N}\sum^N_{i=1}L_i + R(W)$$

Regularization (e.g. L2, L1) is only a function of the weights, not the data.

**Sanity check when kicking off classifier training**

At the beginning of training, your weights will be small, so the scores of all classes should be close to ~0. Exponentiating 0 gives 1, and normalizing gives (1/num_classes). Thus, the *loss* when kicking things off should be $-log\frac{1}{NumOfClasses}$. If it isn't, then something isn't set up properly.

## Backpropagation
There is a calculus derivation of the loss that won't be covered here. But the result is that the key equation needed to initialize backprop, where $\hat Y$ is a prediction vector and $Y$ is a true label vector, is:

$$dZ^{[L]} = \hat Y - Y$$

$dZ^{L}$ is a partial derivative of the cost function with respect to the outputs of the last layer:
$$\frac{\partial J}{\partial Z^{[L]}}$$

## Stochastic Gradient Descent

Batch gradient descent is the simplest parameter update method. For every layer $l$ with weights $W$, biases $b$, and learning rate $\alpha$:

$$ W^{[l]} = W^{[l]} - \alpha \text{ } dW^{[l]}$$
$$ b^{[l]} = b^{[l]} - \alpha \text{ } db^{[l]}$$

$dW$ and $db$ are the local gradients for $W$ and $b$. They are derived via the chain rule from the previous layer during backpropagation.

Stochastic gradient descent, or mini-batch gradient descent, executes gradient descent over smaller batches of training data. Common mini-batch sizes are 32, 64, 128, and 256.