# Formulas

## Notebooks
[01_data_visualization](#01_data_visualization.ipynb)  
[02_formulas](#02_formulas.ipynb)  
[03_demo](03_demo.ipynb)  

## Layers

In its simplest form, a CNN classifier for CIFAR-10 might look like this:
- Convolutionlayer: computes output of neurons connected to local regions in the input, changes output size.
    - Has parameters: weights and biases
    - Has hyperparameters: num_filters (K), filter_size (F), stride (S), zero-padding(P)
- ReluLayer: applies element-wise activation function max(0, x), no change in output size.
- PoolingLayer: downsamples along width and height, changes output size.
    - Has hyperparameters: filter_size (F), stride (S)
- DenseLayer: computes class scores, where numbers correspond to CIFAR-10 class score.
    - Has parameters: weights and biases
    - Has hyperparameter: units
- SoftMaxLayer: applies exponentiation and normalization to the the DenseLayer output scores.

## Scoring Function

$$Z = WX + B$$

Dot product layers in a neural network, such as convolution and dense layers, take some input X, multiply it by weights W, add biases B to output scores Z.

Deeper in the network, the input to a layer are the activation outputs of the previous layer. Output values Z of a given layer L can be expressed as:
$$Z^{[L]} = W^{[L]}A^{[L-1]} + B^{[L]}$$

## Softmax Classifier

Once you get scores Z out of the last Dense layer, it's up to you how to interpret them. The industry standard for interpreting classification problems with more than two classes is Softmax.

Softmax interprets Z scores as the unnormalized log probabilities of the classes. To convert to probabilities, exponentiate and normalize the scores. The probability for a class *k* with score *s* can be expressed as:

$$P(Y = k|X = x_i) = \frac{e^s_k}{\sum_{j} e^s_j}$$

You exponentiate the scores for one class, and divide by the sum of exponentiated scores for all classes.

## Loss

To optimize the network, we need a loss or cost function to minimize. If we want to maximize the likelihood of the correct class, then we want to minimize the negative log likelihood of the correct class:
$$L_i = -logP(Y = y_i|X = x_i)$$

We want the log likelihood of the correct class to be high (we want the negative of it to be low), and the log likelihood is the softmax function of your scores. Log rather than raw values are used because it works better mathematically.

If we just substitute in the probability formula from above, loss becomes:
$$L_i = -log\frac{e^s_{y_i}}{\sum_{j} e^s_j}$$

### Sanity check when kicking off classifier training

As a sanity check at the beginning of training, your weights will be small, so the scores of all classes should be close to ~0. Exponentiating 0 gives 1, and normalizing gives (1/num_classes). Thus, the *loss* when kicking things off should be $$-log\frac{1}{NumOfClasses}$$
If it isn't, then something isn't set up properly.

## Full Loss

$$L = \frac{1}{N}\sum^N_{i=1}L_i + R(W)$$

Loss over the entire training set. Regularization is only a function of the weights, not the data.

## Backpropagation Step
Key step or equation you need to initialize backprop is:

$$dZ^{[L]} = \hat Y - Y$$

dZ is a partial derivative of the cost function with respect to the outputs of the last layer:
$$\frac{\partial J}{\partial Z^{[L]}}$$

## Optimization
Imagine you have a loss landscape, and you're blindfolded, but you have an altimeter, and you're trying to get to the bottom of the valley. That altimeter is the process of optimization.

#### Numerical Approximation of Graidents
- When you implement backprop, do gradient checking

#### Mini-batch Gradient Descent
- Common mini-batch sizes are 32/54/128 examples
- Kirzhevsky ILSVRC ConvNet used 256 examples

## Neural Network

(Before) Linear score function: $$f = Wx$$

(Now) 2-layer Neural Network: $$f = W_2max(0,W_1x)$$

or 3-layer Neural Network: $$f = W_3max(0,W_2max(0,W_1x))$$

## Activation Functions

**Sigmoid**: $$\sigma(x) = \frac{1}{1 + e^-x}$$

**tanh**: $$tanh(x)$$

**ReLU**: $$max(0,x)$$
Does not saturate in +region. Very computationally efficient. Converges much faster than sigmoid/tanh in practice (e.g. 6X). Not 0-centered output though.  

**Leaky ReLU**: $$max(0.1x, x)$$
Will not "die".  

**Maxout**: $$max(w^T_1x + b_1, w^T_2x + b_2)$$

**ELU**: (couldn't copy formula)

## Update



## Learning Rate Decay

**step decay**:
e.g. decay learning rate by half every few epochs

**exponential decay**:
$$\alpha = \alpha_0e^{-kt}$$

**1/t decay**:
$$\alpha = \frac{\alpha_0}{1 + kt}$$

## Convolution output size
*n* x *n* image  
*f* x *f* filter  
padding *p*  
stride *s*  

$$\left[  \frac{n+2p-f}{s} + 1 \right] x \left[  \frac{n+2p-f}{s} + 1 \right]$$

## Padding
Formula for padding so that output is same as input size. Common to use this to preserve size spatially:
$$p = \frac{f-1}{2}$$