## Score Function

$$WX + B$$

Once you get scores out, it's up to you how to interpret them and develop a loss function. You can pipe them into SVMs or Softmax, for example. But there are others.

## SVM Classifier
You can intepret the scores via SVM where you just want the correct score to be some margin above the incorrect scores, but in practice softmax is used more often--where you interpret the scores to be unnormalized log probabilities.

$$L_i = \sum_{j\neq y_i}max(0,s_j - s_{y_i} + 1)$$

SVM has a very local space which it cares about, the margin of 1 here, and beyond that it's invariant. So a loss of -100 versus -200 wouldn't matter to SVM, whereas it would still change loss for Softmax.

You don't count the correct class, because you'd be inflating loss by 1 everywhere. It wouldn't impact ultimate performance, but kind of an arbitrary decision.

## Softmax Classifier
The loss scores for a softmax classifier are interpreted as the unnormalized log probabilities of the classes. The probabilities may be derived by exponentiating the scores and normalizing them. The probability for a class *k* can be represented:

$$P(Y = k|X = x_i) = \frac{e^s_k}{\sum_{j} e^s_j}$$

You exponentiate the scores for one class, and divide by the sum of exponentiated scores for all classes.

Where the score *s* is a function of inputs and weights:

$$s = f(x_i;W)$$

We want to maximize the log likelihood, or (for a loss function) to minimize the negative log likelihood of the correct class:
$$L_i = -logP(Y = y_i|X = x_i)$$

*We want the log likelihood of the correct class to be high (we want the negative of it to be low), and the log likelihood is the softmax function of your scores*.

In summary:
$$L_i = -log\frac{e^s_{y_i}}{\sum_{j} e^s_j}$$

*Summary*: You convert the score outputs as unnormalized log probabilities, so first you convert to probabilities, and then you want to maximize the log probability of the right classes, which gives us the loss function for softmax.

#### Sanity check when kicking off classifier training

As a sanity check at the beginning of your optimization, your weights will be small, so the scores of all classes should be ~0. The *loss* should be $$-log\frac{1}{NumOfClasses}$$



## Full Loss

$$L = \frac{1}{N}\sum^N_{i=1}L_i + R(W)$$

Loss over the entire training set. Regularization is only a function of the weights, not the data.

## Optimization
Imagine you have a loss landscape, and you're blindfolded, but you have an altimeter, and you're trying to get to the bottom of the valley. That altimeter is the process of optimization.

#### Numerical Approximation of Graidents
- When you implement backprop, do gradient checking

#### Mini-batch Gradient Descent
- Common mini-batch sizes are 32/54/128 examples
- Kirzhevsky ILSVRC ConvNet used 256 examples

## Neural Network

(Before) Linear score function: $$f = Wx$$

(Now) 2-layer Neural Network: $$f = W_2max(0,W_1x)$$

or 3-layer Neural Network: $$f = W_3max(0,W_2max(0,W_1x))$$

## Activation Functions

**Sigmoid**: $$\sigma(x) = \frac{1}{1 + e^-x}$$

**tanh**: $$tanh(x)$$

**ReLU**: $$max(0,x)$$
Does not saturate in +region. Very computationally efficient. Converges much faster than sigmoid/tanh in practice (e.g. 6X). Not 0-centered output though.  

**Leaky ReLU**: $$max(0.1x, x)$$
Will not "die".  

**Maxout**: $$max(w^T_1x + b_1, w^T_2x + b_2)$$

**ELU**: (couldn't copy formula)

## Update



## Learning Rate Decay

**step decay**:
e.g. decay learning rate by half every few epochs

**exponential decay**:
$$\alpha = \alpha_0e^{-kt}$$

**1/t decay**:
$$\alpha = \frac{\alpha_0}{1 + kt}$$

## Convolution output size
*n* x *n* image  
*f* x *f* filter  
padding *p*  
stride *s*  

$$\left[  \frac{n+2p-f}{s} + 1 \right] x \left[  \frac{n+2p-f}{s} + 1 \right]$$

## Padding
Common to use this to preserve size spatially:
$$\frac{F-1}{2}$$