# Stanford cs231n Notes

Winter 2016 series by Li Fei-Fei, Andrej Karpathy and Justin Johnson

[videos](https://www.youtube.com/playlist?list=PLkt2uSq6rBVctENoVBg1TpCC7OQi31AlC)

[material](http://cs231n.stanford.edu/2016/syllabus.html)

# Lecture 3

## SVM

Given a classifier in the form of $f(x, W)=Wx$ that assigns class scores. 

Given an example $(x_i, y_i)$, where $x_i$ is the image and $y_i$ is the integer label, and the score vector is $s = f(x_i, W)$.

The **Multi-class SVM Loss (Hinge Loss)** for an example is given by: 

$$ L_i = \sum_{j \neq y_i} \max(0, s_j - s_{y_i} + 1) $$

1 above is a **safty margin** (typical SVM hyperparameter), detailed notes showed that it does not matter as weights can also scale to compensate for it.

The loss is essentially summing the error for all the incorrect classes.

The **full training losss** is the mean over all examples in the training data:

$$ L = \frac{1}{N} \sum_{i=1}^{N} L_i $$

Note: **Squared Hinge Loss**, sometimes works better but meaning is different. 

$$ L_i = \sum_{j \neq y_i} \max(0, s_j - s_{y_i} + 1)^2 $$

**Problem with the above loss is that the weight matrix $W$ is not unique.** It can be scaled to still achieve the same score. Therefore we need regularization.

In [2]:
# implementation
def L_i_vectorized(x, y, W):
    scores = np.dot(W, x)
    margins = np.maximum(0, scores - scores[y] + 1)
    # zero out the correct class
    margins[y] = 0
    loss_i = np.sum(margins)
    return loss_i

## Linear SVM Hinge Loss Gradient

See course [notes](http://cs231n.github.io/optimization-1/#gradcompute) for details. This is written for assigment 1 SVM problems.

This SVM has the following definition, where $j$ and $y_i$ are **class labels (indexes into $S$)** and **true class labels**, for $i$th training example in total of $N$ examples:

Dimensions:

```
# D - data dimension
# C - no. of classes
# N - no. of training examples

W.shape == (D, C)
X.shape == (N, D)
scores.shape = (N, C)
```

$$
\begin{aligned}
s_i &= X_i W \\
Margin_{j, y_i} &= s_j - s_{y_i} + 1 = \text{incorrect class score - true class score + 1}\\
L(S) &= \sum_{j \neq y_i} \max(0, Margin_{j, y_i}) 
\end{aligned}
$$

For a single training example $i$, where $y_i$ is the correct class label for $i$:

$W_j$ is the column vector for the weights for class $j$.

$$
\begin{aligned}
s_{y_i} =& W_{y_i}^T X_i  \\
s_{j} =& W_j^T X_j & j \neq y_i\\
Margin_{j, y_i} =& s_j - s_{y_i} + 1 = \text{incorrect class score - true class score + 1}\\
L_i(S) =& \sum_{j \neq y_i} \max(0, Margin_{j, y_i}) \\
=& \sum_{j \neq y_i} \max\big(0, W_j ^T X_i - W_{y_i}^T X_i + \delta\big)
\end{aligned}
$$

### Backprop

Dimension:

```
W_j - (D, 1)
W_yi - (D, 1)
X_i - (D, 1)
z - (1, D) * (D, 1) = (1, 1)
L_i - sum((1, C)) == (1, 1)
dz - sum((1, C)) == (1, 1)
dWj - (D, 1) * (1, 1) = (D, 1)
```

$$
\begin{aligned}
z_j =& W_j ^T X_j - W_{y_i}^T X_i + \delta \\
L_i(z_j) =& \sum_{j \neq y_i} \max\big(0, z_j\big) \\
dz =& \sum_{j \neq y_i} \mathbb{1}\big( W_j ^T X_i - W_{y_i}^T X_i + \delta > 0 \big) \\
=& \mathbb{1}\big( W_1 ^T X_i - W_{y_i}^T X_i + \delta > 0 \big) + \\
& \mathbb{1}\big( W_2 ^T X_i - W_{y_i}^T X_i + \delta > 0 \big) + \\
& \cdots \\
& \mathbb{1}\big( W_C ^T X_i - W_{y_i}^T X_i + \delta > 0 \big) \\
dW_{y_i} =& -X_i \cdot dz \\
=& -X_i \cdot \sum_{j \neq y_i} \mathbb{1}\big( W_j ^T X_i - W_{y_i}^T X_i + \delta > 0 \big) & 
\text{because } W_{y_i} \text{exists in every time in dz} \\
dW_j =& X_i \cdot dz_j \\
=& X_i \cdot \mathbb{1}\big( W_j ^T X_i - W_{y_i}^T X_i + \delta > 0 \big) & \text{because other } W_j\text{ terms disappear}
\end{aligned}
$$

Where $\mathbb{1}()$ is an indicator function.

**Essentially, the loss function is a count of the number of incorrect classes.**

Vectorized:

$$
\begin{aligned}
S &= X W & (N, D) \times (D, C) = (N, C)\\
Margin_{j, y} &= S_j - S_{y} + 1 & (N, C)\\
L(S) &= \sum_{j \neq y_i} \bigg(\max\big(0, Margin_{j, y_i}\big)\bigg) & (1,)
\end{aligned}
$$

Derivatives are hence:

The key here is we need to capture the gradient from the **correct** class here, which is missing from $dS_j$ because $j \neq y_i$, it is coming from $dS_{y}$.

$$
\begin{aligned}
dMargin &= \begin{cases}
1 & Margin_{j,y_i} > 0 \\
0 & \text{otherwise}
\end{cases} & (N, C)\\
ds_{j \neq y_i} &= 1^{N\times C} \odot dMargin & (N, C)\\
ds_{j = y_i} &= -1^{N \times C} \odot \sum_{j \neq y_i} dMargin & (N, 1) \\
dW_{j \neq y_i} &= X^T \cdot ds_{j \neq y_i} & (N, C)\\
dW_{j = y_i} &= X^T \cdot ds_{j = y_i} & (N, 1)
\end{aligned}
$$

In code form:

```
dmargin = loss > 0
ds = dmargin
dW += np.dot(X.T, dmargin)
```

## Softmax Classifier

Loss function

$$ L_i = -\log \big( softmax(s_i) \big) $$

**Tip**: Initially, the value of this loss function should be close to $-\log(1/C)$ where $C$ is the number of classes. This can be used as a sanity check for running optimizations.


## Lecture 4 Backprop

In a backprop computational graph:

**Add** gate: gradient distributor

**Max** gate: gradient router, gradient goes to the larger input, smaller input has zero gradient.

**Multiply** gate: gradient "switcher"?

If a node's output is used by multiple consumer nodes then in backprop, gradients coming from the consumer nodes should be **added**.

## Lecture 5

Good example in video that implemented a check after initalizing a network, feed data through the network without regularization and check the loss, on CIFAR-10 the expected loss is $-log(1/10) \approx 2.3$.

**Tip**: make sure that you can **overfit** very small portion of the training data. If you can't overfit a small portion of the data, then things are broken.

### Hyperparameter Search

Run coarse search for 5 epochs, then run finer searches. 

Example showed a good result near the boundaries of the learning rate search space. Thsis may be **problematic**, because it may indicate that the optimal parameter may be **outside** the search space.

**Track the ratio of weight updates / weight magnitudes**. This should be `-1e-3`. If too high, maybe decrease learning rate, vice verse.



## Lecture 6

### Second Order Optimization

Uses **Hessian** matrix, this will have faster convergence, and **no hyperparameter**!

However, when training large networks, the Hessian matrix ends up being quite huge, inversing it becomes impractically expensive to evaluate. 

**L-BFGS** is usually used for second order optimization, however, it works for **deterministic** functions. This means it **does not work in minibatch settings**. 

Adam is usually a good choice, if you can afford to do full batch updates, then try L-BFGS. 

### Ensemble Trick/Tips

Can get a small boost from averaging multiple model checkpoints of a single model.

Keep track of a **running average of parameter vector**, to use at test time. 
```
loss = nn.forward()
dx = nn.backward()
x += - learning_rate * dx
x_test = 0.995 * x_test + 0.005 * x
```

## Lecture 7 CNN

Input images are `width * height * depth`, e.g. 64 * 64 * 3, for 64x64 images with RGB colours.

**Filters**, aka kernels, are used to slide through the images, filter size can be 3x3x3, depth is **always** the same as input images. Filters are dot product operators.

**Stride** is the step size to slide the filter. 

**Dimension**: NxN images, FxF filter size, **output size** is given by: 

$$(N-F) / stride + 1$$

**Multiple** filters can be used in a layer to generate multiple **activation maps**, e.g. 6 filters would give 6 maps in a layer, which then are fed to ReLU for example.

3D Convolution is just 2D convolution applied at the same time to the 3rd dimension. 


### Padding

Padding of **zeros** are added to the broders of the input images to **preserve size spatially** for the activation maps after convolution operations. 

If filter size is FxF, then zero-padding size = $(F-1)/2$.

**Example**, image input 32x32x3, 10 filters with size 5x5x3 and stride 1, padding 2, then after convolution size is 32x32x10, e.g. (32 + 2 * 2 - 5)/1 + 1 = 32.

Number of parameters = 5 * 5 * 3 * 10 + 10 = 760, 10 * (filter size + bias). 

### Dimension Summary

For a Conv Layer, 

Input volume size $W_1 \times H_1 \times D_1$

Requires 4 hyperparameters:

* Number of filters, $K$
* filter spaitial extend, $F$,
* stride, $S$
* the amoutn of zero padding, $P$
    
Output volume size $W_2 \times H_2 \times D_2$, where:

$$
\begin{aligned}
W_2 &= (W_1 - F + 2P) / S + 1 \\
H_2 &= (H_1 - F + 2P) / S + 1 \\
D_2 &= K
\end{aligned}
$$

Wither **parameter sharing**, this introduces $F \times F \times D_1$ weights per filter, total parameters size $(F \times F \times D_1) \times K + K$.

$K$ is usually chosen as powers of 2.



In each activation map, neurons share **weights** (from filters) and **local connectivity**.

### Pooling Layer

**Max pooling**: use a filter, e.g. 2x2, then take the max of each 2x2 area. **Average pooling**, performs as well as max pooling. 

Input volume size: $W_1 \times H_1 \times D_1$

Requires 2 hyperparameters:

* their spatial extend, $F$
* stride, $S$

Output volume size: 

$$
\begin{aligned}
W_2 &= (W_1 - F) / S + 1 \\
H_2 &= (H_1 - F) / S + 1 \\
D_2 &= D_1
\end{aligned}
$$

Not common to use zero-padding for pooling layers.

Pooling **shrinks** the input volume.

### AlexNet Example

Input: 227x227x3 images, (paper says 224, confused everyone...)

First conv layer (CONV1): 96 11x11x3 filters applied at stride 4, output volume size: 55x55x96, (227 - 11)/4 + 1 = 55. 

Total number of parmas = 11 * 11 * 3 * 96 + 96 = 35k

After pooling layer, 3x3x3 filter applied at stride 2, output size 27x27x96, e.g. (55 - 3) / 2 + 1, 0 params.

### VGG Example

Total memory ~93MB / image for forward pass, ~2x for backward pass.

Total Params 138mm.

### ResNet

A lot deeper, 152 layers, ~5mm parameters. Relying on skip connections, each layer is trained to be a **delta** that is added to the original input. This allows gradients to feed back to the first layer easily. 

Very repaid spatial reduction, but relying on many layers.

8 GPUs trained for 2 weeks....

As of 2017, Inception-V4 (ResNet + Inception) has the best top-1 accuracy of ~80%. VGG has the highest memory and compute usage. See this [video](https://www.youtube.com/watch?v=DAOcjicFr1Y&index=9&list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv)

ResNeXT was published later.

### DenseNet

Dense blocks where each layer is connected to every other layer in feedforward fashion. 

Alleviates vanishing gradient, strengthens feature propagation, encourages feature reuse.

# CNN Implementation

Replace large covolutions with stacks of 3x3 convolutions

1x1 bottleneck convolutions are very efficient

Can factor NxN convolution into 1xN and Nx1

All of the above give fewer parameters, less compute, more non-linearity.

## Convolution 

### im2col

Input H x W x C, conv weights: D filters, each K x K x C

Reshape K x K x C (dimension of receptive fields in the input) into matrix columns of $K^2 C$ elements. Repeat for all columns to get a $K^2C \times N$ matrix (N receptive field locations).

This uses a lot of memory but in practice ok. 

Reshape each filter to $K^2C$ length rows, making a $D \times K^2C$ matrix.

For mini-batches, concat each sample to the matrices above to form the batch.

Use matrix multiply to get results in dimension of $D \times N$, reshape to output tensor.

### FFT

Convolution Theorem: the convolution of $f$ and $g$ is equal to the element-wise product of their Fourier Transforms: 

$$ \mathcal{F}(f * g) = \mathcal{F}(f) \odot \mathcal{F}(g) $$

Steps:

1. compute FFT of weights: F(W)
2. compute FFT of image (activation map): F(X)
3. compute element-wise product of F(W) * F(X)
4. Compute inverse FFT: Y = inv_F[F(W) * F(X)]

Big speed up but only for large filters. **Not much speedup for 3x3 filters!**

### Strassen's Algorithm 

Navie matrix multiplication takes $O(N^3)$ operations. Strassen's algo reduces complexity to $O(N^{log_2(7)})$.

Lavin and Gray 2015 paper: Fast Algorithms for Convolution Neural Networks. Improved VGG speed by factor of 2.

## Other Tips

Use lower precision floating point such as 16-bit. 

Store data in SSD drives. Write data into HDF5 file to reduce disk seek time. 

<a id='rnn'></a>
# RNN / LSTM Lecture 10

[char-rnn code](https://gist.github.com/karpathy/d4dee566867f8291f086)

[2017 video](https://www.youtube.com/watch?v=6niqTuYFZLQ)

Initial hidden state is usually initialized to zero.

Because of the recurrence, the same weight matrix $W$ is used many times, in a computational graph, this is essentially reusing the same node many times. In the backprop, this means you end up summing the gradients for $dW$.

Input: $X = (x_1, x_2, \cdots, x_n)$
Hidden units: $H = (h_1, h_2, \cdots, h_n)$

Each input and hidden unit pair is passed to an activation function $f_W()$, whose output is the next hidden unit. Generally:

$$ f_W(x_{t}, h_{t-1}) = h_{t} $$

Loss: $L = \sum_{t=1}^{n} L_t$, where $L_t = y_t - h_t, \forall t$ 

**Sequence to Sequence** netowrk is a a) Many-to-One RNN + b) One-to-Many RNN, where the last hidden state from a) is the input sequence for b).

**Truncated** Backprop Through Time: break the dataset into pieces and do forward/backward passes for each piece separate. This is quite similiar to minibatch SGD.

## LSTM

**Gates**: $i, f, o, g$, aka **ifog**. **cell state** $c_t$, **hidden state** $h_t$

Input is $x$ or $h_{t-1}$, outputs are $c_t$ and $h_t$.

RNN models typically not deep, 2-4 layers normally.

### Formula

Here when $l=1$, $h^{l-1}_t = x_t$

$$
\begin{aligned}
\left(
\begin{array}{cc}
i\\
f\\
o\\
g
\end{array} \right)
 &= 
\left( 
\begin{array}{cc}
sigmoid\\
sigmoid\\
sigmoid\\
sigmoid
\end{array} \right) W^{l} \left( \begin{array}{cc}
h_t^{l-1} \\
h_{t-1}^{l}
\end{array} \right)\\
c^l_t &= f \odot c^l_{t-1} + i \odot g \\
h^l_t &= o \odot \tanh(c^l_t)
\end{aligned}
$$



Vanilla RNNs suffer from exploding / vanishing gradient problem. Backpropagation through multiplication gates results in multiplying $W^T$ for computing gradients of $h_t$. This means the gradient of $h_0$ involves many factors of $W^T$. 

Whether it explodes or vanishes depends on the **largest eigenvalue of the hidden state weight matrix $W$**. If larger than 1 it explodes, vanishes if less than 1.

Hack for this: **gradient clipping**.

Forget gates can **turn off** gradient flow. At start we typically initialize forget gates to positive numbers so **not** to turn off gradient, the network then learns how to forget.

### Dimension of weights

$x_t$, $h_{t-1}$ are vectors of length $h$, stack them together to form a matrix of $2h \times 1$.

weights are combined into $W$ for $i,f,o,g$, shape $4h \times 2h$. 


### Backprop

Gradients need to be accumlated (summed) as they flow back in the back pass. 

For each cell, there will be gradients coming back from the output as well as the **next** hidden state. This results in the hidden state weight $W_hh$ being applied repeatedly, resulting in the exploding/vanishing gradient problem. 

Look at Andrej's char-rnn model and description [here](http://cs231n.github.io/neural-networks-case-study/#grad). The derivative derivation shows how gradients are accumulated.


# DL Libraries

Lots of examples in [slides](http://cs231n.stanford.edu/slides/2016/winter1516_lecture12.pdf).

Use cases:

**Caffe**

Feature extraction / fine-tuning existing models.

Not good for RNNs. Cumbersome for big networks such as GooLeNet, ResNet. 

**Torch / Lasagne**

Complex uses of pre-trained models.

Write you own layers.

Not great for RNNs. 

Lua easy to read.

**Theano or Tensorflow**

For crazy RNNs

**TensorFlow**

Huge models, need model parallelism.

Visualization during training: **TensorBoard**

Data and model parallelism, best of all frameworks

Computational graph abstraction, great for RNNs.

Best multi-GPU support.

Distributed models. 

Slower in other models right now. 

Not many pretrained models. 
