# Stanford cs231n Notes

Winter 2016 series by Li Fei-Fei, Andrej Karpathy and Justin Johnson

[videos](https://www.youtube.com/playlist?list=PLkt2uSq6rBVctENoVBg1TpCC7OQi31AlC)

[material](http://cs231n.stanford.edu/2016/syllabus.html)

## Lecture 5

Good example in video that implemented a check after initalizing a network, feed data through the network without regularization and check the loss, on CIFAR-10 the expected loss is $-log(1/10) \approx 2.3$.

**Tip**: make sure that you can **overfit** very small portion of the training data. If you can't overfit a small portion of the data, then things are broken.

### Hyperparameter Search

Run coarse search for 5 epochs, then run finer searches. 

Example showed a good result near the boundaries of the learning rate search space. Thsis may be **problematic**, because it may indicate that the optimal parameter may be **outside** the search space.

**Track the ratio of weight updates / weight magnitudes**. This should be `-1e-3`. If too high, maybe decrease learning rate, vice verse.



## Lecture 6

### Second Order Optimization

Uses **Hessian** matrix, this will have faster convergence, and **no hyperparameter**!

However, when training large networks, the Hessian matrix ends up being quite huge, inversing it becomes impractically expensive to evaluate. 

**L-BFGS** is usually used for second order optimization, however, it works for **deterministic** functions. This means it **does not work in minibatch settings**. 

Adam is usually a good choice, if you can afford to do full batch updates, then try L-BFGS. 

### Ensemble Trick/Tips

Can get a small boost from averaging multiple model checkpoints of a single model.

Keep track of a **running average of parameter vector**, to use at test time. 
```
loss = nn.forward()
dx = nn.backward()
x += - learning_rate * dx
x_test = 0.995 * x_test + 0.005 * x
```

## Lecture 7 CNN

Input images are `width * height * depth`, e.g. 64 * 64 * 3, for 64x64 images with RGB colours.

**Filters**, aka kernels, are used to slide through the images, filter size can be 3x3x3, depth is **always** the same as input images. Filters are dot product operators.

**Stride** is the step size to slide the filter. 

**Dimension**: NxN images, FxF filter size, **output size** is given by: 

$$(N-F) / stride + 1$$

**Multiple** filters can be used in a layer to generate multiple **activation maps**, e.g. 6 filters would give 6 maps in a layer, which then are fed to ReLU for example.

3D Convolution is just 2D convolution applied at the same time to the 3rd dimension. 


### Padding

Padding of **zeros** are added to the broders of the input images to **preserve size spatially** for the activation maps after convolution operations. 

If filter size is FxF, then zero-padding size = $(F-1)/2$.

**Example**, image input 32x32x3, 10 filters with size 5x5x3 and stride 1, padding 2, then after convolution size is 32x32x10, e.g. (32 + 2 * 2 - 5)/1 + 1 = 32.

Number of parameters = 5 * 5 * 3 * 10 + 10 = 760, 10 * (filter size + bias). 

### Dimension Summary

For a Conv Layer, 

Input volume size $W_1 \times H_1 \times D_1$

Requires 4 hyperparameters:

* Number of filters, $K$
* filter spaitial extend, $F$,
* stride, $S$
* the amoutn of zero padding, $P$
    
Output volume size $W_2 \times H_2 \times D_2$, where:

$$
\begin{aligned}
W_2 &= (W_1 - F + 2P) / S + 1 \\
H_2 &= (H_1 - F + 2P) / S + 1 \\
D_2 &= K
\end{aligned}
$$

Wither **parameter sharing**, this introduces $F \times F \times D_1$ weights per filter, total parameters size $(F \times F \times D_1) \times K + K$.

$K$ is usually chosen as powers of 2.



In each activation map, neurons share **weights** (from filters) and **local connectivity**.

### Pooling Layer

**Max pooling**: use a filter, e.g. 2x2, then take the max of each 2x2 area. **Average pooling**, performs as well as max pooling. 

Input volume size: $W_1 \times H_1 \times D_1$

Requires 2 hyperparameters:

* their spatial extend, $F$
* stride, $S$

Output volume size: 

$$
\begin{aligned}
W_2 &= (W_1 - F) / S + 1 \\
H_2 &= (H_1 - F) / S + 1 \\
D_2 &= D_1
\end{aligned}
$$

Not common to use zero-padding for pooling layers.

Pooling **shrinks** the input volume.

### AlexNet Example

Input: 227x227x3 images, (paper says 224, confused everyone...)

First conv layer (CONV1): 96 11x11x3 filters applied at stride 4, output volume size: 55x55x96, (227 - 11)/4 + 1 = 55. 

Total number of parmas = 11 * 11 * 3 * 96 + 96 = 35k

After pooling layer, 3x3x3 filter applied at stride 2, output size 27x27x96, e.g. (55 - 3) / 2 + 1, 0 params.

### VGG Example

Total memory ~93MB / image for forward pass, ~2x for backward pass.

Total Params 138mm.

### ResNet

A lot deeper, 152 layers, ~5mm parameters. Relying on skip connections, each layer is trained to be a **delta** that is added to the original input. This allows gradients to feed back to the first layer easily. 

Very repaid spatial reduction, but relying on many layers.

8 GPUs trained for 2 weeks....

As of 2017, Inception-V4 (ResNet + Inception) has the best top-1 accuracy of ~80%. VGG has the highest memory and compute usage. See this [video](https://www.youtube.com/watch?v=DAOcjicFr1Y&index=9&list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv)

ResNeXT was published later.

### DenseNet

Dense blocks where each layer is connected to every other layer in feedforward fashion. 

Alleviates vanishing gradient, strengthens feature propagation, encourages feature reuse.