# CNN Architectures

## Case Studies 
The following are the primary four main architectures that are widely used in the research field today.

### LeNet-5 [LeCun et al. 1998]
This architecture was very successfully applied to hand written digit recognition in the late 90s. 

CONV - POOL - CONV - POOL - FC - FC

Each convolution filters were 5x5 with stride of 1.

### AlexNet [Krizhevsky et al. 2012]
The first large scale convolutional neural network that was able to do well on the ImageNet classification task.

CONV - MAX POOL - NORM - CONV - MAX POOL - NORM - CONV - CONV - CONV - MAX POOL - FC - FC - FC

The input is of shape (N, 227, 227, 3), first layer of convolution has 96 11x11 filters applied at `stride=4`, thus, the output is of shape (N, 55, 55, 96) with 35,000 parameters. 

|Layer  |Output Dim.| Filter   | Stride | Pad | 
|-------|-----------|----------|--------|-----|
|Input  | 227x227x3 |          |        |     |
|CONV   | 55x55x96  | 96 11x11 | 4      | 0   |
|MAXPOOL| 27x27x96  | 3x3      | 2      |     |
|NORM   | 27x27x96  |          |        |     |
|CONV   | 27x27x256 | 256 5x5  | 1      | 2   |
|MAXPOOL| 13x13x256 | 3x3      | 2      |     |
|NORM   | 13x13x256 |          |        |     |
|CONV   | 13x13x384 | 384 3x3  | 1      | 1   | 
|CONV   | 13x13x384 | 384 3x3  | 1      | 1   |
|CONV   | 13x13x256 | 256 3x3  | 1      | 1   |
|MAXPOOL| 6x6x256   | 3x3      | 2      |     |
|FC     | 4096      |          |        |     |
|FC     | 4096      |          |        |     |
|FC     | 1000      |          |none    |none |

### VGGNet
The idea is that smaller filters but deeper networks. AlexNet had 8 layers but VGG has up to 16~19 layers using the VGG16 architecture. The convolution filter is only 3x3 with stride of 1 and padding of 1. Also the max pooling layer is only 2x2 with stride of 2.

#### Why use smaller filters (3x3)? 
Stack of three 3x3 convolution with stride 1 layers has the same effective receptive field as one 7x7 convolution layer. Here's a diagram that will illustrate the idea:

Assuming we are looking at 1 dimension, along x-axis, sliding a 3x3 filter across 7 pixels. with `stride = 1`
```
1st Layer: = = = = = = =  
2nd Layer:   = = = = =
3rd Layer:     = = = 
Final:           =
```

This is equivalent to 
```
1st Layer: = = = = = = =
Final:           =
```

#### VGG 16
* Input
* 64 3x3 conv
* 64 3x3 conv
* Pool
* 128 3x3 conv
* 128 3x3 conv
* Pool
* 256 3x3 conv
* 256 3x3 conv
* Pool
* 512 3x3 conv
* 512 3x3 conv
* 512 3x3 conv
* Pool
* FC 4096
* FC 4096
* FC 1000
* Softmax

### GoogleNet
Deeper network with computational efficiency. 

* 22 layers
* Efficient "Inception" module
    * Design a good local network topology (network within network) and then stack these modules on top of each other
* No FC layers
* Only 5 million parameters, 12x less than AlexNet!
* 6.7% error only on ImageNet task

However, each inception module creates challenge for computational complexity. For example, we have an input of 28x28x256 to an inception module which has 1x1 conv, 3x3 conv, 5x5 conv, and 3x3 pool layers. Each of these layers will produce an output 28x28x128, 28x28x192, 28x28x96, 28x28x256. Then these outputs are conconcatenated together depth wise. We maintain the spatial dimension using zero padding. However, now the depth blows up. 

**Solution**: Use bottneneck filters, which is a 1x1 conv layer that is inserted before 3x3 conv and 5x5 conv. These 1x1 conv layer will project the feature depth to lower dimension. For example, a 64 1x1 conv on a 28x28x256 input will create an output of 28x28x64

![inception](inception.png)

Now you just stack all these inception modules together to create a deep network.

### ResNet
Extremely deep network using this residual connections.

What happens when we continue stacking deeper layers on a *plain* convolutional neural network?
The network **DOES NOT** perform better!

![deepstack](deepstack.png)

It is not caused by overfitting. It is an optimization problem. Deeper network is just harder to optimize. The deeper model should be able to perform at least as well as the shallower model. A solution by construction is copying the learned layers from the shallower model and setting additional layers to identity mapping. Use network layers to fit a residual mapping instead of directly trying to fit a desired underlying mapping.

Training ResNet in practice:

* Batch normalization after every CONV layer
* Xavier/2 initialization
* SGD with momentum 0.9
* Learning rate is about 0.1 and divide by 10 when validation error plateaus
* Mini-batch size is 256
* Weight decay is 1e-5
* No dropout used
