# Case studies

## Classic networks

**LeNet5**
* [paper](http://vision.stanford.edu/cs598_spring07/papers/Lecun98.pdf)
* goal is to recognize hand-written digit from gray-scale image
* building blocks
    * input layer (32 x 32 x 1)
    * CNN1
        * 6 filters with size (5 x 5), stride 1 and no padding
        * output (28 x 28 x 6)
        * average-pooling, filter (2 x 2), stride 2
        * output (14 x 14 x 6)
    * CNN2
        * 16 filters with size (5 x 5), stride 1 and no padding
        * output (10 x 10 x 16)
        * average-pooling, filter (2 x 2), stride 2
        * output (5 x 5 x 16)
        * flatten to 400 units
    * FC1
        * 120 units
    * FC2
        * 84 units
    * output layer
        * 10 units
        * softmax activation
        * original implementation used different approach -> euclidean radial basis function

Summary  
* 60k params
* with depth, $n_h$, $n_w$ decreases, $n_c$ increases
* conv + pool, conv + pool, fc & fc & output
* sigmoid & tanh no ReLU

**AlexNet**
* [paper](https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf)
* building blocks
    * input layer (227 x 227 x 3)
    * CNN1
        * 96 filters with size (11 x 11), stride 4
        * output (55 x 55 x 96)
        * max-pooling, filter (3 x 3), stride 2
        * output (27 x 27 x 96)
    * CNN2
        * 256 filters with size (5 x 5), same padding
        * output (27 x 27 x 256)
        * max-pooling, filter (3 x 3), stride 2
        * output (13 x 13 x 256)
    * CNN3-5
        * 384 filters with size (3 x 3), same padding
        * output (13 x 13 x 384)
        * 384 filters with size (3 x 3), same padding
        * output (13 x 13 x 384)
        * 256 filters with size (3 x 3), same padding
        * output (13 x 13 x 256)
        * max-pooling, filter (3 x 3), stride 2
        * output (6 x 6 x 256)
        * flatten to 9216 units
    * FC1
        * 4096 units
    * FC2
        * 4096 units
    * output layer
        * 1000 units
        * softmax activation

Summary  
* 60m params
* large ImageNet dataset
* ReLU activations
* distributed training on multiple GPU
* local response normalization layer (normalization across filter position), this does not seem to help much

**VGG16**
* [paper](https://arxiv.org/pdf/1409.1556)
* 16 layers of params
* building blocks
    * input layer (224 x 224 x 3)
    * CNN1-2
        * 64 filters with size (3 x 3), stride 1, same padding
        * output (224 x 224 x 3)
        * 64 filters with size (3 x 3), stride 1, same padding
        * output (224 x 224 x 64)
        * max-pooling, filter (2 x 2), stride 2
        * output (112 x 112 x 64)
    * CNN3-4
        * 128 filters with size (3 x 3), stride 1, same padding
        * output (112 x 112 x 128)
        * 128 filters with size (3 x 3), stride 1, same padding
        * output (112 x 112 x 128)
        * max-pooling, filter (2 x 2), stride 2
        * output (56 x 56 x 128)
    * CNN5-7
        * 256 filters with size (3 x 3), stride 1, same padding
        * output (56 x 56 x 256)       
        * 256 filters with size (3 x 3), stride 1, same padding
        * output (56 x 56 x 256)     
        * 256 filters with size (3 x 3), stride 1, same padding
        * output (56 x 56 x 256)
        * max-pooling, filter (2 x 2), stride 2
        * output (28 x 28 x 256)
    * CNN8-11
        * 512 filters with size (3 x 3), stride 1, same padding
        * output (28 x 28 x 512)              
        * 512 filters with size (3 x 3), stride 1, same padding
        * output (28 x 28 x 512)            
        * 512 filters with size (3 x 3), stride 1, same padding
        * output (28 x 28 x 512)    
        * max-pooling, filter (2 x 2), stride 2
        * output (14 x 14 x 512)
    * CNN12-14
        * 512 filters with size (3 x 3), stride 1, same padding
        * output (14 x 14 x 512)              
        * 512 filters with size (3 x 3), stride 1, same padding
        * output (14 x 14 x 512)            
        * 512 filters with size (3 x 3), stride 1, same padding
        * output (14 x 14 x 512)    
        * max-pooling, filter (2 x 2), stride 2
        * output (7 x 7 x 512)
        * flatten to 25088 units
    * FC1-2
        * 4096 units
        * ReLU activation
        * 4096 units
        * ReLU activation
    * output layer
        * 1000 units
        * softmax activation

Summary  
* 138m params
* uniform (symmetrical?) architecture
* with depth, $n_h$, $n_w$ decreases by factor of 2, $n_c$ increases by factor of 2


## ResNet

* [paper](https://arxiv.org/pdf/1512.03385)

## Inception