# Case studies

## Classic networks

**LeNet5**
* [paper](http://vision.stanford.edu/cs598_spring07/papers/Lecun98.pdf)
* goal is to recognize hand-written digit from gray-scale image
* architecture
    * input layer (32 x 32 x 1)
    * CNN1
        * 6 filters with size (5 x 5), stride 1 and no padding
        * output (28 x 28 x 6)
        * average-pooling, filter (2 x 2), stride 2
        * output (14 x 14 x 6)
    * CNN2
        * 16 filters with size (5 x 5), stride 1 and no padding
        * output (10 x 10 x 16)
        * average-pooling, filter (2 x 2), stride 2
        * output (5 x 5 x 16)
        * flatten to 400 units
    * FC1
        * 120 units
    * FC2
        * 84 units
    * output layer
        * 10 units
        * softmax activation
        * original implementation used different approach -> euclidean radial basis function

Summary  
* 60k params
* with depth, $n_h$, $n_w$ decreases, $n_c$ increases
* conv + pool, conv + pool, fc & fc & output
* sigmoid & tanh no ReLU

**AlexNet**
* [paper](https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf)
* architecture
    * input layer (227 x 227 x 3)
    * CNN1
        * 96 filters with size (11 x 11), stride 4
        * output (55 x 55 x 96)
        * max-pooling, filter (3 x 3), stride 2
        * output (27 x 27 x 96)
    * CNN2
        * 256 filters with size (5 x 5), same padding
        * output (27 x 27 x 256)
        * max-pooling, filter (3 x 3), stride 2
        * output (13 x 13 x 256)
    * CNN3-5
        * 384 filters with size (3 x 3), same padding
        * output (13 x 13 x 384)
        * 384 filters with size (3 x 3), same padding
        * output (13 x 13 x 384)
        * 256 filters with size (3 x 3), same padding
        * output (13 x 13 x 256)
        * max-pooling, filter (3 x 3), stride 2
        * output (6 x 6 x 256)
        * flatten to 9216 units
    * FC1
        * 4096 units
    * FC2
        * 4096 units
    * output layer
        * 1000 units
        * softmax activation

Summary  
* 60m params
* large ImageNet dataset
* ReLU activations
* distributed training on multiple GPU
* local response normalization layer (normalization across filter position), this does not seem to help much

**VGG16**
* [paper](https://arxiv.org/pdf/1409.1556)
* 16 layers of params
* architecture
    * input layer (224 x 224 x 3)
    * CNN1-2
        * 64 filters with size (3 x 3), stride 1, same padding
        * output (224 x 224 x 3)
        * 64 filters with size (3 x 3), stride 1, same padding
        * output (224 x 224 x 64)
        * max-pooling, filter (2 x 2), stride 2
        * output (112 x 112 x 64)
    * CNN3-4
        * 128 filters with size (3 x 3), stride 1, same padding
        * output (112 x 112 x 128)
        * 128 filters with size (3 x 3), stride 1, same padding
        * output (112 x 112 x 128)
        * max-pooling, filter (2 x 2), stride 2
        * output (56 x 56 x 128)
    * CNN5-7
        * 256 filters with size (3 x 3), stride 1, same padding
        * output (56 x 56 x 256)       
        * 256 filters with size (3 x 3), stride 1, same padding
        * output (56 x 56 x 256)     
        * 256 filters with size (3 x 3), stride 1, same padding
        * output (56 x 56 x 256)
        * max-pooling, filter (2 x 2), stride 2
        * output (28 x 28 x 256)
    * CNN8-11
        * 512 filters with size (3 x 3), stride 1, same padding
        * output (28 x 28 x 512)              
        * 512 filters with size (3 x 3), stride 1, same padding
        * output (28 x 28 x 512)            
        * 512 filters with size (3 x 3), stride 1, same padding
        * output (28 x 28 x 512)    
        * max-pooling, filter (2 x 2), stride 2
        * output (14 x 14 x 512)
    * CNN12-14
        * 512 filters with size (3 x 3), stride 1, same padding
        * output (14 x 14 x 512)              
        * 512 filters with size (3 x 3), stride 1, same padding
        * output (14 x 14 x 512)            
        * 512 filters with size (3 x 3), stride 1, same padding
        * output (14 x 14 x 512)    
        * max-pooling, filter (2 x 2), stride 2
        * output (7 x 7 x 512)
        * flatten to 25088 units
    * FC1-2
        * 4096 units
        * ReLU activation
        * 4096 units
        * ReLU activation
    * output layer
        * 1000 units
        * softmax activation

Summary  
* 138m params
* uniform (symmetrical?) architecture
* with depth, $n_h$, $n_w$ decreases by factor of 2, $n_c$ increases by factor of 2


## ResNet

* [paper](https://arxiv.org/pdf/1512.03385)
* vanishing & exploding gradients addressed through skip connections
* residual block
    * activations in consecutive layers $a^{[l]}$, $a^{[l+1]}$, $a^{[l+2]}$,...
    * main path
        * $z^{[l+1]} = W^{[l+1]}\cdot a^{[l]} + b^{[l+1]}$
        * $a^{[l+1]} = g(z^{[l+1]})$, apply ReLU
        * $z^{[l+2]} = W^{[l+2]}\cdot a^{[l+1]} + b^{[l+2]}$
        * $a^{[l+2]} = g(z^{[l+2]})$, apply ReLU    
    * shortcut path (skip connection)
        * $a^{[l+2]} = g(z^{[l+2]}+a^{[l]})$, apply ReLU
        * feeding signal from the initial activation into the deeper activations
    * allows for training very deep networks
* stacking residual blocks on each other
* plain nets - in practice more layers lead to error increase
* res-nets - deeper networks lead to error decrease


* architecture
    * input layer
    * CNN1
        * 64 filters with size (7 x 7), stride 2, same padding?
        * max-pool layer, stride 2
    * CNN2-7
        * first layer incl pooling with stride 2, first skip connection scaled with $W_s$
        * 64 filters with size (3 x 3), stride 1, same padding
        * skip connections two layers apart
    * CNN8-16
        * first layer incl pooling with stride 2, first skip connection scaled with $W_s$ 
        * 128 filters with size (3 x 3), stride 1, same padding
        * skip connections two layers apart
    * CNN17-29
        * first layer incl pooling with stride 2, first skip connection scaled with $W_s$ 
        * 256 filters with size (3 x 3), stride 1, same padding
        * skip connections two layers apart
    * CNN30-36
        * first layer incl pooling with stride 2, first skip connection scaled with $W_s$   
        * 512 filters with size (3 x 3), stride 1, same padding
        * skip connections two layers apart
        * ending with average pooling layer   
    * output layer
        * 1000 units
        * softmax activation

**Intuition example** 
* $x$ -> |big NN| -> $a^{[l]}$ (plain big NN)
* $x$ -> |big NN| -> $a^{[l]}$ -> FC1 -> FC2 -> $a^{[l+2]}$ (res-NN)
    * skip connection between $a^{[l]}$ & $a^{[l+2]}$
    * ReLU activations, this $a \geq 0$
    * $a^{[l+2]} = g(z^{[l+2]}+a^{[l]})$ (skip connection)
    * $g(z^{[l+2]}+a^{[l]}) = g(W^{[l+2]}\cdot a^{[l+1]} + b^{[l+2]} + a^{[l]})$
        * if weight decay is applied it might lead to making $W^{[l+2]}$ and $b^{[l+2]}$ close to zero
        * thus $a^{[l+2]} = g(a^{[l]}) = a^{[l]}$
        * identity function is easy to learn with residual block!
    * for the core operation $a^{[l+2]} = g(z^{[l+2]}+a^{[l]})$ to work, $z^{[l+2]}$ and $a^{[l]}$ must have same dimensions, thus same padding is used in CNN layers, or scaling matrix is added to modify the operation to $a^{[l+2]} = g(z^{[l+2]}+W_{s}\cdot a^{[l]})$, where scaling matrix $W_{s}$ can be learned or fixed


## Inception

**Inception building block - network in network**
* [paper](https://arxiv.org/pdf/1312.4400)
* (1 x 1) convolution
* for 1 channel and same filter sizes, it just scales the original channel
* for multiple channel input (6 x 6 x 32) with filter (1 x 1 x 32) and activation, output would be (6 x 6 x 32)
    * this be perceived as full-connection across channels with activation applied (this is why network in network)
* useful for shrinking number of filters (as opposed to filter size reduced by pooling)
* for other cases (same number of channels), it just adds non-linearity and allows for more complex patterns

**Motivation**
* [paper](https://arxiv.org/pdf/1409.4842)
* decisions about NN architecture are complicated, why not to use every operation at the same time

**Intuition example**
* inception block
    * input (28 x 28 x 192)
    * stacking multiple blocks together
        * 64 filters with size (1 x 1), stride 1, output (28 x 28 x 64)
        * 128 filters with size (3 x 3), stride 1, same padding, output (28 x 28 x 128)
        * 32 filters with size (5 x 5), stride 1, same padding, output (28 x 28 x 32)
        * max-pool layer, stride 1, same padding, output (28 x 28 x 32)
    * output (28 x 28 x 256)

* computational cost example
    * input (28 x 28 x 192), 32 filters with size (5 x 5), stride 1, same padding, output (28 x 28 x 32)
    * 32 filters of size 5 * 5 * 192, with 28 * 28 * 32 multiplications, thus ~120m operations
    
* optimized example
    * input (28 x 28 x 192)
    * 16 filters with size (1 x 1), output (28 x 28 x 16)
    * 32 filters with size (5 x 5), stride 1, same padding, output (28 x 28 x 32)
    * computational costs 28 * 28 * 16 * 192 ~ 2.4m, 28 * 28 * 32 * 5 * 5 *16 ~ 10m, thus ~12.4m operations


* inception block optimized
    * input (previous activation), with dimensions (28 x 28 x 192)
    * 96 conv filters with size (1 x 1), 128 conv filters with size (28 x 28), stride 1 and same padding
    * 16 conv filters with size (1 x 1), 32 conv filters with size (5 x 5), stride 1 and same padding
    * 64 conv filters with size (1 x 1) 
    * max-pool layer with filter (3 x 3), stride 1, same padding, 32 conv filters with size (1 x 1), that is output is (28 x 28 x 32)
    * output concats the previous steps and has size (28 x 28 x 256), this layer also called channel concat

* architecture
    * just chaining inception blocks
    * side branches
        * branch of network after inception block
        * fc layer
        * output layer for label prediction (softmax)
        * regularizing effect (?)

Summary  
* sometimes called *googLeNet*

## MobileNet

* CNNs can be computationally intensive
* normal convolution
    * input (n x n x n_c) * filter (f x f x n_c') = output (n_out x n_out x n_c') [no padding, stride 1]
    * number of multiplications (2160) -> #filter params * #filter positions * #number of filters

* depth-wise separable convolution
    * depth-wise convolution
        * input (n x n x n_c) * depth filter (f x f x n_c) based on channels = output (n_out x n_out x n_c)
        * number of multiplications (432) -> #filter params * #filter position x #number of filters
    * point-wise convolution
        * output (n_out x n_out x n_c) * point filter (1 x 1 x n_c') = (n_out x n_out x n_c')
        * number of multiplications (240) -> #filter params * #filter position x #number of filters
    * reduction in computational cost by factor of ~ 3 for the example at hand, for general case can be estimated as $\frac{1}{n_c'}+\frac{1}{f^2}$
* inference step much more efficient!


**Architecture**

* using separable convolution (depth-wise and point-wise operations) instead of expensive traditional convolution operation
* [paper v2](https://arxiv.org/pdf/1801.04381)

* building blocks (v1)
    * input layer
    * separable convolution (13 layers)\
    * pooling
    * fully-connected layer
    * output layer (softmax activation)

* building blocks (v2)
    * input layer
    * separable convolution (17 layers, "bottleneck block")
        * residual connection (more efficient gradient propagation)
        * expansion layer
        * projection layer
    * pooling
    * fully-connected layer
    * output layer (softmax activation)


* v2 bottleneck block
    * structure
        * input (n x n x 3)
        * expansion layer (1 x 1 x n_c), where n_c somewhat large (ie 6 times channels from previous layer)
        * depth-wise convolution (n x n x n_c), same padding
        * point-wise convolution (n x n x 3), sometimes called projection step (projecting down from the depth-wise convolution)
        * residual conection to the previous layer
    * enables to learn richer functions
    * helps with memory needed


## Efficient net

* [paper](https://arxiv.org/pdf/1905.11946)
* the goal is to scale the NN to the (edge) device
* baseline architecture
    * input size
    * depth
    * width
* efficient architecture
    * higher/lower resolution image 
    * higher/lower depth
    * higher/lower width
* what are sensible decision wrt the NN architecture's params? what is the best trade-off?

# Practical advice

* open-source implementations benefit from reusing existing building blocks, and exploring implementation details of particular solutions, they are usually published on Github, pre-train nets are great
* transfer learning (freezing/training weights in layers), can be split to two steps (pre-computation in the feature extraction piece and saving the outcomes to disc, training shallow classification net on the results)
* data augmentation might help, when not enough data
    * mirroring/flipping along an axe, random cropping, rotation, shearing, local warping
    * color shifting (distorting channels, PCA approach in the AlexNet paper)
    * generating distortion on CPU in parallel during mini-batch generation for training
* CV
    * data vs hand-engineering (object detection (not much data)-> speech recognition (lots of data))
        * with lots of data -> simple algorithms, less hand-engineering
        * with not much data -> hand-engineering features, architecture, other components, hacks; transfer learning
    * hacks
        * ensembling
        * multi-crop at test time (producing larger validation sets, ie 10-crop)