# CNN architectures

## History of significant breakthroughs

* implementing DNN is easy (stack layers)
* what architecture? hyperparams?
* intuition, math insights, trial and error, luck, time, effort, frustration, occasional success


Based on Zhang, Aston, Zachary C. Lipton, Mu Li, and Alexander J. Smola. ‘Dive into Deep Learning’. ArXiv Preprint ArXiv:2106.11342, 2021.



## LeNet - the grandfathers

LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., & others. (**1998**). Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11), 2278–2324.

* Yann LeCun, Leon Bottou, Yoshua Bengio - names to remember!
* decade of research to get here
* first successful CNN with backprop
* handwritten digits recognition (MNIST dataset)
* outstanidng results matching SVMs :)  (~99.05 test set accuracy)


![Yoshua Bengio, Geoff Hinton and Yann LeCun.](BengioHintonLeCun2.jpeg)

Yoshua Bengio - Uni Montreal and Mila, Geoff Hinton - Uni Toronto, and Yann LeCun - Facebook and Uni NY (born French :) )

### LeNet-5 architecture

![LeNet for handwritten digit recongnition.](lenet.svg)

* convolutional block = convolutional layer + sigmoid activation (ReLU not yet invented then!)
* subsampling average pooling (max pooling not yet used!)
* convolutions: 5x5 kernels, 1st - 6 output channels, padding 2. 2nd - 16 channels, padding 0
* poling: 2x2 with stride 2
* 3 dense layers: 120, 84, 10 outputs

### Alternative depiction

![LeNet schematic depiction.](lenet-vert.svg)

In [10]:
# LeNet PyTorch implementation
import torch
from torch import nn as nn

lenet = nn.Sequential(
  nn.Conv2d(1, 6, kernel_size=5, padding=2),
  nn.Sigmoid(),
  nn.AvgPool2d(kernel_size=2, stride=2),
  nn.Conv2d(6, 16, kernel_size=5),
  nn.Sigmoid(),
  nn.AvgPool2d(kernel_size=2, stride=2),
  nn.Flatten(),
  nn.Linear(16*5*5, 120),
  nn.Sigmoid(),
  nn.Linear(120, 84),
  nn.Sigmoid(),
  nn.Linear(84, 10),
  nn.Sigmoid()
)

In [11]:
print(lenet)

Sequential(
  (0): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (1): Sigmoid()
  (2): AvgPool2d(kernel_size=2, stride=2, padding=0)
  (3): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (4): Sigmoid()
  (5): AvgPool2d(kernel_size=2, stride=2, padding=0)
  (6): Flatten()
  (7): Linear(in_features=400, out_features=120, bias=True)
  (8): Sigmoid()
  (9): Linear(in_features=120, out_features=84, bias=True)
  (10): Sigmoid()
  (11): Linear(in_features=84, out_features=10, bias=True)
  (12): Sigmoid()
)


In [12]:
X = torch.rand(size=(5, 1, 28, 28))
for layer in lenet:
    X = layer(X)
    print(f'{layer.__class__.__name__} output shape: \t {X.shape}')

Conv2d output shape: 	 torch.Size([5, 6, 28, 28])
Sigmoid output shape: 	 torch.Size([5, 6, 28, 28])
AvgPool2d output shape: 	 torch.Size([5, 6, 14, 14])
Conv2d output shape: 	 torch.Size([5, 16, 10, 10])
Sigmoid output shape: 	 torch.Size([5, 16, 10, 10])
AvgPool2d output shape: 	 torch.Size([5, 16, 5, 5])
Flatten output shape: 	 torch.Size([5, 400])
Linear output shape: 	 torch.Size([5, 120])
Sigmoid output shape: 	 torch.Size([5, 120])
Linear output shape: 	 torch.Size([5, 84])
Sigmoid output shape: 	 torch.Size([5, 84])
Linear output shape: 	 torch.Size([5, 10])
Sigmoid output shape: 	 torch.Size([5, 10])


## What happened next?

### Nothing! (well not quite)

* 1990 - ~2010: "traditional" computer vision methods dominated NNs
* state-of-the-art: hand-crafted features combined with standard ML classifier (SIFT, SURF, HOG + SVM, NN, DT) 
* ML researchers: ML methods are cool, elegant, mathematically well motivated
* CV researchers: data are dirty, domain knowledge matters, ML methods are secondary

## Heroes

### Main idea: learn the features!

* guerrila exploring alternatives
*  Yann LeCun, Geoff Hinton, Yoshua Bengio, Andrew Ng, Shun-ichi Amari, Juergen Schmidhuber - more names to remember!

![Ng Amari Schmidhuber](NgAmariSchmidhuber.jpg)


## Missing bits

### Data

* research in 1990-2010 based on tiny datasets (UCI repository) - anecdotal from kernel world: "4000 instances is a BIG dataset worth special attention"
* 2009 wow! **ImageNet Challenge**: 1 million examples (3x224x224) = 1k each from 1k categories, Fei-Fei Li (remember!), Google Image Search, Amazon Mechanical Turk

### Hardware

* DNN relies on compute many cycles of simple operations
* GPUs developped for graphis in gaming
* requiremens proved similar to what CNNs need
* NVIDIA and ATM develop general purpose GPUs
* CPUs vs GPUs hardware pros and cons - go to MAI :)
* AlexNet - use GPUs for CNNs - breakthrough!

## AlexNet

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (**2012**). Imagenet classification with deep convolu-
tional neural networks. Advances in neural information processing systems (pp. 1097–1105).

*  Alex Krizhevsky (??), Ilya Sutskever (OpenAI founder) and Geoff Hinten - even more names to remember
* let's use GPUs! wow! remember, no PyTorch (2017), no TensorFlow (2015) then!
* finally **beating** the manually constructed features from 75% to 84.7% accuracy (ImageNet)

![Krizhevsky and Sutsakaver](KrizhevskySutsakaver.jpeg)


## LeNet vs AlexNet

![LeNet vs AlexNet.](alexnet.svg)

### LeNet vs AlexNet

* input dimension much bigger
* AlexNet **deeper**
* ReLU instead of sigmoids

#### Architecture

* 11x11 kernel to capture object in large image
* reduce dimensionality rapidly (strides)
* many more channels (10x as many)

### Activation function - ReLU

* simpler / faster to compute (no exponentiation)
* less sensitive to initialization - sigmoid gradient 0 when activation close to 0/1

### Complexity control - dropout, augmentation

* **droppout**:  only relevant to fully connected layers
* **data augmentation**: flipping, clipping, color changes: larger trainset size, smaller overfitting


In [None]:
# AlexNet PyTorch implementation
import torch
from torch import nn

alexnet = nn.Sequential(
    nn.Conv2d(3, 96, kernel_size=11, stride=4),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=3, stride=2),
    nn.Conv2d(96, 256, kernel_size=5, padding=2),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=3, stride=2),
    nn.Conv2d(256, 384, kernel_size=5, padding=2),
    nn.Conv2d(384, 384, kernel_size=5, padding=2),
    nn.Conv2d(384, 256, kernel_size=5, padding=2),    
    nn.MaxPool2d(kernel_size=3, stride=2),
    nn.Flatten(),
    nn.Linear(6400, 4096),
    nn.ReLU(),
    nn.Dropout(p=0.5),
    nn.Linear(4096, 4096),
    nn.ReLU(),
    nn.Dropout(p=0.5),
    nn.Linear(4096, 1000)    
)

## VGG net (Visual Geometry Group Oxford Uni)

Simonyan, K., & Zisserman, A. (**2014**). Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556.

AlexNet great - finally it works! But how do we develop other NNs?

### Main idea: blocks

Move from thinking about neurons, then layers to thinking about blocks of layers with repeating patterns

* convolutional layer with padding to maintain spatial resolution (VGG 3x3 kernels, padding 1)
* nonlinearity (ReLU and friends)
* max pooling for downsampling (VGG 2x2, stride 2)

### From AlexNet to VGG

![LeNet vs AlexNet.](vgg.svg)

VGG-11: 1 conv(64), 1 conv(128), 2 conv(256), 2 conv(512), 2 conv(512), FC(4096), FC(4096), FC(1000)

## NiN - Network in Network (Singapore)


Lin, M., Chen, Q., & Yan, S. (**2013**). Network in network. arXiv preprint arXiv:1312.4400.

LeNet -> AlexNet -> VGG net: stack convolutional blocks and finish by FC layers, improvements from more width and depth

### Main idea: use FC layers earlier

Hmm ... but how not to lose the spatial info? -> **1x1 convolution**

* FC layer acting at each pixel accross channels
* each pixel an instance with channels as features
* parameters tied through common kernel


### From VGG to NiN

![Nin](nin.svg)

No FC layers at the end! Last NiN blocks as many channels as classes follow by global avg pooling -> smaller num of parameters, longer train time.

## GoogLeNet 

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Rabinovich, A. (**2015**). Going deeper with convolutions. Proceedings of the IEEE conference on computer vision and pattern
recognition.

### Mian diea: combine variously-sized kernels into single block - Inception block


![Nin](inception.svg)

Preserves spatial dimension, main hyper-parameter is number of output channels (93.3% accuracy on ImageNet)

### GoogleNet architecture


![inception-full](inception-full.svg)


## Batchnorm

Ioffe, S., & Szegedy, C. (**2015**). Batch normalization: accelerating deep network training by reduc-
ing internal covariate shift. arXiv preprint arXiv:1502.03167.

### Main idea: deeper nets take long to train - we need to accelerate convergence through normalization

* Remember: data preprocessing in linear regression net via normalization to **align scaling** - speed up convergece
* Deep net: each layer output (next layer input) widely varying magnitues (across nodes, channels, layers)
* Batch normalization idea: before each layer normalize each batch by its statistics (mean, std)
* Careful: SGD with batch size=1 - batchnorm does nothing => batch size matters even more! (seems to work best for 50~100 range)





### Batchnorm math

$\mathbf{x} \in \mathcal{B}$ minibatch (input)

$$\mathrm{BN}(\mathbf{x}) = \boldsymbol{\gamma} \odot \frac{\mathbf{x} - \hat{\boldsymbol{\mu}}_\mathcal{B}}{\hat{\boldsymbol{\sigma}}_\mathcal{B}} + \boldsymbol{\beta}.$$

* $\hat{\boldsymbol{\mu}}_\mathcal{B} = \frac{1}{|\mathcal{B}|} \sum_{\mathbf{x} \in \mathcal{B}} \mathbf{x}$ - batch mean 
* $\hat{\boldsymbol{\sigma}}_\mathcal{B}^2 = \frac{1}{|\mathcal{B}|} \sum_{\mathbf{x} \in \mathcal{B}} (\mathbf{x} - \hat{\boldsymbol{\mu}}_{\mathcal{B}})^2 + \epsilon$ - batch variance (sqrt is standard deviation), small $\epsilon$ to avoid division by 0
* after applying these the new mean = 0, new std = 1
* $\boldsymbol{\gamma}$ - elementwise scale parameter (learned)
* $\boldsymbol{\beta}$ - elementwise shift parameter (learned)

* Using noisy estimates $\hat{\boldsymbol{\mu}}_\mathcal{B}$ and ${\hat{\boldsymbol{\sigma}}_\mathcal{B}}$ based on batches rather than full data - some randomness during training helps (think random perturbations, dropout).
* At test time use full train-sample statistics

### Batchnorm in FC and CNN layers

#### FC - after affine transform before nonlinearity

The original paper suggested: $\mathbf{h} = \phi(\mathrm{BN}(\mathbf{W}\mathbf{x} + \mathbf{b}) ).$

#### CN - after convolution before nonlinearity, per each channel independently

### Controversy

* Intuitively, batchnorm smoothes the optimization landscape. But does it really?
* "reducing *internal covariate shift*". What the h... is that?
* No, it actually does not. So how come it works?
* Example of intuition that works in practice but theory lacks behind => disturbing
* Alimi Rahimi NeurIPS2017: "deep learning = alchemy" :)

## Residual Network (ResNet)

He, K., Zhang, X., Ren, S., & Sun, J. (**2016**). Deep residual learning for image recognition. Proceed-
ings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

Questions to consider
* adding layers increases complexity, but does it help ?
* NN more expressive vs simply differnt?

### ResNet - some theory

* Trfained NN - function $f : \mathcal{X} \to \mathcal{Y}$
* NN architecture - set of functions $f \in \mathcal{F}$ varying in parameter settings, hyperparameters, etc.
* $f^*$ the "best / truth" function, typically $f^* \notin \mathcal{F}$
* make $\mathcal{F}$ more complex - any closer to $f^*$?
* non-nested functions $\approx$ no guarantee $\Rightarrow$ **use nested functions**

![inception-full](functionclasses.svg)

### ResNet in practice

* construct layer that can train to $f(\mathbf{x})=\mathbf{x}$
* *residual (shortcut) connection* propagates inputs fast ahead
* residual block $f(\mathbf{x})-\mathbf{x}$ can be trained to zero in the weight layer

![inception-full](residual-block.svg)

### ResNet block architecture

* follow from VGG net
* two 3x3 convolutions with same number of output channels
* batch norm and ReLU
* preserve input dim through convolutions so can be added to the skip connection
* but can change skip-connection channels through 1x1 convolution

![inception-full](resnet-block.svg)

## ResNet-18 model


![inception-full](resnet18.svg)

### ResNet layers

* inspired by GoogLeNet
* resolution decreases, channels increase
* use BatchNorm

## Densely Connected Network (DenseNet)

Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional
networks. Proceedings of the IEEE conference on computer vision and pattern recognition
(pp. 4700–4708).


![densenet](densenet-block.svg)

* difference from ResNet: concatenation (on channels) instead of addition
* ResNet: $\mathbf{x} \to f_2(\mathbf{x} + f_1(\mathbf{x})) + \mathbf{x} + f_1(\mathbf{x})$
* DenseNet: $\mathbf{x} \to [\mathbf{x}, f_1(\mathbf{x}), f2([\mathbf{x}, f_1(\mathbf{x})])]$
* use 1x1 convolutions to reduce num of channels