# Classic Networks

In this notebook, we'll learn about some of the classic CNN architectures.

<hr>

## LeNet-5 (1998)

The goal for this model was to identify handwritten digits in a $32 \times 32 \times 1$ gray image.

<div style="text-align:center">
    <img src="media/lenet5.png">
    <caption><font color="red">Parameters: 60k</font></caption>
</div>

<hr>

- **Initial Input:** The input is an image of size 32x32x1 (e.g., grayscale handwritten digits).


- **First Convolutional Layer:**
    - Uses 6 filters of size 5x5, with a stride of 1.
    - The absence of padding reduces the image size to 28x28.
    - The output dimension becomes 28x28x6.
- **First Pooling Layer:**
    - Applies average pooling (though modern implementations might prefer max pooling).
    - Uses a 2x2 window with a stride of 2, halving the dimensions to 14x14x6.


- **Second Convolutional Layer:**
    - Employs 16 filters of size 5x5, again without padding.
    - This reduces the dimensions to 10x10x16.
- **Second Pooling Layer:**
    - Similar to the first pooling layer, further reducing the dimensions to 5x5x16.


- **Flattening:**
    - The output from the previous layer is flattened, resulting in a vector of size 400 (5x5x16).
- **Fully Connected Layers:**
    - The network includes fully connected layers, with the first having 120 neurons, and the second having 84 neurons.
- **Output Layer:**
    - The final layer originally used a different classification approach, but modern implementations would use a softmax layer for 10-way classification (digits 0-9).

<hr>

**Characteristics**
- LeNet-5 has approximately 60,000 parameters, which is small compared to modern networks.
- The network depth leads to a decrease in spatial dimensions (height and width) and an increase in the number of channels through the layers.
- The architecture typically follows a pattern of convolutional layers followed by pooling layers, and then fully connected layers leading to the output.

**Historical Context**
- The original LeNet-5 used sigmoid and tanh activation functions instead of ReLU.
- Some unique wiring in the convolutional layers was used due to computational limitations at the time.
- The original LeNet-5 applied a non-linearity (sigmoid) after the pooling layers, which is uncommon in modern architectures.

[LeCun et al., 1998. Gradient-based learning applied to document recognition]

<hr>

## AlexNet (2012)

<br>

<div style="text-align:center">
    <img src="media/alexnet.png">
    <caption><font color="red">Parameters: 60M</font></caption>
</div>

<hr>

- **Input Dimensions:** AlexNet processes images of size 227x227x3 (although the original paper mentioned 224x224x3, the architecture works with 227x227x3).


- **First Convolutional Layer:**
    - Uses 96 filters, each of size 11x11, with a stride of 4.
    - The output dimension becomes 55x55x96 due to the large stride.
- **First Max Pooling Layer:**
    - Applies max pooling with a 3x3 filter and a stride of 2.
    - Reduces the dimension to 27x27x96.


- **Second Convolutional Layer:**
    - Employs a 5x5 convolution with same padding, increasing the number of filters to 256.
    - Maintains the dimension at 27x27x256.
- **Second Max Pooling Layer:**
    - Further applies max pooling, reducing the dimensions to 13x13x256.


- **Additional Convolutional Layers:**
    - Several more convolutional layers are added, using 3x3 filters with same padding.
    - Increases the number of filters first to 384, then again to 256.
    - A final max pooling layer reduces the dimensions to 6x6x256.


- **Flattening:**
    - Flattens the output to prepare for fully connected layers, resulting in 9216 features (6x6x256).
- **Fully Connected Layers:**
    - Includes multiple fully connected layers, the final layer using softmax activation for a 1000-way classification (reflecting the 1000 different classes in the ImageNet dataset).

<hr>

**Notable Characteristics**
- AlexNet is significantly larger than LeNet-5, with approximately 60 million parameters.
- It was one of the first successful applications of deep learning in computer vision, particularly in the ImageNet challenge.
- Uses ReLU (Rectified Linear Unit) activation functions, which was a key factor in its performance.

**Special Features**
- Trained on two GPUs due to hardware limitations at the time.
- Included Local Response Normalization (LRN), a technique not commonly used in modern architectures.

**Historical Impact**
- AlexNet significantly influenced the computer vision community, showcasing the effectiveness of deep learning in this field.
- It paved the way for further advances in deep learning across various applications, not just in computer vision.

[Krizhevsky et al., 2012. ImageNet classification with deep convolutional neural networks]

<hr>

## VGG-16 (2015)

<br>

<div style="text-align:center">
    <img src="media/vgg16.png">
    <caption><font color="red">Parameters: ~ 138M</font></caption>
</div>

<hr>

- **Initial Input:** The input is an image of size 224x224x3.


- **Convolutional Layers:**
    - The first two layers use 64 filters of size 3x3, with same padding and a stride of 1, maintaining the dimension at 224x224x64.
    - This pattern of using 3x3 filters with a stride of 1 and same padding is consistent throughout the network.
- **Max Pooling Layers:**
    - Follows each set of convolutional layers.
    - Uses a 2x2 window with a stride of 2, halving the dimensions each time.
    - First pooling layer reduces the dimension to 112x112x64.


- **Increasing Filter Depth:**
    - After each pooling layer, the number of filters doubles $(64 \rightarrow 128 \rightarrow 256 \rightarrow 512)$.
    - The network includes multiple sets of convolutional layers before each pooling layer, with the number of filters increasing as the network goes deeper.

- **Fully Connected Layers:**
    - After the final pooling layer, the network flattens the output and passes it through three fully connected layers.
    - The first two fully connected layers have 4096 units each.
    - The final layer uses a softmax activation function for classification into 1000 classes.

<hr>

**Network Size:**
- VGG-16 is a large network with approximately 138 million parameters.
- The '16' in VGG-16 refers to the number of layers with weights (convolutional and fully connected layers).


**Design Uniformity:**
- VGG-16 is known for its uniform architecture, making it easy to understand and modify.
- The network architecture systematically <font color='red'>doubles the number of filters while halving the height and width of the feature maps.</font> For instance:
    - Filters: $(64 \rightarrow 128 \rightarrow 256 \rightarrow 512)$
    - Feature Maps: $(224 \rightarrow 112 \rightarrow 56 \rightarrow 28 \rightarrow 14 \rightarrow 7)$


**VGG-19 Variant:**
- There is a larger variant known as VGG-19, but VGG-16 is more commonly used due to its comparable performance and slightly smaller size.
    

**Historical Significance:**
- VGG-16, developed by Karen Simonyan and Andrew Zisserman, is one of the most influential architectures in deep learning, particularly for its simplicity and uniformity.
- It was one of the key architectures that highlighted the effectiveness of deep convolutional networks in image recognition tasks.

[Simonyan & Zisserman 2015. Very deep convolutional networks for large-scale image recognition]

<hr>