# Convolutional Neural Networks

Author: Binghen Wang

Last Updated: 6 Dec, 2022

<nav>
    <b>Deep learning navigation:</b> <a href="./Deep Learning Basics.ipynb">Deep Learning Basics</a> |
    <a href="./Deep Learning Optimization.ipynb">Optimization</a>
    <br>
    <b>CNN navigation:</b> <a href="./Object Detection.ipynb">Object Detection</a> |
    <a href="./Face Recognition.ipynb">Face Recognition</a> |
    <a href="./Visualization and Neural Style Transfer.ipynb">Visualization and Neural Style Transfer</a>
</nav>

---
<nav>
    <a href="../Machine%20Learning.ipynb">Machine Learning</a> |
    <a href="../Supervised Learning/Supervised%20Learning.ipynb">Supervised Learning</a>
</nav>

---

## Contents
Basics:
- [Standard Notation](#StN)
- [Convolutional Layer](#CL)
- [Pooling Layer](#PL)
- [$1\times1$ Convolutional Layer](#1b1)

Classic Models:
- [LeNet-5](#LeN)
- [AlexNet](#AlexN)
- [VGG-16](#VGG16)

Advanced Models:
- [ResNets](#ResN)
- [Inception Network](#IncN)
- [MobileNet](#MobN)
- [EfficientNet](#EffN)

## Basics

<a name ='StN'></a>
### Standard Notation (for a layer of CNN)

<table>
    <tr>
        <th colspan = "2" style="font-size:16px"> Shape parameters </th>
    </tr>
    <tr>
        <th> Object</th>
        <th> Meaning</th>
    <tr>
        <td> $$f^{[l]}$$ </td>
        <td> filter size </td>
    </tr>
    <tr>
        <td> $$p^{[l]}$$ </td>
        <td> padding </td>
    </tr>
    <tr>
        <td> $$s^{[l]}$$ </td>
        <td> stride </td>
    </tr>
    <tr>
        <td> $$n^{[l]}_C$$ </td>
        <td> number of channels </td>
    </tr>
    <tr>
        <td> $$n^{[l]}_H$$ </td>
        <td> height of 2D layer </td>
    </tr> 
    <tr>
        <td> $$n^{[l]}_W$$ </td>
        <td> width of 2D layer </td>
    </tr>
</table>

<table>
    <tr>
        <th colspan = "3" style="font-size:16px"> Model parameters and values </th>
    </tr>
    <tr>
        <th> Object</th>
        <th> Shape </th>
        <th> Meaning</th>
    <tr>
        <td> N/A </td>
        <td> $$f^{[l]} \times f^{[l]} \times n^{[l-1]}_C$$ </td>
        <td> one filter/kernel </td>
    </tr>
    <tr>
        <td> $$a^{[l-1]}$$ </td>
        <td> $$n^{[l-1]}_H \times n^{[l-1]}_W \times n^{[l-1]}_C$$ </td>
        <td> input to layer $l$ </td>
    </tr>
    <tr>
        <td> $$a^{[l]}$$ </td>
        <td> $$n^{[l]}_H \times n^{[l]}_W \times n^{[l]}_C$$ </td>
        <td> output of layer $l$ / activations of layer $l$ </td>
    </tr>
    <tr>
        <td> $$A^{[l-1]}$$ </td>
        <td> $$m \times n^{[l-1]}_H \times n^{[l-1]}_W \times n^{[l-1]}_C$$  </td>
        <td> vectorized input to layer $l$ </td>
    </tr>
    <tr>
        <td> $$A^{[l]}$$ </td>
        <td> $$m \times n^{[l]}_H \times n^{[l]}_W \times n^{[l]}_C$$  </td>
        <td> vectorized output of layer $l$ </td>
    </tr> 
    <tr>
        <td> $$W^{[l]}$$ </td>
        <td> $$f^{[l]} \times f^{[l]} \times n^{[l-1]}_C \times n^{[l]}_C$$ </td>
        <td> weight matrix of layer $l$ </td>
    </tr>
    <tr>
        <td> $$b^{[l]}$$ </td>
        <td> $$1 \times 1 \times 1 \times n^{[l]}_C$$ </td>
        <td> bias of layer $l$ </td>
    </tr>
</table>

#### Relationship between the terms

$$
\begin{equation}
n^{[l]}_H = \left\lfloor \frac{n^{[l-1]}_H + 2p^{[l]} - f}{s^{[l]}} + 1 \right\rfloor \\
n^{[l]}_W = \left\lfloor \frac{n^{[l-1]}_W + 2p^{[l]} - f}{s^{[l]}} + 1 \right\rfloor
\end{equation}
$$

$$
\begin{equation}
z^{[l]} = a^{[l-1]} *_{\text{conv}} W^{[l]} + b^{[l]} \\
a^{[l]} = g^{[l]}(z^{[l]})
\end{equation}
$$

<a name ='CL'></a>
### Convolutional Layer


<div style = "text-align: center;">
    <img src="./images/a CNN layer.png" style="width:80%;" >
</div>


#### Padding

**Problems/properties without padding**:
- Shrinking output
- Counting less the information from the edges of the image

**Valid convolution (no padding)**:

$$
h \times w * f \times f \rightarrow (h-f+1) \times (w-f+1)
$$

**Same convolution (output-size-same-as-input-size padding)**:

Set
$$
\begin{equation}
 h + 2p - f +  1 = h \\
 w + 2p - f +  1 = w 
\end{equation}
$$
and solve for $p$. We get:
$$
p = \frac{f-1}{2}
$$
where $f$ is usually odd (to allow for symmetric padding).


#### Stride
Denote the stride as $s$. Then the output shape of a strided convolution is:
$$
\left\lfloor \frac{h+2p-f}{s}\right\rfloor + 1 \;\;\; \times \;\;\; \left\lfloor \frac{w+2p-f}{s}\right\rfloor + 1
$$
<div style = "text-align: center;">
    <img src="./images/padding and strided convolution.jpg" style="width:80%;" >
</div>

<a name ='PL'></a>
### Pooling Layer

**Hyperparameters**:
- $f$: filter size
- $s$: stride
- type of pooling: max or average

<div style = "text-align: center;">
    <img src="./images/pooling layers.png" style="width:80%;" >
</div>

<a name='1b1'></a>
### $1 \times 1$ Convolutions (Networks in Networks)
- Provides a way to adjust the size of the channels $n_C$ (increase, decrease, keep it the same)
- Can be used as a 'bottleneck' layer to save on computation (for inception network).
<div style = "text-align: center;">
    <img src="./images/1x1 conv.png" style="width:80%;" >
</div>

## Classic Models

<a name ='LeN'></a>
### LeNet-5
**Note:** 
- To make sure smaller images match the input dimensions, padding could be used.
- It is usually the convention to call the combination of a convolutional layer and a subsampling layer (e.g., max pooling) a **layer** in CNN.
- The original input in the LeNet-5 has only **one channel** (grayscale), so the shape is $32\times32\times1$. Here we use an RGB input (with three channels) for illustration purposes.
- The third layer in the visualization uses a convolution layer with 120 filters of size $5\times5\times16$, resulting in activations of size $1\times1\times120$. This is then flattened. An **alternative** equivalent way is to first flatten the $5\times5\times16$ layer input into an input of size $400$ and then use a fully connected layer with $120$ units.
- In the original paper, the activation functions for the hidden layers were chosen to be sigmoid/tanh, but nowadays ReLU usually works better.
**Key Patterns:**
- Around 60k parameters.
- Going from left to right, $n_H$, $n_W \downarrow$ and $n_C \uparrow$
- conv + pool + conv + pool + conv/fc + fc + fc/output

<div style = "text-align: center;">
    <img src="./images/LeNet5.png" style="width:100%;" >
</div>

**Key Sections of the original paper**: **II**, III

<a name ='AlexN'></a>
### AlexNet
**Key Pattern:**
- Around 60m parameters.
- Use of ReLU activation functions.
- Trained using ImageNet.
- A lot of hyperparameters. (Different choices of filter sizes, padding options, strides etc.)

<div style = "text-align: center;">
    <img src="./images/AlexNet.png" style="width:100%;" >
</div>

<a name ='VGG16'></a>
### VGG-16
**Key Pattern:**
- Around 138m parameters.
- Going from left to right, $n_H$, $n_W \downarrow$ and $n_C \uparrow$
- Made of 16 layers.

<div style = "text-align: center;">
    <img src="./images/VGG16.png" style="width:90%;" >
</div>

## Advanced Models

<a name ='ResN'></a>
### ResNets
#### Residual Block
- The activations two layers earlier $a^{[l]}$ are added **before applying the ReLU**.
- The addition works because in a residual block the two layers have the **same number of units**.

<div style = "text-align: center;">
    <img src="./images/residual block.png" style="width:100%;" >
</div>

#### Residual Network
<br>

<div style = "text-align: center;">
    <img src="./images/residual network.png" style="width:90%;" >
</div>
<br>

**ResNet helps with the vanishing and exploding gradients problems, which helps training deeper networks:**
- In theory, a deeper network could result in a smaller training error. In reality, it is usually not the case for a plain network as the optimization algorithm has a much harder time training.
- With ResNet, the training error can keeps going down with the depth of the network. 
<div style = "text-align: center;">
    <img src="./images/ResNet helps training.jpeg" style="width:90%;" >
</div>

**Key properties**:
- It is easy for residual blocks to learn identity functions. 
    - Assume ReLU activation is used for the hidden layers of the network, so $a^{[l]}\geq 0$.
      $$
      a^{[l+2]} = g(W^{[l+2]}a^{[l+1]} + b^{[l+2]} + a^{[l]})
      $$
      Set $W^{[l+2]} = 0$ and $b^{[l+2]} = 0$ and we have:
      $$
      a^{[l+2]} = a^{[l]}
      $$
- In case $a^{[l+2]}$ and $a^{[l]}$ do not match dimensionalities. Use the following equation:
    $$
    a^{[l+2]} = g(W^{[l+2]}a^{[l+1]} + b^{[l+2]} + W^{[l+2]}_s a^{[l]})
    $$
    where $W^{[l+2]}_s$ can be learnt or pre-set to perform zero padding.

<a name ='IncN'></a>
### Inception Network
#### Raw inception module
**Idea:** Instead of choosing among several different types of filters, one could apply multiple types of filters in one layer (and use same convolution to make sure the output could be stacked together).  
<div style = "text-align: center;">
    <img src="./images/inception module.png" style="width:90%;" >
</div>

#### Computation cost of a raw inception module
One drawback of an inception module is the high computation cost. Take the above illustration as an example. The computation cost involving the $5\times 5$ CONV filter (counting only the numebr of multiplications, with that for additions similar) is:
$$
\begin{align}
 (n_H \times n_W \times n_{C,\text{out}}) \times ( n_{\text{kernel},H} \times n_{\text{kernel},W} \times n_{C,\text{in}}) = & 28\times28 \times 32 \times 5\times 5\times 192 \\ = & 120,422,400
\end{align}
$$


#### Bottleneck layer
In comparison, if we add a bottleneck layer ($1\times1$ convolution layer) in the middle, the computation cost can be reduced to:
$$
28\times28\times16\times 192 + 28\times28\times 32 \times (5\times 5\times 16) = 12,443,648
$$

<div style = "text-align: center;">
    <img src="./images/bottleneck layer.png" style="width:90%;" >
</div>

#### Inception module (with bottleneck layers)
- add bottleneck layers before applying $3\times3$ and $5\times5$ convolutions.
- add bottleneck layer after the MAX POOL layer to reduce the number of channels.

<div style = "text-align: center;">
    <img src="./images/inception module with bottleneck layers.png" style="width:90%;" >
</div>

#### GoogLeNet
- **MAX POOL layers** are used in between blocks of inception modules to reduce the height and width of activations.
- **Auxiliary classifiers** (with softmax activations at the end of each) are added to intermediate layers of the network. They are **utilized only during training** and removed during inference. The loss of each classifier is weighted and then added to the total loss. This has a regularization effect and prevents the model from overfitting–making sure the the prediction using layers early on is not too bad.

<div style = "text-align: center;">
    <img src="./images/GoogLeNet.png" style="width:90%;" >
</div>

<a name ='MobN'></a>
### MobileNet
**Key features:**
- Low computational cost at deployment
- Useful for mobile and embedded vision application
- Make use of depthwise separable convolutions to reduce computational cost

#### Depthwise separable convolution
<div style = "text-align: center;">
    <img src="./images/normal conv.png" style="width:90%;" >
</div>

<br>

The computational cost of a normal convolution:
$$
3\times 3 \times 192 \times \underbrace{4 \times 4 \times 5}_{\text{output size}} = 138,240
$$

<br>


<div style = "text-align: center;">
    <img src="./images/depthwise separable conv.png" style="width:90%;" >
</div>

<br>

The computational cost of a depthwise separable convolution:
$$
3 \times 3 \times \underbrace{4\times 4 \times 192}_{\text{output size}} + 192 \times \underbrace{4 \times 4 \times 5}_{\text{output size}} = 43,008
$$
Ratio of the costs: $\frac{43,008}{138,240} = 0.31$

In general, the ratio of costs can be computed using the following formula:
$$
 \text{Ratio of costs} = \frac{1}{n_{C, \text{out}}} + \frac{1}{f^2}
$$

#### MobileNet V1
<div style = "text-align: center;">
    <img src="./images/MobileNetv1.png" style="width:90%;" >
</div>

#### MobileNet V2
Two improvements upon the first version:
- Added an expansion step ($1\times1$ convolution) before the depthwise convolution, allowing for more computation inside each bottleneck block.
- Added a skip connection to aid optimization.

<div style = "text-align: center;">
    <img src="./images/MobileNetv2.png" style="width:90%;" >
</div>


<a name ='EffN'></a>
### EfficientNet
- Helps you choose an efficient combination of resolution, depth and width of your convolutional neural networks for deployment on different devices.
- [Link](https://arxiv.org/pdf/1905.11946.pdf) to the paper.