# Convolutional Neural Networks<a id="Top"></a>

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>
Table of Content
<ul>
<li>1. <a href="#Part_1">What is a convolutional neural network (CNN)?</a></li>
<li>2. <a href="#Part_2">Components of a CNN</a></li>
    <ul>
        <li> 2.1 <a href="#Part_2_1">Convolutional layer</a></li>
        <li> 2.2 <a href="#Part_2_2">Pooling layer</a></li>
        <li> 2.3 <a href="#Part_2_3">Fully connected layer</a></li>
        <li> 2.4 <a href="#Part_2_4">A Keras example</a></li>
    </ul>
<li>3. <a href="#Part_3">Notable CNN architechures</a></li>
</font>
</div>

In [2]:
import random
__counter__ = random.randint(0,2e9)
from IPython.display import HTML, display

# 1. What is a convolutional neural network?<a id="Part_1"></a>
<a href="#Top">Back to page top</a>

__Convolutional neural network (CNN, or convnet) is a type of neural network model that is almost universally used in computer vision applications such as image recognition__. Before diving into what CNNs are, let's instead apply an MLP, i.e. fully-connected neural network on the MNIST data set. The MNIST dataset is a collection of images of hand-written digits. Each image concsists of 28$\times$28 grey scale pixels.

<img src="./images/fig_CNN-01.png" width=200>

Let's say we use a rather humble and small fully connected network with the following layer structure:

1. Input layer - 784 input neurons, this is the pixel size of the image.
2. Hidden layer - 784 hidden neurons, for example.
3. Output layer - 10 neurons, because we have 10 digits.

The weights that connect the input and hidden layers then have 784$\times$784$=$614656 trainable parameters. Likewise, the weights that connects the hidden and the output layer have 7840 parameters. In total, the small network has roughly 620K parameters. Adding an additional 784-neuron hidden layer, the number of parameters easily exceeds 1 million. And this is just the beginning. Suppose we are going to deal with larger image sizes, say 640$\times$480$=$307200 pixels. Our 3-layer fully connected network would end up having 150 million parameters to train. So the point is, using fully collected networks for image-related jobs is a rather bad idea because the size of the network scales up rather quickly with image dimension and number of layers. The network would requires a huge memory space to store the weight and backpropagation parameters. And the huge number of parameters will very likely lead to overfitting.

The real deal breaker, I think, is that it is difficult for fully connected networks to learn features of an input image at different scales and generalize them to arbitrary positions of the image. This weakness of fully connected networks is exactly where CNNs shine. 

# 2. Components of a CNN<a id="Part_2"></a>
<a href="#Top">Back to page top</a>

A typical CNN looks like the following

<img src="./images/fig_CNN-02.svg?10" width=850>

The main types of layers of the architechure are Convolutional layer (Conv), Pooling layer (Pooling), and fully-connected layer (FC). In more detail:

- __Input__: The input layer will take raw pixel values of the image. The dimension of input is `(batch, height, width, channels)` where `batch` is the image batch size, `height` and `width` are the height and width of the input, and `channel` is the color channel of the image. A normal color image usually has 3 channels: R, G, B. Grey scale images only has one channel.
- __Convolutional layer__: This layer computes the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume. 
- __Pooling layer__: This layer will perform a downsampling operation along the spatial dimension.
- __Flatten__: This is an operation that flattens the last Conv/Pooling layer before they can be attached to the following FC layer.
- __Fully connected layer__: This is just the dense layer we have encounter before. The last FC layer will compute class probabilities such as the 10 categories of the MNIST dataset.

Note: as the figure indicates, CNN layers (except for FC) are arranged in 3 dimensions: __height__, __width__, and __depth__ (or __channel__).

## 2.1 Convolutional layer<a id="Part_2_1"></a>
<a href="#Top">Back to page top</a>

The convolutional layer is the most important building block of a CNN. Before we look at how it works, let's ask the question: __What is convolution?__ Convolution is a mathematical operation that __slides__ one function over another and measures the integral of their pointwise multiplication. To be more specific, the convolution of functions $f(t)$ and $g(t)$ is a new function $(f*g)(t)$ defined as

$$ (f*g)(t) \equiv \int_{-\infty}^\infty f(\tau)\,g(t-\tau)d\tau  $$

One can interpret this formula as a weighted average of the function $f(\tau)$ at $t$ where the weighting is given by $g(-\tau)$ __shifted by amount $t$__. As $t$ changes, the weighting function can sample different parts of the input function. In other words, $g(t-\tau)$ acts like a sliding window picking up $f(\tau)$ at different locations. This is exactly what is happening in a convolutional layer. 

Central to any convolutional layers is the idea of __receptive fields__ or __fileters__. Neurons in a convolutional layer are not connected to every single pixel in the input, but only to pixels in their receptive fields. The convolution operation then extracts patches from the filters and applies the same transformation to all of these filters, producing an output map. The following animation is a demonstration. Here we are sliding a 3$\times$3 filter over the 6$\times$6 input, producing a 4$\times$4 output map: 

<img src="./images/fig_CNN-03.gif" width=450>

At each step, the filter takes element-wise product with the pixel values sampled by the filter. The results are summed and send to the corresponding output neuron and activation function. Although in this example there is only one filter, in reality one typically employs a collections of filter sets resulting a 3D filter tensor with dimention `(filter_height, filter_width, _filter_depth)`. 

Clearly, the dimension of receptive fields determines the output geometry of a convolutional layer. Lets assume the input and the filter are in square shape. Denote the linear dimension of the input as $W$, and use $F$ to represent the linear dimension of the filter set. Then the output's linear dimension is $W-F+1$. However, there are two more parameters that control the output size: __zero-padding__ and __stride__.

### 2.1.1 Zero-padding

In our example where the input is a 6$\times$6 image, there are only 16 places where one can center the 3$\times$3 filter. At the end of the day, the output shrinks: from 36 pixels to 16 pixels. This is sometimes referred to as the border effect. If we wish to preserve the input geometry, i.e. the output having the same dimention as the input, then we can use zero-padding.

Padding consists of adding an appropriate number of rows and columns of zero value pixels on each side of the input feature map so as to make it possible to fit center convolution filter around every input tile. 

<img src="./images/fig_CNN-04.gif" width=450>

The figure demonstrates zero-parring of $P=1$. By sliding the 3$\times$3 filter, the output image has the same 6$\times$6 geometry. Note that near the edges, the filter only picks up partial information from the input.


### 2.1.2 Stride

Stride is yet another factor that can influence the output size. The description of convolution so far has assumed that the center tiles of the filters are all contiguous. But the distance between two successive filters is a parameter of the convolution, called its stride, which defaults to 1. It is possible to have stided convolutions.

<img src="./images/fig_CNN-05.gif" width=450>

The above figure illustrates convolution with stride $S=2$ over a 5$\times$5 input with zero-padding $P=1$. It can be seen from the animation that the filter is essentially skipping 2 pixels during it slides. The resulting output is a 3$\times$3 feature map. Therefore, using stride $S > 1$ effectively shrinks the output size.

### 2.1.3 Summary

1. To summarize, the following factors determine the output size of convolution operation:
    - Filter size $F$.
    - Zero-padding size $P$. 
    - Stride $S$.

    It can be shown that if the input has linear dimension $W$ (assuming square shape again), 
    then the output feature map's linear dimension is
    $$ \frac{1}{S}\,(W-F+2P) + 1 $$
    In principle, one should get an integer from the formula. 
    
2. In TensorFlow or Keras, zero-padding is control by the argument `padding=VALID` or `padding=SAME`:
    - __If set to `padding=VALID`, the convolutional layer does not use zero padding__, and, in order 
    to center the fillter, may ignore some rows and columns depending on the stride.

    - __If set to `padding=SAME`, the convolutional layer uses zero padding if necessary__. In this case, 
    the number of output neurons is equal to the number of input neurons divided by the stride, rounded 
    up (for example, `ceil(13/5) = 3`). Then zeros are added as evenly as possible around the inputs.

## 2.2 Pooling layer<a id="Part_2_2"></a>
<a href="#Top">Back to page top</a>

The role of pooling layer is to aggressively downsample its input, much like strided convolutions. This operation reduce the computaional load, the memory usage, and the number of parameters (thereby reducing the risk of overfitting). Reducing the input image size also makes the neural network tolerate a little of image shift.

Just like in convolutional layers, each neuron in a pooling layer is connected to the outputs of a limited number of neurons in the previous layer, located within a receptive field (filter). One has to specify the following for the pooling layer:
- Filter size $F$.
- Zero-padding size $P$. 
- Stride $S$.
    
In Keras, the default settings for `MaxPooling1D`, `MaxPooling2D`, `MaxPooling3D` are:
- $F=2$.
- `padding='valid'`.
- $S=F$.

So that the input's dimension is divided by 2 by default. Obviously, one could have other choices.

Most importantly, unlike a convolutional layer, a neuron in a pooling layer has no weight. All it does is aggregate the inputs using an aggregation function such as the max or mean. An example is given in the following figure

<img src="./images/fig_CNN-06.png" width=500>

In this case, $F=2$, $S=2$, and no zero-padding. As a result, the last column of pixels are ignored. The 2$\times$2 filter reads in four pixels `[1, 5, 3, 2]` and outpus the largest one `5`. This is what __MaxPooling2D__ does in Keras. In additional to MaxPooling, Keras also offers __AveragePooling__, __GlobalAveragePooling__, and __GlobalMaxPooling__ layers. Each of these corresponds to a particular way of aggregating pixel values.

## 2.3 Fully connected layer<a id="Part_2_3"></a>
<a href="#Top">Back to page top</a>

The last output from convolutional/pooling combo will be sent to a fully connected layer. Before doing so, remember that the output of the combo is a 3D tensor. So the tensor has to be reshaped/flattened before it is attached to the dense layer. In Keras, this is done by adding `keras.layers.Flatten()` operation before the dense layer. Neurons in a dense layer have full connections to all activations in the previous comvolutional/pooling layer, just as seen in regular neural networks. 

The number of output in the last fully connected layer is task dependent. For classification problems, it is the number of categories with softmax (multiclass classification) or sigmoid (binary calssification) activation function. For regression problems, on the other hand, there will only one output without any activation.

## 2.4 A Keras example<a id="Part_2_4"></a>
<a href="#Top">Back to page top</a>

In this section, we'll use Keras to demonstrate the construction of a simple CNN for the MNIST dataset. The input tensor for the MNIST has dimension `(batch, height, width, channel) = (batch_size, 28, 28, 1)`. The CNN has has two sets of convolutional/MaxPooling2D combination. The first convolutional layer has 32 3$\times$3 filters with `padding=same`; the second convolutional layer has 64 3$\times$3 filters also with `padding=same`. The MaxPooling2D layers use $F=2$ and $S=F=2$ default without padding.

In [1]:
from keras import layers
from keras import models

Using TensorFlow backend.


In [2]:
model = models.Sequential()
model.add(layers.Conv2D(32, 3, activation='relu', padding='same', input_shape=(28,28,1)))
model.add(layers.MaxPool2D(2))
model.add(layers.Conv2D(64, 3, activation='relu', padding='same'))
model.add(layers.MaxPool2D(2))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 28, 28, 32)        320       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 14, 14, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 14, 14, 64)        18496     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 7, 7, 64)          0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 3136)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                200768    
_________________________________________________________________
dense_2 (Dense)              (None, 10)                650       
Total para

Using the above parameters, we can verify the dimension of all the output layers:

conv2d_1: $$ \frac{1}{S}\,(W-F+2P)+1 = (28-3+2)+1 = 28 $$

max_pooling2d_1: $$ \frac{1}{S}\,(W-F+2P)+1 = \frac{1}{2}(28-2)+1 = 14 $$

conv2d_2: $$ \frac{1}{S}\,(W-F+2P)+1 = (14-3+2)+1 = 14 $$

max_pooling2d_2: $$ \frac{1}{S}\,(W-F+2P)+1 = \frac{1}{2}(14-2)+1 = 7 $$

# 3. Notable CNN architectures<a id="Part_3"></a>
<a href="#Top">Back to page top</a>

We have introduced basic building blocks of CNNs in the previous sections. Although there are not too many components, implementation of the layers is highly task dependent. Over the years, various CNN architectures have been developed, leading to amazing progress in the field of image recognition. In the following, we will summarize several famous CNN models which have achieved very good results in competitions such as ImageNet.

## 3.1 LeNet-5

<a href='http://yann.lecun.com/exdb/lenet/'>LeNet-5</a> was created by Yann LeCun in 1998 used for hand-written digit MNIST recognition. It is one of the most widely known CNN architectures. It's composed of the layers shown in the following table

<img src='./images/fig_CNN-LeNet5.png' width=550>

The original 28$\times$28 MNIST images are zero-padded to 32$\times$32 pixels and normalized before being fed to the network. The entire network does not use any zero-padding, so the outputs continue to shrink.

## 3.2 AlexNet

<a href='http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf'>AlexNet</a> was developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. The model won the 2012 ImageNet ILSVRC challenge. The following table presents the architecture.

<img src='./images/fig_CNN-AlexNet.png' width=680>

It is quite similar to LeNet-5, only a lot deeper and larger. To avoid overfitting, the authors used two regularization techniques:

- Applying 50% dropout during training to layers F8 and F9.
- Implementing data augmentation by randomly shifting images by various offsets, horizontal flipping, and changing brightness.

Inspired by biological newrons, the authors use a competitive normalization step immediately after the ReLU step of layers C1 and C3, called *local response normalization*. This form of normalization makes the neurons that most strongly activate inhibit neurons at the same location but in neighboring feature maps.

## 3.3 GoogLeNet

The <a href='http://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf'>GoogLeNet</a> was developed by Christian Szegedy et al. from Google Research. The network won the 2014 ILSVRC competition. Its layers are presented in the following diagram

<img src='./images/fig_CNN-GoogLeNet.png' width=600>

As can be seen from the figure, this network is much deeper than previous CNNs and has a total of $\sim$6 million parameters. The huge parameter space is one of the reasons that gives GoogLeNet such a great performance. Another reason is the design and implementation of the subnetwork called Inception modules. The architecture of an Inception unit is depicted as follows

<img src='./images/fig_CNN-InceptionModule.png' width=580>

The notation "3$\times$3$+$1(S)" means that the layer uses a 3$\times$3 filter, stride 1, and SAME padding. A few remarks:
1. Every single layer uses a stride of 1 and SAME padding. The outputs all have the same dimension as the inputs. This makes it possible to concatenate all the outputs along the *depth* dimension.
2. The second level of convolutional layers uses different filter sizes 1$\times$1, 3$\times$3, 5$\times$5. The different sizes allow the filters to capture patterns at different scales.
3. The layers with 1$\times$1 filters before the 3$\times$3 and 5$\times$5 convolutions also serve as *bottleneck layers*, meaning they reduce dimensionality.
4. Each pair of convolutional layers such as [1$\times$1, 3$\times$3] and [1$\times$1, 5$\times$5] acts as a single powerful convolutional layer, capable of capturing more complex patterns.



## 3.4 ResNet

<a href='https://arxiv.org/pdf/1512.03385.pdf'>Residual Network</a>, or ResNet, was developed by Kaiming He et al. in 2015. The extremely deep 152-layer model had a top-5 error rate under 3.6% and went on to become the winner of the 2015 ILSVRC challenge. The key to be able to train such a deep network without suffering the graient vanishing problem is to use the skip connections, or residual units.

<img src='./images/fig_CNN-ResNet-01.png' width=520>

As the above figure indicated, the signal feeding into a layer is also added to the output of a layer located a bit higher up the stack. Generally speaking, when training a network, the goal is to make it model a target $h(x)$. By adding the input $x$ to the output, the network is forced to model $f(x) = h(x) - x$. Why is this useful? At early stages of training, the network's output values would be close to zero because the network has not learned anything yet. By foring the network to model the input at this stage, the training will converge faster. Moreover, adding many skip connections would enable the network to start making progress even if several layers have not started learning yet, as depicted by the following figure

<img src='./images/fig_CNN-ResNet-02.png' width=520>

The ReNet's architecture is summarized by the next diagram

<img src='./images/fig_CNN-ResNet-03.png' width=620>

Overall, its structure is simpler than that of the GoogLeNet. One sees a large stack of residual units. Each residual unit is composed of two convolutional layers with batch normalization, ReLU activation function, 3$\times$3 filters, and stride 1 with SAME padding (hence spatial dimentions are preserved). 

The number of filters is doubled every few residual units, at the same time as their height and width are halved (using a convolutional layer with stride 2). When this happens the inputs cannot be added directly to the outputs of the residual unit since they don’t have the same dimension (for example, this problem affects the skip connection represented by the dashed arrow in the above figure). To solve this problem, the inputs are passed through a 1$\times$1 convolutional layer with stride 2 and the right number of output filters, as shown in the following

<img src='./images/fig_CNN-ResNet-04.png' width=500>


## 3.5 VGG16

<a href='https://arxiv.org/pdf/1409.1556.pdf'>VGG16</a> is a CNN model proposed by K. Simonyan and A. Zisserman from the University of Oxford. VGG16 improves the AlexNet by replacing large kernel-sized filters (11 and 5 in the first and second convolutional layer) with multiple 3×3 kernel-sized filters one after another. VGG16's architecture is depicted below

<img src='./images/fig_CNN-VGG16.png' width=800>

Overall, the model construction is quite straightforward. No special units such as residual connections, inceptions are implemented. Just plain convolutional and pooling layers. The most notable property is that the depths/channels of convolutional layers are much larger.

VGG16's convolutional layers use 3$\times$3 filters, stride 1, and SAME padding so that the spatial dimension is preserved after convolution operation. MaxPooling layers implement 2$\times$2 filters with stride 2. The 1$\times$1 convolutional filters can be seen as a linear transformation of the input depths/channels. 

Two major drabacks of VGG16:
1. It is extremely slow to train.
2. The network architecture weights are quite large in disk/bandwidth.

## 3.6 InceptionV3

<a href='https://arxiv.org/pdf/1512.00567.pdf'>Inception V3</a> is a variant of GoogLeNet. The network is also consructed using the concept of Inception module. Through several iterations of improvements, Inception V3 is able to push the network's performance (in terms of speed and accuracy) to an even higher level. A version of Inception V3 adopted in <a href='https://www.mdpi.com/2072-4292/10/7/1119'>this work</a> is shown below

<img src='./images/fig_CNN-InceptionV3.png' width=850>

Just like the GoogLeNet, the Inception unit adopted in Inception V3 performs convolution on an input, with 3 different sizes of filters (1$\times$1, 3$\times$3, 5$\times$5). To reduce computational cost, an extra 1$\times$1 convolution before the 3$\times$3 and 5$\times$5 convolutions to limit the nnumber of input channels. This is called __dimension reduction__ in their <a href='http://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf'>original paper</a>.

On top of the Inception module, the Inception V3 explores the following (partial quoting, for details, see the <a href='https://arxiv.org/pdf/1512.00567.pdf'>Inception V3</a> paper):
1. Avoid representational bottlenecks, especially early in the network. 
2. Higher dimensional representations are easier to process locally within a network. 
   Increasing the activations per tile in a convolutional network allows for more disentangled features. 
   The resulting networks will train faster.
3. Spatial aggregation can be done over lower dimensional embeddings without much or any loss 
   in representational power.
4. Balance the width and depth of the network.   

---
__Figure Credits__:
- Pictures/Tables of LeNet, AlexNEt, GoogLeNet, ResNet are taken from the book __A. Géron: Hands-On Machine Learning__.
- Picture of the VGG16 is taken from <a href='https://neurohive.io/en/popular-networks/vgg16/'>VGG16 – Convolutional Network for Classification and Detection</a>.
- Architecture overview of the InceptionV3 model is taken from tha paper <a href='https://www.mdpi.com/2072-4292/10/7/1119'>Very Deep Convolutional Neural Networks for Complex Land Cover Mapping Using Multispectral Remote Sensing Imagery</a> by Masoud Mahdianpari et al.