# Convolutional Neural Networks

## Foundations of Convolutional Neural Networks

### Computer Vision

Computer vision is one of the applications that are rapidly active thanks to deep learning. Some of the applications of computer vision that are using deep learning includes self driving cars and face recognition.

Rapid changes to computer vision are making new applications that weren't possible a few years ago. Computer vision deep leaning techniques are always evolving making a new architectures which can help us in other areas other than computer vision. For example, Andrew Ng took some ideas of computer vision and applied it in speech recognition.

Examples of a computer vision problems includes:
* Image classification.
* Object detection $\rightarrow$ detect object and localize them.
* Neural style transfer $\rightarrow$ changes the style of an image using another image.

One of the challenges of computer vision is that images can be extremely large while a fast and accurate algorithm is required.

For example, a $1000 \times 1000$ image will represent $3$ million feature/input to the full connected neural network. If the following hidden layer contains $1000$ units, then the matrix of weights is $1000 \times 3$ million which is $3$ billion parameters only in the first layer,  and that is computationally very expensive!

One of the solutions is to build this using **convolution layers** instead of the fully connected layers.

### Edge Detection Example

The convolution operation is one of the fundamentals blocks of a CNN. One of the examples about convolution is the image edge detection operation.

Early layers of CNN might detect edges then the middle layers will detect parts of objects and the later layers will put the these parts together to produce an output.

In an image we can detect vertical edges, horizontal edges, or full edge detector. An example of convolution operation to detect vertical edges:

* on the left there is a grey image (10 is brighter than 0)
* the convolution operator is denoted by $*$
* the second element is called *filter* or *kernel* $\rightarrow$ intuition: for vertical edges consider as if there are bright pixels on the left anddark pixels on the right
* each element of the resulting matrix is given by the sum of the element  of the filter, each one multiplied by the corresponding elements in the "overlapping" square on the left matrix (see red and green elements)

<img src="w1_edge_detection.PNG" width="600px" />

In python the convolution operation is done by  `tf.nn.conv2d ` (TensorFlow) or  `Conv2D ` (keras)

Example of a convolution:

<img src="convolution-example-matrix.gif" width="600px" />

Consider instead an input image dark-to-light, with columns $[0,...0,10...10]$, applying the convlution would result in an image gray-dark-gray, with colummns $[0,-30,0]$. To solve this issue generally is applied the absolute value.

An horizontal filter would be made of rows $$\left[\begin{array}{ccc}1 & 1 & 1\\0 & 0 & 0\\-1 & -1 & -1\end{array}\right]$$

Different filters have been presented such as the Sobel filter $\left[\begin{array}{ccc}1 & 0 & -1\\2 & 0 & -2\\1 & 0 & -1\end{array}\right]$ or the Scharr filter $\left[\begin{array}{ccc}3 & 0 & -3\\10 & 0 & -10\\3 & 0 & -3\end{array}\right]$ to put more weight on the central pixels, to make them more robust.

Applying Deep Learning means that we don't need to handcraft these numbers, we can treat them as weights and then learn them. It can learn horizontal, vertical, angled, or any edge type automatically rather than getting them by hand:

$$\left[\begin{array}{ccc}w_1 & w_2 & w_3\\w_4 & w_5 & w_6\\w_7 & w_8 & w_9\end{array}\right]$$

### Padding

When a $n \times n$ matrix is convolved with a $f \times f$ filter the result is a $(n-f+1) \times (n-f+1)$ matrix, therefore one issue with convolutions is that the resulting image is smaller than the input image.

A second issue is that the filter barely touches the corners and edges of the input images while the pixels in the center are processed many times.

When we want to apply convolution operation multiple times, if the image shrinks we will lose a lot of data on this process. Also the edges pixels are used less than the central pixels in the image.

For these reasons to use deep neural networks we really need to use **paddings**: the input matrix is augmented with an additional border of *zeros*. If the border thickness is $p$ then the resulting matrix has dimension $(n+2p-f+1)\times(n+2p-f+1)$.

*Valid* convolutions do not apply padding, while in *same* convolutions the pad is such that the output size is the same as the input size. Which means that $p = \frac{f-1}{2}$.

By convention in computer vision $f$ is usually odd. Some of the reasons is that it has a central position.

### Strided Convolutions

Strided convolutions refers to fix a number $s$ to define the number of pixels the algorithm will jump when applying the filter. A stride of $s=2$ means that the filter will cover the input matrix moving by $2$ cells each time.

The resulting matrix has dimension

$$\bigg(\frac{(n+2p-f)}{s}+1\bigg) \times \bigg(\frac{(n+2p-f)}{s}+1\bigg)$$

If the dimension is not made of integers it is rounded down using the `floor()` function, denoted by $\lfloor \dots \rfloor$. 

In math textbooks the convolution operation flips the filter before applying it to the imput matrix:

$$\left[\begin{array}{ccc}w_1 & w_2 & w_3\\w_4 & w_5 & w_6\\w_7 & w_8 & w_9\end{array}\right] \rightarrow \left[\begin{array}{ccc}w_9 & w_8 & w_7\\w_6 & w_5 & w_4\\w_3 & w_2 & w_1\end{array}\right]$$

But in DL there is no flipping. It is still referred to as convolution even if it would be a cross-correlation.

### Convolutions Over Volume

When working with colored images we add the depth
dimenson given by the number of channels (3 channels for RGB). An $(n \times n \times n_c)$ input image will be convolved with a $(f \times f \times n_c)$ filter:

<img src="conv_over_volumns.png" width="600px" />

Where each of the numbers of the filter is multiplied with the corresponding number in the input image and then summed up.

It is possible to detect horizontal edges only for a channel and keep the others equal to zero:

$$\underbrace{\left[\begin{array}{ccc}1 & 1 & 1\\0 & 0 & 0\\-1 & -1 & -1\end{array}\right]}_R \quad \underbrace{\left[\begin{array}{ccc}0 & 0 & 0\\0 & 0 & 0\\0 & 0 & 0\end{array}\right]}_G\quad \underbrace{\left[\begin{array}{ccc}0 & 0 & 0\\0 & 0 & 0\\0 & 0 & 0\end{array}\right]}_B$$

It is possible to use multiple filters at the same time, for example one vertical and one horizontal edge detector. The two outputs can be stacked together with depth equal to the numbe of filter used, for example:

<img src="mult_filters.png" width="600px" />

$$
(6\times6\times3) \text{ input image} \rightarrow 
 \biggl\{\begin{array}{c}(3\times3\times3)\text{ "vertical" filter} \rightarrow (4\times4) \text{ matrix}\\
(3\times3\times3)\text{ "horizontal" filter} \rightarrow (4\times4) \text{ matrix}\end{array} \biggr\} \rightarrow
(4\times4\times2) \text{ output}
$$

### One Layer of a Convolutional Network

In a layer of a CNN the filters have the same role of the weights $w^{[l]}$ of a NN. To each output of the convolutional operation we add a (different) constant $b^{[l]} \in \mathbb{R}$ with broadcasting, so that the augmented output takes the role of $z^{[l]}$, to which we apply the non-linearity to get $a^{[l]} = g(z^{[l]})$, the final "stacked" output.

<img src="example_layer.png" width="700px" />

With ten $(3\times3\times3)$ filters we need $3*3*3*10+10=280$ parameters.

Notice that no matter the size of the input, the number of the parameters is same if the filter size is same. That makes it less prone to overfitting.

Summary of notation for layer $l$ of a convolutional layer:
* Hyperparameters:
 * $f^{[l]}$ = filter size
 * $p^{[l]}$ = padding	# Default is zero $\rightarrow$ note that padding does not apply to the depth!
 * $s^{[l]}$ = stride
 * $n_c^{[l]}$ = number of channels/filters
 
* Input (height and width): $(n_H^{[l-1]} \times n_W^{[l-1]} \times n_c^{[l-1]})$

* Output (height and width): $(n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]})$
 * where $n_H^{[l]} = \lfloor \frac{n_H^{[l-1]}+2p^{[l]}-f^{[l]}}{s^{[l]}}+1 \rfloor$, same for $n_W^{[l]}$

* Each filter is $(f^{[l]} \times f^{[l]} \times n_c^{[l-1]})$, since it should match the number of channels of the input.

* The activations $a^{[l]}$ correspond to the outputs, however, in a vectorized notation/batch gradient descend $A^{[l]} = (m \times n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]})$

* The weights are $(f^{[l]} \times f^{[l]} \times n_c^{[l-1]} \times n_c^{[l]})$, where the last quantity is the total number of filters of layer $l$.

* The bias is a vector $n_c^{[l]}$, one for each filter, but it would be easier to express it as a $(1 \times 1 \times 1 \times n_c^{[l]})$ tensor $\rightarrow$ a multidimensional array.

### Simple convolution network example

The dimension of the layers follows the rule $n_H^{[l]} = \lfloor \frac{n_H^{[l-1]}+2p^{[l]}-f^{[l]}}{s^{[l]}}+1 \rfloor$:

<img src="simple_cnn_example.png" width="800px" />

Finally we vectorize the last volume into a $7*7*40=1960$ column vector and feed it to a logistic or soft-max unit (depending on if the output is binary of contains multiple objects).

In the example the image is getting smaller after each layer and that is the current trend in CNN.

There are 3 types of layer in a convolutional network:
* Convolution
* Pooling
* Fully connected

### Pooling layers

Other than the convolution layers, CNNs often uses **pooling layers** to reduce the size of the inputs, speed up computation, and to make some of the features it detects more robust:

<img src="max_pooling.png" width="600px" />

Notice that there are no parameters be be learned!

In case of input with multiple channels ($n_c^{[l-1]}$) the filter does the computation over all the channels independently and the output has $n_c^{[l]} = n_c^{[l-1]}$ as third dimension: the first matrix of the output takes the max of the elements from the first matrix of the input, the second from the second and so on...

The main reason why people are using pooling because its works well in practice and reduce computations.

An alternative to max pooling is to compute the average pooling.

The importnt thing is that here are no parameters to learn.

### CNN Example

This example is something like the LeNet-5 that was invented by Yann Lecun in 1998.

It is a convention to refer to the couple conv layer and pooling layer as only one layer, since the pooling layer does not have weights.

<img src="nn_example.png" width="1000px" />


Conv layers need relatively little parameters, in layer 1 only $5*5*3*6+6 = 456$ and in layer 2 only $5*5*6*16+16=2416$. This is way less that the fully connected layers 3 and 4 with $400*120+120=48120$ and $120*84+84=10164$ parameters respectively.

Generally, the deeper you go and the input size decreases over layers while the number of filters increases.

### Why Convolutions?

CNN are convenient because they need less parameters to be trained. In the example above the input image has $32*32*3=3072$ features and the first conv layer is made of $28*28*6=4704$ layers. In the conv layer we only need $(5*5+1)*6=156$ parameters while on a fully connected layer we would need $3072*4704>14$ millions of parameters.

* Parameter sharing: the same filter (same parameters) can be applied to multiple parts of the image, for example a vertical edge detector.

* Sparsity of connections: in each layer, each output value (element of the output matrix) depends only on a small number of inputs (those where the filter is applied) which makes it good in capturing translation invariance: the result is robust to small shift of pixels in the input image.

## Case Studies

### Why look at case studies?

Some neural networks architecture that works well in some tasks can also work well in other tasks.

Here are some classical CNN networks:

* LeNet-5
* AlexNet
* VGG

The best CNN architecture that won the last ImageNet competition is called ResNet and it has 152 layers. There are also an architecture called Inception that was made by Google that are very useful to learn and apply to your tasks.

Reading and trying the mentioned models can boost you and give you a lot of ideas to solve your task.

### Classic networks

**LeNet-5 (1998)**

The goal for this model was to identify handwritten digits in a ($32\times32\times1$) gray image.

<img src="LeNet-5.png" width="1000px" />

This model was published in 1998. At that time was common to use avg pooling instead of max pooling. The last layer wasn't using softmax back then.

It has around $60k$ parameters. Very few respect to today's networks.

The dimensions of the image decreases as the number of channels increases.

The architecture: `Conv ==> Pool ==> Conv ==> Pool ==> FC ==> FC ==> softmax` is quite common.

The activation function used in the paper was Sigmoid and Tanh. Modern implementation uses ReLU in most of the cases.

**AlexNet (2012)**

The goal for the model was the ImageNet challenge which classifies images into 1000 classes.

<img src="AlexNet.png" width="1000px" />

The architecture is `Conv => Max-pool => Conv => Max-pool => Conv => Conv => Conv => Max-pool ==> Flatten ==> FC ==> FC ==> Softmax`

Has 60 Million parameter compared to 60k parameter of LeNet-5.

It used the ReLU activation function.

This paper convinced the computer vision researchers that deep learning is so important.

**VGG-16 (201)**

It always use conv layers with ($3\times3$) filters, stride = 1 and same padding and max-pool layers with ($2\times2$) filters and stride = 2.

<img src="VGG-16.png" width="1000px" />


The 16 in teh name refers to the 16 layers with weights.

This network is large even by modern standards. It has around 138 million parameters.


### ResNets (Residual Networks)

Very deep NNs are difficult to train because of vanishing and exploding gradients problems.

A solution to this problem is to take the activation from one layer $a^{[l]}$ and suddenly feed it to another layer even much deeper in the NN which allows you to train large NNs even with layers greater than 100.

Instead of what happens in a **plain network**::

$a^{[l]} ==> \underbrace{z^{[l+1]} = W^{[l+1]} a^{[l]} + b^{[l+1]}}_{\text{linear}} ==> \underbrace{a^{[l+1]} = g(z{[l+1]})}_{\text{ReLU}} ==> \underbrace{z{[l+2]} = W{[l+2]} a{[l+1]} + b{[l+2]}}_{\text{linear}} ==> \underbrace{a{[l+2]} = g(z{[l+2]})}_{\text{ReLU}}$

We take a *short cut* / skip connection to make it a **residual network**:

$a^{[l]} ==> ... ==> \underbrace{a{[l+2]} = g(z{[l+2]} + a^{[l]})}_{\text{ReLU}}$

This is done after the last linear operation and before the last ReLU operation and forms a **residual block**, which allows to train much deeper NNs.

<img src="res_block.png" width="1000px" />

In plain NNs with many layers after a while the training error starts increasing


<img src="resnet_error.png" width="1000px" />

### Why ResNets work

Consider a residual block such as

$$a^{[l+2]} = g(z^{[l+2]} + a^{[l]}) = g(w^{[l+2]} a^{[l+1]} + b^{[l+2]} + a^{[l]})$$

When applying L2 regularization the weights shrink close to zero, so that

$$a^{[l+2]} = g(a^{[l]})$$

And when the activation function is ReLU all activations are non-negative so

$$a^{[l+2]} = a^{[l]}$$

And the identity function is easy to learn, so adding the two extra layers does not hurt performance. However adding the two layers can learn something useful so they would do better than just learning the identity function.

Using a skip-connection helps the gradient to backpropagate and thus helps you to train deeper networks.

The dimensions of $z^{[l+2]}$ and $ a^{[l]}$ have to be the same in ResNets. In case they have different dimensions we put a matrix of parameters (which can be learned or fixed) such that 

$$a^{[l+2]} = g( z^{[l+2]} + w_s * a^{[l]})$$

where $w_s$ also can be a zero padding transformation.

### Networks in Networks and 1x1 Convolutions

A 1 x 1 convolution -also called Network in Network- is useful in many CNN models. It has been used in lots of modern CNN implementations like ResNet and Inception models.

A 1 x 1 convolution is useful when:

* We want to shrink the number of channels. We also call this feature transformation: $(28\times 28\times 192)\underbrace{\rightarrow}_{32 \text{ filters } 1\times 1}(28\times 28\times 32)$ 

    We will later see that by shrinking it we can save a lot of computations.

* If we have specified the number of 1 x 1 Conv filters to be the same as the input number of channels then the output will contain the same number of channels. But the 1 x 1 Conv will still act like a non-linearity and will learn non linear operator.

### Inception Network Motivation

The idea behind **inception networks** is to compine multiple different layers at once. For example one layer can be made of the concatenation of a $1 \times 1$ filter, a $3 \times 3$ filter, a $5 \times 5$ filter, and a pooling layer, all with different numbers of channels and same padding:

<img src="inception_layer.png" width="1000px" />

In this scenario we have done all the convs and pools we might want and will let the NN learn and decide which it want to use most.

Te problem is the computational cost. Only counting the multiplications of the $5 \times 5$ Conv gives $5*5*192*28*28*32$ which is more than $120$ millions of operations.

A solution to this is to use a $1 \times 1$ convolution as a **bottleneck layer**:

<img src="bottleneck_layer.PNG" width="800px" />

In this case the multiplycations are $1*1*192*28*28*16 + 5*5*16*28*28*32$ which are $2,4 + 10 =12,4$ millions, therefore one tenth of the first example.

### Inception Network

The center of the **inception network** is the **inception unit**:

<img src="inception_unit.PNG" width="800px" />

Remember that the pooling layer does the computation over all the channels independently and the output has $n_c^{[l]} = n_c^{[l-1]}$ as third dimension: the first matrix of the output takes the max of the elements from the first matrix of the input, the second from the second and so on... Therefore we need another $1 \times 1$ filter to match the dimension.

Example of an inception unit in Keras:

<img src="inception_keras.PNG" width="800px" />


An example of inception networ is the GoogLeNet NN. It has 9 inception units and uses some max-pool layers to reduce dimension.

There are a 3 soft-max branches at different positions to push the network toward its goal. It helps to ensure that the features computed even in the intermediate layers are good enough to make a prediction. It works as a regularization effect.

<img style="transform: rotate(90deg); width:250px" src="googlenet.PNG" />

Since the development of the Inception module, the authors and others have built another versions of this network. Like inception v2, v3, and v4. Also there is a network that has used the inception module and the ResNet together.

The name inception network comes from the movie *Inception* (2010):

<img src="inception.gif" width="400px"/>


## Practical advices for using ConvNets

### Using Open-Source Implementation

A lot of NN are difficult to replicated because there are some details that may not presented on its papers such as parameter tuning.

A lot of deep learning researchers are open sourcing their code in sites like GitHub.

Some advantage of doing this is that you might download the network implementation along with its parameters/weights. The author might have used multiple GPUs and spent some weeks to reach this result and its right in front of you after you download it.

### Transfer Learning

It is a common practice to use a NN architecture that has been trained before. This means to use its pretrained parameters/weights instead of a random initialization and training. The pretrained models might have been trained on a large datasets like ImageNet. This can save a lot of time.

For example, when using another NN with its weights, just remove the softmax activation layer and put your own one and make the network learn only the new layer while the other weights are fixed/frozen.

Another trick that can speed up training, is to run the pretrained NN without final softmax layer and get an intermediate representation of your images and save them to disk. And then use these representation to a shallow NN network. This can save you the time needed to run an image through all the layers. It is like converting your images into vectors.

An alternative is to freeze few layers from the beginning of the pretrained network and learn the other weights in the network, or to put your own layers there.

If you have enough data, you can initialize the weights using the pretrained network (and change the softmax layer).

### Data Augmentation

The more data you have, the better the deep NN will perform. Data augmentation is one of the techniques that deep learning uses to increase the performance of deep NN.

Some data augmentation methods that are used for computer vision tasks includes:
* Mirroring
* Random cropping (take random portions of the image, note that they should be big enough)
* Rotation
* Shearing
* Local warping
* Color shifting: for example, we add to R, G, and B some distortions that will make the image identified as the same for the human but is different for the computer. In practice the added value are pulled from some probability distribution and the shifts are relatively small. There is an algorithm called PCA color augmentation that decides the shifts needed automatically.

It is possible to implementing distortions during training by using a different CPU thread to make you a distorted mini batches while you are training your NN. Data Augmentation has also some hyperparameters. A good place to start is to find an open source data augmentation implementation and then use it or fine tune these hyperparameters.

### State of Computer Vision

When there is no much data people tend to use some "hacks", like choosing a more complex NN architecture. This is typical in computer vision where the complexity of the problem would require much more data

Tips for doing well on benchmarks/winning competitions:
* Ensembling: train several networks independently and average their outputs. This will generally increase performance but will slow down your production by the number of the ensembles. Also it takes more memory as it saves all the models in the memory.
* Multi-crop at test time: run classifier on multiple versions of test (not only train) versions and average results. There is a technique called 10 crops that uses this. This can give you a better result in the production.

Use architectures of networks published in the literature. Use open source implementations if possible. Use pretrained models and fine-tune on your dataset.