# Convolutional Neural Networks

## Foundations of Convolutional Neural Networks

### Computer Vision

Computer vision is one of the applications that are rapidly active thanks to deep learning. Some of the applications of computer vision that are using deep learning includes self driving cars and face recognition.

Rapid changes to computer vision are making new applications that weren't possible a few years ago. Computer vision deep leaning techniques are always evolving making a new architectures which can help us in other areas other than computer vision. For example, Andrew Ng took some ideas of computer vision and applied it in speech recognition.

Examples of a computer vision problems includes:
* Image classification.
* Object detection $\rightarrow$ detect object and localize them.
* Neural style transfer $\rightarrow$ changes the style of an image using another image.

One of the challenges of computer vision is that images can be extremely large while a fast and accurate algorithm is required.

For example, a $1000 \times 1000$ image will represent $3$ million feature/input to the full connected neural network. If the following hidden layer contains $1000$ units, then the matrix of weights is $1000 \times 3$ million which is $3$ billion parameters only in the first layer,  and that is computationally very expensive!

One of the solutions is to build this using **convolution layers** instead of the fully connected layers.

### Edge Detection Example

The convolution operation is one of the fundamentals blocks of a CNN. One of the examples about convolution is the image edge detection operation.

Early layers of CNN might detect edges then the middle layers will detect parts of objects and the later layers will put the these parts together to produce an output.

In an image we can detect vertical edges, horizontal edges, or full edge detector. An example of convolution operation to detect vertical edges:

* on the left there is a grey image (10 is brighter than 0)
* the convolution operator is denoted by $*$
* the second element is called *filter* or *kernel* $\rightarrow$ intuition: for vertical edges consider as if there are bright pixels on the left anddark pixels on the right
* each element of the resulting matrix is given by the sum of the element  of the filter, each one multiplied by the corresponding elements in the "overlapping" square on the left matrix (see red and green elements)

<img src="w1_edge_detection.PNG" width="600px" />

In python the convolution operation is done by  `tf.nn.conv2d ` (TensorFlow) or  `Conv2D ` (keras)

Consider instead an input image dark-to-light, with columns $[0,...0,10...10]$, applying the convlution would result in an image gray-dark-gray, with colummns $[0,-30,0]$. To solve this issue generally is applied the absolute value.

An horizontal filter would be made of rows $$\left[\begin{array}{ccc}1 & 1 & 1\\0 & 0 & 0\\-1 & -1 & -1\end{array}\right]$$

Different filters have been presented such as the Sobel filter $\left[\begin{array}{ccc}1 & 0 & -1\\2 & 0 & -2\\1 & 0 & -1\end{array}\right]$ or the Scharr filter $\left[\begin{array}{ccc}3 & 0 & -3\\10 & 0 & -10\\3 & 0 & -3\end{array}\right]$ to put more weight on the central pixels, to make them more robust.

Applying Deep Learning means that we don't need to handcraft these numbers, we can treat them as weights and then learn them. It can learn horizontal, vertical, angled, or any edge type automatically rather than getting them by hand:

$$\left[\begin{array}{ccc}w_1 & w_2 & w_3\\w_4 & w_5 & w_6\\w_7 & w_8 & w_9\end{array}\right]$$

### Padding

When a $n \times n$ matrix is convolved with a $f \times f$ filter the result is a $(n-f+1) \times (n-f+1)$ matrix, therefore one issue with convolutions is that the resulting image is smaller than the input image.

A second issue is that the filter barely touches the corners and edges of the input images while the pixels in the center are processed many times.

When we want to apply convolution operation multiple times, if the image shrinks we will lose a lot of data on this process. Also the edges pixels are used less than the central pixels in the image.

For these reasons to use deep neural networks we really need to use **paddings**: the input matrix is augmented with an additional border of *zeros*. If the border thickness is $p$ then the resulting matrix has dimension $(n+2p-f+1)\times(n+2p-f+1)$.

*Valid* convolutions do not apply padding, while in *same* convolutions the pad is such that the output size is the same as the input size. Which means that $p = \frac{f-1}{2}$.

By convention in computer vision $f$ is usually odd. Some of the reasons is that it has a central position.

### Strided Convolutions

Strided convolutions refers to fix a number $s$ to define the number of pixels the algorithm will jump when applying the filter. A stride of $s=2$ means that the filter will cover the input matrix moving by $2$ cells each time.

The resulting matrix has dimension

$$\bigg(\frac{(n+2p-f)}{s}+1\bigg) \times \bigg(\frac{(n+2p-f)}{s}+1\bigg)$$

If the dimension is not made of integers it is rounded down using the `floor()` function, denoted by $\lfloor \dots \rfloor$. 

In math textbooks the convolution operation flips the filter before applying it to the imput matrix:

$$\left[\begin{array}{ccc}w_1 & w_2 & w_3\\w_4 & w_5 & w_6\\w_7 & w_8 & w_9\end{array}\right] \rightarrow \left[\begin{array}{ccc}w_9 & w_8 & w_7\\w_6 & w_5 & w_4\\w_3 & w_2 & w_1\end{array}\right]$$

But in DL there is no flipping. It is still referred to as convolution even if it would be a cross-correlation.

### Convolutions Over Volume

When working with colored images we add the depth
dimenson given by the number of channels (3 channels for RGB). An $(n \times n \times n_c)$ input image will be convolved with a $(f \times f \times n_c)$ filter:

<img src="conv_over_volumns.png" width="600px" />

Where each of the numbers of the filter is multiplied with the corresponding number in the input image and then summed up.

It is possible to detect horizontal edges only for a channel and keep the others equal to zero:

$$\underbrace{\left[\begin{array}{ccc}1 & 1 & 1\\0 & 0 & 0\\-1 & -1 & -1\end{array}\right]}_R \quad \underbrace{\left[\begin{array}{ccc}0 & 0 & 0\\0 & 0 & 0\\0 & 0 & 0\end{array}\right]}_G\quad \underbrace{\left[\begin{array}{ccc}0 & 0 & 0\\0 & 0 & 0\\0 & 0 & 0\end{array}\right]}_B$$

It is possible to use multiple filters at the same time, for example one vertical and one horizontal edge detector. The two outputs can be stacked together with depth equal to the numbe of filter used, for example:

<img src="mult_filters.png" width="600px" />

$$
(6\times6\times3) \text{ input image} \rightarrow 
 \biggl\{\begin{array}{c}(3\times3\times3)\text{ "vertical" filter} \rightarrow (4\times4) \text{ matrix}\\
(3\times3\times3)\text{ "horizontal" filter} \rightarrow (4\times4) \text{ matrix}\end{array} \biggr\} \rightarrow
(4\times4\times2) \text{ output}
$$

### One Layer of a Convolutional Network

In a layer of a CNN the filters have the same role of the weights $w^{[l]}$ of a NN. To each output of the convolutional operation we add a (different) constant $b^{[l]} \in \mathbb{R}$ with broadcasting, so that the augmented output takes the role of $z^{[l]}$, to which we apply the non-linearity to get $a^{[l]} = g(z^{[l]})$, the final "stacked" output.

<img src="example_layer.png" width="700px" />

With ten $(3\times3\times3)$ filters we need $3*3*3*10+10=280$ parameters.

Notice that no matter the size of the input, the number of the parameters is same if the filter size is same. That makes it less prone to overfitting.

Summary of notation for layer $l$ of a convolutional layer:
* Hyperparameters:
 * $f^{[l]}$ = filter size
 * $p^{[l]}$ = padding	# Default is zero $\rightarrow$ note that padding does not apply to the depth!
 * $s^{[l]}$ = stride
 * $n_c^{[l]}$ = number of channels/filters
 
* Input (height and width): $(n_H^{[l-1]} \times n_W^{[l-1]} \times n_c^{[l-1]})$

* Output (height and width): $(n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]})$
 * where $n_H^{[l]} = \lfloor \frac{n_H^{[l-1]}+2p^{[l]}-f^{[l]}}{s^{[l]}}+1 \rfloor$, same for $n_W^{[l]}$

* Each filter is $(f^{[l]} \times f^{[l]} \times n_c^{[l-1]})$, since it should match the number of channels of the input.

* The activations $a^{[l]}$ correspond to the outputs, however, in a vectorized notation/batch gradient descend $A^{[l]} = (m \times n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]})$

* The weights are $(f^{[l]} \times f^{[l]} \times n_c^{[l-1]} \times n_c^{[l]})$, where the last quantity is the total number of filters of layer $l$.

* The bias is a vector $n_c^{[l]}$, one for each filter, but it would be easier to express it as a $(1 \times 1 \times 1 \times n_c^{[l]})$ tensor $\rightarrow$ a multidimensional array.

### Simple convolution network example

The dimension of the layers follows the rule $n_H^{[l]} = \lfloor \frac{n_H^{[l-1]}+2p^{[l]}-f^{[l]}}{s^{[l]}}+1 \rfloor$:

<img src="simple_cnn_example.png" width="800px" />

Finally we vectorize the last volume into a $7*7*40=1960$ column vector and feed it to a logistic or soft-max unit (depending on if the output is binary of contains multiple objects).

In the example the image is getting smaller after each layer and that is the current trend in CNN.

There are 3 types of layer in a convolutional network:
* Convolution
* Pooling
* Fully connected

### Pooling layers

Other than the convolution layers, CNNs often uses **pooling layers** to reduce the size of the inputs, speed up computation, and to make some of the features it detects more robust:

<img src="max_pooling.png" width="600px" />

Notice that there are no parameters be be learned!

In case of input with multiple channels ($n_c^{[l-1]}$) the filter does the computation over all the channel independently and the output has $n_c^{[l]} = n_c^{[l-1]}$ as third dimension: the first matrix of the output takes the max of the elements from the first matrix of the input, the second from the second and so on...

The main reason why people are using pooling because its works well in practice and reduce computations.

An alternative to max pooling is to compute the average pooling.

The importnt thing is that here are no parameters to learn.

### CNN Example

This example is something like the LeNet-5 that was invented by Yann Lecun in 1998.

It is a convention to refer to the couple conv layer and pooling layer as only one layer, since the pooling layer does not have weights.

<img src="nn_example.png" width="1000px" />


Conv layers need relatively little parameters, in layer 1 only $5*5*3*6+6 = 456$ and in layer 2 only $5*5*6*16+16=2416$. This is way less that the fully connected layers 3 and 4 with $400*120+120=48120$ and $120*84+84=10164$ parameters respectively.

Generally, the deeper you go and the input size decreases over layers while the number of filters increases.

### Why Convolutions?

CNN are convenient because they need less parameters to be trained. In the example above the input image has $32*32*3=3072$ features and the first conv layer is made of $28*28*6=4704$ layers. In the conv layer we only need $(5*5+1)*6=156$ parameters while on a fully connected layer we would need $3072*4704>14$ millions of parameters.

* Parameter sharing: the same filter (same parameters) can be applied to multiple parts of the image, for example a vertical edge detector.

* Sparsity of connections: in each layer, each output value (element of the output matrix) depends only on a small number of inputs (those where the filter is applied) which makes it good in capturing translation invariance: the result is robust to small shift of pixels in the input image.

In [25]:
(5*5+1)*100

2600

In [21]:
(5*5+1)*100

2600