# Convolutional Neural Networks

## Foundations of Convolutional Neural Networks

### Computer Vision

Computer vision is one of the applications that are rapidly active thanks to deep learning. Some of the applications of computer vision that are using deep learning includes self driving cars and face recognition.

Rapid changes to computer vision are making new applications that weren't possible a few years ago. Computer vision deep leaning techniques are always evolving making a new architectures which can help us in other areas other than computer vision. For example, Andrew Ng took some ideas of computer vision and applied it in speech recognition.

Examples of a computer vision problems includes:
* Image classification.
* Object detection $\rightarrow$ detect object and localize them.
* Neural style transfer $\rightarrow$ changes the style of an image using another image.

One of the challenges of computer vision is that images can be extremely large while a fast and accurate algorithm is required.

For example, a $1000 \times 1000$ image will represent $3$ million feature/input to the full connected neural network. If the following hidden layer contains $1000$ units, then the matrix of weights is $1000 \times 3$ million which is $3$ billion parameters only in the first layer,  and that is computationally very expensive!

One of the solutions is to build this using **convolution layers** instead of the fully connected layers.

### Edge Detection Example

The convolution operation is one of the fundamentals blocks of a CNN. One of the examples about convolution is the image edge detection operation.

Early layers of CNN might detect edges then the middle layers will detect parts of objects and the later layers will put the these parts together to produce an output.

In an image we can detect vertical edges, horizontal edges, or full edge detector. An example of convolution operation to detect vertical edges:

* on the left there is a grey image (10 is brighter than 0)
* the convolution operator is denoted by $*$
* the second element is called *filter* or *kernel* $\rightarrow$ intuition: for vertical edges consider as if there are bright pixels on the left anddark pixels on the right
* each element of the resulting matrix is given by the sum of the element  of the filter, each one multiplied by the corresponding elements in the "overlapping" square on the left matrix (see red and green elements)

<img src="w1_edge_detection.PNG" width="600px" />

In python the convolution operation is done by  `tf.nn.conv2d ` (TensorFlow) or  `Conv2D ` (keras)

Consider instead an input image dark-to-light, with columns $[0,...0,10...10]$, applying the convlution would result in an image gray-dark-gray, with colummns $[0,-30,0]$. To solve this issue generally is applied the absolute value.

An horizontal filter would be made of rows $$\left[\begin{array}{ccc}1 & 1 & 1\\0 & 0 & 0\\-1 & -1 & -1\end{array}\right]$$

Different filters have been presented such as the Sobel filter $\left[\begin{array}{ccc}1 & 0 & -1\\2 & 0 & -2\\1 & 0 & -1\end{array}\right]$ or the Scharr filter $\left[\begin{array}{ccc}3 & 0 & -3\\10 & 0 & -10\\3 & 0 & -3\end{array}\right]$ to put more weight on the central pixels, to make them more robust.

Applying Deep Learning means that we don't need to handcraft these numbers, we can treat them as weights and then learn them. It can learn horizontal, vertical, angled, or any edge type automatically rather than getting them by hand:

$$\left[\begin{array}{ccc}w_1 & w_2 & w_3\\w_4 & w_5 & w_6\\w_7 & w_8 & w_9\end{array}\right]$$

### Padding

When a $n \times n$ matrix is convolved with a $f \times f$ filter the result is a $(n-f+1) \times (n-f+1)$ matrix, therefore one issue with convolutions is that the resulting image is smaller than the input image.

A second issue is that the filter barely touches the corners and adges of the input images while the pixels in the center are processed many times.

When we want to apply convolution operation multiple times, if the image shrinks we will lose a lot of data on this process. Also the edges pixels are used less than the central pixels in the image.

For these reasons to use deep neural networks we really need to use **paddings**: the input matrix is augmented with an additional border of *zeros*. If the border thickness is $p$ then the resulting matrix has dimension $(n+2p-f+1)\times(n+2p-f+1)$.

*Valid* convolutions do not apply padding, while in *same* convolutions the pad is such that the output size is the same as the input size. Which means that $p = \frac{f-1}{2}$.

By convention in computer vision $f$ is usually odd. Some of the reasons is that it has a central position.

### Strided Convolutions

Strided convolutions refers to fix a number $s$ to define the number of pixels the algorithm will jump when applying the filter. A stride of $s=2$ means that the filter will cover the input matrix moving by $2$ cells each time.

The resulting matrix has dimension $(\frac{(n+2p-f)}{s}+1) \times (\frac{(n+2p-f)}{s}+1)$. If the dimension is not made of integers it is rounded down.

In math textbooks the convolution operation flips the filter before applying it to the imput matrix:

$$\left[\begin{array}{ccc}w_1 & w_2 & w_3\\w_4 & w_5 & w_6\\w_7 & w_8 & w_9\end{array}\right] \rightarrow \left[\begin{array}{ccc}w_9 & w_8 & w_7\\w_6 & w_5 & w_4\\w_3 & w_2 & w_1\end{array}\right]$$

But in DL there is no flipping. It is still referred to as convolution even if it would be a cross-correlation.

### Convolutions Over Volume

When working with colored images we add the depth
dimenson given by the number of channels (3 channels for RGB). An $(n \times n \times n_c)$ input image will be convolved with a $(f \times f \times n_c)$ filter:

<img src="conv_over_volumns.png" width="600px" />

Where each of the numbers of the filter is multiplied with the corresponding number in the input image and then summed up.

It is possible to detect horizontal edges only for a channel and keep the others equal to zero:

$$\underbrace{\left[\begin{array}{ccc}1 & 1 & 1\\0 & 0 & 0\\-1 & -1 & -1\end{array}\right]}_R \quad \underbrace{\left[\begin{array}{ccc}0 & 0 & 0\\0 & 0 & 0\\0 & 0 & 0\end{array}\right]}_G\quad \underbrace{\left[\begin{array}{ccc}0 & 0 & 0\\0 & 0 & 0\\0 & 0 & 0\end{array}\right]}_B$$

It is possible to use multiple filters at the same time, for example one vertical and one horizontal edge detector. The two outputs can be stacked together with depth equal to the numbe of filter used.