# Convolutional Neural Networks

## Foundations of Convolutional Neural Networks

### Computer Vision

Computer vision is one of the applications that are rapidly active thanks to deep learning. Some of the applications of computer vision that are using deep learning includes self driving cars and face recognition.

Rapid changes to computer vision are making new applications that weren't possible a few years ago. Computer vision deep leaning techniques are always evolving making a new architectures which can help us in other areas other than computer vision. For example, Andrew Ng took some ideas of computer vision and applied it in speech recognition.

Examples of a computer vision problems includes:
* Image classification.
* Object detection $\rightarrow$ detect object and localize them.
* Neural style transfer $\rightarrow$ changes the style of an image using another image.

One of the challenges of computer vision is that images can be extremely large while a fast and accurate algorithm is required.

For example, a $1000 \times 1000$ image will represent $3$ million feature/input to the full connected neural network. If the following hidden layer contains $1000$ units, then the matrix of weights is $1000 \times 3$ million which is $3$ billion parameters only in the first layer,  and that is computationally very expensive!

One of the solutions is to build this using **convolution layers** instead of the fully connected layers.

### Edge Detection Example

The convolution operation is one of the fundamentals blocks of a CNN. One of the examples about convolution is the image edge detection operation.

Early layers of CNN might detect edges then the middle layers will detect parts of objects and the later layers will put the these parts together to produce an output.

In an image we can detect vertical edges, horizontal edges, or full edge detector. An example of convolution operation to detect vertical edges:

* on the left there is a grey image (10 is brighter than 0)
* the convolution operator is denoted by $*$
* the second element is called *filter* or *kernel* $\rightarrow$ intuition: for vertical edges consider as if there are bright pixels on the left anddark pixels on the right
* each element of the resulting matrix is given by the sum of the element  of the filter, each one multiplied by the corresponding elements in the "overlapping" square on the left matrix (see red and green elements)

<img src="w1_edge_detection.PNG" width="600px" />

In python the convolution operation is done by  `tf.nn.conv2d ` (TensorFlow) or  `Conv2D ` (keras)

Example of a convolution:

<img src="convolution-example-matrix.gif" width="600px" />

Consider instead an input image dark-to-light, with columns $[0,...0,10...10]$, applying the convlution would result in an image gray-dark-gray, with colummns $[0,-30,0]$. To solve this issue generally is applied the absolute value.

An horizontal filter would be made of rows $$\left[\begin{array}{ccc}1 & 1 & 1\\0 & 0 & 0\\-1 & -1 & -1\end{array}\right]$$

Different filters have been presented such as the Sobel filter $\left[\begin{array}{ccc}1 & 0 & -1\\2 & 0 & -2\\1 & 0 & -1\end{array}\right]$ or the Scharr filter $\left[\begin{array}{ccc}3 & 0 & -3\\10 & 0 & -10\\3 & 0 & -3\end{array}\right]$ to put more weight on the central pixels, to make them more robust.

Applying Deep Learning means that we don't need to handcraft these numbers, we can treat them as weights and then learn them. It can learn horizontal, vertical, angled, or any edge type automatically rather than getting them by hand:

$$\left[\begin{array}{ccc}w_1 & w_2 & w_3\\w_4 & w_5 & w_6\\w_7 & w_8 & w_9\end{array}\right]$$

### Padding

When a $n \times n$ matrix is convolved with a $f \times f$ filter the result is a $(n-f+1) \times (n-f+1)$ matrix, therefore one issue with convolutions is that the resulting image is smaller than the input image.

A second issue is that the filter barely touches the corners and edges of the input images while the pixels in the center are processed many times.

When we want to apply convolution operation multiple times, if the image shrinks we will lose a lot of data on this process. Also the edges pixels are used less than the central pixels in the image.

For these reasons to use deep neural networks we really need to use **paddings**: the input matrix is augmented with an additional border of *zeros*. If the border thickness is $p$ then the resulting matrix has dimension $(n+2p-f+1)\times(n+2p-f+1)$.

*Valid* convolutions do not apply padding, while in *same* convolutions the pad is such that the output size is the same as the input size. Which means that $p = \frac{f-1}{2}$.

By convention in computer vision $f$ is usually odd. Some of the reasons is that it has a central position.

### Strided Convolutions

Strided convolutions refers to fix a number $s$ to define the number of pixels the algorithm will jump when applying the filter. A stride of $s=2$ means that the filter will cover the input matrix moving by $2$ cells each time.

The resulting matrix has dimension

$$\bigg(\frac{(n+2p-f)}{s}+1\bigg) \times \bigg(\frac{(n+2p-f)}{s}+1\bigg)$$

If the dimension is not made of integers it is rounded down using the `floor()` function, denoted by $\lfloor \dots \rfloor$. 

In math textbooks the convolution operation flips the filter before applying it to the imput matrix:

$$\left[\begin{array}{ccc}w_1 & w_2 & w_3\\w_4 & w_5 & w_6\\w_7 & w_8 & w_9\end{array}\right] \rightarrow \left[\begin{array}{ccc}w_9 & w_8 & w_7\\w_6 & w_5 & w_4\\w_3 & w_2 & w_1\end{array}\right]$$

But in DL there is no flipping. It is still referred to as convolution even if it would be a cross-correlation.

### Convolutions Over Volume

When working with colored images we add the depth
dimenson given by the number of channels (3 channels for RGB). An $(n \times n \times n_c)$ input image will be convolved with a $(f \times f \times n_c)$ filter:

<img src="conv_over_volumns.png" width="600px" />

Where each of the numbers of the filter is multiplied with the corresponding number in the input image and then summed up.

It is possible to detect horizontal edges only for a channel and keep the others equal to zero:

$$\underbrace{\left[\begin{array}{ccc}1 & 1 & 1\\0 & 0 & 0\\-1 & -1 & -1\end{array}\right]}_R \quad \underbrace{\left[\begin{array}{ccc}0 & 0 & 0\\0 & 0 & 0\\0 & 0 & 0\end{array}\right]}_G\quad \underbrace{\left[\begin{array}{ccc}0 & 0 & 0\\0 & 0 & 0\\0 & 0 & 0\end{array}\right]}_B$$

It is possible to use multiple filters at the same time, for example one vertical and one horizontal edge detector. The two outputs can be stacked together with depth equal to the numbe of filter used, for example:

<img src="mult_filters.png" width="600px" />

$$
(6\times6\times3) \text{ input image} \rightarrow 
 \biggl\{\begin{array}{c}(3\times3\times3)\text{ "vertical" filter} \rightarrow (4\times4) \text{ matrix}\\
(3\times3\times3)\text{ "horizontal" filter} \rightarrow (4\times4) \text{ matrix}\end{array} \biggr\} \rightarrow
(4\times4\times2) \text{ output}
$$

### One Layer of a Convolutional Network

In a layer of a CNN the filters have the same role of the weights $w^{[l]}$ of a NN. To each output of the convolutional operation we add a (different) constant $b^{[l]} \in \mathbb{R}$ with broadcasting, so that the augmented output takes the role of $z^{[l]}$, to which we apply the non-linearity to get $a^{[l]} = g(z^{[l]})$, the final "stacked" output.

<img src="example_layer.png" width="700px" />

With ten $(3\times3\times3)$ filters we need $3*3*3*10+10=280$ parameters.

Notice that no matter the size of the input, the number of the parameters is same if the filter size is same. That makes it less prone to overfitting.

Summary of notation for layer $l$ of a convolutional layer:
* Hyperparameters:
 * $f^{[l]}$ = filter size
 * $p^{[l]}$ = padding	# Default is zero $\rightarrow$ note that padding does not apply to the depth!
 * $s^{[l]}$ = stride
 * $n_c^{[l]}$ = number of channels/filters
 
* Input (height and width): $(n_H^{[l-1]} \times n_W^{[l-1]} \times n_c^{[l-1]})$

* Output (height and width): $(n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]})$
 * where $n_H^{[l]} = \lfloor \frac{n_H^{[l-1]}+2p^{[l]}-f^{[l]}}{s^{[l]}}+1 \rfloor$, same for $n_W^{[l]}$

* Each filter is $(f^{[l]} \times f^{[l]} \times n_c^{[l-1]})$, since it should match the number of channels of the input.

* The activations $a^{[l]}$ correspond to the outputs, however, in a vectorized notation/batch gradient descend $A^{[l]} = (m \times n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]})$

* The weights are $(f^{[l]} \times f^{[l]} \times n_c^{[l-1]} \times n_c^{[l]})$, where the last quantity is the total number of filters of layer $l$.

* The bias is a vector $n_c^{[l]}$, one for each filter, but it would be easier to express it as a $(1 \times 1 \times 1 \times n_c^{[l]})$ tensor $\rightarrow$ a multidimensional array.

### Simple convolution network example

The dimension of the layers follows the rule $n_H^{[l]} = \lfloor \frac{n_H^{[l-1]}+2p^{[l]}-f^{[l]}}{s^{[l]}}+1 \rfloor$:

<img src="simple_cnn_example.png" width="800px" />

Finally we vectorize the last volume into a $7*7*40=1960$ column vector and feed it to a logistic or soft-max unit (depending on if the output is binary of contains multiple objects).

In the example the image is getting smaller after each layer and that is the current trend in CNN.

There are 3 types of layer in a convolutional network:
* Convolution
* Pooling
* Fully connected

### Pooling layers

Other than the convolution layers, CNNs often uses **pooling layers** to reduce the size of the inputs, speed up computation, and to make some of the features it detects more robust:

<img src="max_pooling.png" width="600px" />

Notice that there are no parameters be be learned!

In case of input with multiple channels ($n_c^{[l-1]}$) the filter does the computation over all the channels independently and the output has $n_c^{[l]} = n_c^{[l-1]}$ as third dimension: the first matrix of the output takes the max of the elements from the first matrix of the input, the second from the second and so on...

The main reason why people are using pooling because its works well in practice and reduce computations.

An alternative to max pooling is to compute the average pooling.

The importnt thing is that here are no parameters to learn.

### CNN Example

This example is something like the LeNet-5 that was invented by Yann Lecun in 1998.

It is a convention to refer to the couple conv layer and pooling layer as only one layer, since the pooling layer does not have weights.

<img src="nn_example.png" width="1000px" />


Conv layers need relatively little parameters, in layer 1 only $5*5*3*6+6 = 456$ and in layer 2 only $5*5*6*16+16=2416$. This is way less that the fully connected layers 3 and 4 with $400*120+120=48120$ and $120*84+84=10164$ parameters respectively.

Generally, the deeper you go and the input size decreases over layers while the number of filters increases.

### Why Convolutions?

CNN are convenient because they need less parameters to be trained. In the example above the input image has $32*32*3=3072$ features and the first conv layer is made of $28*28*6=4704$ layers. In the conv layer we only need $(5*5+1)*6=156$ parameters while on a fully connected layer we would need $3072*4704>14$ millions of parameters.

* Parameter sharing: the same filter (same parameters) can be applied to multiple parts of the image, for example a vertical edge detector.

* Sparsity of connections: in each layer, each output value (element of the output matrix) depends only on a small number of inputs (those where the filter is applied) which makes it good in capturing translation invariance: the result is robust to small shift of pixels in the input image.

## Case Studies

### Why look at case studies?

Some neural networks architecture that works well in some tasks can also work well in other tasks.

Here are some classical CNN networks:

* LeNet-5
* AlexNet
* VGG

The best CNN architecture that won the last ImageNet competition is called ResNet and it has 152 layers. There are also an architecture called Inception that was made by Google that are very useful to learn and apply to your tasks.

Reading and trying the mentioned models can boost you and give you a lot of ideas to solve your task.

### Classic networks

**LeNet-5 (1998)**

The goal for this model was to identify handwritten digits in a ($32\times32\times1$) gray image.

<img src="LeNet-5.png" width="1000px" />

This model was published in 1998. At that time was common to use avg pooling instead of max pooling. The last layer wasn't using softmax back then.

It has around $60k$ parameters. Very few respect to today's networks.

The dimensions of the image decreases as the number of channels increases.

The architecture: `Conv ==> Pool ==> Conv ==> Pool ==> FC ==> FC ==> softmax` is quite common.

The activation function used in the paper was Sigmoid and Tanh. Modern implementation uses ReLU in most of the cases.

**AlexNet (2012)**

The goal for the model was the ImageNet challenge which classifies images into 1000 classes.

<img src="AlexNet.png" width="1000px" />

The architecture is `Conv => Max-pool => Conv => Max-pool => Conv => Conv => Conv => Max-pool ==> Flatten ==> FC ==> FC ==> Softmax`

Has 60 Million parameter compared to 60k parameter of LeNet-5.

It used the ReLU activation function.

This paper convinced the computer vision researchers that deep learning is so important.

**VGG-16 (201)**

It always use conv layers with ($3\times3$) filters, stride = 1 and same padding and max-pool layers with ($2\times2$) filters and stride = 2.

<img src="VGG-16.png" width="1000px" />


The 16 in teh name refers to the 16 layers with weights.

This network is large even by modern standards. It has around 138 million parameters.


### ResNets (Residual Networks)

Very deep NNs are difficult to train because of vanishing and exploding gradients problems.

A solution to this problem is to take the activation from one layer $a^{[l]}$ and suddenly feed it to another layer even much deeper in the NN which allows you to train large NNs even with layers greater than 100.

Instead of what happens in a **plain network**::

$a^{[l]} ==> \underbrace{z^{[l+1]} = W^{[l+1]} a^{[l]} + b^{[l+1]}}_{\text{linear}} ==> \underbrace{a^{[l+1]} = g(z{[l+1]})}_{\text{ReLU}} ==> \underbrace{z{[l+2]} = W{[l+2]} a{[l+1]} + b{[l+2]}}_{\text{linear}} ==> \underbrace{a{[l+2]} = g(z{[l+2]})}_{\text{ReLU}}$

We take a *short cut* / skip connection to make it a **residual network**:

$a^{[l]} ==> ... ==> \underbrace{a{[l+2]} = g(z{[l+2]} + a^{[l]})}_{\text{ReLU}}$

This is done after the last linear operation and before the last ReLU operation and forms a **residual block**, which allows to train much deeper NNs.

<img src="res_block.png" width="1000px" />

In plain NNs with many layers after a while the training error starts increasing


<img src="resnet_error.png" width="1000px" />

### Why ResNets work

Consider a residual block such as

$$a^{[l+2]} = g(z^{[l+2]} + a^{[l]}) = g(w^{[l+2]} a^{[l+1]} + b^{[l+2]} + a^{[l]})$$

When applying L2 regularization the weights shrink close to zero, so that

$$a^{[l+2]} = g(a^{[l]})$$

And when the activation function is ReLU all activations are non-negative so

$$a^{[l+2]} = a^{[l]}$$

And the identity function is easy to learn, so adding the two extra layers does not hurt performance. However adding the two layers can learn something useful so they would do better than just learning the identity function.

Using a skip-connection helps the gradient to backpropagate and thus helps you to train deeper networks.

The dimensions of $z^{[l+2]}$ and $ a^{[l]}$ have to be the same in ResNets. In case they have different dimensions we put a matrix of parameters (which can be learned or fixed) such that 

$$a^{[l+2]} = g( z^{[l+2]} + w_s * a^{[l]})$$

where $w_s$ also can be a zero padding transformation.

### Networks in Networks and 1x1 Convolutions

A 1 x 1 convolution -also called Network in Network- is useful in many CNN models. It has been used in lots of modern CNN implementations like ResNet and Inception models.

A 1 x 1 convolution is useful when:

* We want to shrink the number of channels. We also call this feature transformation: $(28\times 28\times 192)\underbrace{\rightarrow}_{32 \text{ filters } 1\times 1}(28\times 28\times 32)$ 

    We will later see that by shrinking it we can save a lot of computations.

* If we have specified the number of 1 x 1 Conv filters to be the same as the input number of channels then the output will contain the same number of channels. But the 1 x 1 Conv will still act like a non-linearity and will learn non linear operator.

### Inception Network Motivation

The idea behind **inception networks** is to compine multiple different layers at once. For example one layer can be made of the concatenation of a $1 \times 1$ filter, a $3 \times 3$ filter, a $5 \times 5$ filter, and a pooling layer, all with different numbers of channels and same padding:

<img src="inception_layer.png" width="1000px" />

In this scenario we have done all the convs and pools we might want and will let the NN learn and decide which it want to use most.

Te problem is the computational cost. Only counting the multiplications of the $5 \times 5$ Conv gives $5*5*192*28*28*32$ which is more than $120$ millions of operations.

A solution to this is to use a $1 \times 1$ convolution as a **bottleneck layer**:

<img src="bottleneck_layer.PNG" width="800px" />

In this case the multiplycations are $1*1*192*28*28*16 + 5*5*16*28*28*32$ which are $2,4 + 10 =12,4$ millions, therefore one tenth of the first example.

### Inception Network

The center of the **inception network** is the **inception unit**:

<img src="inception_unit.PNG" width="800px" />

Remember that the pooling layer does the computation over all the channels independently and the output has $n_c^{[l]} = n_c^{[l-1]}$ as third dimension: the first matrix of the output takes the max of the elements from the first matrix of the input, the second from the second and so on... Therefore we need another $1 \times 1$ filter to match the dimension.

Example of an inception unit in Keras:

<img src="inception_keras.PNG" width="800px" />


An example of inception networ is the GoogLeNet NN. It has 9 inception units and uses some max-pool layers to reduce dimension.

There are a 3 soft-max branches at different positions to push the network toward its goal. It helps to ensure that the features computed even in the intermediate layers are good enough to make a prediction. It works as a regularization effect.

<img style="transform: rotate(90deg); width:250px" src="googlenet.PNG" />

Since the development of the Inception module, the authors and others have built another versions of this network. Like inception v2, v3, and v4. Also there is a network that has used the inception module and the ResNet together.

The name inception network comes from the movie *Inception* (2010):

<img src="inception.gif" width="400px"/>


## Practical advices for using ConvNets

### Using Open-Source Implementation

A lot of NN are difficult to replicated because there are some details that may not presented on its papers such as parameter tuning.

A lot of deep learning researchers are open sourcing their code in sites like GitHub.

Some advantage of doing this is that you might download the network implementation along with its parameters/weights. The author might have used multiple GPUs and spent some weeks to reach this result and its right in front of you after you download it.

### Transfer Learning

It is a common practice to use a NN architecture that has been trained before. This means to use its pretrained parameters/weights instead of a random initialization and training. The pretrained models might have been trained on a large datasets like ImageNet. This can save a lot of time.

For example, when using another NN with its weights, just remove the softmax activation layer and put your own one and make the network learn only the new layer while the other weights are fixed/frozen.

Another trick that can speed up training, is to run the pretrained NN without final softmax layer and get an intermediate representation of your images and save them to disk. And then use these representation to a shallow NN network. This can save you the time needed to run an image through all the layers. It is like converting your images into vectors.

An alternative is to freeze few layers from the beginning of the pretrained network and learn the other weights in the network, or to put your own layers there.

If you have enough data, you can initialize the weights using the pretrained network (and change the softmax layer).

### Data Augmentation

The more data you have, the better the deep NN will perform. Data augmentation is one of the techniques that deep learning uses to increase the performance of deep NN.

Some data augmentation methods that are used for computer vision tasks includes:
* Mirroring
* Random cropping (take random portions of the image, note that they should be big enough)
* Rotation
* Shearing
* Local warping
* Color shifting: for example, we add to R, G, and B some distortions that will make the image identified as the same for the human but is different for the computer. In practice the added value are pulled from some probability distribution and the shifts are relatively small. There is an algorithm called PCA color augmentation that decides the shifts needed automatically.

It is possible to implementing distortions during training by using a different CPU thread to make you a distorted mini batches while you are training your NN. Data Augmentation has also some hyperparameters. A good place to start is to find an open source data augmentation implementation and then use it or fine tune these hyperparameters.

### State of Computer Vision

When there is no much data people tend to use some "hacks", like choosing a more complex NN architecture. This is typical in computer vision where the complexity of the problem would require much more data

Tips for doing well on benchmarks/winning competitions:
* Ensembling: train several networks independently and average their outputs. This will generally increase performance but will slow down your production by the number of the ensembles. Also it takes more memory as it saves all the models in the memory.
* Multi-crop at test time: run classifier on multiple versions of test (not only train) versions and average results. There is a technique called 10 crops that uses this. This can give you a better result in the production.

Use architectures of networks published in the literature. Use open source implementations if possible. Use pretrained models and fine-tune on your dataset.

## Detection Algorithms

### Object Localization

Object localization is about defining a box around the predicted label ("classification with localization"), it is the step before object detection (where there can by multiple objects of different classes).

<img src="object_det.jpeg" width="600px"/>

The predictionis not only about the class $1, ...C$ , but also about the box in which the class is:

$$
y_i = \left[\begin{array}{c}
p_c \in \{0,1\} & \text{is there any object?} \\
b_x & \text{x-axis center of the box} \\
b_y & \text{y-axis center of the box} \\
b_h & \text{height of the box} \\
b_w & \text{width of the box} \\
c_1 \in \{0,1\} & \text{class 1 object} \\
... & \text{...} \\
c_C \in \{0,1\} & \text{class C object}
\end{array}\right]
$$

The dimension of the boxes goes by convention from N-W: $\{0,0\}$ to S-E: $\{1,1\}$.

If there is no object then $p_c = 0$ and the rest of the vector $y_i$ is irrilevant.

The loss function can be quadratic if $y_i = 1$ such that 

$$L(\hat{y}_i,y_i) = (\hat{p}_c-p_c)^2+(\hat{b}_x-b_x)^2+...+(\hat{c}_C-c_C)^2$$

and just $(\hat{p}_c-p_c)^2$ if there is no object.

### Landmark Detection

Landmark detection is the task of predicting strategic points in an image, for example the border of a face or the skeleton of a body.

<img src="landmark_det.jpeg" width="300px"/>

It can be an intermediate step of a larger model in which firts you use a CNN to detect if there is a face in the image and where, and then use another NN to classify the facial expression on the face.

With $64$ landmarks the output is a vector with $1$ entry to tell if there is the object or not, and $64$ couples of coordinates, one couple for every landmark, therefore $1+128=129$ entries. 

$$
y_i = \left[\begin{array}{c}
p_c \in \{0,1\} & \text{is there the object?} \\
l_{1,x} & \text{x-coord of first landmark} \\
l_{1,y} & \text{y-coord of first landmark} \\
l_{2,x} & \text{x-coord of second landmark} \\
l_{2,y} & \text{y-coord of second landmark} \\
... & \text{...} \\
l_{64,x} & \text{y-coord of 64th landmark} \\
l_{64,y} & \text{y-coord of 64th landmark}
\end{array}\right]
$$

### Object Detection

A first technique for object detection is called **sliding window detection algorithm**.

The first step consists is training a CNN on the targets you want to detect, using closely cropped images:
<img src="sliding_windows_1.png" width="150px"/>

Notice that if you want to predict bounding boxes then you need to provide them.

Then you work on the final outputs by picking a certain window size and input into the Convnet the small rectangular region and make a prediction if that portion contains the image or not. You continue by sliding the rectangle on another portion of the output image using a certain stride untill you cover all parts:

<img src="sliding_window.gif" width="200px"/>

You repeat the entire process with a larger rectangular a few times and in the end you store the rectangles that contains the targets.

The problem with the sliding window detection algorithm is the computation time. Before deep learning, people used a hand crafted linear classifiers that classifies the object and then use the sliding window technique. The linear classier needed a cheap computation. In the deep learning era that is so computational expensive due to the complexity of the deep learning model.

However, the sliding windows object detection can be implemented *convolutionally*, or much more efficiently.

### Convolutional Implementation of Sliding Windows

First, consider how to turn a FC layer into convolutional layers (which predict an image class out of four classes). Mathematically a $400\times1$ vector and a $1\times1\times400$ tensor are identical:

<img src="sliding_windows_2.png" width="800px"/>


To solve the computational problem of applying a CNN sequentially to every part of the image the idea is to perform all the computations at once by passing the entire image in a CNN to make all the predictions on the different areas at the same time. This is feasible because the different rectangulars of the image share overlapping pixels.

Consider a ConvNet input of $14\times14\times3$ that eventually will output a $1\times1\times4$ volume (as above). In order to make a prediction on a test image of $16\times16\times3$, with a stride of $2$ you would need to pass four rectangles into the ConvNet (top left, top right, bottom left, bottom right): but notice that most of the pixels are overlapping! The operations is the four iterations are highly duplicative.

So, instead of feedind the ConvNet four times, feed into the same ConvNet the whole image, the output is a $2\times2\times4$ volume in which each $1\times1\times4$ element gives the result of running the same ConvNet on the corresponding area of the (top left, top right, bottom left, bottom right) window:

<img src="sliding_windows_3.png" width="800px"/>

If you want to run a sliding windows on a $28\times28\times3$ image, applying the same ConvNet as above leads to a $8\times8\times4$ volume which is equivalent to run a sliding window with that $14\times14$ region with a stride of $2$:

<img src="sliding_windows_4.png" width="800px"/>

**NOT CLEAR: should the window be of the same size of the input of the ConvNet? Does using multiple windows of different size mean to train multiple ConvNets with different input size?**

This method was introduced by [Sermanet et al., 2014, OverFeat: Integrated recognition, localization and detection using convolutional networks](https://arxiv.org/abs/1312.6229)

### YOLO Algorithm

#### Bounding Box Predictions

With sliding windows you may not use the accurate bounding box because the location of the window may not match perfectly with the position of the object, or maybe the real bounding box has not the same shape of the windows (e.g. a wider rectangle).

You split the image (e.g. $100\times100$) in grid cells and apply the classification with localization algorithm to each grid cell. The object is assigned to the grid cell where the centroid of the bounding box lies.

<img src="yolo.png" width="200px"/>

For each grid cell the output is again a vector of dimension $C+5$ (with $C$ total classes)

$$
y_i = \left[\begin{array}{c}
p_c \in \{0,1\} & \text{is there any object?} \\
b_x \in [0,1] & \text{x-axis center of the box} \\
b_y \in [0,1] & \text{y-axis center of the box} \\
b_h > 0 & \text{height of the box} \\
b_w > 0 & \text{width of the box} \\
c_1 \in \{0,1\} & \text{class 1 object} \\
... & \text{...} \\
c_C \in \{0,1\} & \text{class C object}
\end{array}\right]
$$

Notice that while the center of the box is always between 0 (N-W) and 1 (S-E), the height and width is expressed relative to the grid cell dimensions and they can be larger that the grid cell if the object lies in more grid cells.

To train a NN the input are $100\times100\times3$ images and the output is a $3\times3\times(C+5)$ where the channel corresponds to the prediction for each of the $9$ grid cells.

Notice two things: 
* With respect to the sliding window algorithm it outputs the bounding box coordinates explicitly, therefore it allows the NN to output bounding boxes of any aspect as well as output much more precise coordinates that are bnot dictated by the stride side of the sliding window.

* Also this algorithm has a convolutional implementation, since you don't implement the algorthm one time for each grid cell, but it's one ConvNet with a lot of shared computations between all the grid cells. It runs relatively fast.

There is a problem if there are more than one object in one grid cell. For a  $100\times100\times3$ image it is reasonable to use  $19\times19$ grid cells.

This method is called **YOLO** and was presented in [You Only Look Once: Unified, Real-Time Object Detection](https://arxiv.org/abs/1506.02640)

#### Intersection Over Union

Intersection Over Union (IoU) is a function used to evaluate an object detection algorithm.

Consider the red box as the ground truth and the purple box the prediction:

<img src="iou.png" width="200px"/>

It computes the size of intersection and divide it by the union. More generally, IoU is a measure of the overlap between two bounding boxes. A "correct" object localization has IoU $\geq 0.5$.

The higher the IoU $\in [0,1]$ the better is the accuracy.

#### Non-max Suppression

One of the problems of Object Detection is that the algorithm may find multiple detections of the same objects. Rather than detecting an object just once, it might detect it multiple times:

<img src="non-max-suppression.png" width="400px"/>

Non-max Suppression is a way to make sure that the algorithm detects the object just once.

With only one class, each grid cell outputs its own probability $p_c$ of containing the object.

* Discard all boxes with $p_c \leq 0.6$
* While there are any remaining boxes:
    * Pick the box with the largest $p_c$. Output that as a prediction
    * Discard any remaining box with IoU $> 0.5$ with the box output in the previous step i.e any box with high overlap (greater than overlap threshold of $0.5$).
    
If there are multiple classes/object types (let's say $C$) you want to detect, you should run the Non-max suppression $C$ times, once for every output class.

#### Anchor Boxes

Anothe problem with the algorithms of Object Detection seen so far is that each of the grid cell can predict only one object.

Suppose there are two objects wich generally lie in different rectangles, then pre-define the two different shapes and change the output as follows:

<img src="anchor-boxes.png" width="400px"/>

Previously, each object was assigned to the grid cell that corresponds to that object's midpoint. And so the output Y was $3\times3\times8$ because we had a $3\times3$ grid. And for each grid position, we had an output vector with its $p_c$, then the bounding box, and teh indicators for each class.

With the anchor box, each object is assigned to the same grid cell that contains the object's midpoint as before, but also to the anchor box with the highest IoU with the object's shape. For example the woman is assigned to anchor box 1 and the car to anchor box 2, so the output of the shared grid cell is

$$[\underbrace{p_c=1, b_x, b_y, b_h, b_y, c_1=1, c_2=0, c_3=0,}_{c_1 = 1 \text{ for the human}}\underbrace{p_c=1, b_x, b_y, b_h, b_y, c_1=0, c_2=1, c_3=0}_{c_2 = 1 \text{ for the car}}]$$

and the output Y is $3\times3\times16$.

There are still problmes if there are more objects than anchor boxes in the same cell (e.g. 2 cars and one person) or when there are multiple objects in the same cell associated to the same anchor box.

This cases are not common, especially when using a finer grid.

#### Putting together

Consider a problem with 3 classes (1: pedestrian, 2: car, 3: motorcycle) and two anchor boxes. Split the image in $3\times3$ cells, then the output has dimension $3\times3\times(2*8)$, one for each cell.

When training, each cell has its assocated prediction/box for each anchor box. Suppose the car gets associated to anchor box 2:

<img src="yolo_2.png" width="600px"/>

In the YOLO algorithm, at training time, only the cell containing the center/midpoint of an object is responsible for detecting that object.

When making a prediction the output associated to the blue cell will have "ideally" $p_c=0$ for both anchor boxes and some irrelevant values for the other elements. The output of the green cell should have $p_c=0$ for the first anchor box and some bounding box for the second anchor box.

Consider another example where to apply non-max suppression.

With 2 anchor boxes, for each of the grid cell we get two predicted bounding boxes, some of them will have very low probability $p_c$ of containing an image:

<img src="yolo_pred1.png" width="200px"/>

Get rid of the low probability predictions:

<img src="yolo_pred2.png" width="200px"/>

For each of the classes independently run non-max suppression for the objects that were predicted to come from that class:

<img src="yolo_pred3.png" width="200px"/>

### Region Proposal

One of the downsides of YOLO is that it processes a lot of areas where there are no objects.

Region Proposals (R-CNN) runs a segmentation algorithm to figure out what could be objects and then runs ConvNet only on those regions:

<img src="rcnn.png" width="300px"/>


It propose regions and classify the proposed regions one at a time, outputting labels and bounding box. 

It is still pretty slow.

The Fast R-CNN use a convolutional implementation of sliding windows to classify all the proposed regions at once.

The Faster R-CNN use a convolutional network to propose the regions.

## Face Recognition

### What is face recognition?

Face recognition system identifies a person's face. It can work on both images or videos.

For exapmle, an advanced application is liveness detection within a video: face recognition system prevents the network from identifying a person from a static image. It can be learned by supervised deep learning using a dataset for live human and non-live human.

Face verification consists in the task of claiming if an input image is that of the claimed person. The input is an image and a name/ID and it answer the question "is this the claimed person?", it is a 1:1 problem.

Face Recognition is the task of identifying an ID out of a sample of $K$ persons using an image as input. It answers the question "who is this person?", it is a 1:$K$ problem.

We can use a face verification system to make a face recognition system. However, the accuracy of the verification system has to be high (let's say 99.9% or more) to be use accurately within a recognition system because the chance of making an error in a 1:1 problem is multiplied by the $K$ persons.

### One Shot Learning

One of the challenges of face recognition is to solve the **one-shot learning problem**: it is a recognition system able to recognize a person, learning from one image only. For example, a company has only one image of each employee and the doors must recognise them. 

Historically deep learning doesn't work well with a small number of data. Training a ConNet with a sift-max function on every possible person is not good. Moreover if a new person is added to the database the model should be trained again.

This problem is solved using a similarity function

$$d(\text{img}_1, \text{img}_2) = \text{ degree of difference between images}$$

We want $d$ to be low in case of the same faces. We use a threshold $\tau$ such that if $d(\text{img}_1, \text{img}_2) \leq \tau$ then the faces are considered the same.

It is also robust to new inputs.

### Siamese Network

We apply the similarity function $d(\text{img}_1, \text{img}_2)$ to the output of two copies of the same ConvNet.

The two images $x_1$ and $x_2$ are passed through two copies of the same ConNet which outputs an "encoded" vector of features $f(x_1)$ and $f(x_2)$ respectively. We then compute a loss function

$$d(x_1, x_2) = || f(x_1) - f(x_2) ||^2$$

If $x_1$ and $x_2$ are images of the same person, we want $d(x_1, x_2)$ to be low.

This method was presented in the paper [Taigman et. al., 2014. DeepFace closing the gap to human level performance](https://www.cv-foundation.org/openaccess/content_cvpr_2014/html/Taigman_DeepFace_Closing_the_2014_CVPR_paper.html)