# Sequence Models

## Recurrent Neural Networks

Examples in which sequence models such as recurrent neural networks (RNNs) are useful are the following:

+ Speech recognition: input is audio $\rightarrow$ output is a sequence of words
+ Music generation: input can be none or some parameters $\rightarrow$ output is a sequence of sounds
+ Sentiment classification: input is a comment $\rightarrow$ output is a score
+ DNA sequence analysis: inpute is a sequence of letters AGCCCCTGTGAAGGCTAG $\rightarrow$ output is detecting which part of the sequence corresponds to what
+ Machine translation: input is a sentence $\rightarrow$ output is a sentence
+ Video activity recognition: input is a sequece of frames $\rightarrow$ output is the recognized activity
+ Name entity recognition: input is a sentence $\rightarrow$ output are the people's names in it

### Notation

Consider a name entity recognition example.

The i-th input is $x^{(i)}$: "Harry Potter and Hermione Granger invented a new spell". Each element of the $x^{(i)}$ sample is denoted by $x^{(i)<t>}$, where $t = 1,2, ...T^{(i)}_x$. In this example $x^{(i)<1>}$ is "Harry".

The i-th target is $y^{(i)}:[1 \space 1 \space 0 \space 1 \space 1 \space 0 \space 0 \space 0 \space 0]$, where $y^{(i)<t>} = 1$ if the $t$ element is a person's name and 0 otherwise. The length of $y^{(i)}$ is denoted by $T^{(i)}_y$.

Notice that the length of each entry can differ, and also the the length of the input and the output can be different.

Consider a dictionary that takes the form of a vector whose elements are all the words admitted (generally around 50.000 words). 

Words can be represented as vectors of the same length of the dictionary with a 1 in the corresponding position of that word in the dictionary and zeros elsewhere.

### Recurrent Neural Network Model

#### Why not a standard network?

+ Inputs and outputs can have different lengths in different samples
+ It would not share features learned across different positions of text (if "Harry" is recognized as a word in a sample it should share this knowledge with other samples too) $\rightarrow$ this is the same thing with CNN with images
+ Each element (word) of each input (sentence) has the length of the dictionary, so it's very big

### Basic Recurrent Neural Network

In a one-directional RNN the information is processed from left to right and the parameters are shared:

<img src="images/one-directional_rnn.jpg" width="600px" />

Starting from some vector of hidden units $a^{<0>}$ generally equal to zero, each element $x^{<t>}$ (a one-hot vector) is combined with some parameters $W_{ax}$ and $b_a$ and an activation function $g_1()$ (generally a $tanh$ function) to compute the hidden layer $a^{<t>} = g_1(W_{aa}a^{<t-1>} + W_{ax}x^{<t>}+b_a)$, this layer is then used to make a prediction $\hat{y}^{<t>} = g_2(W_{ya}a^{<t>}+b_y)$, where $g_2()$ is another activation function (maybe a sigmoid function if the problem is binary).

<img src="images/description-block-rnn-ltr.png" width="600px" />


**The important thing is that the parameters $W_{aa}, W_{ax}, b_a, W_{ya}, b_y$ are the same across the whole sequence**. They will be updated through the backpropagation step.


So each activation value $a^{<t-1>}$ is passed to the next one to make the following prediction. However, one weakness of this RNN is that it only uses the information that is earlier in the sequence to make a prediction. In particular, when predicting $y^{<3>}$, it doesn't use information about the words $x^{<4>}$, $x^{<5>}$ and so on.

To simplify the notation let  $a^{<t>} = g(W_a [a^{<t-1>} , x^{<t>}]+b_a)$, where $[a^{<t-1>} , x^{<t>}]$ means stacking the two vectors one on top of the other. This way the matrix $W_a$ is the horizontal stacking of the matrices $[W_{aa};W_{ax}]$.

For example, if $a^{<t-1>}$ is a vector of length 100 and $x^{<t>}$ a vector of length 10.000, the new vector $[a^{<t-1>} , x^{<t>}]$ has length 10.100.




### Backpropagation

Consider again the forward propagation:
    
- Starting from some value of $a^{<0>}$, the initialized parameters $W_a,b_a$ and the first input $x^{<1>}$ we compute $a^{<1>}$
- We use $a^{<1>}$ together with initialized parameters $W_y,b_y$ to compute the probability $\hat{y}^{<1>}$
- We pass the same parameters $W_a,b_a$ and $W_y,b_y$ (together with $a^{<t-1>}$) to compute every $a^{<t>}$ and so $\hat{y}^{<t>}$

For every prediction $\hat{y}^{<t>}$ we compute the loss function $L^{<t>}(\hat{y}^{<t>} - y^{<t>}) = -y^{<t>}\log(\hat{y}^{<t>}) - (1-y^{<t>})\log(1-\hat{y}^{<t>})$, the typical logistic regression loss.

The overall loss is given by $L(\hat{y} - y) = \sum_{t=1}^{T_y} L^{<t>}(\hat{y}^{<t>} - y^{<t>})$

Backpropagation requires to do computations (passing messages) in the opposite direction to take derivatives with respect to the parameters $W_a, b_a, W_y, b_y$ in order to update them with gradient descent.

The most significant message to by passed is the one from $a^{<t>}$ to $a^{<t-1>}$, called **backpropagation through time**.

### Examples of sequence data and RNN architectures

|Type of RNN|Illustration|Example|
|-|-|-|
| One-to-one $$T_x = T_y = 1$$ |<img src="images/rnn-one-to-one-ltr.png" width="200px" /> |Traditional neural network|
|One-to-many $$T_x = 1, T_y > 1$$|<img src="images/rnn-one-to-many-ltr.png" width="400px" />|Music generation|
|Many-to-one $$T_x > 1, T_y = 1$$|<img src="images/rnn-many-to-one-ltr.png" width="400px" /> |Sentiment classification|
|Many-to-many $$T_x = T_y $$|<img src="images/rnn-many-to-many-same-ltr.png" width="400px" />|Name entity recognition|
|Many-to-many $$T_x \neq T_y $$ |<img src="images/rnn-many-to-many-different-ltr.png" width="400px" /> |Machine translation|


### Language Model and Sequence Generation

A language model estimates the probability of a sentence, being the probability of a sequence of words $P(y^{<1>}, y^{<2>}, ...,y^{<T_y>})$. An application is speach recognition where the input is an audio and the output is a sentence among all possible sentences.

To train a languange model you need a large corpus of text. Then you need to tokenize the text by mapping every word to a vector of zeros and a 1 to the corresponding position in the dictionary. Punctuation such as "." can also be tokenized as "end of sentence" $<EOS>$. Words not present in the dictionary are tokenized as "unknown" $<UNK>$.

Consider the example 

$$\text{Cats average 15 hours of sleep a day.}$$

The sequence generation hast a one-to-many structure in the form

<img src="images/rnn-one-to-many-ltr.png" width="400px" />

First we feed $a^{<0>} = 0$ and $x^{<0>} = 0$ to compute $a^{<1>}$ and $\hat{y}^{<1>}$, which is the probability of every word in the dictionary.

Going forward, we feed $y^{<1>}$ = "Cats" in the following computation of $a^{<2>}$ and $\hat{y}^{<2>}$, so that this time $\hat{y}^{<2>}$ is the probability of every word in the dictionary **conditional on** $y^{<1>}$ = "Cats" and so on. Notice that the pobability of a sequence of 3 words is $P(y_1, y_2, y_3) = P(y_1)P(y_2|y_1)P(y_3|y_1, y_2)$.

The model works by minimizing the loss $L(\hat{y} - y) = \sum_{t=1}^{T_y} L^{<t>}(\hat{y}^{<t>} - y^{<t>})$ where for every word  $L^{<t>}(\hat{y}^{<t>} - y^{<t>}) = - \sum_{i=1}^M y_i^{<t>} log(\hat{y}_i^{<t>})$ where $M$ is the total number of feasible words and $y_j^{<t>}$ is the tokenized version of the word in position $j$ (a vector which has 1 in position $j$ and zeros elsewhere). This way if the true word is in position $j$ minimizing $- \sum_{i=1}^M y_i^{<t>} log(\hat{y}_i^{<t>})$ is equivalent to maximinzing $log(\hat{y}_j^{<t>})$, the probability of $y^{<t>} = j$.

### Vanishing gradients with RNNs

One problem of basic RNNs is that they're not very good in capturing long term dependence. For example, in the sentence

$$\text{The cat, which already [...], was full.}$$

the word "was" depends on the singular word "cat", which was much earlier in the sequence.

The **vanishing gradient problem**, typical of deep NNs means that it's difficult for the error computed at the end of the sequence on $\hat{y}^{<T_y>}$ to affects the computations that are earlier such as in $a^{<1>}$. On the contrary, there is much more influence on the closest words of the sequence.

Although less common, there may also be the problem of exploding gradients which makes parameters become very large and computation results in numerical overflow. 

A solution to that is called "gradient clipping" and consists in putting a cap to the maximum value for the gradient.

### Gated Recurrent Unit (GRU)

## Foundations of Convolutional Neural Networks

### Computer Vision

Computer vision is one of the applications that are rapidly active thanks to deep learning. Some of the applications of computer vision that are using deep learning includes self driving cars and face recognition.

Rapid changes to computer vision are making new applications that weren't possible a few years ago. Computer vision deep leaning techniques are always evolving making a new architectures which can help us in other areas other than computer vision. For example, Andrew Ng took some ideas of computer vision and applied it in speech recognition.

Examples of a computer vision problems includes:
* Image classification.
* Object detection $\rightarrow$ detect object and localize them.
* Neural style transfer $\rightarrow$ changes the style of an image using another image.

One of the challenges of computer vision is that images can be extremely large while a fast and accurate algorithm is required.

For example, a $1000 \times 1000$ image will represent $3$ million feature/input to the full connected neural network. If the following hidden layer contains $1000$ units, then the matrix of weights is $1000 \times 3$ million which is $3$ billion parameters only in the first layer,  and that is computationally very expensive!

One of the solutions is to build this using **convolution layers** instead of the fully connected layers.

### Edge Detection Example

The convolution operation is one of the fundamentals blocks of a CNN. One of the examples about convolution is the image edge detection operation.

Early layers of CNN might detect edges then the middle layers will detect parts of objects and the later layers will put the these parts together to produce an output.

In an image we can detect vertical edges, horizontal edges, or full edge detector. An example of convolution operation to detect vertical edges:

* on the left there is a grey image (10 is brighter than 0)
* the convolution operator is denoted by $*$
* the second element is called *filter* or *kernel* $\rightarrow$ intuition: for vertical edges consider as if there are bright pixels on the left and dark pixels on the right
* each element of the resulting matrix is given by the sum of the element  of the filter, each one multiplied by the corresponding elements in the "overlapping" square on the left matrix (see red and green elements)

<img src="images/w1_edge_detection.PNG" width="600px" />

In python the convolution operation is done by  `tf.nn.conv2d ` (TensorFlow) or  `Conv2D ` (keras)

Example of a convolution:

<img src="images/convolution-example-matrix.gif" width="600px" />

Consider instead an input image dark-to-light, with columns $[0,...0,10...10]$, applying the convlution would result in an image gray-dark-gray, with colummns $[0,-30,0]$. To solve this issue generally is applied the absolute value.

An horizontal filter would be made of rows $$\left[\begin{array}{ccc}1 & 1 & 1\\0 & 0 & 0\\-1 & -1 & -1\end{array}\right]$$

Different filters have been presented such as the Sobel filter $\left[\begin{array}{ccc}1 & 0 & -1\\2 & 0 & -2\\1 & 0 & -1\end{array}\right]$ or the Scharr filter $\left[\begin{array}{ccc}3 & 0 & -3\\10 & 0 & -10\\3 & 0 & -3\end{array}\right]$ to put more weight on the central pixels, to make them more robust.

Applying Deep Learning means that we don't need to handcraft these numbers, we can treat them as weights and then learn them. It can learn horizontal, vertical, angled, or any edge type automatically rather than getting them by hand:

$$\left[\begin{array}{ccc}w_1 & w_2 & w_3\\w_4 & w_5 & w_6\\w_7 & w_8 & w_9\end{array}\right]$$

### Padding

When a $n \times n$ matrix is convolved with a $f \times f$ filter the result is a $(n-f+1) \times (n-f+1)$ matrix, therefore one issue with convolutions is that the resulting image is smaller than the input image.

A second issue is that the filter barely touches the corners and edges of the input images while the pixels in the center are processed many times.

When we want to apply convolution operation multiple times, if the image shrinks we will lose a lot of data on this process. Also the edges pixels are used less than the central pixels in the image.

For these reasons to use deep neural networks we really need to use **paddings**: the input matrix is augmented with an additional border of *zeros*. If the border thickness is $p$ then the resulting matrix has dimension $(n+2p-f+1)\times(n+2p-f+1)$.

*Valid* convolutions do not apply padding, while in *same* convolutions the pad is such that the output size is the same as the input size. Which means that $p = \frac{f-1}{2}$.

By convention in computer vision $f$ is usually odd. Some of the reasons is that it has a central position.

### Strided Convolutions

Strided convolutions refers to fix a number $s$ to define the number of pixels the algorithm will jump when applying the filter. A stride of $s=2$ means that the filter will cover the input matrix moving by $2$ cells each time.

The resulting matrix has dimension

$$\bigg(\frac{(n+2p-f)}{s}+1\bigg) \times \bigg(\frac{(n+2p-f)}{s}+1\bigg)$$

If the dimension is not made of integers it is rounded down using the `floor()` function, denoted by $\lfloor \dots \rfloor$. 

In math textbooks the convolution operation flips the filter before applying it to the imput matrix:

$$\left[\begin{array}{ccc}w_1 & w_2 & w_3\\w_4 & w_5 & w_6\\w_7 & w_8 & w_9\end{array}\right] \rightarrow \left[\begin{array}{ccc}w_9 & w_8 & w_7\\w_6 & w_5 & w_4\\w_3 & w_2 & w_1\end{array}\right]$$

But in DL there is no flipping. It is still referred to as convolution even if it would be a cross-correlation.

### Convolutions Over Volume

When working with colored images we add the depth
dimenson given by the number of channels (3 channels for RGB). An $(n \times n \times n_c)$ input image will be convolved with a $(f \times f \times n_c)$ filter:

<img src="images/conv_over_volumns.png" width="600px" />

Where each of the numbers of the filter is multiplied with the corresponding number in the input image and then summed up.

It is possible to detect horizontal edges only for a channel and keep the others equal to zero:

$$\underbrace{\left[\begin{array}{ccc}1 & 1 & 1\\0 & 0 & 0\\-1 & -1 & -1\end{array}\right]}_R \quad \underbrace{\left[\begin{array}{ccc}0 & 0 & 0\\0 & 0 & 0\\0 & 0 & 0\end{array}\right]}_G\quad \underbrace{\left[\begin{array}{ccc}0 & 0 & 0\\0 & 0 & 0\\0 & 0 & 0\end{array}\right]}_B$$

It is possible to use multiple filters at the same time, for example one vertical and one horizontal edge detector. The two outputs can be stacked together with depth equal to the numbe of filter used, for example:

<img src="images/mult_filters.png" width="600px" />

$$
(6\times6\times3) \text{ input image} \rightarrow 
 \biggl\{\begin{array}{c}(3\times3\times3)\text{ "vertical" filter} \rightarrow (4\times4) \text{ matrix}\\
(3\times3\times3)\text{ "horizontal" filter} \rightarrow (4\times4) \text{ matrix}\end{array} \biggr\} \rightarrow
(4\times4\times2) \text{ output}
$$

### One Layer of a Convolutional Network

In a layer of a CNN the filters have the same role of the weights $w^{[l]}$ of a NN. To each output of the convolutional operation we add a (different) constant $b^{[l]} \in \mathbb{R}$ with broadcasting, so that the augmented output takes the role of $z^{[l]}$, to which we apply the non-linearity to get $a^{[l]} = g(z^{[l]})$, the final "stacked" output.

<img src="images/example_layer.png" width="700px" />

With ten $(3\times3\times3)$ filters we need $(3*3*3+1)*10=280$ parameters.

Notice that no matter the size of the input is, the number of the parameters is the same if the filter size is the same. That makes it less prone to overfitting.

Summary of notation for layer $l$ of a convolutional layer:
* Hyperparameters:
 * $f^{[l]}$ = filter size
 * $p^{[l]}$ = padding	# Default is zero $\rightarrow$ note that padding does not apply to the depth!
 * $s^{[l]}$ = stride
 * $n_c^{[l]}$ = number of channels/filters
 
* Input (height and width): $(n_H^{[l-1]} \times n_W^{[l-1]} \times n_c^{[l-1]})$

* Output (height and width): $(n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]})$
 * where $n_H^{[l]} = \lfloor \frac{n_H^{[l-1]}+2p^{[l]}-f^{[l]}}{s^{[l]}}+1 \rfloor$, same for $n_W^{[l]}$

* Each filter is $(f^{[l]} \times f^{[l]} \times n_c^{[l-1]})$, since it should match the number of channels of the input.

* The activations $a^{[l]}$ correspond to the outputs, however, in a vectorized notation/batch gradient descend $A^{[l]} = (m \times n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]})$

* The weights are $(f^{[l]} \times f^{[l]} \times n_c^{[l-1]} \times n_c^{[l]})$, where the last quantity is the total number of filters of layer $l$.

* The bias is a vector $n_c^{[l]}$, one for each filter, but it would be easier to express it as a $(1 \times 1 \times 1 \times n_c^{[l]})$ tensor $\rightarrow$ a multidimensional array.

### Simple convolution network example

The dimension of the layers follows the rule $n_H^{[l]} = \lfloor \frac{n_H^{[l-1]}+2p^{[l]}-f^{[l]}}{s^{[l]}}+1 \rfloor$:

<img src="images/simple_cnn_example.png" width="800px" />

Finally we vectorize the last volume into a $7*7*40=1960$ column vector and feed it to a logistic or soft-max unit (depending on if the output is binary of contains multiple objects).

In the example the image is getting smaller after each layer and that is the current trend in CNN.

There are 3 types of layer in a convolutional network:
* Convolution
* Pooling
* Fully connected

### Pooling layers

Other than the convolution layers, CNNs often uses **pooling layers** to reduce the size of the inputs, speed up computation, and to make some of the features it detects more robust:

<img src="images/max_pooling.png" width="600px" />

Notice that there are no parameters to be learned!

In case of input with multiple channels ($n_c^{[l-1]}$) the filter does the computation over all the channels independently and the output has $n_c^{[l]} = n_c^{[l-1]}$ as third dimension: the first matrix of the output takes the max of the elements from the first matrix of the input, the second from the second and so on...

The main reason why people are using pooling because its works well in practice and reduce computations.

An alternative to max pooling is to compute the average pooling.

The importnt thing is that here are no parameters to learn.

### CNN Example

This example is something like the LeNet-5 that was invented by Yann Lecun in 1998.

It is a convention to refer to the couple conv layer and pooling layer as only one layer, since the pooling layer does not have weights.

<img src="images/nn_example.png" width="1000px" />


Conv layers need relatively little parameters, in layer 1 only $5*5*3*6+6 = 456$ and in layer 2 only $5*5*6*16+16=2416$. This is way less that the fully connected layers 3 and 4 with $400*120+120=48120$ and $120*84+84=10164$ parameters respectively.

Generally, the deeper you go and the input size decreases over layers while the number of filters increases.

### Why Convolutions?

CNN are convenient because they need less parameters to be trained. In the example above the input image has $32*32*3=3072$ features and the first conv layer is made of $28*28*6=4704$ layers. In the conv layer we only need $(5*5+1)*6=156$ parameters while on a fully connected layer we would need $3072*4704>14$ millions of parameters.

* Parameter sharing: the same filter (same parameters) can be applied to multiple parts of the image, for example a vertical edge detector.

* Sparsity of connections: in each layer, each output value (element of the output matrix) depends only on a small number of inputs (those where the filter is applied) which makes it good in capturing translation invariance: the result is robust to small shift of pixels in the input image.

## Case Studies

### Why look at case studies?

Some neural networks architecture that works well in some tasks can also work well in other tasks.

Here are some classical CNN networks:

* LeNet-5
* AlexNet
* VGG

The best CNN architecture that won the last ImageNet competition is called ResNet and it has 152 layers. There are also an architecture called Inception that was made by Google that are very useful to learn and apply to your tasks.

Reading and trying the mentioned models can boost you and give you a lot of ideas to solve your task.

### Classic networks

**LeNet-5 (1998)**

The goal for this model was to identify handwritten digits in a ($32\times32\times1$) gray image.

<img src="images/LeNet-5.png" width="1000px" />

This model was published in 1998. At that time was common to use avg pooling instead of max pooling. The last layer wasn't using softmax back then.

It has around $60k$ parameters. Very few respect to today's networks.

The dimensions of the image decreases as the number of channels increases.

The architecture: `Conv ==> Pool ==> Conv ==> Pool ==> FC ==> FC ==> softmax` is quite common.

The activation function used in the paper was Sigmoid and Tanh. Modern implementation uses ReLU in most of the cases.

**AlexNet (2012)**

The goal for the model was the ImageNet challenge which classifies images into 1000 classes.

<img src="images/AlexNet.png" width="1000px" />

The architecture is `Conv => Max-pool => Conv => Max-pool => Conv => Conv => Conv => Max-pool ==> Flatten ==> FC ==> FC ==> Softmax`

Has 60 Million parameter compared to 60k parameter of LeNet-5.

It used the ReLU activation function.

This paper convinced the computer vision researchers that deep learning is so important.

**VGG-16 (2014)**

It always use conv layers with ($3\times3$) filters, stride = 1 and same padding and max-pool layers with ($2\times2$) filters and stride = 2.

<img src="images/VGG-16.png" width="1000px" />


The 16 in teh name refers to the 16 layers with weights.

This network is large even by modern standards. It has around 138 million parameters.


### ResNets (Residual Networks)

Very deep NNs are difficult to train because of vanishing and exploding gradients problems.

A solution to this problem is to take the activation from one layer $a^{[l]}$ and suddenly feed it to another layer even much deeper in the NN which allows you to train large NNs even with layers greater than 100.

Instead of what happens in a **plain network**:

$a^{[l]} ==> \underbrace{z^{[l+1]} = W^{[l+1]} a^{[l]} + b^{[l+1]}}_{\text{linear}} ==> \underbrace{a^{[l+1]} = g(z{[l+1]})}_{\text{ReLU}} ==> \underbrace{z{[l+2]} = W{[l+2]} a{[l+1]} + b{[l+2]}}_{\text{linear}} ==> \underbrace{a{[l+2]} = g(z{[l+2]})}_{\text{ReLU}}$

We take a *short cut* / skip connection to make it a **residual network**:

$a^{[l]} ==> ... ==> \underbrace{a{[l+2]} = g(z{[l+2]} + a^{[l]})}_{\text{ReLU}}$

This is done after the last linear operation and before the last ReLU operation and forms a **residual block**, which allows to train much deeper NNs.

<img src="images/res_block.png" width="1000px" />

In plain NNs with many layers after a while the training error starts increasing


<img src="images/resnet_error.png" width="1000px" />

### Why ResNets work

Consider a residual block such as

$$a^{[l+2]} = g(z^{[l+2]} + a^{[l]}) = g(w^{[l+2]} a^{[l+1]} + b^{[l+2]} + a^{[l]})$$

When applying L2 regularization the weights shrink close to zero, so that

$$a^{[l+2]} = g(a^{[l]})$$

And when the activation function is ReLU all activations are non-negative so

$$a^{[l+2]} = a^{[l]}$$

And the identity function is easy to learn, so adding the two extra layers does not hurt performance. However adding the two layers can learn something useful so they would do better than just learning the identity function.

Using a skip-connection helps the gradient to backpropagate and thus helps you to train deeper networks.

The dimensions of $z^{[l+2]}$ and $ a^{[l]}$ have to be the same in ResNets. In case they have different dimensions we put a matrix of parameters (which can be learned or fixed) such that 

$$a^{[l+2]} = g( z^{[l+2]} + w_s * a^{[l]})$$

where $w_s$ also can be a zero padding transformation.

### Networks in Networks and 1x1 Convolutions

A 1 x 1 convolution -also called Network in Network- is useful in many CNN models. It has been used in lots of modern CNN implementations like ResNet and Inception models.

A 1 x 1 convolution is useful when:

* We want to shrink the number of channels. We also call this feature transformation: $(28\times 28\times 192)\underbrace{\rightarrow}_{32 \text{ filters } 1\times 1}(28\times 28\times 32)$ 

    We will later see that by shrinking it we can save a lot of computations.

* If we have specified the number of 1 x 1 Conv filters to be the same as the input number of channels then the output will contain the same number of channels. But the 1 x 1 Conv will still act like a non-linearity and will learn non linear operator.

### Inception Network Motivation

The idea behind **inception networks** is to compine multiple different layers at once. For example one layer can be made of the concatenation of a $1 \times 1$ filter, a $3 \times 3$ filter, a $5 \times 5$ filter, and a pooling layer, all with different numbers of channels and same padding:

<img src="images/inception_layer.png" width="1000px" />

In this scenario we have done all the convs and pools we might want and will let the NN learn and decide which it want to use most.

Te problem is the computational cost. Only counting the multiplications of the $5 \times 5$ Conv gives $5*5*192*28*28*32$ which is more than $120$ millions of operations.

A solution to this is to use a $1 \times 1$ convolution as a **bottleneck layer**:

<img src="images/bottleneck_layer.PNG" width="800px" />

In this case the multiplycations are $1*1*192*28*28*16 + 5*5*16*28*28*32$ which are $2,4 + 10 =12,4$ millions, therefore one tenth of the first example.

### Inception Network

The center of the **inception network** is the **inception unit**:

<img src="images/inception_unit.png" width="800px" />

Remember that the pooling layer does the computation over all the channels independently and the output has $n_c^{[l]} = n_c^{[l-1]}$ as third dimension: the first matrix of the output takes the max of the elements from the first matrix of the input, the second from the second and so on... Therefore we need another $1 \times 1$ filter to match the dimension.

Example of an inception unit in Keras:

<img src="images/inception_keras.png" width="800px" />


An example of inception networ is the GoogLeNet NN. It has 9 inception units and uses some max-pool layers to reduce dimension.

There are a 3 soft-max branches at different positions to push the network toward its goal. It helps to ensure that the features computed even in the intermediate layers are good enough to make a prediction. It works as a regularization effect.

<img style="transform: rotate(90deg); width:250px" src="images/googlenet.png" />

Since the development of the Inception module, the authors and others have built another versions of this network. Like inception v2, v3, and v4. Also there is a network that has used the inception module and the ResNet together.

The name inception network comes from the movie *Inception* (2010):

<img src="images/inception.gif" width="400px"/>


### MobileNet

MobileNets is another foundational convolutional neural network architecture used for computer vision. Using MobileNets will allow to build and deploy neural networks that work even in low compute environment, such as a mobile phone.

MobileNets are using a special tipe of convolution called **Depthwise Separable Convolution**.

<img src="images/normal_conv.png" width="600px" />

In a "normal" convolution the computation involves $\underbrace{f^{[l]} \times f^{[l]} \times n_c^{[l-1]}}_{\text{# filter params}} \times \underbrace{n_H^{[l]} \times n_W^{[l]}}_{\text{# filter positions}} \times \underbrace{n_c^{[l]}}_{\text{# of filters}}$ multiplications, in the example $3 \times 3 \times 3 \times 4 \times 4 \times 5  = 2160$ multiplications.

Depthwise Separable Convolutions are made by a Depthwise step followed by a Pointwise step

<img src="images/dps_conv.png" width="600px" />

In the deptwise step each channel of the filter convolves only the corresponding channel of the input tensor

<img src="images/dps_det1.png" width="600px" />


<img src="images/dps_det2.png" width="600px" />

This step requires $\underbrace{3 \times 3}_{\text{# filter params}} \times \underbrace{4 \times 4}_{\text{# filter positions}} \times \underbrace{3}_{\text{# of filters}} = 432$ multiplications

The pointwise step takes the output of the previous step and convolves it with $n_c'$ $(1 \times 1 \times n_c)$ filters

<img src="images/dps_point.png" width="600px" />

This step requires $\underbrace{1 \times 1 \times 3}_{\text{# filter params}} \times \underbrace{4 \times 4}_{\text{# filter positions}} \times \underbrace{5}_{\text{# of filters}} = 240$ multiplications

In this example teh normal convolution was required 2160 multiplications versus 672 multiplications of the depthwise separable convolution, which is 31% of the operations.

In general a depthwise separable convolution takes $\frac{1}{n_c'}+\frac{1}{f^2}$ of the operations. In the example it is $\frac{1}{5}+\frac{1}{9}=0.3111$ but generally the number of channels is larger, such as 512 for reduction up to 10%.

This procedure was introduced in [Howard et al. 2017, MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861)

### MobileNet Architecture

The MobileNet presented in the paper stacks 13 layers with depthwise separable convolution, one pooling layer, one fully-connected layer and a soft-max step to solve a classification problem. This was the MobileNet v1 application.

A second version called MobileNet v2 use a "bottleneck block" with a residual connection repeated 17 times instead of the simpler depthwise separable convolution.

<img src="images/MobileNets.png" width="600px" />

This version was introduced in [Sandler et al. 2019, MobileNetV2: Inverted Residuals and Linear Bottlenecks](https://arxiv.org/abs/1801.04381)

The bottleneck block works as follows:

- first it passes the input via a residual connection directly to the output
- in the non residual connection part of the block it does an expansion, a depthwise step and a projection

<img src="images/MNv2.png" width="600px" />

This bottleneck block achieves two things:
- first, by using the expansion operation it increases the size of the representation within the bottleneck block and this alllows the neural network to learn a richer function
- when deployed on mobile devices with memory constraints, it uses the pointwise convolution to project the representation back to a smaller set of values so that when going to the next block you don't need much memory

So it allows to learn complex functions while keeping the memory (that is the size of the activation that needs to be passed from layer to layer) relatively small.

### EfficientNet

EfficientNets were developed to deal with the tradeoff between memory and cost of processing images. The cost can increase by a larger resolution, a deeper NN or a wider representations (larger tensors). There are many open source application based on the paper 

[Tan and Le, 2019, EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946)

## Practical advices for using ConvNets

### Using Open-Source Implementation

A lot of NN are difficult to replicated because there are some details that may not presented on its papers such as parameter tuning.

A lot of deep learning researchers are open sourcing their code in sites like GitHub.

Some advantage of doing this is that you might download the network implementation along with its parameters/weights. The author might have used multiple GPUs and spent some weeks to reach this result and its right in front of you after you download it.

### Transfer Learning

It is a common practice to use a NN architecture that has been trained before. This means to use its pretrained parameters/weights instead of a random initialization and training. The pretrained models might have been trained on a large datasets like ImageNet. This can save a lot of time.

For example, when using another NN with its weights, just remove the softmax activation layer and put your own one and make the network learn only the new layer while the other weights are fixed/frozen.

Another trick that can speed up training, is to run the pretrained NN without final softmax layer and get an intermediate representation of your images and save them to disk. And then use these representation to a shallow NN network. This can save you the time needed to run an image through all the layers. It is like converting your images into vectors.

An alternative is to freeze few layers from the beginning of the pretrained network and learn the other weights in the network, or to put your own layers there.

If you have enough data, you can initialize the weights using the pretrained network (and change the softmax layer).

### Data Augmentation

The more data you have, the better the deep NN will perform. Data augmentation is one of the techniques that deep learning uses to increase the performance of deep NN.

Some data augmentation methods that are used for computer vision tasks includes:
* Mirroring
* Random cropping (take random portions of the image, note that they should be big enough)
* Rotation
* Shearing
* Local warping
* Color shifting: for example, we add to R, G, and B some distortions that will make the image identified as the same for the human but is different for the computer. In practice the added value are pulled from some probability distribution and the shifts are relatively small. There is an algorithm called PCA color augmentation that decides the shifts needed automatically.

It is possible to implementing distortions during training by using a different CPU thread to make you a distorted mini batches while you are training your NN. Data Augmentation has also some hyperparameters. A good place to start is to find an open source data augmentation implementation and then use it or fine tune these hyperparameters.

### State of Computer Vision

When there is no much data people tend to use some "hacks", like choosing a more complex NN architecture. This is typical in computer vision where the complexity of the problem would require much more data

Tips for doing well on benchmarks/winning competitions:
* Ensembling: train several networks independently and average their outputs. This will generally increase performance but will slow down your production by the number of the ensembles. Also it takes more memory as it saves all the models in the memory.
* Multi-crop at test time: run classifier on multiple versions of test (not only train) versions and average results. There is a technique called 10 crops that uses this. This can give you a better result in the production.

Use architectures of networks published in the literature. Use open source implementations if possible. Use pretrained models and fine-tune on your dataset.

## Detection Algorithms

### Object Localization

Object localization is about defining a box around the predicted label ("classification with localization"), it is the step before object detection (where there can by multiple objects of different classes).

<img src="images/object_det.jpeg" width="600px"/>

The predictionis not only about the class $1, ...C$ , but also about the box in which the class is:

$$
y_i = \left[\begin{array}{c}
p_c \in \{0,1\} & \text{is there any object?} \\
b_x & \text{x-axis center of the box} \\
b_y & \text{y-axis center of the box} \\
b_h & \text{height of the box} \\
b_w & \text{width of the box} \\
c_1 \in \{0,1\} & \text{class 1 object} \\
... & \text{...} \\
c_C \in \{0,1\} & \text{class C object}
\end{array}\right]
$$

The dimension of the boxes goes by convention from N-W: $\{0,0\}$ to S-E: $\{1,1\}$.

If there is no object then $p_c = 0$ and the rest of the vector $y_i$ is irrilevant.

The loss function can be quadratic if $y_i = 1$ such that 

$$L(\hat{y}_i,y_i) = (\hat{p}_c-p_c)^2+(\hat{b}_x-b_x)^2+...+(\hat{c}_C-c_C)^2$$

and just $(\hat{p}_c-p_c)^2$ if there is no object.

### Landmark Detection

Landmark detection is the task of predicting strategic points in an image, for example the border of a face or the skeleton of a body.

<img src="images/landmark_det.jpeg" width="300px"/>

It can be an intermediate step of a larger model in which firts you use a CNN to detect if there is a face in the image and where, and then use another NN to classify the facial expression on the face.

With $64$ landmarks the output is a vector with $1$ entry to tell if there is the object or not, and $64$ couples of coordinates, one couple for every landmark, therefore $1+128=129$ entries. 

$$
y_i = \left[\begin{array}{c}
p_c \in \{0,1\} & \text{is there the object?} \\
l_{1,x} & \text{x-coord of first landmark} \\
l_{1,y} & \text{y-coord of first landmark} \\
l_{2,x} & \text{x-coord of second landmark} \\
l_{2,y} & \text{y-coord of second landmark} \\
... & \text{...} \\
l_{64,x} & \text{y-coord of 64th landmark} \\
l_{64,y} & \text{y-coord of 64th landmark}
\end{array}\right]
$$

### Object Detection

A first technique for object detection is called **sliding window detection algorithm**.

The first step consists is training a CNN on the targets you want to detect, using closely cropped images:
<img src="images/sliding_windows_1.png" width="150px"/>

Notice that if you want to predict bounding boxes then you need to provide them.

Then you work on the final outputs by picking a certain window size and input into the Convnet the small rectangular region and make a prediction if that portion contains the image or not. You continue by sliding the rectangle on another portion of the output image using a certain stride untill you cover all parts:

<img src="images/sliding_window.gif" width="200px"/>

You repeat the entire process with a larger rectangular a few times and in the end you store the rectangles that contains the targets.

The problem with the sliding window detection algorithm is the computation time. Before deep learning, people used a hand crafted linear classifiers that classifies the object and then use the sliding window technique. The linear classier needed a cheap computation. In the deep learning era that is so computational expensive due to the complexity of the deep learning model.

However, the sliding windows object detection can be implemented *convolutionally*, or much more efficiently.

### Convolutional Implementation of Sliding Windows

First, consider how to turn a FC layer into convolutional layers (which predict an image class out of four classes). Mathematically a $400\times1$ vector and a $1\times1\times400$ tensor are identical:

<img src="images/sliding_windows_2.png" width="800px"/>


To solve the computational problem of applying a CNN sequentially to every part of the image the idea is to perform all the computations at once by passing the entire image in a CNN to make all the predictions on the different areas at the same time. This is feasible because the different rectangulars of the image share overlapping pixels.

Consider a ConvNet input of $14\times14\times3$ that eventually will output a $1\times1\times4$ volume (as above). In order to make a prediction on a test image of $16\times16\times3$, with a stride of $2$ you would need to pass four rectangles into the ConvNet (top left, top right, bottom left, bottom right): but notice that most of the pixels are overlapping! The operations is the four iterations are highly duplicative.

So, instead of feedind the ConvNet four times, feed into the same ConvNet the whole image, the output is a $2\times2\times4$ volume in which each $1\times1\times4$ element gives the result of running the same ConvNet on the corresponding area of the (top left, top right, bottom left, bottom right) window:

<img src="images/sliding_windows_3.png" width="800px"/>

If you want to run a sliding windows on a $28\times28\times3$ image, applying the same ConvNet as above leads to a $8\times8\times4$ volume which is equivalent to run a sliding window with that $14\times14$ region with a stride of $2$:

<img src="images/sliding_windows_4.png" width="800px"/>

**NOT CLEAR: should the window be of the same size of the input of the ConvNet? Does using multiple windows of different size mean to train multiple ConvNets with different input size?**

This method was introduced by [Sermanet et al., 2014, OverFeat: Integrated recognition, localization and detection using convolutional networks](https://arxiv.org/abs/1312.6229)

### YOLO Algorithm

#### Bounding Box Predictions

With sliding windows you may not use the accurate bounding box because the location of the window may not match perfectly with the position of the object, or maybe the real bounding box has not the same shape of the windows (e.g. a wider rectangle).

You split the image (e.g. $100\times100$) in grid cells and apply the classification with localization algorithm to each grid cell. The object is assigned to the grid cell where the centroid of the bounding box lies.

<img src="images/yolo.png" width="200px"/>

For each grid cell the output is again a vector of dimension $C+5$ (with $C$ total classes)

$$
y_i = \left[\begin{array}{c}
p_c \in \{0,1\} & \text{is there any object?} \\
b_x \in [0,1] & \text{x-axis center of the box} \\
b_y \in [0,1] & \text{y-axis center of the box} \\
b_h > 0 & \text{height of the box} \\
b_w > 0 & \text{width of the box} \\
c_1 \in \{0,1\} & \text{class 1 object} \\
... & \text{...} \\
c_C \in \{0,1\} & \text{class C object}
\end{array}\right]
$$

Notice that while the center of the box is always between 0 (N-W) and 1 (S-E), the height and width is expressed relative to the grid cell dimensions and they can be larger that the grid cell if the object lies in more grid cells.

To train a NN the input are $100\times100\times3$ images and the output is a $3\times3\times(C+5)$ where the channel corresponds to the prediction for each of the $9$ grid cells.

Notice two things: 
* With respect to the sliding window algorithm it outputs the bounding box coordinates explicitly, therefore it allows the NN to output bounding boxes of any aspect as well as output much more precise coordinates that are bnot dictated by the stride side of the sliding window.

* Also this algorithm has a convolutional implementation, since you don't implement the algorthm one time for each grid cell, but it's one ConvNet with a lot of shared computations between all the grid cells. It runs relatively fast.

There is a problem if there are more than one object in one grid cell. For a  $100\times100\times3$ image it is reasonable to use  $19\times19$ grid cells.

This method is called **YOLO** and was presented in [You Only Look Once: Unified, Real-Time Object Detection](https://arxiv.org/abs/1506.02640)

#### Intersection Over Union

Intersection Over Union (IoU) is a function used to evaluate an object detection algorithm.

Consider the red box as the ground truth and the purple box the prediction:

<img src="images/iou.png" width="200px"/>

It computes the size of intersection and divide it by the union. More generally, IoU is a measure of the overlap between two bounding boxes. A "correct" object localization has IoU $\geq 0.5$.

The higher the IoU $\in [0,1]$ the better is the accuracy.

#### Non-max Suppression

One of the problems of Object Detection is that the algorithm may find multiple detections of the same objects. Rather than detecting an object just once, it might detect it multiple times:

<img src="images/non-max-suppression.png" width="400px"/>

Non-max Suppression is a way to make sure that the algorithm detects the object just once.

With only one class, each grid cell outputs its own probability $p_c$ of containing the object.

* Discard all boxes with $p_c \leq 0.6$
* While there are any remaining boxes:
    * Pick the box with the largest $p_c$. Output that as a prediction
    * Discard any remaining box with IoU $> 0.5$ with the box output in the previous step i.e any box with high overlap (greater than overlap threshold of $0.5$).
    
If there are multiple classes/object types (let's say $C$) you want to detect, you should run the Non-max suppression $C$ times, once for every output class.

#### Anchor Boxes

Anothe problem with the algorithms of Object Detection seen so far is that each of the grid cell can predict only one object.

Suppose there are two objects wich generally lie in different rectangles, then pre-define the two different shapes and change the output as follows:

<img src="images/anchor-boxes.png" width="400px"/>

Previously, each object was assigned to the grid cell that corresponds to that object's midpoint. And so the output Y was $3\times3\times8$ because we had a $3\times3$ grid. And for each grid position, we had an output vector with its $p_c$, then the bounding box, and teh indicators for each class.

With the anchor box, each object is assigned to the same grid cell that contains the object's midpoint as before, but also to the anchor box with the highest IoU with the object's shape. For example the woman is assigned to anchor box 1 and the car to anchor box 2, so the output of the shared grid cell is

$$[\underbrace{p_c=1, b_x, b_y, b_h, b_y, c_1=1, c_2=0, c_3=0,}_{c_1 = 1 \text{ for the human}}\underbrace{p_c=1, b_x, b_y, b_h, b_y, c_1=0, c_2=1, c_3=0}_{c_2 = 1 \text{ for the car}}]$$

and the output Y is $3\times3\times16$.

There are still problmes if there are more objects than anchor boxes in the same cell (e.g. 2 cars and one person) or when there are multiple objects in the same cell associated to the same anchor box.

This cases are not common, especially when using a finer grid.

#### Putting together

Consider a problem with 3 classes (1: pedestrian, 2: car, 3: motorcycle) and two anchor boxes. Split the image in $3\times3$ cells, then the output has dimension $3\times3\times(2*8)$, one for each cell.

When training, each cell has its assocated prediction/box for each anchor box. Suppose the car gets associated to anchor box 2:

<img src="images/yolo_2.png" width="600px"/>

In the YOLO algorithm, at training time, only the cell containing the center/midpoint of an object is responsible for detecting that object.

When making a prediction the output associated to the blue cell will have "ideally" $p_c=0$ for both anchor boxes and some irrelevant values for the other elements. The output of the green cell should have $p_c=0$ for the first anchor box and some bounding box for the second anchor box.

Consider another example where to apply non-max suppression.

With 2 anchor boxes, for each of the grid cell we get two predicted bounding boxes, some of them will have very low probability $p_c$ of containing an image:

<img src="images/yolo_pred1.png" width="200px"/>

Get rid of the low probability predictions:

<img src="images/yolo_pred2.png" width="200px"/>

For each of the classes independently run non-max suppression for the objects that were predicted to come from that class:

<img src="images/yolo_pred3.png" width="200px"/>

### Region Proposal

One of the downsides of YOLO is that it processes a lot of areas where there are no objects.

Region Proposals (R-CNN) runs a segmentation algorithm to figure out what could be objects and then runs ConvNet only on those regions:

<img src="images/rcnn.png" width="300px"/>


It propose regions and classify the proposed regions one at a time, outputting labels and bounding box. 

It is still pretty slow.

The Fast R-CNN use a convolutional implementation of sliding windows to classify all the proposed regions at once.

The Faster R-CNN use a convolutional network to propose the regions.

## Face Recognition

### What is face recognition?

Face recognition system identifies a person's face. It can work on both images or videos.

For example, an advanced application is liveness detection within a video: face recognition system prevents the network from identifying a person from a static image. It can be learned by supervised deep learning using a dataset for live human and non-live human.

Face verification consists in the task of claiming if an input image is that of the claimed person. The input is an image and a name/ID and it answer the question "is this the claimed person?", it is a 1:1 problem.

Face Recognition is the task of identifying an ID out of a sample of $K$ persons using an image as input. It answers the question "who is this person?", it is a 1:$K$ problem.

We can use a face verification system to make a face recognition system. However, the accuracy of the verification system has to be high (let's say 99.9% or more) to be use accurately within a recognition system because the chance of making an error in a 1:1 problem is multiplied by the $K$ persons.

### One Shot Learning

One of the challenges of face recognition is to solve the **one-shot learning problem**: it is a recognition system able to recognize a person, learning from one image only. For example, a company has only one image of each employee and the doors must recognise them. 

Historically deep learning doesn't work well with a small number of data. Training a ConNet with a sift-max function on every possible person is not good. Moreover if a new person is added to the database the model should be trained again.

This problem is solved using a similarity function

$$d(\text{img}_1, \text{img}_2) = \text{ degree of difference between images}$$

We want $d$ to be low in case of the same faces. We use a threshold $\tau$ such that if $d(\text{img}_1, \text{img}_2) \leq \tau$ then the faces are considered the same.

It is also robust to new inputs.

### Siamese Network

We apply the similarity function $d(\text{img}_1, \text{img}_2)$ to the output of two copies of the same ConvNet.

The two images $x_1$ and $x_2$ are passed through two copies of the same ConNet which outputs an "encoded" vector of features $f(x_1)$ and $f(x_2)$ respectively. We then compute a loss function

$$d(x_1, x_2) = || f(x_1) - f(x_2) ||^2$$

If $x_1$ and $x_2$ are images of the same person, we want $d(x_1, x_2)$ to be low.

This method was presented in the paper [Taigman et. al., 2014. DeepFace closing the gap to human level performance](https://www.cv-foundation.org/openaccess/content_cvpr_2014/html/Taigman_DeepFace_Closing_the_2014_CVPR_paper.html)

### Triplet loos

The **Triplet Loss** is one of the loss functions that can be used to solve the similarity distance in a Siamese network. It is made of three inputs: an anchor image (A) to be used as reference, a positive (P) image (the one that corresponds to the reference) and a negative (N) image (a wrong image).

It was introduced in [FaceNet: A Unified Embedding for Face Recognition and Clustering](https://arxiv.org/abs/1503.03832)

The triplet loss compares the distance between the two couples of images. Consider

$$||f(A)-f(P)||^2 - ||f(A)-f(N)||^2 + \alpha \leq 0$$ 

where $\alpha>0$ is a parameter called margin and prevents to have the two distances equal. The larger $\alpha$ and the larger you want the gap between the two distances since 

$$||f(A)-f(P)||^2 + \alpha \leq ||f(A)-f(N)||^2$$

is forcing the difference between A and P to be smaller that the difference between A and N by at least $\alpha$.

The triplet loss function is defined as 

$$L(A,P,N) = \max \bigl\{||f(A) - f(P)||^2 - ||f(A) - f(N)||^2 + \alpha , 0\bigr\}$$

Therefore so long as the difference between the distances is larger than $-\alpha$, the LHS term is positive and minimizing the loss means reducing $||f(A) - f(P)||^2 - ||f(A) - f(N)||^2 + \alpha$ possibly below zero, that is the distance between A and N must be higher than the distance bewteen A and P + $\alpha$.

The overall cost function is 

$$J = \sum_{i=1}^{m} L(A^{(i)},P^{(i)},N^{(i)})$$

where you sum across all pictures of all persons. When training, since you need both A and P you need multiple pairs of images of the same person. But after training you can use one shot learning and you only need  a picture of someone you are trying to recognize.

During training, given that A and P are from the same person and A and N are not, if you choose randomly it's very easy to satisfy $$||f(A)-f(P)||^2 - ||f(A)-f(N)||^2 + \alpha \leq 0$$

So you have to choose triplets that are hard to train on. For example one can choose P and N to be similar people in similar poses.

### Face Verification and Binary Classification

Another way to learn the parameters of a ConvNet for face recognition is by straight binary classification problem.

<img src="images/face_recog.PNG" width="600px"/>

You can have the siamese networks to compute the two embeddings $f(x^{(i)})$ and $f(x^{(j)})$ and use them to make a prediction where $\hat{y}=1$ if they are the same person and zero otherwise. For example, with embedding of 128 elements the output is computed with a sigmoid function $$\hat{y} = \sigma \bigl(\sum_{k=1}^{128} w_k |f(x^{(i)})_k - f(x^{(j)})_k| + b\bigr)$$

When the new employee walks in you compute the upper part of the siamese network and compare it to the embedding already computed of the photos you have stored.

This was presented in [Taigman et. al., 2014. DeepFace closing the gap to human level performance](https://www.cv-foundation.org/openaccess/content_cvpr_2014/html/Taigman_DeepFace_Closing_the_2014_CVPR_paper.html)

## Neural Style Transfer

### What is neural style transfer?

Neural style transfer is one application of ConvNets. It takes a content image C and a style image S and generates an image G with the content from C and the style from the S image.

<img src="images/neural_style_transfer.png" width="600px"/>


In order to implement Neural Style Transfer, you need to look at the features extracted by ConvNet at various layers, the shallow and the deeper layers of a ConvNet.

### What are deep ConvNets learning?

As presented in [Zeiler and Fergus., 2013, Visualizing and understanding convolutional networks](https://arxiv.org/pdf/1311.2901.pdf), the ConvNet learns gradually from the shallow to the deeper layers. What they did was to use a "deconvolution" method to map from the features to the pixels in order to find for each layer the nine image patches that maximize the top activations the units.

In other words: for each layer, pick one unit and scan the images in the training set to find the 9 images that produce the largest activation value in that unit. Then "deconvolve" it to find the patches (in each of the 9 image) that give the largest activation value in that hidden unit. See comments: https://www.coursera.org/learn/convolutional-neural-networks/discussions/weeks/4/threads/45J-ecX0Eee3RBKmpYlRFA

They trained a CNN on the ImageNet 2012 training set with the following layers:

<img src="images/vis_cnn.PNG" width="800px"/>


The picture below shows 9 different units in layer 1, and for each unit the 9 image patches that maximize its activation. The top-left and top-center unit seem to learn diagonal shapes in different directions:

<img src="images/vis_l1.PNG" width="400px"/>

In deeper layers a hidden unit will "see" larger region of the image thanks to the convolutions. Consider that at the extreme the output $\hat{y}$ is based on the whole picture.

The below picture shows layer 2, which is detecting more complex shapes and patterns. The top-center unit seems to capture vertical textures with lots of vertical lines, the center-left unit looks to be highly activated when there is a round shape on the left part of the image.

<img src="images/vis_l2.PNG" width="300px"/>

The below picture shows layer 3. The center-center hidden unit seems to respond highly to rounded shapes in the lower-left portion of the image, and it's detecting cars, while the bottom-right unit is already detecting people.

<img src="images/vis_l3.PNG" width="300px"/>

From the paper: "The projections from each layer show the **hierarchical nature of the features** in the network. Layer 2 responds to corners and other edge/color conjunctions. Layer 3 has more complex invariances, capturing similar textures (e.g. mesh patterns (Row 1, Col 1); text (R2,C4)). Layer 4 shows significant variation, but is more class-specific: dog faces (R1,C1); bird’s legs (R4,C2). Layer 5 shows entire objects with significant pose variation, e.g. keyboards (R1,C11) and dogs (R4)."

<img src="images/vis_layers.png" width="900px"/>

See comments: https://www.coursera.org/learn/convolutional-neural-networks/discussions/weeks/4/threads/8ze-VZUyEeiHaQooPC_NRA

### Cost Function

To apply Neural Style Transfer we need a cost function $J(G)$ to measure how good is the generated image G. In particular

$$J(G) = \alpha J_C(C,G) + \beta J_S(S,G)$$

where $J_C(C,G)$ is the content cost: how similar is the content of C to the content of G; $J_S(S,G)$ is the style cost: how similar is the style of S to the style of G; and $\alpha$ and $\beta$ are hyperparameters to specify the relative weights to the two different costs.

The algorithm described was first presented in [Gatys et al., 2015. A Neural Algorithm of Artistic Style](https://arxiv.org/abs/1508.06576)

In order to find the generated image G:
* Initiate G randomly (for example G: 100 X 100 X 3)
* Use gradient descent to minimize J(G):
    * $G := G - dG$ where $dG$ is $\frac{\partial J(G)}{\partial G}$
    
Notice that in each iteration you are updating the pixel values of the generated image G.

So that from

<img src="images/nst_cost.png" width="500px"/>

You will go from a randomly initilized G towards

<img src="images/nst_cost_path.png" width="500px"/>

From [A brief introduction to Neural Style Transfer](https://towardsdatascience.com/a-brief-introduction-to-neural-style-transfer-d05d0403901d):

Now remember- while doing style transfer, we are not training a neural network. Rather, what we're doing is — we start from a blank image composed of random pixel values, and we optimize a cost function by changing the pixel values of the image. In simple terms, we start with a blank canvas and a cost function. Then we iteratively modify each pixel so as to minimize our cost function. To put it in another way, while training neural networks we update our weights and biases, but in style transfer, we keep the weights and biases constant, and instead, update our image.

### Content Cost Function

Intuitively, consider a layer $l$ and compute the content cost $J(C,G)$ at that layer. If $l$ is small then it will force the generated image to pixel values very similar to the content image. Whereas, if you use a very deep layer, then it's just asking, to replicate the content of C (e.g. a dog) somewhere in the generated image G.

In practice, layer $l$ is chosen in between, it's neither too shallow nor too deep in the neural network.

You can use a pre-trained ConvNet (e.g. VGG network)

Let $a^{[l](C)}$ and $a^{[l](G)}$ be the *vectorized* activation of layer $l$ on the images, the content cost is defined as

$$J(C,G) = \frac{1}{2}||a^{[l](C)}-a^{[l](G)}||^2$$

If $a^{[l](C)}$ and $a^{[l](G)}$ are similar, both images have similar content.

### Style Cost Function

Consider to use layer $l$ to measure the style of an image. The style of an image is defined as the correlation between activations across different channels. Actually is not a formal correlation but consider it as the intuition.

<img src="images/style.png" width="200px"/>

In the image above it means to compare every element of the red channel the the corresponding element in the yellow channel.

The correlation tells you how components of the image (textures, shapes, ...) might occur or not occur together in the same image.

Let $a_{i,j,k}^{[l]}$ be the activation at position $(i,j,k)$ of layer $l$ (height, width and depht), and compute the style matrix (Gram matrix) $G^{[l]}$ of dimension $n_c^{[l]} \times n_c^{[l]}$, such that element $G^{[l]}_{kk'}$ gives the correspondence between the activations of channel $k$ and channel $k'$ of layer $l$. More formally

$$G^{[l]}_{kk'} = \sum_i^{n_H^{[l]}} \sum_j^{n_W^{[l]}} a_{ijk}^{[l]}a_{ijk'}^{[l]}$$

If the activations tend to be large together then $G^{[l]}_{kk'}$ will be large wherease if they are uncorrelated it will be small.

You do both for the style image and the generated image $G^{[l](S)}_{kk'}$ and $G^{[l](G)}_{kk'}$ and define the style cost of layer $l$ as

$$\begin{align}
J_S^{[l]}(S,G) &= \frac{1}{(2 n_H^{[l]} n_W^{[l]} n_C^{[l]})^2} ||G^{[l](S)}-G^{[l](G)} ||^2_F \rightarrow \text{ Frobenius norm}\\
&= \frac{1}{(2 n_H^{[l]} n_W^{[l]} n_C^{[l]})^2} \sum_k \sum_{k'} (G^{[l](S)}_{kk'}-G^{[l](G)}_{kk'})
\end{align}$$

It turns out that you get more visually pleasing results if you use the style cost function from multiple different layers. So, the overall style cost function, can be defined as sum over all the different layers of the style cost function for that layer

$$J_S(S,g) = \sum_l \lambda^{[l]} J_S^{[l]}(S,G)$$

with some weighting across layers $\lambda^{[l]}$.

## 1D and 3D Generalizations

ConvNets can work also with 1D and 3D data too.

For example, an ECG is a 1D series of numbers corresponding to the heart beats. You can convolve it with a 1D filter to get a 2D output (where the depth depends on the number of filters)

<img src="images/1Dconv.PNG" width="600px"/>

1D data come from a lot of sources such as waves, sounds, heartbeat signals. However, most of the applications that use 1D data use Recurrent Neural Network RNN instead of CNN.

An example of 3D data are CT scans or movies

<img src="images/3dconv.gif" width="300px"/>

The data have some height and width, plus a depth given by the slices of the scan. And the 3D input is convolved with a 3D filter, so that the output will be a 4D object (where the last dimension depends on the number of filters)

# See also

http://cs231n.stanford.edu/