# Course 4. Convolutional neural networks

## Foundations of Convolutional Neural Networks

### Computer Vision

3 major groups of computer vision problems are: image classification, object detection, neural style transfer

### Edge detection example

This lesson describes convolution layer of nn through edge detection example.

Vertical edge detection can be done by applying convolution operation described below:
<img src="imgs/verticaledge.png">
<img src="imgs/verticaledge2.png">

### More Edge Detection

In CV literature there are different kind of filters used for vertical edge detection, for ex Sobel filter or Schorr filter. However, we can also let neural network learn new filter and this is exactly what we are doing by adding Convolutional layer to nn. 
<img src="imgs/edges.png">

### Padding

When you take a better look at convolution operation you will noticed that pixels in the middle of original image are included in calculation more then pixels on the borders of the image. The effect of this is that filtered image is created with accent on central part. To avoid that, padding operation can be introduced:
<img src="imgs/padding.png">

In practice, most commonly we apply convolution operations in 2 ways:
1. Valid convolution (no padding) - original image $n \times n$, filter $f \times f$ which makes output of dimesions $n-f+1 \times n-f+1$
2. Same convolution (padding is calculated in a way that image keeps its original dimensions) - original image $n \times n$, filter $f \times f$, padding $p = \frac{f - 1}{2}$, which makes output of size $n \times n$

Usually size of a filter is choose to be __odd__ number.

### Strided Convolutions

Stried is a step size of how much you're moving the filter
<img src="imgs/stride.png">

### Convolutions Over Volumes

In previous lessons we saw how to do convolution operation on 2d images where image is represented as 2 dimentional matrix (gray scale image). We can do similar thing for RGB images or images with depth $n_c$ ($n_c$ channels). 

<img src="imgs/convolutiononrgb.png">

What we would do is basically multiplay each element in the filter cube with corresponding element in image 3 dim vector. The result is matrix.

Similarly we can apply multiple filters at the same time (for example, one filter to detect vertical edges, and one to detect horizontal edges).

<img src="imgs/multiplefilters.png">


### One Layer of a Convolutional Net

As seen on previous slides, adding convlution operations can help detect features of images. Now, when we take the output and add bias and apply non-linearity (ReLu) we get one layer of conv net.

<img src="imgs/onelayerofconvnet.png">

When we go back to previous notation of neural network layers, we had:

$$
z^{[i]} = w^{[i]} a^{[i-1]} + b^{[i]}\\
a^{[i]} = g(z^{[i]})
$$

$w^{[i]}$ corresponds to all parameters in the filters ($3 \times 3 \times 3 = 27$ times $2$ filters makes $54$ parameters in total)

$b^{[i]}$ has $2$ biases because we have $2$ filters.

Here is the notation of convolutional network layer:
<img src="imgs/convnotation.png">

### Pooling layers

Pooling layers are used to speed up computation, reduce the size of representation and make some feature of image more robust. The next image discribes max pooling operation:

<img src="imgs/pooling.png">

Besides max pooling, average pooling operation is sometimes used (just apply avg instead of max)

## Deep convolutional models: case studies

###  Why look at case studies?

* Classic networks
    * LeNet-5
    * AlexNet
    * VGG
* Residual networks
* Inception


### Classic Network

__LeNet-5__ (Gradient-based learning applied to document recognition, 1998. https://ieeexplore.ieee.org/document/726791)

<img src="imgs/lenet5.png">

__AlexNet__ (ImageNet classification with deep convolutional nets, 2012. https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf)
<img src="imgs/alexnet.png">

VGG-16 (Very Deep Convolutional Networks for Large-Scale Image Recognition, 2015.https://arxiv.org/pdf/1409.1556.pdf)

<img src="imgs/vgg16.png">

### Resnets

Deep Residual Learning for Image Recognition, 2015, https://arxiv.org/abs/1512.03385

Couple of important points:
* Residual networks help with problem of exploding / vanishing gradients

<img src="imgs/resnets.png">
<img src="imgs/resnets2.png">

### Why resnets work?

The answer to this question lies in ability of Residual block to learn identity function. In regular architecture 

$$
z^{[l+1]} = w^{[l+1]} a^{[l]} + b^{[l]}\\
a^{[l+1]} = g(z^{[l+1]})\\
z^{[l+2]} = w^{[l+2]} a^{[l+1]} + b^{[l+1]}\\
a^{[l+2]} = g(z^{[l+2]}) = g(w^{[l+2]} a^{[l+1]} + b^{[l+1]})
$$

whereas if we have skip connection from layer $l$ to layer $l+2$ we would have:

$$
z^{[l+1]} = w^{[l+1]} a^{[l]} + b^{[l]}\\
a^{[l+1]} = g(z^{[l+1]})\\
z^{[l+2]} = w^{[l+2]} a^{[l+1]} + b^{[l+1]}\\
a^{[l+2]} = g(z^{[l+2]}) = g(w^{[l+2]} a^{[l+1]} + b^{[l+1]} + a^{[l]})
$$

If for some reason $w^{[l+2]} a^{[l+1]} + b^{[l+1]} = 0$ turns out to be zero then $a^{[l+2]} = a^{[l]}$ because we are using relu activations. That means that we have learn identity function and that definatelly won't hurt.

### Network in network

The idea is presented in a paper (Network in Network, Lin 2013, https://arxiv.org/pdf/1312.4400.pdf). The basic idea is to use 1x1 convolutions (if we have input with only one channel 1x1 convolution would just be multiplying the matrix with a constant). However, if we have more then one channel we can use $1 \times 1 \times n_c$ and get more complex non-linearity. To sumarize, polling layers help us reduce the hight and width of a layer and 1x1 convolution layers help us reduce the number of channels.

<img src="imgs/1x1conv.png">

### Inception Network Motivation

Paper: Going deeper with convolutions, 2014, https://arxiv.org/pdf/1409.4842.pdf
The basic idea is instead of choosing how big filter size should be, or if you don't know is it better to apply conv or max pooling do everything and stack the output into the same layer (brute force :))

<img src="imgs/inception1.png">

This will be really expensive to compute ... to calculate how expensive lets just look at only 5x5 conv layer.

<img src="imgs/inception2.png">

Instead of doing ~120m operations we can first apply 1x1 convolution but with 16 filters and then 5x5 and eventually get the output of the same size but significatly faster.

<img src="imgs/inception3.png">


### Inception network

Inception network consist of many Inception modules and the inception module looks like this:

<img src="imgs/inceptionmodule.png">

### Using Open Source Implementation

Andrew Ng's suggestion is first look for other implementations of the paper before you start implementing it yourself

### Transfer learning

If you need to build you're own classifier the suggested method is to use pretrained net. Using pretrained net can be done in 4 ways:

1) __Freeze flag__ - use exactly the same architecture of pretrained net and add your own layers to that. ML frameworks usually have a way to flag which layers are trainable and which aren't (for example freeze flag or trainable layer etc)

2) __Save to disk__ - compute activation that pretrained net outputs for all training data and that train shallow network that you wanted to try out

3) __Combine 1) and 2)__

4) __Continue where pretrained finished__ - basically take the pretrained network, add your layers to it, and start training of whole network with your datasat. You usualy do this if you have a lot of data.

<img src="imgs/transferlearning2.png">

### Data augmentation

There are several techinques often used: mirroring, random cropping, rotatiting, color shifting (apply PCA to boost dominant colors - I could maybe impelement this in python when i go back to previous ML course)

<img src="imgs/dataaugmentation.png">

Its advised to implement this preprocessing before forming a mini-batch ... One thread would iterate over the dataset, selecting the images and preforming augmentation and then that would be served to other thread that is in charge of training.

<img src="imgs/dataaugmentationimpl.png">


## Object detection

### Object localization

In this lessson we define a target labels for object detection problem. Target would be vector in which first element indicates whether the image contains objects we are trying to detect, the next 4 elements are polar cooridnates of the object on the image if it is present, and then next 3 elements represent class of object.

<img src="imgs/objectdetectiony.png">

For $p_c$ we can use logistic regression loss, for object bounding box we can use mean square error loss and for class we can use softmax with log likelihood loss.

### Landmark detection

Landmark is point of interest on the image and the output of neural network in landmark detection task is vector of polar cooridnates of those points in the image.

<img src="imgs/landmarkdetection.png">

### Object Detection

Sliding window algorithm is described: train conv net classifier to detect if car is in the image or not, then for each new image, crop some part of image and feed it into conv net ... repeat the process by sliding the crop box. This algorithm clearly has some downsides (computationaly expensive).

### Convolutional Implementation Sliding Windows

The idea is that basically in one foreward pass we get the probabilities for all sliding windows that we chosen ... I didn't full get how it works but thats the idea.

https://arxiv.org/pdf/1312.6229.pdf

<img src="imgs/convimplofslidingwindow.png">

### Bounding box predictions

https://pjreddie.com/media/files/papers/yolo.pdf

In sliding window algorithm we don't have a guarantee that sliding window will perfectly capture the object. In addition to that, sliding window doesn't always has to be perfectly squared.