# Convolutional Neural Networks

This notebook is based on my learning from **course 3** of the **Deep Learning Specialization** provided by **deeplearning.ai**. The course videos could be found on [YouTube](https://www.youtube.com/watch?v=ArPaAX_PhIs&list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-KnDzF) or [Coursera](https://www.coursera.org/specializations/deep-learning). Learning through Coursera is highly recommended to get access to the quizes and programmin exercises along the course, as well as the course certification upon completion. Personally, I completed the specialization of 5 coursesand acquired [Specialization Certificate](https://coursera.org/share/e590c28a5c258e500ca6d3ccb4ed57ba). Later, I discovered the YouTube videos and used them for review.

## Foundations of Convolutional Nueral Networks

### Edge Detection Examples

Earlier layers might detect edges. The somewhat later layers might detect part of the objects, and the even later layers might detect part of the the complete objects.

Detect vertical edge in a 6 by 6 image (the $*$ sign means convolution):

$\begin{bmatrix}
3 & 0 & 1 & 2 & 7 & 4 \\
1 & 5 & 8 & 9 & 3 & 1 \\
2 & 7 & 2 & 5 & 1 & 3 \\
0 & 1 & 3 & 1 & 7 & 8 \\
4 & 2 & 1 & 6 & 2 & 8 \\
2 & 4 & 5 & 2 & 3 & 9 \\
\end{bmatrix} * \begin{bmatrix}
1 & 0 & -1 \\
1 & 0 & -1 \\
1 & 0 & -1 \\
\end{bmatrix} = \begin{bmatrix}
-5 & -4 & 0 & 8 \\
. & . & . & . \\
. & . & . & . \\
. & . & . & . \\
\end{bmatrix}$

- $3\times1 + 1\times1 + 2\times1 + 0\times0 + 5\times0 + 7\times0 + 1\times-1 + 8\times-1 + 2\times-1 = -5$
- and so on

```pyhon
tf.nn.conv2D
```
When there's a vertical line in the image:

$\begin{bmatrix}
10 & 10 & 10 & 0 & 0 & 0 \\
10 & 10 & 10 & 0 & 0 & 0 \\
10 & 10 & 10 & 0 & 0 & 0 \\
10 & 10 & 10 & 0 & 0 & 0 \\
10 & 10 & 10 & 0 & 0 & 0 \\
10 & 10 & 10 & 0 & 0 & 0 \\
\end{bmatrix} * \begin{bmatrix}
1 & 0 & -1 \\
1 & 0 & -1 \\
1 & 0 & -1 \\
\end{bmatrix} = \begin{bmatrix}
0 & 30 & 30 & 0 \\
0 & 30 & 30 & 0 \\
0 & 30 & 30 & 0 \\
0 & 30 & 30 & 0 \\
\end{bmatrix}$

### More Edge Detection

Horizontal edge: $
\begin{bmatrix}
1 & 1 & 1 \\
0 & 0 & 0 \\
-1 & -1 & -1 \\
\end{bmatrix}$, Sobel filter: $
\begin{bmatrix}
1 & 0 & -1 \\
2 & 0 & -2 \\
1 & 0 & -1 \\
\end{bmatrix}$, Scharr filter: $
\begin{bmatrix}
3 & 0 & -3 \\
10 & 0 & -10 \\
3 & 0 & -3 \\
\end{bmatrix}$, or learn the nine parameters by backprobagation


### Padding

Valid convolutions (without padding): 

- $(n, n) * (f, f) = (n-f+1, n-f+1)$

- $(6,6) * (3,3) = (4,4)$


Same convolutions (pad so that output size is the same as the input size); (padding with $p=1$ in this example): 

- $(n+2p, n+2p) * (f, f) = (n+2p-f+1, n+2p-f+1)$

- $(8,8) * (3,3) = (6,6)$

- $f$ is uaually odd

- when $f=5, p=2$ for same conlutions


### Strided Convolutions

When stride = 2, take 2 steps instead of 1 step.

$\begin{bmatrix}
2 & 3 & 7 & 4 & 6 & 2 & 9 \\
6 & 6 & 9 & 8 & 7 & 4 & 3 \\
3 & 4 & 8 & 3 & 8 & 9 & 7 \\
7 & 8 & 3 & 6 & 6 & 3 & 4 \\
4 & 2 & 1 & 8 & 3 & 4 & 6 \\
3 & 2 & 4 & 1 & 9 & 8 & 3 \\
0 & 1 & 3 & 9 & 2 & 1 & 4 \\
\end{bmatrix} * \begin{bmatrix}
3 & 4 & 4 \\
1 & 0 & 2 \\
-1 & 0 & 3 \\
\end{bmatrix} = \begin{bmatrix}
91 & 100 & 93 \\
69 & 91 & 127 \\
44 & 72 & 74 \\
\end{bmatrix}$

- $(n+2p, n+2p) * (f, f) = (\lfloor\frac{n+2p-f}{s}+1\rfloor, \lfloor\frac{n+2p-f}{s}+1\rfloor)$
- If the output is not an iteger, we round down.

### Convolutions over Volumes

**When there're multiple channels.**

More details: [video](https://www.youtube.com/watch?v=KTB_OFoAQcc&list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-KnDzF&index=6).

- $(6, 6, 3) * (3, 3, 3) = (4, 4)$ The channel $3$ in $(6, 6, 3)$ has to be the same in both image and filter.
- We can use the filter to detect a certain color or  all colors
- We can apply multiple filters and then stack the output as new channels
- **Number of filters will become the new number of channels.**
- Summary: $(n, n, n_c) * (f, f, n_c) = (n-f+1, n-f+1, n_c')$

### One Layer of a COnvolutional Net

If you have 10 filters that are 3 x 3 x 3 in one layer of a neural network, how many parameters does that layers have?
- Each filter: 3x3x3 + 1 = 28
- All filters: 28x10 = 280

If layer $l$ is a convolution layer:
- $f^{[l]} =$ filter size
- $p^{[l]} =$ padding
- $s^{[l]} =$ stride
- $n_c^{[l]} =$ number of filters


- Each filter is: $f^{[l]} \times f^{[l]} \times n_c^{[l-1]}$


- Activations: $a^{[l]} \rightarrow n_H^{[l]} \times n_W^{[l]} \times n_C^{[l]}$


- Weights: $f^{[l]} \times f^{[l]} \times n_c^{[l-1]} \times n_c^{[l]}$


- Bias: $n_c^{[l]} = (1, 1, 1, n_c^{[l]})$


- Input: $n_H^{[l-1]} \times n_W^{[l-1]} \times n_C^{[l-1]}$


- Output: $n_H^{[l]} \times n_W^{[l]} \times n_C^{[l]}$


-  $n_H^{[l]} = \lfloor \frac{n_H^{[l-1]}+2p^{[l]}-f^{[l]}}{s^{[l]}}+1\rfloor, n_W^{[l]} = ...$


### Pooling Layers

Break the input into different regions and take the max of each region.

$\begin{bmatrix}
1 & 3 & 2 & 1 \\
2 & 9 & 1 & 1 \\
1 & 3 & 2 & 3 \\
5 & 6 & 1 & 2 \\
\end{bmatrix} \rightarrow \begin{bmatrix}
9 & 2 \\
6 & 3 \\
\end{bmatrix}$

Hyperparameters $f=2, s=2$

Intuition: A large number might mean a particular feature. What max pooling does is to detect the feature. However, no one can really say how it works for sure. Max pooling just works well in a lot of experiments/literatures.

Also: average pooling - used less often

Hypterparamters:
- $f$: filter size
- $s$: stride
- Max or average pooling

No parameters/weights to learn, so usually it's not considered a real layer.

$n_H \times n_W \times n_C \rightarrow \lfloor \frac{n_H-f}{s} + 1 \rfloor \times \lfloor \frac{n_W-f}{s} + 1 \rfloor \times n_C$

### CNN Example

| Layer | Activation shape | Activation Size | # parameters |
| --- | --- | --- | --- |
| Input | (32, 32, 3) | 3072 | 0 |
| CONV1 (f=5, s=1) | (28, 28, 8) | 6272 | 208 |
| POOL1 | (14, 14, 8) | 1568 | 0 |
| CONV2 (f=5, s=1) | (10, 10, 16) | 1600 | 416 |
| POOL2 | (5, 5, 16) | 400 | 0 |
| FC3 | (120, 1) | 120 | 48001 |
| FC4 | (84,1) | 84 | 10081 |
| Softmax | (10,1) | 10 | 841 |


### Why Convolutions

Paramater sharing: A featrue detector (such as a vertical edge detector) that's useful in one part of the image is probably useful in another part of the image.

Sparsity of connections: In each layer, each output value depends only on a small number of inputs.

## Deep Convolutional Models: Case Studies

Classic networks:
- LeNet-5
- AlexNet
- VGG

ResNet

Inception