# Convolutional Neural Networks
## 1. Motivation
Let's consider one of the main appliations of Deep Learning: computer vision. Previously all images considered were of the size 64 by 64. The input features would then be of $3 \cdot 64 \cdot 64 = 12288$ dimensional. New data and high resolution images are generally $10^3 \cdot 10 ^ 3$ which means that input vector is $10 ^6$ dimentional. Assuming a neural network with $1000$ units in the first hidden layer, the first mapping matrix is already: $10 ^ 9$ parameters which is already quite large. DL researchers conceived the convulotion operations. 


## 2. Convolutions
### 2.1 The convulation operator
The convolution operator is a binary matrix operator heavily used in convolutional networks. For mathematical explanation, consider the following [link](https://github.com/ayhem18/Towards_Data_science/blob/master/Machine_Learning/CNN/math_parts/Convolution_operator.pdf)
### 2.2 Edge detection
To detect vertical edges a number of matrices generally referred to as ***filters*** were conceived. Among the popular ones:
1. $\begin{bmatrix} 1 & 0 & -1 \\ 1 & 0 & -1 \\ 1 & 0 & -1\end{bmatrix}$ 

2. $\begin{bmatrix} 1 & 0 & -1 \\ 2 & 0 & -2 \\ 1 & 0 & -1\end{bmatrix}$: sobel fitler

3. $\begin{bmatrix} 3 & 0 & -3 \\ 10 & 0 & -10 \\ 3 & 0 & -3\end{bmatrix}$: schwarr fitler

4. set the filter's values as learnable parameters instead of hard coding them.

Such filters are used for vertical edge detection. The transposed matrices can be used to detect horizontal edges. 

### 2.3 Padding
Assuming a picture of the shape $(s, s)$ and a fitler $(f, f)$ then the convolution operator produces a $(s - f + 1, s - f + 1)$. Such approach has certain shortcomings:
1. The picture's size shrinks with every convolution operation: with a large NN, the image might get signficantly reduced by the latest layers
2. The values at the corners are generally included few times. Thus the information they bring are easily lost

Padding is a possible solution. Before applying convolution, a random number $p$ is selected and $2p$ rows and columns are added, hence obtaining a $(s + 2p, s + 2p)$ matrix. This way, it is possible to keep the matrix's size non-decreasing as well as make better use of the information in the corner cellls. \
There are two main approaches to padding:
1. Valid: np padding
2. Same: padding enough to keep the input size.

### 2.4 Striding
In addition to padding, Deep Learning researchers introduced striding, or the notion of a ***stride***: the number of squares between two consecutive inner squares.In other words, we start by applying the weighted multiplication between the filter square and the square at the top left corner then we move to the next square $s$ squares away. Assuming we have a $(n, n)$ square and a $(f,f)$ filter then with padding $p$ and stride $s$. The resulting square would be of shape: $\lfloor\frac{n + 2p - f}{s}\rfloor + 1$

### 2.5 Convolution on 3d objects.
Assuming we have a $(n_1, n_2,  n_c)$ array as well as a $(f_1,f_2, n_c)$ fiter. Then the result is a $(\lfloor\frac{n_1 + 2p - f_1}{s}\rfloor + 1, \lfloor\frac{n_2 + 2p - f_2}{s}\rfloor + 1)$. The value at each square in the resulting array is the sum of all pairwise convolutions of the $n_c$ individual filter's layers. Which is why it is a mathematical condition to have $n_c$ as a common value. \
The notion of 3d convolution enables the use of multiple filters at once, detecting different features at once. The image below is more expressive illustration provided by the amazing course of *CNN* by ***Andrew NG***:
[image](https://github.com/ayhem18/Towards_Data_science/blob/master/Machine_Learning/CNN/Convolution_3D.png?raw=true)

The square in the top left corner is the sum of weighted sum between the filter and the top left $(3, 3)$ squares in the red, green and blue layers. The same mechanism is applied to the rest of the squares.

## 3. Convolutional Neural Networks
### 3.1 Convolutional Layer
Starting with a $(n_{hi}, n_{wi}, n_{ci})$ array, we apply the the convolution operation to each of the $f$ fitlers, obtaining $f$ new arrays with shapes: $n_{ho}, n_{wo}$. Each of them is summed up with a bias unit $b_i$ and then applied an activation function element-wise. The final output of the layer would be a $n_{ho}, n_{wo}, f$

where the $f$ previous arrays are stacked on one another.
### 3.2 Parameters
Among the crucial features of CNN is that the number of parameters in a single layer is architecture dependent and not input dependent. More specifically, the number of parameters in a layer is estimated as: $(n_{ho}\cdot n_{wo} + 1) \cdot f$. The input's size does not affect by any mean the number of parameters
### 3.3 Notation
For the $l$-th layer, the following holds:
* $f^{[l]}$ = fitler size
* $p^{[l]}$ = padding
* $s^{[l]}$ = stride
* $n_c^{[l - 1]}$ = number of filters: number of channels in the input
* input: $(n_h^{[l - 1]} * n_w^{[l - 1]} * n_c^{[l - 1]})$
* output: $(n_h^{[l]} * n_w^{[l]} * n_c^{[l]})$ where 
    * $n_h^{[l]} = \lfloor\frac{n_h^{[l - 1]} + 2p^{[l - 1]} - f^{[l - 1]}}{s^{[l - 1]}}\rfloor + 1$
    * $n_w^{[l]} = \lfloor\frac{n_w^{[l - 1]} + 2p^{[l - 1]} - f^{[l - 1]}}{s^{[l - 1]}}\rfloor + 1$


### 3.4 Example: 
For a first simple CNN example, consider the following [illustration](https://github.com/ayhem18/Towards_Data_science/blob/master/Machine_Learning/CNN/CNN_example.png?raw=true).

### Additional Notes
* A convolution extracts certain features from an input image. The values within the 3d array referred to as ***filter*** determines the features extracted. 
* The 2d object resulting is referred to as the ***feature map***
* Zero padding is quite useful to keep as much information about the edges as possible while approximately maintaining the input's size throughout the Network which enabless the creation of significantly deep networks.
* Pooling layers gradually reduce the height and width of the input as they divide the input into regions and summarizing these regions in a single value.


### 3.5 Pooling Layers
In addition to Convolutional layers, there is also pooling layers. There are two main types: 
* max Pooling layers
* average Pooling layers

Pooling when applied to a $(n, n)$ matrix, filter of shape $(f * f)$ and stride $s$, the output is a matrix of size $(\lfloor\frac{n_- f_1}{s}\rfloor + 1, \lfloor\frac{n_2 - f_2}{s}\rfloor + 1)$ where each square is either the ***max*** or the ***average*** of the elements in the corresponding matrix.
#### Example
Consider the following [example](https://media.geeksforgeeks.org/wp-content/uploads/20190721025744/Screenshot-2019-07-21-at-2.57.13-AM.png).
#### Pooling in 3D
Extending the operation to higher dimensions is simply done by applying the exact operation channel (layer) wise.