## __The convolutional neural netwok (CNN)__
<font size=3>

In the [previous notebook](../3-mnist/3.5-handson-MNIST_experiment.ipynb), we conducted two experiment predictions by rotating a handwritten digit image and creating its negative version. These results underscore the limitations of the __dense layers__. While it captures individual features based on the input vector, it fails to recognize the relationships between different parts of the vector that form complementary data features together. Consequently, when the image is inverted, the sequence of the input vector changes entirely, causing the previously learned features to be lost.

The goal of our next type layer, the __convolutional layer__,  is to enhance feature extraction by considering both macro and micro __morphological patterns__ within the data and their correlations.

### __1. The convolutional layer:__
<font size=3>

To build a convolutional layer, we will adapt the dense layer, 
$$
    a_l^i = \sigma_l\left(W_l^{ij}\,a_{l-1}^j + b_l^i\right) \, ,
$$
to extract morphological features. 

For didactic purpose, we will consider a 2-dimention input data $a_0^{ij}$, such as an image. Unlike in the previous equation, feature extraction will not occur using big weight arrays $W_l^{ij}$ but instead small __filters/kernels__ $W_l^{mn}$ that _map_ a specific data pattern. In math, we express this idea as
$$
    a_l^{ij} = \sigma_l\left(\sum_{m=0}^{k-1}\sum_{n=0}^{k-1} W_l^{mn}\,a_{l-1}^{i+m,\,j+n} + b_l\right) \, ,
$$
where:
* The bias $b_l$ is now a scalar for the entire layer $l$;
* The filter $W_l^{mn}$ of size $(k\times k)$ _scans_ the input data, summing the product of its coefficients with part of data of the previous layer.

<br>

The figure above illustrates a convolutional operation where a filter of shape $(3,\, 3)$, so $k=3$, scans the input data $(20,\, 20)$, $a_0^{rs} = a_0^{i+m,\, j+n}$ such that $r,\,s \in [0,20)$. In this operation, 9 filter coefficients/weights multiply 9 input elements to produce one element of the output conv layer. The scanning process goes from left to right and up to down to form the conv layer $a_1^{ij}$, for $i,\,j \in [0,\,18)$.

<center>
<img src="../figs/cnn1.png" width="500"/>
</center>

<center>
<em>(Illustration from the author's thesis)</em>
</center>

__Note__ that
- the filter operation reduces the data dimension as $\mathbf{(20} - 2\cdot(k\%2)\mathbf{,\, 20} - 2\cdot(k\%2)\mathbf{)}$, where "\%" means the integer division. So, if $k=5$, the output shape would be $(16,\,16)$ (please check this affirmation based on the figure);
- the filter slides _one_ element to the right, which means __strides = 1__. Sometimes, for some reason, we need the filter jumps further; thus, we set this hyperparameter the number of _strides_;
- if we want the conv output shape to be the same as the input data, we can __padding__ its boundary with zeros.


<font size=3>
<br>

Now have some water to refresh your mind, because maybe a crucial point passed announced: we said $W_l^{mn}$ is __one__ filter in the layer $l$ to capture the input data features... so maybe one filter is not enough. Let's take a bunch of them! We then define a conv layer containing different __channels__ so that each cames from an operation of a different filter $W_{l,c}^{mn}$.

In the below figure, we consider a conv layer with 3 channels which means 3 filters will extract the input data features.

<center>
<img src="../figs/cnn2.png" width="500"/>
</center>

<center>
<em>(Illustration from the author's thesis)</em>
</center>

<br>

__Note__ that for convolution operations, the input channel size must be set, even though the input data is 2-dimensional. For instance, in the MNIST handwritten digits, the input shape would be (28, 28, 1) since we are working in grayscale. If we were working with a colorful image, we would separate the colors in RGB arrays so that each color would be a channel. 

The conv layer equation, for many channels $c$, can be written as

$$
    a_{l,c}^{ij} = \sigma_l\left(\sum_{m=0}^{k-1}\sum_{n=0}^{k-1} W_{l,c}^{mn}\,a_{l-1,c}^{i+m,\,j+n} + b_{l,c}\right) \, ,
$$

<br>

__Questions:__
- So, how actually filters capture features?!\
  The filters' weight values are set initially at random and will be adjusted to optimize the loss function. As the training progresses through epochs, the filters will take the form of certain "patterns" that better extract information/features to pass to the next layer. As a result, filters that extract handwritten digits will have patterns of vertical lines, horizontal lines, round shapes, and so on.
  
- Why do we name _"convolutional layer"_?\
  The convolution of two functions $f$ and $g$ returns another function as their "overlap". It's
  continuous form is given by
  $$
  (f \star g)(t) = \int_{-\infty}^\infty f(t - \tau)\,g(\tau)\, d\tau \, ,
  $$
  while the discrete form is
  $$
  (f\star g)[n] = \sum_{m=-M}^M f[n-m]\cdot g[m] \, ,
  $$
  which is equivalente to the conv layer equation above.\
  Check the discrete convolution operation [here](https://en.wikipedia.org/wiki/Convolution).

<br>

__What about Keras?__\
In Keras, the [convolutional layers](https://keras.io/api/layers/convolution_layers/) can be programmed as
```python
    x = layers.Conv2D(filters=32, kernel_size=3, strides=(1, 1), padding="valid", activation="relu")(x)
```
<br>
which means 32 filters/channels, filter/kernel-size $k=3$, strides $=(1,\,1)$ is one step to the right and one down, and padding="valid" means no padding.

<br>

### __1. The pooling layer:__
<font size=3>

To accelerate the CNN encoding process, the conv layer can be combined with another layer that reduces its output by half while retaining the best (or average) features. This new type of layer is known as the __pooling layer__. The idea is similar to the filter scanning operation, but here, a small window slides in the conv output, computing the maximum or the average of this data portion. Check the figure below for elucidation.

<center>
<img src="../figs/cnn3.png" width="500"/>
</center>

<center>
<em>(Illustration from the author's thesis)</em>
</center>

<br>

For instance, the __max-pooling__ layer of a $(2,\,2)$ window takes the maximum value among those 4 conv output elements. The effect of this operation 
 - does a feature extraction of higher values (_"most important"_);
 - makes a _"zoom"_ into the data that enhances more detailed features.

The figure below shows the effect of two successive max-pooling operations with pool-size $=(2,\,2)$.

<center>
<img src="../figs/pool_effect.png" width="600"/>
</center>

__What about Keras?__\
In Keras, the [pooling layers](https://keras.io/api/layers/pooling_layers/) can be programmed as
```python
    x = layers.Conv2D(filters=32, kernel_size=3, strides=(1, 1), padding="valid", activation="relu")(x)
    x = layers.MaxPooling2D((2,2))(x)
```
<br>