# Convolutional neural networks

You've seen the basics of neural networks in the [structured data chapter](../1.structured_data/). Now, we go a bit deeper into it, and play with some **new kinds of layers** that compose a neural network. One layer specifically, the **convolutional layer**. It is such an important layer in **computer vision** tasks, that the whole type of network is named after it.

It is important to know though, that convolutional neural networks do **not only consist out of convolutional layers**. They are often a **combination** of more traditional layers and convolutional layers.



## 1. Convolution

### 1.1 Differences in data contents

Convolution is a key concept to understand about CNN's, it is a mathematical operation of **sliding** two function over each other to produce a third. What does this mean in the context of neural networks? For this, we need to understand **the fundamental difference** between a regular net layer and a convolutional net layer (often referred to as convnet).

The difference lies in the **type of data handled**. Images are **data organised in 2D space**, as opposed to a 1D vector that you would usually feed to a NN.

![flattening](../assets/flattening.png)

Take the above image for instance. Both sets of data **contain the same info** but are **organised differently**. If we would flatten a 2D matrix to a 1D vector, we'd lose this information.

An image is usually a set of 3 2D matrices, one for each colour (r,g,b), represented like the image below. The values are values ranging from 0 to 255, the width and height depends on the actual image size. 

![image](../assets/image_representation.png)

Okay, now you have an idea how the data looks like that is being fed to our CNN, but what does the actual layer look like? **what is being trained here**?



### 1.2 Shape of the convolutional layer

Oftentimes, when looking up the shape of a convnet, it is represented as a **3D block having the same width and height as the image, and a different depth**. This is is not entirely wrong, but it's not what is actually being trained, and doesn't explain the concept of convolution. Take the below gif:

![conv](../assets/convolving.gif)

This is what is happening when you feed an image (in this case 1 channel of an 5x5px image) to a convnet. a **filter** (the red set of multipliers inside the yellow matrix) is being **convolved** step-by-step over the entire image, to produce a new feature to pass to the next layer. 

It is important to realize that the **weights of the convnet are the multipliers of the filters**. In this case, we have 9 weights that can be trained to produce a different convolved result.

Okay, so the 2D representation of a neural network hopefully makes a bit more sense now. Take a look at the image below taken from [Camacho's blog on CNN's](https://cezannec.github.io/Convolutional_Neural_Networks/)

![CNN](../assets/CNN.png)

Each of the layers of the **feature learning** section is represented by a 3D block, getting fed the convolution of the previous layer, but **why do each of these layers have a different depth**? This has to do with the **number of kernels of the layer filter**. In CNN's the number of kernels determines how many **convolutional sets of weights** can be trained on a given input. **A set of 2D kernels together combine a 3D filter, with the depth of the filter being the number of kernels**!

[this](https://towardsdatascience.com/intuitively-understanding-convolutions-for-deep-learning-1f6f42faee1) article goes a bit more in depth on how convnets, filters and kernels all fit together, and it has pretty gifs!




### 1.3 Feature extraction through convolution

More depth means there is a larger chance for all of the underlying patterns to be captured, but also implies a more complex model. The following question may arise from learning this: **If we can have multiple filters for each layer, why do we need multiple convnet layers in our neural network?**. For this, the following image might help you understand:

![progression](../assets/convnet_progression.png)

Here, the **convolved outputs** of three convnet layers are presented (bottom-to-top), for four different convolutional neural networks (left-to-right). Each of the little squares of every layer represents a **trained kernel**, and all kernels of a single convnet layer stack up to compose the convnet **filter**.

You can see that the first layer of each of the CNN's have extracted very **rudimentary features** such as lines. It's hard to fit more complex features than that on a bunch of raw pixels without **overfitting**. 

These rudimentary features, however, are the building blocks for the next feature, where more complex features are constructed from these lines. The final layer takes it a step further and you can already recognise the final supposed class from these. If these features are present in the image, the neural network will light up and the object is recognised.

Take note that all of these could be combined into a holistic neural network, but the **number of kernels** should be increased to be able to capture all possible features of all classes.

A question often coming up in CNN discussion is **how do I choose my kernel size**? The kernel size being the size of the 2D matrix you slide (convolve) over your input to the convnet layer. Bigger kernels imply less **locality** (inferring relations only in local areas instead of the whole image), but increase the model complexity since you have more parameters to train. A rule of thumb though, is to [keep kernels relatively small](https://www.sicara.ai/blog/2019-10-31-convolutional-layer-convolution-kernel), and go for [depth, instead of width and height](https://stats.stackexchange.com/questions/222883/why-are-neural-networks-becoming-deeper-but-not-wider). Over the years, best the cutting edge in CNN's have been converging to become smaller in kernel size.



## 2. Pooling

Pool party! Anyone? No?

Pooling has nothing to do with water, sadly, it refers to the act of **pooling together input data**. Pooling is done in its own layer and usually comes in two flavours: **max pooling** and **average pooling**.

![pooling](../assets/pooling.png)

Max pooling is taking the maximum value of the pooled pixels, while average pooling takes the average. Pretty simple stuff. But why do it?
- data reduction: Pooling is often used to alternate convolution layers to reduce the model complexity and in the case of max pooling, only retain the 'sharpest' features
- noise reduction: Max pooling has the added benefit of regularizing the data and filtering out noise. This is also the reason why max pooling is the most popular method of the two



## 3. Flattening and dense layers

In the end, we still need to go back to a number of classes, not represented in a 2D format as the convnet layers are, so for this reason, we **flatten** the our data after all the convolution and pooling.

![flattening](../assets/flattening.png)


If you just flatten the extracted features though, the neural net will **perform very poorly**. We have successfully extracted features (through convolution and pooling), but actual classification hasn't been properly done yet! For this reason, we add some **fully connected dense layers** at the end of our neural network. These serve the same purpose as in normal neural networks; to **make nonlinear connections between the different features of the image**. 

For example, "A nose, some eyes and a mouth are present in the image --> this will likely be a **person**"

take the image below taken from [this open access article on CNN's for impact detection](https://www.mdpi.com/1424-8220/19/22/4933/htm). It gives a good final overview of a pretty **standard CNN structure** with all the discussed elements present.

![CNN](../assets/CNN_deep.png)