# Deep Learning Basics

Deep learning is an illed defined term that may refer to many different concepts. In this notebook, deep learning designate methods used to optimize a **network** to execute a task which success is quantified by a **loss function**. This optimization or learning process is based on a **dataset**, whose samples are used to optimize the parameters of the network.

Deep learning networks are a succession of functions (called **layers**) which transform their inputs in outputs (called **feature maps**). There are two types of layers:
- Layers including learnable parameters which will be updated to improve the loss function (for example convolutions).
- Layers which have the same behaviour during the whole training process (for example pooling or activation functions).

Indeed, some characteristics are not modified during the training of the networks. These components are fixed prior to training by **hyperparameters**, such as the number of layers or intrisic characteristics of layers. In this way one of the main difficulty of deep learning is often not to train the networks, but to find good hyperparameters that will be adapted to the task and the dataset. This problem gave birth to a research field called **Neural Architecture Search** (NAS) and is not in the scope of this tutorial.

<details>
<summary>
Why deep ?
</summary>
Originally the term deep was used to differentiate shallow networks, with only one layer from those with two layers are more. Today the distinction is not really useful anymore as most of the networks have many more than two layers !
</details>

## Common network layers

In a deep learning network every function is called a layer though the operations layers perform are very different. You will find below a summary of the layers composing the architectures used in the following sections of this tutorial.

### Convolution

The aim of a convolution layer is to learn a set of filters (or kernels) which capture useful patterns in the data distribution. These filters parse the input feature map using translations:

<img src="https://drive.google.com/uc?id=166EuqiwIZkKPMOlVzA-v5WemJE2tDCES" style="height: 200px;">

The learnable parameters in a filter are:
- weights: a tensor displaying a local pattern that will result in high output values for all locations at which it is  found in the input feature map.
- bias: a scalar added to the feature map produced by the filter.

There are also hyperparameters that must be fixed:
- the number of filters,
- the size of each filter,
- the stride (number of pixels jumped during a translation of the filter).

The output value of a convolution is a weighted linear combination of the input values to which is added a scalar.

### Batch Normalization

This layer learns to normalize feature maps according to ([Ioffe & Szegedy, 2015](https://arxiv.org/abs/1502.03167)). The following formula is applied on each feature map  $FM_i$:

> $FM^{normalized}_i = \frac{FM_i - mean(FM_i)}{\sqrt{var(FM_i) + \epsilon}} * \gamma_i + \beta_i$

*   $\epsilon$ is a hyperparameter of the layer (default=1e-05)
*   $\gamma_i$ is the value of the scale for the ith channel (learnable parameter)
*   $\beta_i$ is the value of the shift for the ith channel (learnable parameter)

Adding this layer to the network may accelerate the training procedure.

### Activation function (Leaky ReLU)

In order to introduce non-linearity in the model, an activation function is introduced after the convolutions. Without activation functions, the network could only learn linear combinations !

Many activation functions have been proposed to solve deep learning problems. In the architectures implemented in `clinicadl` the activation function is Leaky ReLU:

<img src="https://sefiks.com/wp-content/uploads/2018/02/prelu.jpg?w=600" style="height: 200px;">


### Pooling function

The structure of the pooling layer is very similar to the convolutional layer: a kernel with a defined size and stride is passing through the input  However there are no learnable parameters in this layer, the kernel outputing the maximum value of the part of the feature map it covers.

Here is an example in 2D of the standard layer of pytorch `nn.MaxPool2d`:

<img src="https://drive.google.com/uc?id=1qh9M9r9mfpZeSD1VjOGQAl8zWqBLmcKz" style="height: 200px;">

We can observe that the last column may not be used depending on the size of the kernel/input and stride value.

In `clinicadl`, pooling layers with adaptative padding were implemented to exploit information from the whole feature map.

<img src="https://drive.google.com/uc?id=14R_LCTiV0N6ZXm-3wQCj_Gtc1LsXdQq_" style="height: 200px;">


### Dropout 

The aim of a dropout layer is literally to drop out (i.e. replacing their values by 0) a fixed proportion of the input values.

This behaviour is enabled during training to limit overfitting, then it is disabled during evaluation to obtain the best possible prediction.

### Fully-connected

Contrary to convolutions in which relationships between values are studied locally, these layers look for a global linear combination between all the input values (hence the term fully-connected).
In convolutional neural networks they are often used at the end of the architecture to reduce the final feature maps to a number of nodes equal to the number of classes in the dataset.

<details>
<summary>
A bit of history
</summary>
One of the first deep learning network architecture was the Multi-Layer Perceptron and was only composed of fully-connected and activation layers.
</details>

## Tasks & architectures

Deep learning methods have been used to learn many different task such as classification, dimension reduction, data synthesis... In this notebook we focus on **classification of images** achieved with **convolutional neural networks** (CNN).

### Classification with a CNN

### Pretraining with an autoencoder

## Neuroimaging inputs
