### __1. Underfitting and overfitting:__
<font size=3>

During the __train-validation step__, using the _training data_, the _optimizer_ updates the values of the model's inner parameters (_i.e._, weights, biases, etc.) over the epochs while minimizing/maximizing the loss function. Meanwhile, the model's performance is measured for each epoch using the validation data. At this workflow stage, we model the neural network architecture to avoid [overfitting and underfitting](https://www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/).

__Underfitting__ means a poor NN fitting, _i.e._, the model does not learn well. On the other hand, __overfitting__ occurs when the model fits the training data very well but makes poor predictions with validation data.

__To avoid underfitting__, we need to make the NN more robust - with more layers and neurons - to increase the NN's depth.

__To avoid overfitting__, we have two basic options: __i)__ decrease the number of neurons (or/and layers) - as an analogy, we are decreasing the degree of a polynomial function (check the [figure](https://www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/) again); __ii)__ we can apply a dropout to a layer with large number of neurons.

### __2. The Dropout:__
<font size=3>

What is the dropout layer? Dropout _"closes"_ the activation of neurons from the previous layer at random by setting them to zero! When training becomes rigid, we create a type of _"neuroplasticity"_ in the network to form more flexible connections. 

The figure below, from the pioneer [paper](https://paperswithcode.com/method/dropout), shows, on the left, an example of an NN with three hidden layers, each with five neurons and an output with a single neuron. On the right, the dropout applied in these three hidden layers illustrates how the neuron connections change (are blocked). The new input -> output connections can make the _pathway_ more flexible, preventing overfitting. Have a look at the paper's motivation section!

<center>
<img src="../figs/dropout.png" width="800"/>
</center>

<font size=3>

In the __keras__ [layers.Dropout(q)](https://keras.io/api/layers/regularization_layers/dropout/) function, we determine the quantile $q$ of neurons from the chosen layer, which will be randomly set to zero. Below, we have a model with three hidden layers. The 2nd hidden layer has a 50% dropout, so 100 neurons will be set to zero.

```python
In = keras.Input(shape=(x_train.shape[1],))

x = keras.layers.Dense(50, activation='sigmoid')(In)

x = keras.layers.Dense(200, activation='sigmoid')(x)

x = keras.layers.Dropout(0.5)(x)

x = keras.layers.Dense(20, activation='sigmoid')(x)

Out = keras.layers.Dense(1, activation='sigmoid')(x)

model = keras.Model(inputs=In, outputs=Out)
```

<br>
In another example below, we have a model with three hidden layers, with 30% and 40% dropouts in the 2nd and 3rd hidden layers, respectively.

```python
In = keras.Input(shape=(x_train.shape[1],))

x = keras.layers.Dense(150, activation='sigmoid')(In)

x = keras.layers.Dropout(0.3)(x)

x = keras.layers.Dense(80, activation='sigmoid')(x)

x = keras.layers.Dropout(0.4)(x)

x = keras.layers.Dense(20, activation='sigmoid')(x)

Out = keras.layers.Dense(1, activation='sigmoid')(x)

model = keras.Model(inputs=In, outputs=Out)
```

<br>

#### __2.1 What about the math?__
<font size=3>
    
In the paper, the authors compute the dropout by multiplying the [Bernoulli distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution) $r_l^i$ of probability $p$ in the layer $a_l^i$,

\begin{align}
    r_{l-1}^i &\sim Bernoulli(p) \, , \\
    \tilde a_{l-1}^i &= r_{l-1}^i \odot a_{l-1}^i \, , \\
    a_l^i &= \sigma_l\left(W_l^{ij}\,\tilde a_{l-1}^j + b_l^i\right) \, .\\
\end{align}

Below, an example how it works:

In [1]:
from numpy.random import randn
from scipy.stats import bernoulli

In [2]:
a = randn(20)
a

array([ 0.10461083, -0.78009315,  1.11124381,  0.04286497, -1.03652512,
       -2.37830578, -0.04181743,  2.0773545 , -0.50668722, -0.22254797,
        0.6602887 , -0.55147563,  0.21989477,  0.68738247,  0.27180098,
       -1.50879617, -0.94190003, -2.60679257,  0.05628814,  1.33111176])

In [3]:
p = 0.5

r = bernoulli.cdf(a, p)

a = r*a

a

array([ 0.05230541, -0.        ,  1.11124381,  0.02143249, -0.        ,
       -0.        , -0.        ,  2.0773545 , -0.        , -0.        ,
        0.33014435, -0.        ,  0.10994739,  0.34369124,  0.13590049,
       -0.        , -0.        , -0.        ,  0.02814407,  1.33111176])