# Dropout

### Original Paper
Dropout: A Simple Way to Prevent Neural Networks from Overfitting by
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov (2014) http://jmlr.org/papers/v15/srivastava14a.html

### Description
At each iteration, we are going to disable randomly selected neurons, then do both forward and backward pass with those neurons disabled. Numerically, we can sample a vector from Bernoulli distribution,
then multiply the neuron's activation matrix's with this vector column-wise (e.g. disabling the same neuron across all samples).

### Mathematical Definition
Let's define:
* $ p = \text{probability of keeping a node} $
* $ \mathbf{d} = \text{vector of random variables ~ Bernoulli}(p)$

Now we can zero-out randomly selected columns of the activation matrix:
$$ 
\tilde{\mathbf{a}}_{:, m}^{[1]} 
= \frac {\mathbf{a}_{:, m}^{[1]} \bigodot \mathbf{d}_m} {p} 
$$

Why divide by $p$? To keep the mean across the features constant:

$$ 
\begin{align}
    \mathbb{E} [\tilde{\mathbf{a}}_{s, :}^{[1]}] 
    & = \frac {\mathbb{E}[\mathbf{a}_{s, :}^{[1]}] \cdot \mathbb{E}[\mathbf{d}]} {p} \\
    & = \frac {\mathbb{E}[\mathbf{a}_{s, :}^{[1]}] \cdot p}{p} \\
    & = \mathbb{E}[\mathbf{a}_{s, :}^{[1]}]
\end{align}
$$

### Regularization
Why does this work as a regularization method? 
Intuitively, every node has to use information in all of its input nodes to minimize the impact of any particular input node being disabled. 
Numerically, this makes the L2-norm of the weights vector smaller.
Theoretically, it’s also similar to having an ensemble of neural networks, since each sampling of the dropout mask represents a different network.
Overall, the dropout introduces noise robustness to the model.

In [1]:
from tensorflow.keras.layers import Dropout
?Dropout

[0;31mInit signature:[0m [0mDropout[0m[0;34m([0m[0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Applies Dropout to the input.

The Dropout layer randomly sets input units to 0 with a frequency of `rate`
at each step during training time, which helps prevent overfitting.
Inputs not set to 0 are scaled up by 1/(1 - rate) such that the sum over
all inputs is unchanged.

Note that the Dropout layer only applies when `training` is set to True
such that no values are dropped during inference. When using `model.fit`,
`training` will be appropriately set to True automatically, and in other
contexts, you can set the kwarg explicitly to True when calling the layer.

(This is in contrast to setting `trainable=False` for a Dropout layer.
`trainable` does not affect the layer's behavior, as Dropout does
not have any variables/weights that can be frozen during training.)

>>> tf.random.set_seed(0)
>>> layer = tf.