#  Regularizing Neural Networks 

* Neural networks learn a set of weights that best map inputs to outputs.
* Deep learning neural networks are likely to quickly overfit a training dataset with few examples.

## Weight penalties

* A network with large network weights can be a sign of an unstable network where small changes in the input can lead to large changes in the output. This can be a sign that the network has overfit the training dataset and will likely perform poorly when making predictions on new data.
* A solution to this problem is to update the learning algorithm to encourage the network to keep the weights small. This is called weight regularization and it can be used as a general technique to reduce overfitting of the training dataset and improve the generalization of the model.
* We will mainly explore L1 regularization and L2 regularization.

### L1 norm and L2 norm

* Before delving into L1/L2 regularization, let's spend a bit of time introducing L1/L2 norm.
* In mathematics, a "norm" is a function from vector space to the non-negative real numbers that behaves in certain ways "like the distance from the origin".
* L1 norm:

$$
\left\lVert x \right\rVert_1 = \sum_{i=1}^n \left| x_i \right|
$$

* L2 norm:

$$
\left\lVert x \right\rVert_2 = \sqrt{\sum_{i=1}^n x_i^2}
$$

* Function graph (one of the interesting differences is how their derivatives change when $w$ changes):

<img src="./assets/regularization/l1-and-l2-norms.gif" alt="L1 and L2 norms" style="width: 500px;"/>

### L1 regularization

* The formula for MSE with L1 regularization is:

$$
MSE_{L1} = \frac{1}{n} \sum_{i=1}^{n}(y_i - \hat{y}i)^2 + \lambda \sum_{j=1}^{p}(|w_j|)
$$

* where:

    * $n$ is the number of samples in the dataset
    * $y_i$ is the true target value for the $i$-th sample
    * $\hat{y}_i$ is the predicted target value for the $i$-th sample
    * $p$ is the number of weights in the dataset
    * $w_j$ is the coefficient of the $j$-th weight in the model
    * $\lambda$ is the regularization parameter, which controls the strength of the L1 regularization penalty

* The general idea is that apart from minimizing the loss function, we also want to penalize the L1 "distance" from the weight vector to origin.

## Dropout

* Ensembles of neural networks (i.e., a large number of neural networks) with different model configurations are known to reduce overfitting, but require the additional computational expense of training and maintaining multiple models.
* A single model can be used to simulate having a large number of different network architectures by randomly dropping out nodes during **training**.
* Dropout is implemented per-layer in a neural network. It can be used with most types of layers, such as dense fully connected layers, convolutional layers, and recurrent layers such as the long short-term memory network layer.
* Dropout can be used after convolutional layers (e.g. Conv2D) and after pooling layers (e.g. MaxPooling2D).

## References
* [Use Weight Regularization to Reduce Overfitting of Deep Learning Models](https://machinelearningmastery.com/weight-regularization-to-reduce-overfitting-of-deep-learning-models/)
* [A Gentle Introduction to Dropout for Regularizing Deep Neural Networks](https://machinelearningmastery.com/dropout-for-regularizing-deep-neural-networks/)
* [How to Reduce Overfitting With Dropout Regularization in Keras](https://machinelearningmastery.com/how-to-reduce-overfitting-with-dropout-regularization-in-keras/)
* [How to Reduce Overfitting Using Weight Constraints in Keras](https://machinelearningmastery.com/how-to-reduce-overfitting-in-deep-neural-networks-with-weight-constraints-in-keras/)