<a href="https://colab.research.google.com/github/deltorobarba/machinelearning/blob/master/regularization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regularization

*Author: Alexander Del Toro Barba*

## Overview

* Regularization is a technique for preventing over-fitting by penalizing a model for having large weights
* "Overfitting is when you have a complicated model that gives worse predictions, on average, than a simpler model" (Andrew Gelman). "Overfitting of models is widely recognized as a concern. It is less recognized however that overfitting is not an absolute but involves a comparison. A model overfits if it is more complex than another model that fits equally well" (Douglas Hawkins).
* Weight regularization was borrowed from penalized regression models in statistics. Neural networks learn a set of weights that best map inputs to outputs. A network with large network weights can be a sign of an unstable network where small changes in the input can lead to large changes in the output. This can be a sign that the network has overfit the training dataset and will likely perform poorly when making predictions on new data. A solution to this problem is to update the learning algorithm to encourage the network to keep the weights small. This is called weight regularization and it can be used as a general technique to reduce overfitting of the training dataset and improve the generalization of the model.
* **Measures against overfitting**: Vectornorm (L1, L2, Elastic Net), Dropout, Jitter, batch size, ensemble models, simpler model, more data..

<br>

**Benefits of regularization from a mathematical optimization point of view**
* Almost every machine learning model can be cast as an optimization problem: (linear regression, logistic regression, SVMs, neural networks, PCA, ICA, K-means), where we form a cost function and try to minimize it in order to find optimal values for our model's parameters. Some machine learning models have cost functions that are non-convex, e.g., neural networks. From an optimization perspective, this is an undesirable property. Stationary points in these two cost functions are problematic because numerical optimization schemes (like gradient descent) can easily get stuck in them, leading to poor results.
* Regularization can be used as a way of convexifying a non-convex cost function to help gradient descent avoid some undesirable stationary points of such a function. The L2 regularizer, being an upward-facing convex function, can unflatten flat regions and curve up some stationary points without severely changing the minimum locations (e.g L2 regularized cost no longer has an issue with saddle points, as the region surrounding it has been curved upwards). 
* Furthermore, regularization can also help with the optimization of convex machine learning problems, when is not invertible. One can easily verify that the solution to the L2 regularized version of linear regression is given by is the regularization parameter, which can be set large enough so that becomes invertible. Why regularize with the L1 norm? To answer this question let's keep our optimization glasses on for a little longer. The L0 norm (not really a norm!) of a vector w, usually denoted by ‖w‖, is the number of nonzero entries in w So if you want to make w sparse (for the reasons we get to later), a direct way to do so is by minimizing ‖w‖. Unfortunately, in many cases this is not an easy task, so instead we replace it with the best convex approximation to it, which is (you guessed it right!) the L1 norm. So you can think of the L1 norm as a compromise between the L0 and L2 norms, inheriting the sparsity-inducing property from the former and convexity from the latter.

<br>

**Variance-Bias-Tradeoff**
* Generally, we refer to this model as having a large variance and a small bias. That is, the model is sensitive to the specific examples, the statistical noise, in the training dataset.

[Bias Variance Tradeoff](https://raw.githubusercontent.com/deltorobarba/repo/master/6ECD1124-FDA7-424C-9705-419523281733.png)

Source: [Regularization and Geometry](https://towardsdatascience.com/regularization-and-geometry-c69a2365de19)

<br>

**Overfitting or Overtraining?**
* From [Mehmet Suzen](https://www.linkedin.com/in/mehmetsuzen/): All due respect to [Andrew Ng](https://www.youtube.com/watch?v=NyG-7nRpsW8) but regularisation does not prevent overfitting or even reduces. Regularisation originally developed for reducing ill-conditioning in inverse-problems. Regularisation, along with early-stopping, cross-validation and drop out, reduces and provides a reliable measure for generalisation error. Overfitting, on the other hand, is about the 'fit' , i.e., the model complexity. In deep nets, model complexity correlates with the full architecture and the activation functions. Of course, this sounds like a semantic issue for many but this was how it was called in 90s, 'overtraining' (([IEEE: Asymptotic statistical theory of overtraining and cross-validation](https://ieeexplore.ieee.org/document/623200)).
* Answer: if let it be, the learning process "will tend to learn more and more complex functions as the number of iterations increases". A model represented by a more complex function, thus having poor generalization, is an overfitting model. From the statement above, such a model can be prevented by stopping the learning early (among other techniques). Regularization is a process of applying those techniques. 




---



![Regularization Types](https://raw.githubusercontent.com/deltorobarba/repo/master/5907C4B3-6EC5-40CD-BB51-A5AB75C3DC71.jpeg)

Source: ['Getting started with Regression'](https://medium.com/@savannahar68/getting-started-with-regression-a39aca03b75f)

## L1 Regularization

<br>
<p>
$\sum_{i=1}^{n}\left|u_{i}\right|=\sum_{i=1}^{n}\left|y_{i}-b_{0}-b_{1} x_{i}\right|$
</p><br>

* **Synonyms**: Lasso, Manhatten distance, least absolute deviations (LAD), least absolute errors (LAE)
* **Summary**: Sum of the absolute weights. Gives sparse solutions, since it does not take all features
* **Advantages**: less influenced by outliers (robust). Can shrink some coefficients to zero while lambda increases, performing variable selection. generates sparse feature vectors (Sparse: only very few entries in a matrix or vector is non-zero. L1-norm has property of producing many coefficients with zero values or very small values with few large coefficients). Sparse is sometimes good eg. in high dimensional classification problems. sparsity properties: calculation more computationally efficient.
* **Disadvantages**: L1 regularization doesn’t easily work with all forms of training. gives a solution with more large residuals, and a lot of zeros in the solution.
* **Use Cases**: if only a subset of features are correlated with the label, as in lasso model some coefficient can be shrunken to zero. very useful when you want to understand exactly which features are contributing to a decision. if you can ignore the ouliers in your dataset or you need them to be there. use L1 when constraints on feature extraction: easily avoid computing a lot of computationally expensive features  at the cost of some of the accuracy, since the L1-norm will give us a solution which has the weights for a large set of features set to zero (real-time detection or tracking of an object/face/material using a set of diverse handcrafted features with a large margin classifier like an SVM in a sliding window fashion - you'd probably want feature computation to be as fast as possible in this case).

*Bayesian: L1 usually corresponds to setting a Laplacean prior: Some of the coefficients will shrink to zero: similar effect would be achieved in Bayesian linear regression using a Laplacian prior (strongly peaked at zero) on each of the beta coefficients.*



## L2 Regularization

<br><p>
$\sum_{i=1}^{n} u_{i}^{2}=\sum_{i=1}^{n}\left(y_{i}-b_{0}-b_{1} x_{i}\right)^{2}$
</p><br>

* **Synonyms**: Weight Decay, Ridge Regression, KQ-Methode, kleinste Quadrate, Tikhonov regularization, Euclidean distance, least squares error (LSE)
* **Summary**: Sum of the squared weights. Is the most common type of regularization, also called simply “weight decay,” with values often on a logarithmic scale between 0 and 0.1, such as 0.1, 0.001, 0.0001, etc.
* **Advantages**: Shrinks all the coefficient by the same proportions, but eliminates none. Leads to small distributed weights in neural networks. The L2 regularization heavily penalizes "peaky" weight vectors and prefers diffuse weight vectors. Empirically performs better than L1. The fit for L2 will be more precise than L1. Works with all forms of training. Smoother: fewer large residual values along with fewer very small residuals as well. L2-norm has analytical solution - allows the L2-norm solutions to be calculated computationally efficiently.
* **Disadvantages**: Sensitive to outliers, since L2 wants all errors to be tiny and heavily penalizes anyone who doesn't obey. Computation heavy compared to the L1 norm. Doesn’t give you implicit feature selection.
* **Use Cases**: Use ridge if all the features are correlated with the label, as the coefficients are never zero in ridge. 
* **Relationship to Dropout**: Dropout is nothing more than an adaptive form of L2 regularization and that both methods have similar effects.


## Elastic Net

* Linear combination of L1 and L2
* Even in the case when you have a strong reason to use L1 given the number of features, I would recommend going for Elastic Nets instead. Granted this will only be a practical option if you are doing linear/logistic regression. But, in that case, Elastic Nets have proved to be (in theory and in practice) better than L1/Lasso. Elastic Nets combine L1 and L2 regularization at the "only" cost of introducing another hyperparameter to tune (see Hastie's paper for more details Page on stanford.edu).
* Method that linearly combines the L1 and L2 penalties of the lasso and ridge methods
* Overcome limitations of L1: in the "large p, small n" case (high-dimensional data with few examples), the LASSO selects at most n variables before it saturates. Also if there is a group of highly correlated variables, then the LASSO tends to select one variable from a group and ignore the others.  
Solution in elastic net: add quadratic part to penalty (L2). quadratic penalty term makes the loss function strictly convex, and it therefore has a unique minimum
* Naive version of elastic net method finds an estimator in a two-stage procedure : first for each fixed λ2 it finds the ridge regression coefficients, and then does a LASSO type shrinkage. This kind of estimation incurs a double amount of shrinkage, which leads to increased bias and poor predictions. To improve the prediction performance, the authors rescale the coefficients of the naive version of elastic net by multiplying the estimated coefficients by (1+λ2).

## Lambda Value (λ)

* Lambda is a regularization hyperparameter
* Reasonable values of lambda range between 0 and 0.1
* L2 weight regularization with very small regularization hyperparameters such as (e.g. 0.0005 or 5 x 10^−4) may be a good starting point

## Add Regularization to Cost Function

* xxx

## Dropout

* Ziel: Overfitting vermeiden
* Andrew Ng: dropout is nothing more than an adaptive form of L2 regularization and that both methods have similar effects
* the dropout will randomly mute some neurons in the neural network and we therefore have a sparse network which hugely decreases the possibility of overfitting. More importantly, the dropout will make the weights spread over the input features instead of focusing on some features. 
https://hackernoon.com/is-the-braess-paradox-related-to-dropout-in-neural-nets-270ecb97cdeb
https://de.m.wikipedia.org/wiki/Dropout_(künstliches_neuronales_Netz) 

<br>

**Apply L2 and Dropout same time?**
* You can, but it is still not clear whether using both at the same time acts synergistically or rather makes things more complicated for no net gain.
* While ℓ 2 regularization is implemented with a clearly-defined penalty term, dropout requires a random process of “switching off” some units, which cannot be coherently expressed as a penalty term and therefore cannot be analyzed other than experimentally.
* they both try to avoid the network’s over-reliance on spurious correlations, which are one of the consequences of overtraining that wreaks havoc with generalization. But more detailed research is necessary to determine whether and when they can “work together” or rather end up “fighting each other”. So far, it seems the results tend to vary in a case-by-case fashion.
Using both can increase accuracy: https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf (Hinton paper 2014) 

# RNN Model

## Import & Prepare Data

In [0]:
import tensorflow as tf
import datetime, os

fashion_mnist = tf.keras.datasets.fashion_mnist

(x_train, y_train),(x_test, y_test) = fashion_mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

## Choose Regularization

The Dense layer takes three regularizers, which all default to None. 
* **kernel_regularizer**: Regularizer function applied to the kernel weights matrix.
* **bias_regularizer**: Regularizer function applied to the bias vector.
* **activity_regularizer**: Regularizer function applied to the output of the layer (its "activation")

In [0]:
kernel_regularizer=tf.keras.regularizers.l2(l=0.0005)
bias_regularizer=None
activity_regularizer=None

**Add Dropout (optional)**

In [0]:
dropout = 0.0

## Define Model & Run

In [0]:
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Flatten(input_shape=(28, 28)))
model.add(tf.keras.layers.Dense(512, activation='relu', 
                                kernel_regularizer=kernel_regularizer, 
                                bias_regularizer=bias_regularizer, 
                                activity_regularizer=activity_regularizer))
model.add(tf.keras.layers.Dropout(dropout))
model.add(tf.keras.layers.Dense(10, activation='softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(x=x_train, y=y_train, epochs=5, validation_data=(x_test, y_test))

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Train on 60000 samples, validate on 10000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fc4c0436b00>