# Regularization
Regularization is a widespread technique in machine learning, which is used to control the complexity of the machine learning model and thereby improve its generalization ability.

## What is Regularization?
Regularization is a technique to prevent overfitting by penalizing complex models. The idea is to add a penalty term to the cost function of the model, such that it becomes dependent on two factors:

$$\text{Cost}(h) = \text{Training Error}(h) + \lambda \text{Complexity}(h)$$

$\lambda$ is a hyperparameter (called the regularization coefficient) that controls the tradeoff between the bias and the variance. Higher $\lambda$ will induce a larger penalty on the complexity of the model, and thus will lead to simpler models with higher error on the training set but with smaller variance.

The complexity of the model can be measured in a variety of ways. For example, in models that consist of a vector of parameters (weights) $w$, such as linear regression or neural networks, we use the size of the parameters (the norm of the vector $w$) as a measure for the model’s complexity. In such models, there are two common types of regularization, depending on the norm of the vector $w$ that we are using:

1. **L1 regularization.** In this case, we use the $L1$ norm of the vector $w$, i.e., the sum of the absolute values of the weights. For example, in linear regression, if we have $m$ features in our data set, then the model will have $m$ parameters (weights) plus a bias term, thus we can write the $L1$ norm of $w$ as:

$$||W||_1 = |w_0| + |w_1| + \cdots + |w_m|$$

2. **L2 regularization.** In this case, we use the $L2$ norm of the vector $w$ (squared), i.e., the sum of the squares of the weights:

$$||w||_2 ^2 = w_0 ^2 + w_1 ^2 + \cdots + w_m ^2 $$

In general, $L1$ regularization is a stronger form of regularization than $L2.$ In $L1$ regularization, the rate at which the weights drop to 0 is constant (since the gradient of $|w_j|$ is 1), while in $L2$ regularization, the rate becomes slower as the weights approach 0 (since the gradient of $w_j^2$ is $2w_j$). Hence, $L1$ is more likely to zero out some of the weights, effectively removing their associated features from the model.

Normally, the bias (intercept) $w_0$ is not regularized, since penalizing the model based on the intercept value can have a dramatic effect on the resulting model. For example, in linear regression, changing $w_0$ shifts the regression hyperplane closer or farther from the origin along the dimension of the target variable $y$, while setting $w_0$ to exactly 0 forces the hyperplane to go through the origin.