# Regularization

Machine learning models need to generalize well to new examples that the model has not seen in practice. Regularization helps prevent models from overfitting the training data and helps us decide what degree of polynomials we should use.

With regularization, we keep all the features, but reduce the magnitude/values of the parameters $\theta_j$. This works well when we have a lot of features, each of which contributes a bit to predicting $y$.

## Intuition

Suppose we have a hypothesis that overfits the training data:

$$\theta_0x_0 + \theta_1x_1 + \theta_2x^2 + \theta_3x^3 + \theta_4x^4$$

We can punish the model by making $\theta_3$ and $\theta_4$ really small so that the theta value combined with its coefficient $\approx 0$.

$$J(\theta) = \frac{1}{N}\sum\limits_{i=1}^N(h_\theta (x^{(i)}) - y^{(i)})^2 + 1000\theta_3^2 + 1000\theta_4^2$$

Since $\theta_3$ and $\theta_4$ are close to zero, we've effictively made an overfit model into a quadratic function that fits the training data better.

So, small values for your theta parameters will produce a simpler hypothesis and be less prone to overfitting. We can express this mathematically:

$$J(\theta) = \frac{1}{N}\sum\limits_{i=1}^N(h_\theta (x^{(i)}) - y^{(i)})^2 + \lambda \sum\limits_{j=1}^n \theta_j^2$$
<p style="text-align:center"> where $\lambda$ is the regularization parameter </p>

If $\lambda$ is a large value such as $1000$ like in the above examples, the only way gradient descent will find a minimum is to make the theta values very very small. So a large lambda value will produce small theta values to make that polynomial degree irrelevant to the model.

## Gradient Descent Implementation

To implement this with gradient descent, we just add the regularization term.

$$\theta_j = \theta_j - \alpha [\frac{\delta}{\delta \theta_j} J(\theta) + \frac{\lambda}{N}\theta_j]$$

We can further simplify the above:

$$\theta_j = \theta_j(1 - \alpha \frac{\lambda}{N}) - \alpha \frac{\delta}{\delta \theta_j} J(\theta)$$

$(1 - \alpha \frac{\lambda}{N})$ will typically produce a number $< 1$. This will effectively shrink the $\theta_j$ term and then we perform the update as usual.

We also need to remember to update $\theta_0$ separately as so:

$$\theta_0 = \theta_0 - \alpha \frac{2}{N} \sum\limits_{i=1}^N(h_\theta (x^{(i)}) - y^{(i)})x_0^{(i)}$$

This implementation will work for both linear and logistic regression, even if the hypothesis ($h_\theta(x)$) between the two are different.

## Normal Equation Implementation

$$\theta = (X^TX + \lambda
\left[
\begin{array}{cccc}
0 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 \\
0 & 0 & 1 & 0 \\
0 & 0 & 0 & ...
\end{array}
\right]
)^{-1}X^Ty$$