# Week 3: Regularization - Theory

# 1. Setting the Scene

Sometimes the model we create will either <b>underfit</b> or <b>overfit</b> our data, which in both cases impedes the efficiency and accuracy of the model.

## 1.1. What are overfitting and underfitting?

1. <b>Overfitting</b> means the model matches the training data so closely that the model fails to create new predictions on new data.  It is accurate on the training set but doesn't perform well on new data.


2. <b>Underfitting</b> means the model does not even capture the training dataset, so is unlikely to accurately predict values for new data.  

### 1.1.1. Linear Regression Example

<img src="../Images\logregressionOverfitting2.png" width=60%>

1. Leftmost figure shows fitting a y = θ<sub>0</sub> + θ<sub>1</sub>x hypothesis to the dataset.  This underfits because the data doesn't really lie on a straight line.


2. Middle figure shows fitting a y = θ<sub>0</sub> + θ<sub>1</sub>x + θ<sub>2</sub>x<sup>2</sup> hypothesis to the datatset (i.e. we added another feature, x<sup>2</sup>).  This seems to fit most of the data points well.


3. Rightmost figure shows fitting a 5<sup>th</sup> order polynomial, y=∑<sup>5</sup><sub>j = 0</sub> θ<sub>j</sub>x<sup>j</sup>.  This fits the dataset perfectly, however, it is likely to fail at predicting new values.


As such, we can say (1) is <b>underfitted</b> in the sense the model cannot capture the underlying trend of the data, and (3) is <b>overfitted</b> because it matches the training data so closely it is unlikely to predict values based on new data.

### 1.1.2. Logistic Regression Example

<img src="../Images\logregressionOverfitting3.png" width=60%>

1. Leftmost underfits.


2. Middle best fits.


3. Rightmost overfits.

## 1.2. When does overfitting vs. underfitting occur?

1. Underfitting (aka <b>high bias</b>) occurs when the hypothesis is <b>too simple</b> or uses <b>too few features</b>.  


2. Overfitting (aka <b>high variance</b>) occurs when the hypothesis is an <b>overcomplicated function</b> with <b>too many features</b>.

## 1.3. How can we avoid overfitting and underfitting?

Answer: <b>Regularization</b>.  Essentially we are aiming to find the optimal number of features and keep those features regularized.  Specifically, to avoid overfitting we must:

1. <b>Reduce</b> the number of features by:

    (a) Manually selecting which features to keep

    OR

    (b) Using a model selection algorithm (studied later in the course).


2. Use <b>Regularization</b> techniques to keep all the features, but <b>reduce the magnitude of parameters θ<sub>j</sub></b> (note: Regularization works well when we have a lot of slightly useful features).

# 2. How does regularisation work?

## 2.1. Regularisation and the Cost Function

### 2.1.1. The basic idea

Say we have two hypothesis functions from the same data set, i.e. the linear regression example above: 

1. the first one is $h_\theta(x)= \theta_0+\theta_1x +\theta_2x^{2}$ and it works well; 


2. the third one is $h_\theta(x)= \theta_0 + \theta_1x + \theta_2x^{2}+ \theta_3x^{3} + \theta_4x^{4}$ and it suffers from overfitting. 

As the training data is the same for both (1) and (2), this means (2) must have something wrong in its formula. It turns out that those two parameters $\theta_3$ and $\theta_4$ contribute too much to the curliness of the function.

<b>The core idea:</b> penalize those additional parameters and make them very small, so that they will contribute less, or even don't contribute at all to the function shape. 

### 2.1.2. Option 1 - Manually penalise $\theta$

If we manually set $\theta_3 ≈0 $  and $\theta_4 ≈ 0$ (in words: set them to very small values, next to zero) we would basically end up with (1), which fits the data well.

### 2.1.2. Option 2 - Use Regualrization to penalise $\theta$

Rather than manually penalizing $\theta$, regularization pushes the idea even further: <b>all parameters' values</b> are reduced by some amount, producing a simpler, or smoother hypothesis function (and it can be proven mathematically).

For example, returning to the usual linear regression problem of house price prediction. We have:

* 100 features: $x_1,x_2,⋯,x_{100}$ (size, number of floors, ...) that produce...


* 100 parameters: $\theta_0, \theta_1,⋯, \theta_{100}$


Of course is nearly impossible to know which parameter contributes more or less to the overfitting issue. So in regularization we modify the cost function to <b>shrink all parameters by some amount</b>.

The original cost function for linear regression is:

\begin{align}
J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2
\end{align}

The regularized version adds an extra term, called regularization term that shrinks all the parameters:

\begin{align}
J_{reg}(\theta) = \frac{1}{2m} \bigg[\sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^{m} \theta_j^2\bigg]
\end{align}

The lambda symbol ($\lambda$) is called the regularization parameter and it is responsible for a trade-off between fitting the training set well and keeping each parameter small. By convention the first parameter $\theta_0$ is left unprocessed, as the loop in the regularization term starts from 1 (i.e. j=1).

### 2.1.3. But you still need to manually select λ... which can result in problems

<p align = "center">
<img src="../Images\Regularization1.png" width=60%>
</p>

# 3. An Example: Regularised Linear Regression

To build regularised linear regression we have to tweak the original gradient descent algorithm:

\begin{align*} & \text{repeat until convergence:} \; \lbrace \newline \; & 
\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}_j &
\newline \rbrace \end{align*}

\begin{align*} & \text{repeat until convergence:} \; \lbrace \newline \; & 
\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)} \newline \; & 
\theta_1 := \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_1^{(i)} \newline \; & 
\theta_2 := \theta_2 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_2^{(i)} \newline & 
\cdots 
\newline \rbrace \end{align*}

The tweaked version, with regularisation term added, looks like this:

\begin{align}
\frac{\partial}{\partial \theta_j} J_{reg}(\theta) = \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} + \frac{\lambda}{m}\theta_j
\end{align}

Plugging the above into the gradient descent algorithm looks like this:

\begin{align*} & \text{repeat until convergence:} \; \lbrace \newline \; & 
\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)} \newline \; & 
\cdots  \newline \; & 
\theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} + \frac{\lambda}{m}\theta_j & \newline 
\rbrace \end{align*}

Note that the first $\theta_0$ is left unprocessed.  Rearranging the above can make it simpler:

\begin{align*} & \text{repeat until convergence:} \; \lbrace \newline \; & 
\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)} \newline \; & 
\cdots  \newline \; & 
\theta_j := \theta_j(1 - \alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} & \newline 
\rbrace \end{align*}



# 4. Useful Resources

General intro, overview and explanation of Logistic Regression:

- https://www.internalpointers.com/post/introduction-classification-and-logistic-regression
    
Detailed explanation of the Logistic Regression Cost Function:
    
- https://www.internalpointers.com/post/cost-function-logistic-regression

Detailed explanation of Regularisation for Logistic Regression:

- https://www.internalpointers.com/post/problem-overfitting-machine-learning-algorithms 
- http://enhancedatascience.com/2017/07/04/machine-learning-explained-regularization/ 

Detailed general overview:

- https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc
- https://towardsdatascience.com/understanding-logistic-regression-9b02c2aec102