# Bias / Variance / Irreducible Error

## Bias

This part of the generalisation error is due to wrong assumptions, such as assuming that data is linear when it is actually quandratic. A high-bias model is most like to underfit the training data. (High MSE in training data)


## Variance

This part is due to the model's excessive sensitivity to small variations in the training data. A model with many degrees of freedom (such as a high-degree polynomial model) is likely to have high variance, and thus to overfit the training data. (Low MSE in Training Data, High MSE in Validation/Test Data)



## Sweet Spot
The goal is optimise the sweet spot with reasonable bias/variance, preferablly low bias, low variance. However, a low bias typically results in high varance (overfit) and viceversa (underfit).

The ways in optimising are:
- Regularisation - to reduce overfit by adding a paramer against weights
- Boosting (Emsemble)
- Bagging (random forest)


## Irreducible Error
This part is due to the noisiness of the data itself. The only way to reduce this part of the error is to clean up data.

Bias and Variance is always a trade off between model complexity (more complex less bias more variance.)

# Regularisation

There is a good article on this via sklearn: https://scikit-learn.org/stable/modules/linear_model.html

**Note** 

- Regularisation term is only added to cost function during training.
- Y-intercept is NOT included


# Ridge Regression
Ridge Regrssion (Tikhnov regularisation) is a regularised version of linear regression. Forces model not only fit but keep the weights as small as possible.

Sum of all weight^2 with a hyper parameter (lambda which controlls how much regularisation is applied. if 0 then typical linear regression). Higher the lambda the flatter the slope, therefore reducing variance increasiong bias. The slope asymptote to zero

## Cost function
**note** i start from 1 for the regularisation term

$j(\theta) = MSE(\theta) + \frac{\lambda\sum_{i=1}^n(\theta^2)}{2}$

where:

$sum(\theta^2) = l2 norm$

In [15]:
from sklearn.datasets import make_regression
from sklearn.linear_model import Ridge

X, y, coef = make_regression(coef = True, noise = 10, bias = 5)

ridge_reg = Ridge(alpha = 1, solver = "cholesky") # note sometimes lambda is referred as alpha. 
ridge_reg.fit(X, y)

theta_1 = ridge_reg.coef_
bias_1 = ridge_reg.intercept_

ridge_reg = Ridge(alpha = 4, solver = "cholesky") # note sometimes lambda is referred as alpha. 

ridge_reg.fit(X, y)

theta_2 = ridge_reg.coef_
bias_2 = ridge_reg.intercept_

In [16]:
import numpy as np

print(np.mean(theta_1), np.mean(theta_2))
print(bias_1, bias_2)

3.944275448163377 3.4133960363438054
5.4007930428875 2.4395440048724204


## Lasso Regrssion
Least absolute shrinkage and selection operator regression, using l1 norm instead of l2 norm

### Cost function
$j(\theta) = MSE(\theta) + \lambda\sum(abs(\theta))$


### Elastic Net
j(theta) = MSE(theta) + r*alpha* sum(abs(theta)) + (1-r)/2 * alpha * sum(theta^2)

Note that ridge regression is a useful technique if as it automatically reduce features, however if there are features of importance it would be better to use Lasso or Elastic Net