# Bias-variance trade-off

This notebook consists of 4 parts:
- A high-level summary of the bias-variance trade-off;
- A theoretical break-down of total model error into bias, variance, and noise;
- A practical example computation of each of these terms using polynomial regression;
- (Maybe) Examples to reduce bias or variance (Ridge, bagging)

Sources
- https://gist.github.com/fabgoos/6788818
- http://scott.fortmann-roe.com/docs/BiasVariance.html
- Book: Elements of Statistical Learning 

## Summary

We build a model using a specific algorithm on a given dataset. 
The bias-variance trade-off is about how wrong we are in trusting the algorithm vs the dataset.

The total model's error can be decomposed into variance, bias, and noise.
- **Variance** quantifies how much our model, built on this particular dataset,
would change if we used another dataset, with the same algorithm. 
High model variance means the model is overfitting.
- **Bias** quantifies how much a model based on the _average dataset_ 
would differ from the true label. It's about how wrong our algorithm is.
High model bias means the model is underfitting.
- **Noise** is intrinsic to the domain.
Noise appears when 2 identical data points have different labels.
Noise is measurable, but our model is independent of it.



## Theoretical breakdown

First, let's introduce a lemma: $E[x^2] = E[(x - E(x))^2] + E[x]^2$

Proof: 
$
E[(x - E[x])^2] 
\\= E[x^2 - 2xE[x] + E[x]^2] 
\\= E[x^2] - 2E[xE[x]] + E[E[x]^2]
\\= E[x^2] - E[x]^2
$

Now to the bias-variance breakdown. 
We built our model and want to predict the value of a newly-seen data point. 
We define: 
- $x$ is the feature vector of the new data point.
- $y$ is the dependent variable, given to us in the dataset.
- $\epsilon$ is the noise in the DV $y$. We assume that $\epsilon$ is normally distributed around 0.
- $z$ is the true DV of the new datapoint, without noise, such that $y = z + \epsilon$. 
- $\hat{y}$ is the prediction of our model, $h$, for $y$: $\hat{y} = h(x)$.

Let's look at the expected value, over all the possible models we could build, 
of the squared error of our prediction for this new data point. 

$
E[SE]
\\= E[(\hat{y}-y)^2]
\\= E[\hat{y}^2 - 2y\hat{y} + y^2]
\\= E[\hat{y}^2] - 2E[\hat{y}y] + E[y^2]
$

Here we apply the lemma above.

$
E[SE] 
\\= E[ (\hat{y}-E[\hat{y}])^2] + E[\hat{y}]^2 - 2E[y]E[\hat{y}] + E[(y-E[y])^2] + E[y]^2
\\= E[ (\hat{y}-E[\hat{y}])^2] + ( E[\hat{y}]-E[y])^2 + E[\epsilon^2]
\\= variance + bias^2 + noise
$

We just broke down the typical prediction error into bias, variance, and noise.
The "trade-off" part is: if you held this error and noise constant, 
decreasing bias would increase variance, and vice-versa.