# Bias-variance decomposition

It is instructive to analyze the generalization error of a model in terms of bias and variance from first principles. 
Suppose we have regression problem in which we wish to predict a scalar response/output $y$ given input variables (also often called features, covariates or predictors) $x=(x_1,...,x_p)$. We assume there is some relationship or function between $x$ and $y$ which we can write as

$$
y = f(x) + \epsilon
$$

where $f(x)$ is a fixed but unknown function of $x$ and $\epsilon$ is an error term that is independent of $x$ with $\mathbb{E}[\epsilon]=0$. 

Here $f(x)$ represent the systematic information the input features provide about the output $y$ and is the function we wish to estimate with our models. 

Why is there this error term $\epsilon$? The error term quantifies errors such as... The presence of this error term implies that there will in generally be some error in all our models even if we manage to learn $f(x)$ perfectly. 

If we choose our error function to be the mean squared error, $\mathcal{L}(y, \hat{y})=(y-\hat{y})^2$, we can decompose the $\textit{expected}$ generalization error as so

$$
\begin{align*}
\mathbb{E}[(y-\hat{y})^2] &= \mathbb{E}[(f(x)+\epsilon-\hat{f}(x))^2] \\
&= \mathbb{E}[(f(x)-\hat{f}(x))^2 + 2\epsilon\bigl(f(x)-\hat{f}(x)\bigr) + \epsilon^2] \\
&= \mathbb{E}[(f(x)-\hat{f}(x))^2] + \mathbb{E}[2\epsilon\bigl(f(x)-\hat{f}(x)\bigr)] + \mathbb{E}[\epsilon^2] \quad  \textrm{Apply linearity of $\mathbb{E}$} \\
&= (f(x)-\hat{f}(x))^2 + 2\bigl(f(x)-\hat{f}(x)\bigr)\mathbb{E}[\epsilon] + \mathbb{E}[\epsilon^2] \quad \textrm{Recall that $f(x)$ and $\hat{f}(x)$ are not random variables so $\mathbb{E}$ has no affect on them} \\
&= (f(x)-\hat{f}(x))^2 + \mathbb{E}[\epsilon^2]  \quad  \textrm{Recall that $\mathbb{E}[\epsilon]=0$ by assumption} \\
&= \underbrace{(f(x)-\hat{f}(x))^2}_{\substack{\text{Reducible} \\ \text{error}}} + \underbrace{\textrm{Var}[\epsilon^2]}_{\substack{\text{Irreducible} \\ \text{error}}}  \quad  \textrm{Recall that $\textrm{Var}[Z^2] = \mathbb{E}[Z^2] -  (\mathbb{E}[Z])^2$ for a random variable $Z$ and $\mathbb{E}[\epsilon]=0$ } \\
\end{align*}
$$

$$
(f(x)-\hat{f}(x))^2
= f^2(x) - 2f(x)\hat{f}(x) + \hat{f}^2(x)
$$