## Error Analysis
### The bias-variance tradeoff -- linear regression case
In machine learning we often hear the term "overfitting" and "underfitting". But what do they really meant. In this section we briefly explain these concept through math induction.
Assume we want to come up with a machine learning model $\hat{f}(x)$ that maps each data $x$ in my data set to some label $y$. Then a natural solution to measure the performance of our model is to apply the mean squared error(MSE).

$$MSE = \operatorname{\mathbb{E}_{(x,y)}}\lvert \hat{f}(x) - y \rvert^2$$

**Our intuition**
- Overfitting: the model doesn't generalize well to other dataset
- Underfitting: the model wasn't well trained and the predicted data seems well off from the target y
- Other reasons: the model's environment is noisy, or that the dataset is noisy

**Intuition formalized:**
The model can be seen as:
$$y_i = f(x_i) + \epsilon_i$$
where the noise satifies $\operatorname{\mathbb{E}}(\epsilon_i) = 0$ and $Var(\epsilon_i) = \sigma^2$.

The key obervation here is that $\hat{f}(x)$ is a random variable since it depends on the error term. The error term by itself is a random variable thus $\hat{f}(x)$ is r.v.

$$
\begin{split}
MSE &= \operatorname{\mathbb{E}}(y - \hat{f}(x))^2 \\
    &= \operatorname{\mathbb{E}}((\epsilon+f(x) - \hat{f}(x))^2) \\
    &= \operatorname{\mathbb{E}}(\epsilon^2) + \operatorname{\mathbb{E}}((f(x)-\hat{f}(x))^2) \\
    &= \sigma^2 + \operatorname{\mathbb{E}}(f(x)-\hat{f}(x))^2 + Var(f(x)-\hat{f}(x)) \\
    &= \sigma^2 + (Bias\;\hat{f}(x))^2 + Var(\hat{f}(x))
\end{split}
$$
In equation three. Assume error term and $\hat{f}$ is independent. Then the product of their expectation may go to zero since $\operatorname{\mathbb{E}}(\epsilon_i) = 0$.

- High Bias <==> Underfitting
- High Variance <==> Overfitting
- Large $\sigma^2$ <==> Noisy data

Another keypoint is that, most of the time reducing one will increase the other, and there is a tradeoff between bias and variance. Also this is done solely in the **linear regression** setting. For classification, there are no real agreement on what is the right or the most useful formalism.