# Chapter 2: Statistical Learning

## General form
$$
Y = f(X) + \epsilon 
$$
We estimate f for: (1) prediction (2) inference

## Parametric Methods and Non-premetric Methods
Paremetric methods
- make an assumption about the functional form
- fit/train data to get parameters

Non-parametric methods
- no explicit assumption but require large number of observations

*a tradeoff: prediction accuracy and model interpretability*

## Supervised and Unsepervised Learning
Supervised learning: For each observation of the predictor measurement, there is an associated response measurement $y_i$.
Unsupervised learning: For each observation, there's a vector of $x_i$, but no $y_i$.
Semi-supervised learning problem: some of observations don't have $y_i$.

## Assess Model Accuracy
### Quality of Fit
$$
MSE = \frac{1}{n}\sum_{i=1}^{n} (y_i-\hat{f}(x_i))^2
$$

- We care about test MSE more than training MSE.
- **Overfitting** happends if a method yields a small training MSE and large test MSE.
- **Cross-validation**: a method for estimating the test MSE using the training data.

### Bias-Variance Trade-off
Decompostion of expected test MSE :
$$
E(y_0-\hat{f}(x_0))^2 = var(\hat{f}(x_0)) + [bias(\hat{f}(x_0))]^2 + Var(\epsilon)
$$

- **Variance** refers to the amount by which $\hat{f}$ would change if we estimated it using a different training data set, while **bias** refers to the error that is introduced by approximating a real-life problem.

### Classification Setting
**Training error rate** $\frac{1}{n}\sum_{i=1}^{n}I(y_i\neq \hat{y_i})$ is usually used to assess accuracy. We care about the smallest **test error**.

The **Bayes Classifier** assigns a test observation with predictor vector $x_0$ to the class $j$ for which $Pr(Y=j|X=x_0)$ is largest. It minimized test error on average.

In reality, we cannot get the conditional distribution, so we estimate it.(ex. **K-nearest neighbors classifer**)



# Notes
## Derivation of the Bias-Variance Trade-off

Consider a regression problem where the data is generated as:

$$
y = f(x) + \varepsilon, \quad \text{with } \varepsilon \sim \mathcal{N}(0, \sigma^2)
$$

Our goal is to use a model $\hat{f}(x)$ to estimate $y$, and minimize the prediction error.

### 1. Define the Expected Mean Squared Error (MSE)

$$
\text{MSE}(x) = \mathbb{E}_{\mathcal{D}, \varepsilon} \left[ \left( y - \hat{f}(x) \right)^2 \right]
$$

Substitute $ y = f(x) + \varepsilon $:

$$
\text{MSE}(x) = \mathbb{E}_{\mathcal{D}, \varepsilon} \left[ \left( f(x) + \varepsilon - \hat{f}(x) \right)^2 \right]
$$

Expand the square:

$$
= \mathbb{E} \left[ \left( f(x) - \hat{f}(x) \right)^2 + 2\left( f(x) - \hat{f}(x) \right)\varepsilon + \varepsilon^2 \right]
$$

Since \( \varepsilon \) is independent of \( \hat{f}(x) \) and \( \mathbb{E}[\varepsilon] = 0 \), the cross term vanishes:

$$
= \mathbb{E} \left[ \left( f(x) - \hat{f}(x) \right)^2 \right] + \mathbb{E}[\varepsilon^2] = \mathbb{E} \left[ \left( f(x) - \hat{f}(x) \right)^2 \right] + \sigma^2
$$

---

### 2. Decompose the Squared Error Term

Now decompose the term $ \mathbb{E} \left[ \left( \hat{f}(x) - f(x) \right)^2 \right] $ by adding and subtracting the expected prediction $ \mathbb{E}[\hat{f}(x)] $:

$$
\mathbb{E} \left[ \left( \hat{f}(x) - f(x) \right)^2 \right] = \mathbb{E} \left[ \left( \hat{f}(x) - \mathbb{E}[\hat{f}(x)] + \mathbb{E}[\hat{f}(x)] - f(x) \right)^2 \right]
$$

Expand this expression:

$$
= \mathbb{E} \left[ \left( \hat{f}(x) - \mathbb{E}[\hat{f}(x)] \right)^2 \right]
+ \left( \mathbb{E}[\hat{f}(x)] - f(x) \right)^2
+ 2 \cdot \mathbb{E} \left[ \left( \hat{f}(x) - \mathbb{E}[\hat{f}(x)] \right) \left( \mathbb{E}[\hat{f}(x)] - f(x) \right) \right]
$$

The third term is zero because the inner expectation is 0:

$$
\mathbb{E}[\hat{f}(x) - \mathbb{E}[\hat{f}(x)]] = 0
$$

So we obtain:

$$
\mathbb{E} \left[ \left( \hat{f}(x) - f(x) \right)^2 \right]
= \underbrace{ \left( \mathbb{E}[\hat{f}(x)] - f(x) \right)^2 }_{\text{Bias}^2}
+ \underbrace{ \mathbb{E} \left[ \left( \hat{f}(x) - \mathbb{E}[\hat{f}(x)] \right)^2 \right] }_{\text{Variance}}
$$

---

### 3. Bias-Variance Decomposition Formula

Putting it all together:

$$
\mathbb{E} \left[ \left( y - \hat{f}(x) \right)^2 \right]
= \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}
= (\mathbb{E}[\hat{f}(x)] - f(x))^2 + \mathbb{E}[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2] + \sigma^2
$$
