# Model Selection

The model selection is to choose the best model for the task. 

The expected prediction error (e.g., mean squared error) of a model is the sum of the bias, the variance and the irreducible error.

$$E(y_0 - \hat{f}(x_0))^2 = Var(\hat{f}(x_0)) + [Bias(\hat{f}(x_0))]^2 + Var(\epsilon)$$

## BIC


## Cross Validation

K-fold cross validation is a method to estimate the expected prediction error.
Assume the data set is divided into $K$ subsets, $D_1, D_2, \dots, D_K$.
For each subset $D_i$, we train the model on the rest of the data set $D \setminus D_i$ and test the model on $D_i$.
Therefore, we have $K$ set of hyperparameters for the same model and $K$ test errors.

The large the $K$, the more accurate the estimate of the expected prediction error. 
With $K = N$, the so-called leave-one-out cross validation, the estimate is unbiased for the expected prediction error, but can have high variance because the $N$ training sets are so similar to one another.

**Why similar training sets lead to high variance in prediction error??** 
Since the models are trained on almost identical datasets, the prediction errors for each iteration are highly correlated. As a result, the average error estimate can be sensitive to individual data points and can have a larger variance.


Cross validation is mainly used for model selections between different models by comparing their test errors.
Say we have a linear regression model and a neural network model, we can use cross validation to decide which model to use for the task based on the cross validation errors, which represents somehow the expected prediction errors.
- for each model, (linear regression or neural network)
    - perform cross-validation to get the cross validation errors by dividing the training dataset into $K$ subsets
    - choose the model with the smallest cross validation error
- final training
    - train the chosen model on the whole training dataset


## Bootstrap

Bootstrap is a method to estimate the bias and variance of a model.
Bootstrap method resamples the training dataset with replacement to generate $B$ bootstrap datasets.
The resampled datasets has the same size as the original dataset, but some data points are missing and some are duplicated.

- for each bootstrap dataset
    - train the model on the bootstrap dataset
- evaluate the model on the original dataset
    - for each observation in the original dataset
        - keep track of the predictions from the bootstrap models not containing that observation
        -  average the prediction errors


Some properties:
- bootstrap is a computer implementation of nonparametric or parametric maximum likelihood estimation. It allows to compute maximum likelihood estimates of standard errors and other quantities in settings where no maximum likelihood formulas are available.
- bootstrap mean is approximately a posterior average, from Bayesian perspective.
  - Since the posterior mean (not mode) minimizes squared-error loss, it is not surprising that bagging can often reduce mean squared-error.

## Questions
1. How can the cross-validation and bootstrap methods help balance the bias and variance?
    - cross validation can estimate expected prediction error
        - MSE as a measure of expected prediction error
        - MSE = bias + variance + irreducible error
        - bias and variance are inversely related
    - model selected by cross validation is the one with the smallest expected prediction error
    - the model with the smallest expected prediction error is the one with the smallest bias and variance
2. What is the difference between cross validation and ensemble learning?
    - cross-validation and boostrap are just resampling methods, which are used to estimate the variance and bias of a model, to quantify the uncerntainty of the model, and to select the best model in terms of generalization error.
    - ensemble learning is a method to combine multiple models to improve the performance of a single model.
        - bagging: bootstrap aggregation. 
            - train multiple models on different bootstrap datasets and average the predictions

