# Chapter 6: Linear Model Selection and Regularization
extend the linear framework; alternative preocedures (instead of least squares) that can yield better **prediction acuracy** and **model interpretability**.

three classes of methods: 
- **subset selection**: identify a subset of predictors first
- **shrinkage (regularization)**: the estimated coeﬀicients are shrunken towards zero relative to the least squares estimates
- **dimenstion reduction**: project the p predictors into an M-dimensional subspace, where $M < p$. (This is achieved by computing M different linear combinations, or projections, of the variables. Then these M projections are used as predictors to fit a linear regression model by least squares.)

Curse of Dimensinality:
1) Regularization or shrinkage plays a key role in high-dimensional problems.
2) Appropriate tuning parameter selection is crucial for good predictive performance
3) The test error tends to increase as the dimensionality of the problem (i.e. the number of features or predictors) increases, unless the additional features are truly associated with the response.

## Subset Selection

### Best Subset Selection
Idea: first, fit a separate least squares regression for each possible combination of the p predictors; second, identify the best one.

1. Let $M_0$ denote the *null model*, which contains no predictors. This model simply predicts the sample mean for each observation.

2. For $k = 1, 2, \ldots, p$:

   (a) Fit all $\binom{p}{k}$ models that contain exactly $k$ predictors.

   (b) Pick the best among these $\binom{p}{k}$ models, and call it $M_k$. Here *best* is defined as having the smallest RSS, or equivalently largest $R^2$.

3. Select a single best model from among $M_0, \ldots, M_p$ using the prediction error on a validation set, $C_p$ (AIC), BIC, or adjusted $R^2$. Or use the cross-validation method.

Disadvantages: (1) computational limitation (2) only for least squares linear regression

### Stepwise Selection

#### Forward Stepwise Selection

1. Let $M_0$ denote the *null model*, which contains no predictors.

2. For $k = 0, \ldots, p - 1$:

   (a) Consider all $p - k$ models that augment the predictors in $M_k$ with one additional predictor.

   (b) Choose the *best* among these $p - k$ models, and call it $M_{k+1}$. Here *best* is defined as having smallest RSS or highest $R^2$.

3. Select a single best model from among $M_0, \ldots, M_p$ using the prediction error on a validation set, $C_p$ (AIC), BIC, or adjusted $R^2$. Or use the cross-validation method.

#### Backward Stepwise Selection

1. Let $M_p$ denote the *full model*, which contains all $p$ predictors.

2. For $k = p, p - 1, \ldots, 1$:

   (a) Consider all $k$ models that contain all but one of the predictors in $M_k$, for a total of $k - 1$ predictors.

   (b) Choose the *best* among these $k$ models, and call it $M_{k-1}$. Here *best* is defined as having smallest RSS or highest $R^2$.

3. Select a single best model from among $M_0, \ldots, M_p$ using the prediction error on a validation set, $C_p$ (AIC), BIC, or adjusted $R^2$. Or use the cross-validation method.

p.s. Backward stepwise is viable when $p$ is very large.

#### Hybrid Approaches

Idea: Variables are added to the model sequentially, in analogy to forward selection. After adding each new variable, the method may also remove any variables that no longer provide an improvement in the model fit.

### Choosing the Optimal Model
A fact: RSS and $R^2$ are not suitable for selecting the best model among a collection of models with different numbers of predictors.

Two approaches to estimate the test error:

1) indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting

| Criterion        | Formula (OLS case)                                                                                     | Parameters                                                                                         | Intuition                                                                                          |
|------------------|-------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|
| $C_p$            | $C_p = \frac{\mathrm{RSS}}{\hat{\sigma}^2} + 2d$                                                       | $\mathrm{RSS}$ = residual sum of squares; $\hat{\sigma}^2$ = estimate of error variance; $d$ = number of parameters (incl. intercept) | Balances fit and complexity; aims to select models with low prediction error.                     |
| AIC              | $\mathrm{AIC} = n \log\left(\frac{\mathrm{RSS}}{n}\right) + 2d$                                        | $n$ = sample size; $d$ = number of parameters; $\mathrm{RSS}$ = residual sum of squares            | Information-theoretic measure; penalizes complexity with $2d$, favors models that explain the data well. |
| BIC              | $\mathrm{BIC} = n \log\left(\frac{\mathrm{RSS}}{n}\right) + d\log(n)$                                  | $n$ = sample size; $d$ = number of parameters; $\mathrm{RSS}$ = residual sum of squares            | Stronger penalty than AIC when $n > 7$; more conservative, tends to choose simpler models.         |
| Adjusted $R^2$   | $\bar{R}^2 = 1 - \frac{\mathrm{RSS}/(n - d - 1)}{\mathrm{TSS}/(n - 1)}$                                 | $\mathrm{TSS}$ = total sum of squares; $n$ = sample size; $d$ = number of predictors (excl. intercept) | Measures proportion of variance explained, adjusted for number of predictors; discourages overfitting. |

  
2) directly estimate the test error, using either a validation set approach or a cross-validation approach

  **one-standard-error rule**: We first calculate the standard error of the estimated test MSE for each model size, and then select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve.

## Shrinkage Methods

### Ridge

minimize:

$$
\min_{\beta_0, \beta} \ \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 \ + \ \lambda \sum_{j=1}^p \beta_j^2
$$

p.s. We can use cross-validation to choose the proper $\lambda$ (tuning parameter).

Note the shrinkage penalty is applied to $β_1,...,β_p$, but not to the intercept $β_0$.

It can create a challenge in model interpretation in settings in which the number of variables p is quite large. (then consider LASSO)

### LASSO

$$
\min_{\beta_0, \beta} \ \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 \ + \ \lambda \sum_{j=1}^p \left| \beta_j \right|
$$

### Alternative Formulation

We can show that the lasso and ridge regression coefficient estimates solve the problems

**Lasso:**
$$
\begin{aligned}
&\min_{\beta_0, \beta} \ \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 \\
&\text{subject to} \quad \sum_{j=1}^p |\beta_j| \le s
\end{aligned}
$$

**Ridge:**
$$
\begin{aligned}
&\min_{\beta_0, \beta} \ \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 \\
&\text{subject to} \quad \sum_{j=1}^p \beta_j^2 \le s
\end{aligned}
$$


vs **Best Subset Selection:** (coputiational infeasible)
$$
\begin{aligned}
& \min_{\beta_0, \beta} \quad \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 \\
& \text{subject to} \quad \sum_{j=1}^p \mathbf{1}_{\{\beta_j \neq 0\}} \le k
\end{aligned}
$$

*Ridge vs LASSO*: In general, one might expect the lasso to perform better in a setting where a relatively small number of predictors have substantial coeﬀicients, and the remaining predictors have coeﬀicients that are very small or that equal zero. Ridge regression will perform better when the response is a function of many predictors, all with coeﬀicients of roughly equal size. 

### Select the tuning parameter

choose a grid of $λ$ values, and compute the cross-validation error for each value of $λ$;
then select the tuning parameter value for which the cross-validation error is smallest;
refit the model using all of the available observations and the selected value of the tuning parameter.

## Dimesnsion Reduction Methods
Idea: transform the predictors and then fit a least squares model using the transformed variables

### Principal Components Regression (PCA)
PCA seeks a projection direction that maximizes the variance of the data projected onto it. A larger variance means that the direction contains richer information and can better distinguish between samples.

### The Principal Components Regression Approach (PCR)
construct the first M principal components, $Z_1,...,Z_M$; use these components as the predictors in a linear regression model that is fit using least squares

PCR does not result in the development of a model that relies upon a small set of the original features. (In this sense, PCR is more closely related to ridge regression than to the lasso.)

Directions are identified in an **unsupervised** way, since the response Y is not used to help determine the principal component directions. So, there is no guarantee that the directions that best explain the predictors will also be the best directions to use for predicting the response.


### Partial Least Squares (PLS)
$-$ Standardize predictors. 

$-$ Compute first PLS direction $Z_1$ by weighting predictors based on their regression coefficients with $Y$ (PLS identifies directions in a **supervised** way that makes use of $Y$).

$-$ Regress predictors on $Z_1$, take residuals to remove explained variation.

$-$ Compute next PLS direction $Z_2$ using residuals.

$-$ Repeat to get orthogonal components $Z_1, Z_2, ..., Z_M$.


# Notes

## Ridge and LASSO

From a geometric perspective, because the constraint region of Lasso has sharp corners (located on the coordinate axes), the optimization solution is more likely to fall exactly on some axes, resulting in sparse solutions (some coefficients being exactly zero); whereas the Ridge constraint is a smooth circle, which rarely yields coefficients that are exactly zero.

## Compare Dimension Reduction Methods

| Method | Objective                              | Basis for Dimension Reduction       |
|--------|--------------------------------------|------------------------------------|
| PCA    | Maximize variance of predictors      | Covariance matrix of predictors X  |
| PCR    | Principal component regression based on PCA | Principal components of predictors X |
| PLS    | Maximize covariance between predictors and response | Covariance between predictors X and response Y |
