# Contents

1. [Subset Selection](#Subset-Selection)
2. [Shrinkage Methods](#Shrinkage-Methods)
3. [Dimension Reduction Methods](#Dimension-Reduction-Methods)
4. [Considerations In High Dimensions](#Considerations-in-High-Dimensions)

---

# Introduction 
Before moving to the non-linear world, we discuss in this chapter some ways in which the simple linear model can be improved, by replacing plain least squares fitting with some alternative fitting procedures. Why might we want to use another fitting procedure instead of least squares? As we will see, alternative fitting procedures can yield better **prediction accuracy** and **model interpretability** by controlling the ratio of $n$, the number of
observations to $p$, the number of variables.

1. *Prediction Accuracy*: Provided that the true relationship between the response and the predictors is approximately linear, the least squares estimates will have low bias. If $ n \gg p$ then the least squares estimates tend to also have low variance, and hence will perform well on test observations. However, if n is not much larger than p, then there can be a lot of variability in the least squares fit, resulting in overfitting and consequently poor predictions on future observations not used in model training. And if $p > n$, then there is no longer a unique least squares coefficient estimate: the variance is infinite so the method cannot be used at all. **By constraining or shrinking the estimated coefficients, we can often substantially reduce the variance at the cost of a negligible increase in bias.** This can lead to substantial improvements in the accuracy with which we can predict the response for observations not used in model training.

2. *Model Interpretability*: It is often the case that some or many of the variables used in a multiple regression model are in fact not associated with the response. Including such irrelevant variables leads to unnecessary complexity in the resulting model. By removing these variables—that is, by setting the corresponding coefficient estimates to zero—we can obtain a model that is more easily interpreted.

In this chapter, we see some approaches for automatically performing feature selection or variable selection. We will discuss three important classes of methods.

- **Subset Selection**. This approach involves identifying a subset of the p predictors that we believe to be related to the response. We then fit a model using least squares on the reduced set of variables.
    
- **Shrinkage**. This approach involves fitting a model involving all p predictors. However, the estimated coefficients are shrunken towards zero relative to the least squares estimates. This shrinkage (also known as **regularization**) has the effect of reducing variance. Depending on what type of shrinkage is performed, some of the coefficients may be estimated to be exactly zero. Hence, shrinkage methods can also perform variable selection.
    
- **Dimension Reduction**. This approach involves projecting the p predictors into a M-dimensional subspace, where $M < p$. This is achieved by computing M different linear combinations, or projections, of the variables. Then these M projections are used as predictors to fit a linear regression model by least squares.

---

# Subset Selection

## Best Subset Selection
To perform best subset selection, we fit a separate least squares regression for each possible combination of the p predictors. We then look at all of the resulting models, with the goal of identifying the one that is best.

The problem of selecting the best model from among the $2^p$ possibilities considered by best subset selection is not trivial. This is usually broken up into two stages, as described in **Algorithm 6.1** below.

**Algorithm 6.1 -** *Best subset selection*

>1. Let $M_0$ denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation.
>2. For $k = 1, 2, \ldots p$:
>
>    (a) Fit all $p \choose k$ models that contain exactly k predictors.
>
>    (b) Pick the best among these $p \choose k$ models, and call it $M_k$. Here *best* is defined as having the smallest $\text{RSS}$, or equivalently largest $R^2$.
>3. Select a single best model from among $M_0, \ldots ,M_p$ using cross-validated prediction error, $C_p$ (AIC), BIC, or adjusted $R^2$.

In Algorithm 6.1, **Step 2** identifies the best model (on the training data) for each subset size, in order to reduce the problem from one of $2^p$ possible models to one of $p + 1$ possible models.

Now in order to select a single best model, we must simply choose among these $p + 1$ options. The problem is
that a low RSS or a high $R^2$ indicates a model with a low training error, whereas we wish to choose a model that has a low test error. Therefore, in **Step 3**, we use cross-validated prediction error, $C_p$ , BIC, or adjusted $R^2$ in order to select among $M_0 , M_1 , \ldots , M_p$.

Although we have presented best subset selection here for least squares regression, the same ideas apply to other types of models, such as logistic regression. In the case of logistic regression, instead of ordering models by
RSS in *Step 2* of Algorithm 6.1, we instead use the **deviance**, a measure that plays the role of RSS for a broader class of models. The deviance is negative two times the maximized log-likelihood; the smaller the deviance, the better the fit.

While best subset selection is a simple and conceptually appealing approach, it suffers from computational limitations. The number of possible models that must be considered grows rapidly as p increases. In general, there are $2^p$ models that involve subsets of p predictors. As a result, best subset selection becomes computationally infeasible for values of p greater than around 40, even with extremely fast modern computers. We present computationally efficient alternatives to best subset selection next.

## Stepwise Selection
For computational reasons, best subset selection cannot be applied with very large p. Best subset selection may also suffer from statistical problems when p is large. The larger the search space, the higher the chance of finding models that look good on the training data, even though they might not have any predictive power on future data. **Thus an enormous search space can lead to overfitting and high variance of the coefficient estimates.**

### Forward Stepwise Selection
While the best subset selection procedure considers all $2^p$ possible models containing subsets of the p predictors, forward stepwise considers a much smaller set of models.

Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model. In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model. More formally, the forward stepwise selection procedure is given in **Algorithm 6.2**.

**Algorithm 6.2 -** *Forward stepwise selection*

>1. Let $M_0$ denote the null model, which contains no predictors. 
>2. For $k = 0, \ldots, p-1$:
>
>    (a) Fit all $p - k$ models that augment the predictors in $M_k$ with one additional predictor.
>
>    (b) Choose the *best* among these $p - k$ models, and call it $M_{k+1}$. Here *best* is defined as having the smallest $\text{RSS}$, or largest $R^2$.
>3. Select a single best model from among $M_0, \ldots ,M_p$ using cross-validated prediction error, $C_p$ (AIC), BIC, or adjusted $R^2$.

Unlike best subset selection, which involved fitting $2^p$ models, forward stepwise selection involves fitting one null model, along with $p − k$ models in the kth iteration, for $k = 0, \ldots, p − 1$. This amounts to a total of $1 + \sum^{p−1}_{k=0} (p − k) = 1 + \frac{p(p + 1)}{2}$ models.

Say we want to identify the best model from among those $p− k$ that augment $M_k$ with one additional predictor. We can do this by simply choosing the model with the lowest RSS or the highest $R^2$. However, in Step 3, we must identify the best model among a set of models with different numbers of variables. This is more challenging, and
is discussed in the section [Choosing the Optimal Model](#Choosing-the-Optimal-Model).

Forward stepwise selection’s computational advantage over best subset selection is clear. Though forward stepwise tends to do well in practice, it is not guaranteed to find the best possible model out of all $2^p$ models containing subsets of the p predictors.

For instance, suppose that in a given data set with $p = 3$ predictors, the best possible one-variable model
contains $X_1$, and the best possible two-variable model instead contains $X_2$ and $X_3$. Then forward stepwise selection will fail to select the best possible two-variable model, because $M_1$ will contain $X_1$, so $M_2$ must also contain $X_1$ together with one additional variable. The table below demonstrates this issue.

| # Variables   	| Best subset                   	| Forward stepwise                  	|
|----------------	|-------------------------------	|-----------------------------------	|
| **One**         	| rating                        	| rating                            	|
| **Two**         	| rating, income                	| rating, income                    	|
| **Three**       	| rating, income,<br>student       	| rating, income,<br>student           	|
| **Four**        	| cards, income,<br>student, limit 	| rating, income,<br>student, limit 	|
>**Table 6.1.** The first four selected models for best subset selection and forward
stepwise selection on the **Credit** data set. The first three models are identical but
the fourth models differ.

Forward stepwise selection can be applied even in the high-dimensional setting where $n < p$; however, in this case, it is possible to construct sub-models $M_0 , \ldots , M_{n−1}$ only, since each submodel is fit using least squares, which will not yield a unique solution if $p \ge n$.


### Backward Stepwise Selection
Like forward stepwise selection, backward stepwise selection provides an efficient alternative to best subset selection. However, unlike forward stepwise selection, it begins with the full least squares model containing
all p predictors, and then iteratively removes the least useful predictor, one-at-a-time. Details are given in **Algorithm 6.3**.

**Algorithm 6.3 -** *Backward stepwise selection*

>1. Let $M_p$ denote the full model, which contains $p$ predictors. 
>2. For $k = p, p-1, \ldots, 1$:
>
>    (a) Consider all $k$ models that contain all but one of the predictors in $M_k$, for a total of $k − 1$ predictors.
>
>    (b) Choose the *best* among these $k$ models, and call it $M_{k-1}$. Here *best* is defined as having the smallest $\text{RSS}$, or largest $R^2$.
>3. Select a single best model from among $M_0, \ldots ,M_p$ using cross-validated prediction error, $C_p$ (AIC), BIC, or adjusted $R^2$.

Like forward stepwise selection, the backward selection approach searches through only $1 + \frac{p(p + 1)}{2}$ models, and so can be applied in settings where p is too large to apply best subset selection. Also like forward stepwise selection, backward stepwise selection is not guaranteed to yield the best model containing a subset of the p predictors.

Backward selection requires that the number of samples n is larger than the number of variables p (so that the full model can be fit). In contrast, forward stepwise can be used even when $n < p$, and so is the only viable subset method when p is very large.

### Hybrid Approaches
The best subset, forward stepwise, and backward stepwise selection approaches generally give similar but not identical models. As another alternative, hybrid versions of forward and backward stepwise selection are
available, in which variables are added to the model sequentially, in analogy to forward selection. However, after adding each new variable, the method may also remove any variables that no longer provide an improvement in the model fit. Such an approach attempts to more closely mimic best sub-set selection while retaining the computational advantages of forward and backward stepwise selection.

## Choosing the Optimal Model
Best subset selection, forward selection, and backward selection result in the creation of a set of models, each of which contains a subset of the p predictors. In order to implement these methods, we need a way to determine
which of these models is best. As we discussed, the model containing all of the predictors will always have the smallest RSS and the largest $R^2$, since these quantities are related to the training error. Instead, we wish to choose a model with a low test error.

In order to select the best model with respect to test error, we need to estimate this test error. There are two common approaches:
1. We can indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting.
2. We can directly estimate the test error, using either a validation set approach or a cross-validation approach, as discussed in Chapter 5.

### $C_p$, AIC, BIC, and Adjusted $R^2$
We saw in Chapter 2 that the training set MSE is generally an under-estimate of the test MSE. (Recall that $\text{MSE} = \frac{\text{RSS}}{n}$.) In particular, the training error will decrease as more variables are included in the model, but the test error may not. Therefore, training set RSS and training set $R^2$ cannot be used to select from among a set of models with different numbers of variables.

However, a number of techniques for adjusting the training error for the model size are available. These approaches can be used to select among a set of models with different numbers of variables. We now consider four such
approaches: $C_p$, *Akaike information criterion* (AIC), *Bayesian information criterion* (BIC), and *adjusted* $R^2$.

#### $C_p$
For a fitted least squares model containing $d$ predictors, the $C_p$ estimate of test MSE is computed using the equation

\begin{equation}\label{6.2}
    C_p = \frac{1}{n} (\text{RSS} + 2d\hat{\sigma}^2)
    \tag{6.2}
\end{equation}

where $\hat{\sigma}^2$ is an estimate of the variance of the error $\epsilon$ associated with each response measurement. Typically $\hat{\sigma}^2$ is estimated using the full model containing all predictors.

Essentially, the $C_p$ statistic adds a penalty of $2d\hat{\sigma}^2$ to the training RSS in order to adjust for the fact that the training error tends to underestimate the test error. Clearly, the penalty increases as
the number of predictors in the model increases; this is intended to adjust for the corresponding decrease in training RSS.

Though it is beyond the scope of this book, one can show that if $\hat{\sigma}^2$ is an unbiased estimate of $\sigma^2$ in (\ref{6.2}), then $C_p$ is an unbiased estimate of test MSE. As a consequence, the $C_p$ statistic tends to take on a small value for models with a low test error, so when determining which of a set of models is best, we choose the model with the lowest $C_p$ value.

#### Akaike Information Criterion
The AIC criterion is defined for a large class of models fit by maximum likelihood. In the case of the model with Gaussian errors, maximum likelihood and least squares are the same thing. In this case AIC is given by

\begin{align*}
    \text{AIC} = \frac{1}{n\hat{\sigma}^2}(\text{RSS} + 2d\hat{\sigma}^2), 
\end{align*}

where, for simplicity, we have omitted an additive constant. Hence for least squares models, $C_p$ and AIC are proportional to each other.

#### Bayesian information criterion
BIC is derived from a Bayesian point of view, but ends up looking similar to $C_p$ (and AIC) as well. For the least squares model with $d$ predictors, the BIC is, up to irrelevant constants, given by

\begin{equation}\label{6.3}
    \text{BIC} = \frac{1}{n\hat{\sigma}^2}(\text{RSS} +\log(n)d\hat{\sigma}^2)
    \tag{6.3}
\end{equation}

Notice that BIC replaces the $2d\hat{\sigma}^2$ used by $C_p$ with a $\log(n)d\hat{\sigma}^2$ term, where $n$ is the number of observations. Since $\log n > 2$ for any $n > 7$, the BIC statistic generally places a heavier penalty on models with many variables, and hence results in the selection of smaller models than $C_p$.

#### Adjusted $R^2$
The adjusted R2 statistic is another popular approach for selecting among a set of models that contain different numbers of variables. Recall from Chapter 3 that the usual $R^2$ is defined as $1 − \text{RSS}/\text{TSS}$, where $\text{TSS} = (y_i − y)^2$ is the *total sum of squares* for the response.

Since RSS always decreases as more variables are added to the model, the $R^2$ always increases as more variables are added. For a least squares model with $d$ variables, the adjusted $R^2$ statistic is calculated as

\begin{equation}\label{6.4}
    \text{Adjusted } R^2 = 1 - \frac{\text{RSS} / (n - d - 1) }{\text{TSS} / (n - 1)}
    \tag{6.4} 
\end{equation}

Unlike $C_p$, AIC, and BIC, for which a small value indicates a model with a low test error, a large value of adjusted $R^2$ indicates a model with a small test error. Maximizing the adjusted $R^2$ is equivalent to minimizing $\frac{\text{RSS}}{n - d - 1}$. While RSS always decreases as the number of variables in the model increases, $\frac{\text{RSS}}{n - d - 1}$ may increase or decrease, due to the presence of $d$ in the denominator.

The intuition behind the adjusted $R^2$ is that once all of the correct variables have been included in the model, adding additional noise variables will lead to only a very small decrease in RSS.

Here we have presented the formulas for AIC, BIC, and Cp in the case of a linear model fit using least squares; however, these quantities can also be defined for more general types of models.

![Cp BIC and adjusted R2](./figures/6.2.png)
>**Figure 6.2.** $C_p$, BIC, and adjusted $R^2$ are shown for the best models of each
size for the Credit data set. $C_p$ and BIC are estimates of test MSE. In the middle plot we see that the BIC estimate of test error shows an increase after four variables are selected. The other two plots are
rather flat after four variables are included.

### Validation and Cross-Validation
As an alternative to the approaches just discussed, we can directly estimate the test error using the validation set and cross-validation methods discussed in Chapter 5. We can compute the validation set error or the cross-validation error for each model under consideration, and then select the model for which the resulting estimated test error is smallest.

This procedure has an advantage relative to AIC, BIC, $C_p$, and adjusted $R^2$, in that it provides a direct estimate of the test error, and makes fewer assumptions about the true underlying model. It can also be used in a wider range of model selection tasks, even in cases where it is hard to pinpoint the model degrees of freedom (e.g. the number of predictors in the model) or hard to estimate the error variance $\sigma^2$.

![Sqrt BIC Validation and CV error](./figures/6.3.png)
>**Figure 6.3.** For the **Credit** data set, three quantities are displayed for the
best model containing *d* predictors, for *d* ranging from 1 to 11. The overall best
model, based on each of these quantities, is shown as a blue cross.

Figure 6.3 displays, as a function of *d*, the BIC, validation set errors, and cross-validation errors on the Credit data, for the best *d*-variable model. In this case, the validation and cross-validation methods both result in a six-variable model. However, all three approaches suggest that the four-, five-, and six-variable models are roughly equivalent in terms of their test errors.

While a three-variable model clearly has lower estimated test error than a two-variable model, the estimated test
errors of the 3- to 11-variable models are quite similar. Furthermore, if we repeated the validation set approach using a different split of the data into a training set and a validation set, or if we repeated cross-validation using a different set of cross-validation folds, then the precise model with the lowest estimated test error would surely change.

In this setting, we can select a model using the **one-standard-error rule**. We first calculate the standard error of the estimated test MSE for each model size, and then select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve.

**The rationale here is that if a set of models appear to be more or less equally good, then we might as well choose the simplest model—that is, the model with the smallest number of predictors**. In this case, applying the *one-standard-error rule* to the validation set or cross-validation approach leads to selection of the three-variable model.

---

# Shrinkage Methods
The subset selection methods described in the above section involve using least squares to fit a linear model that contains a subset of the predictors. As an alternative, we can fit a model containing all *p* predictors using a technique that **constrains** or **regularizes** the coefficient estimates, or equivalently, that **shrinks** the coefficient estimates towards zero. It may not be immediately obvious why such a constraint should improve the fit, but **it turns out that shrinking the coefficient estimates can significantly reduce their variance.**

The two best-known techniques for shrinking the regression coefficients towards zero are **ridge regression** and the **lasso**.

## Ridge Regression
Recall from Chapter 3 that the least squares fitting procedure estimates $\beta_0, \beta_1, \ldots, \beta_p$ using the values that minimize

\begin{align*}
    \text{RSS} = \sum^{n}_{i=1} \left( y_i - \beta_0 - \sum^{p}_{j=1} \beta_j x_{ij} \right)^2 . 
\end{align*}

**Ridge regression** is very similar to least squares. In particular, the ridge regression coefficient estimates $\hat{\beta}^R$ are the values that minimize

\begin{equation}\label{6.5}
    \sum^n_{i=1} \left( y_i - \beta_0 - \sum^{p}_{j=1} \beta_j x_{ij} \right)^2 + \lambda \sum^p_{j=1} \beta_j^2 = \text{RSS} + \lambda \sum^p_{j=1} \beta_j^2,
    \tag{6.5}
\end{equation}

where $\lambda \ge 0$ is a *tuning parameter*, to be determined separately.

As with least squares, ridge regression seeks coefficient estimates that fit the data well, by making the RSS
small. However, the second term, $\lambda_j \sum_j \beta_j^2$, called a *shrinkage penalty*, is small when $\beta_1, \ldots, \beta_p$ are close to zero, and so it has the effect of shrinking the estimates of $\beta_j$ towards zero.

The tuning parameter λ serves to control the relative impact of these two terms on the regression coefficient estimates. When $\lambda = 0$, the penalty term has no effect, and ridge regression will produce the least squares estimates. However, as $\lambda \to \infty$, the impact of the shrinkage penalty grows, and the ridge regression coefficient estimates will approach zero.

Selecting a good value for λ is critical; we defer this discussion to the section [Selecting the Tuning Parameter](#Selecting-the-Tuning-Parameter), where we use cross-validation.

Note that in (\ref{6.5}), the shrinkage penalty is applied to $\beta_1, \ldots, \beta_p$, but not to the intercept $\beta_0$ . We want to shrink the estimated association of each variable with the response; however, we do not want to shrink the intercept, which is simply a measure of the mean value of the response when $x_{i1} = x_{i2} = \ldots = x_{ip} = 0$. If we assume that the variables—that is, the columns of the data matrix $X$—have been centered to have mean zero before ridge regression is performed, then the estimated intercept will take the form $\hat{\beta} = \bar{y} = \sum^n_{i=1} y_i/n$

### An Application to the Credit Data

![Standardized Ridge Regression Coefficients](./figures/6.4.png)
>**Figure 6.4.** The standardized ridge regression coefficients are displayed for the Credit data set,  
as a function of $\lambda$ and $|| \hat{\beta}^R_\lambda ||_2 / ||\hat{\beta}||_2$

In Figure 6.4, the ridge regression coefficient estimates for the Credit data set are displayed. At the extreme left-hand side of the plot, $\lambda$ is essentially zero, and so the corresponding ridge coefficient estimates are the same as the usual least squares estimates. But as $\lambda$ increases, the ridge coefficient estimates shrink towards zero. When λ is extremely large, then all of the ridge coefficient estimates are basically zero; this corresponds to the *null model* that contains no predictors.

The right-hand panel of Figure 6.4 displays the same ridge coefficient estimates as the left-hand panel, but instead of displaying $\lambda$ on the x-axis, we now display $\lambda$ and $|| \hat{\beta}^R_\lambda ||_2 / ||\hat{\beta}||_2$, where $\hat{\beta}$ denotes the vector of least squares coefficient estimates. The notation $||\beta||_2$ denotes the $\ell_2$ *norm* of a vector, and is defined as $||\beta||_2 = \sqrt{\sum_{j=1}^p \beta_j^2}$. It measures the distance of $\beta$ from zero. As $\lambda$ increases, the $\ell_2$ norm of $\hat{\beta}_λ^R$ will always decrease, and so will $|| \hat{\beta}^R_\lambda ||_2 / ||\hat{\beta}||_2$. The latter quantity ranges from 1 to 0, when $\lambda = 0$ and $\lambda = \infty$ respectively. Therefore, we can think of the x-axis in the right-hand panel of Figure 6.4 as the amount that the ridge regression coefficient estimates have been shrunken towards zero; a small value indicates that they have been shrunken very close to zero.

The standard least squares coefficient estimates discussed in Chapter 3 are **scale equivariant**: multiplying $X_j$ by a constant $c$ simply leads to a scaling of the least squares coefficient estimates by a factor of $1/c$. In other words, regardless of how the $j$th predictor is scaled, $X_j \beta_j$ will remain the same. However, $X_j \hat{\beta}_{j,\lambda}^R$ will depend not only on the value of $\lambda$, but also on the scaling of the $j$th predictor. In fact, the value of $X_j \hat{\beta}_{j,\lambda}^R$ may even depend on the scaling of the other predictors! Therefore, it is best to apply ridge regression after standardizing the predictors, using the formula

\begin{equation}\label{6.6}
    \tilde{x}_{i,j} = \frac{x_{ij}}{\sqrt{\frac{1}{n} \sum_{i=1}^n (x_{ij} - \bar{x}_j)^2}}
    \tag{6.6}
\end{equation}

so that they are all on the same scale. In (\ref{6.6}), the denominator is the estimated standard deviation of the $j$th predictor. Consequently, all of the standardized predictors will have a standard deviation of one. As a result the final fit will not depend on the scale on which the predictors are measured.

### Why Does Ridge Regression Improve Over Least Squares?
Ridge regression’s advantage over least squares is rooted in the bias-variance trade-off. As $\lambda$ increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias.This is illustrated in the left-hand panel of Figure 6.5, using a simulated data set containing $p = 45$ predictors and $n = 50$ observations.

![Ridge Mean Squared Error](./figures/6.5.png)
>**Figure 6.5.** Squared bias (black), variance (green), and test mean squared
error (purple) for the ridge regression predictions on a simulated data set, as a
function of $\lambda$ and $|| \hat{\beta}^R_\lambda ||_2 / ||\hat{\beta}||_2$.
The horizontal dashed lines indicate the minimum possible MSE.
The purple crosses indicate the ridge regression models for which
the MSE is smallest.

The minimum MSE is achieved at approximately $λ = 30$. Interestingly, because of its high variance, the MSE associated with the least squares fit, when $λ = 0$, is almost as high as that of the null model for which all coefficient estimates are zero, when $λ = \infty$. However, for an intermediate value of $λ$, the MSE is considerably lower.

The right-hand panel of Figure 6.5 displays the same curves as the left-hand panel, this time plotted against the $\ell_2$ norm of the ridge regression coefficient estimates divided by the $\ell_2$ norm of the least squares estimates. Now as we move from left to right, the fits become more flexible, and so the bias decreases and the variance increases.

In general, in situations where the relationship between the response and the predictors is close to linear, the least squares estimates will have low bias but may have high variance. This means that a small change in the training data can cause a large change in the least squares coefficient estimates.

In particular, when the number of variables $p$ is almost as large as the number of observations $n$, as in the example in Figure 6.5, the least squares estimates will be extremely variable. And if $p > n$, then the least squares estimates do not even have a unique solution, whereas ridge regression can still perform well by trading off a small increase in bias for a large decrease in variance. Hence, ridge regression works best in situations where the least squares estimates have high variance.

Ridge regression also has substantial computational advantages over best subset selection, which requires searching through $2^p$ models. In contrast, for any fixed value of $λ$, ridge regression only fits a single model, and the model-fitting procedure can be performed quite quickly. In fact, one can show that the computations required to solve (\ref{6.5}), simultaneously for all values of $λ$, are almost identical to those for fitting a model using least squares.

## The Lasso
Ridge regression does have one obvious disadvantage. Unlike best subset, forward stepwise, and backward stepwise selection, which will generally select models that involve just a subset of the variables, **ridge regression will include all $p$ predictors in the final model.**

The penalty $λ \sum \beta_j^2$ in (\ref{6.5}) will shrink all of the coefficients towards zero, but it will not set any of them exactly to zero (unless $λ = \infty$). This may not be a problem for prediction accuracy, but it can create a challenge in **model interpretation** in settings in which the number of variables $p$ is quite large.

For example, in the *Credit* data set, it appears that the most important variables are *income, limit, rating,* and *student*. So we might wish to build a model including just these predictors. However, ridge regression will always generate a model involving all ten predictors. Increasing the value of $\lambda$ will tend to reduce the magnitudes of the coefficients, but will not result in exclusion of any of the variables.

The lasso is a relatively recent alternative to ridge regression that over-comes this disadvantage. The lasso coefficients, $\hat{\beta}_\lambda^L$ , minimize the quantity

\begin{equation}\label{6.7}
    \sum^n_{i=1} \left( y_i - \beta_0 - \sum^{p}_{j=1} \beta_j x_{ij} \right)^2 + \lambda \sum^p_{j=1} |\beta_j| = \text{RSS} + \lambda \sum^p_{j=1} |\beta_j|.
    \tag{6.7}
\end{equation}

Comparing to (\ref{6.5}), we see the only difference is that the $\beta_j^2$ term in the ridge regression penalty (\ref{6.5}) has been replaced by $|\beta_j|$ in the lasso penalty (\ref{6.7}). In statistical parlance, the lasso uses an $\ell_1$ penalty instead of an $\ell_2$ penalty. The $\ell_1$ norm of a coefficient vector $\beta$ is given by $||\beta||_1 = \sum|\beta_j|$.

As with ridge regression, the lasso shrinks the coefficient estimates towards zero. However, in the case of the lasso, **the $\ell_1$ penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter $λ$ is sufficiently large.** Hence, much like best subset selection, the lasso performs variable selection. We say that the lasso yields *sparse models*—that is, models that involve only a subset of the variables.  As in ridge regression, selecting a good value of λ for the lasso is critical; we defer this discussion to section [Selecting the Tuning Parameter](#Selecting-the-Tuning-Parameter), where we use cross-validation.

![Lasso Coefficients on Credit Data](./figures/6.6.png)
>**Figure 6.6.** The standardized lasso coefficients on the Credit data set are
shown as a function of $\lambda$ and $|| \hat{\beta}^L_\lambda ||_1 / ||\hat{\beta}||_1$.

As an example, consider the coefficient plots in Figure 6.6, which are generated from applying the lasso to the Credit data set. When $λ = 0$, then the lasso simply gives the least squares fit, and when $λ$ becomes sufficiently large, the lasso gives the null model in which all coefficient estimates equal zero. However, in between these two extremes, the ridge regression and lasso models are quite different from each other. 

Moving from left to right in the right-hand panel of Figure 6.6, we observe that at first the lasso results in a model that contains only the *rating* predictor. Then the *student* and *limit* enter the model almost simultaneously, shortly followed by *income*. Eventually, the remaining variables enter the model.

Hence, depending on the value of $λ$, the lasso can produce a model involving any number of variables. In contrast, ridge regression will always include all of the variables in the model, although the magnitude of the coefficient estimates will depend on $λ$.

### Another Formulation for Ridge Regression and the Lasso
One can show that the lasso and ridge regression coefficient estimates solve the problems

\begin{equation}\label{6.8}
    {\text{minimize } \atop \beta} \left\{ \sum^n_{i=1} \left( y_i - \beta_0 - \sum^p_{j=1} \beta_j x_{ij} \right)^2 \right\} \text{ subject to } \sum^p_{j=1} |\beta_j| \le s
    \tag{6.8}
\end{equation}

and  

\begin{equation}\label{6.9}
    {\text{minimize } \atop \beta} \left\{ \sum^n_{i=1} \left( y_i - \beta_0 - \sum^p_{j=1} \beta_j x_{ij} \right)^2 \right\} \text{ subject to } \sum^p_{j=1} \beta_j^2 \le s
    \tag{6.9}
\end{equation}

respectively. In other words, for every value of $λ$, there is some $s$ such that the Equations (\ref{6.7}) and (\ref{6.8}) will give the same lasso coefficient estimates and vice versa.

We can think of (\ref{6.8}) as follows. When we perform the lasso we are trying to find the set of coefficient estimates that lead to the smallest RSS, subject to the constraint that there is a *budget* $s$ for how large $\sum^p_{j=1} |\beta_j|$ can be. When $s$ is extremely large, then this budget is not very restrictive, and so
the coefficient estimates can be large. In fact, if $s$ is large enough that the least squares solution falls within the budget, then (\ref{6.8}) will simply yield the least squares solution. In contrast, if $s$ is small, then $\sum_{j=1}^p |β_j|$ must be small in order to avoid violating the budget. Similarly, (\ref{6.9}) indicates that when we perform ridge regression we seek a set of coefficient estimates such that the RSS is as small as possible, subject to the requirement that $\sum_{j=1}^p β_j^2$ not exceed the budget $s$.

The formulations (\ref{6.8}) and (\ref{6.9}) reveal a close connection between the lasso, ridge regression, and best subset selection. Consider the problem

\begin{equation}\label{6.10}
    {\text{minimize } \atop \beta} \left\{ \sum^n_{i=1} \left( y_i - \beta_0 - \sum^p_{j=1} \beta_j x_{ij} \right)^2 \right\} \text{ subject to } \sum^p_{j=1} I( \beta_j \neq 0) \le s
    \tag{6.10}
\end{equation}

Here $I( \beta_j \neq 0)$ is an indicator variable, taking the value of 1 if $\beta_j \neq 0$ and zero otherwise. Then (\ref{6.10}) amounts to finding a set of coefficient estimates such that RSS is as small as possible, subject to the constraint that no more than $s$ coefficients can be nonzero. **The problem (\ref{6.10}) is equivalent to best subset selection.** Unfortunately, solving (\ref{6.10}) is computationally infeasible when $p$ is large, since it requires considering all $p \choose s$ models containing $s$ predictors.

Therefore, we can interpret ridge regression and the lasso as computationally feasible alternatives to best subset selection that replace the intractable form of the budget in (\ref{6.10}) with forms that are much easier to solve. Of course, the lasso is much more closely related to best subset selection, since only the lasso performs feature selection for $s$ sufficiently small in (\ref{6.8}).

### The Variable Selection Property of the Lasso
Why is it that the lasso, unlike ridge regression, results in coefficient estimates that are exactly equal to zero? The formulations (\ref{6.8}) and (\ref{6.9}) can be used to shed light on the issue. Figure 6.7 illustrates the situation.

![Lasso and Ridge Regression Contours](./figures/6.7.png)
>**Figure 6.7.** Contours of the error and constraint functions for the lasso (left) and ridge regression (right). The solid blue areas are the constraint regions, $|\beta_1 | + |\beta_2 | \le s$ and $\beta_1^2 + \beta_2^2 \le s$, while the red ellipses are the contours of the RSS.

The least squares solution is marked as $\hat{\beta}$, while the blue diamond and circle represent the lasso and ridge regression constraints in (\ref{6.8}) and (\ref{6.9}), respectively.  If s is sufficiently large, then the constraint regions will contain $\hat{\beta}$, and so the ridge regression and lasso estimates will be the same as the least squares estimates. However, in Figure 6.7 the least squares estimates lie outside of the diamond and the circle, and so the least squares estimates are not the same as the lasso and ridge regression estimates.

The ellipses that are centered around $\hat{\beta}$ represent regions of constant RSS. As the ellipses expand away from the least squares coefficient estimates, the RSS increases. Equations (\ref{6.8}) and (\ref{6.9}) indicate that the lasso and ridge regression coefficient estimates are given by the first point at which an ellipse contacts the constraint region. Since ridge regression has a circular constraint with no sharp points, this intersection will not generally occur on an axis, and so the ridge regression coefficient estimates will be exclusively non-zero.

**However, the lasso constraint has *corners* at each of the axes, and so the ellipse will often intersect the constraint region at an axis. When this occurs, one of the coefficients will equal zero.** In higher dimensions, many of the coefficient estimates may equal zero simultaneously. In Figure 6.7, the intersection occurs at $\beta_1 = 0$, and so the resulting model will only include $\beta_2$.

In Figure 6.7, we considered the simple case of $p = 2$. When $p = 3$, then the constraint region for ridge regression becomes a sphere, and the constraint region for the lasso becomes a polyhedron. When $p > 3$, the constraint for ridge regression becomes a hypersphere, and the constraint for the lasso becomes a polytope. However, the key ideas depicted in Figure 6.7 still hold. In particular, the lasso leads to feature selection when $p > 2$ due to the sharp corners of the polyhedron or polytope.

### Comparing the Lasso and Ridge Regression
It is clear that the lasso has a major advantage over ridge regression, in that it produces simpler and more interpretable models that involve only a subset of the predictors. It is clear that the lasso has a major advantage over ridge regression, in that it produces simpler and more interpretable models that involve only a subset of the predictors. However, which method leads to better prediction accuracy?

![variance squared bias and test MSE for figure 6.5 data](./figures/6.8.png)
>**Figure 6.8.** *Left*: Plots of squared bias (black), variance (green), and test MSE
(purple) for the lasso on a simulated data set.
*Right*: Comparison of squared bias, variance and test MSE between lasso (solid) and ridge (dotted).
Both are plotted against their $R^ 2$ on the training data, as a common form of indexing.
The crosses in both plots indicate the lasso model for which the MSE is smallest.

Figure 6.8 displays the variance, squared bias, and test MSE of the lasso applied to the same simulated data as in Figure 6.5. Clearly the lasso leads to qualitatively similar behavior to ridge regression, in that as $\lambda$ increases, the variance decreases and the bias increases.

In this example, the lasso and ridge regression result in almost identical biases. However, the variance of ridge regression is slightly lower than the variance of the lasso. Consequently, the minimum MSE of ridge regression is slightly smaller than that of the lasso.

However, the data in Figure 6.8 were generated in such a way that all 45 predictors were related to the response—that is, none of the true coefficients $\beta_1, \ldots , \beta_{45}$ equaled zero. The lasso implicitly assumes that a number of the coefficients truly equal zero. Consequently, it is not surprising that ridge regression outperforms the lasso in terms of prediction error in this setting.

![variance squared bias and test MSE for 2 of 45](./figures/6.9.png)
>**Figure 6.9.** *Left*: Plots of squared bias (black), variance (green), and test MSE
(purple) for the lasso. The simulated data is similar to that in Figure 6.8, except
that now only two predictors are related to the response. *Right*: Comparison of
squared bias, variance and test MSE between lasso (solid) and ridge (dotted). Both
are plotted against their $R^2$ on the training data, as a common form of indexing.
The crosses in both plots indicate the lasso model for which the MSE is smallest.

Figure 6.9 illustrates a similar situation, except that now the response is a function of only 2 out of 45 predictors. Now the lasso tends to outperform ridge regression in terms of bias, variance, and MSE.

These two examples illustrate that neither ridge regression nor the lasso will universally dominate the other. **In general, one might expect the lasso to perform better in a setting where a relatively small number of predictors have substantial coefficients, and the remaining predictors have coefficients that are very small or that equal zero.** Ridge regression will perform better when the response is a function of many predictors, all with coefficients of roughly equal size. However, the number of predictors that is related to the response is never known *a priori* for real data sets. A technique such as cross-validation can be used in order to determine which approach is better on a particular data set.

As with ridge regression, when the least squares estimates have excessively high variance, the lasso solution can yield a reduction in variance at the expense of a small increase in bias, and consequently can generate more accurate predictions. Unlike ridge regression, the lasso performs variable selection, and hence results in models that are easier to interpret.

There are very efficient algorithms for fitting both ridge and lasso models; in both cases the entire coefficient paths can be computed with about the same amount of work as a single least squares fit.

### A Simple Special Case for Ridge Regression and the Lasso
In order to obtain a better intuition about the behavior of ridge regression and the lasso, consider a simple special case with $n = p$, and $\boldsymbol{X}$ a diagonal matrix with 1’s on the diagonal and 0’s in all off-diagonal elements. To simplify the problem further, assume also that we are performing regression without an intercept. With these assumptions, the usual least squares problem simplifies to finding $\beta_1, \ldots, \beta_p$ that minimize

\begin{equation}\label{6.11}
    \sum^p_{j=1}(y_i - \beta_j)^2.
    \tag{6.11}
\end{equation}

In this case, the least squares solution is given by

\begin{align*}
    \hat{\beta}_j = y_j
\end{align*}

And in this setting, ridge regression amounts to finding $\beta_1, \ldots , \beta_p$ such that

\begin{equation}\label{6.12}
    \sum^p_{j=1}(y_j - \beta_j)^2 + \lambda \sum^p_{j=1} \beta_j^2
    \tag{6.12}
\end{equation}

is minimized, and the lasso amounts to finding the coefficients such that

\begin{equation}\label{6.13}
    \sum^p_{j=1} (y_j - \beta_j)^2 + \lambda \sum^p_{j=1} | \beta_j |
    \tag{6.13}
\end{equation}

is minimized. One can show that in this setting, the ridge regression estimates take the form

\begin{equation}\label{6.14}
    \hat{\beta}_j^R = y_j/(1+\lambda),
    \tag{6.14}
\end{equation}

and the lasso estimates take the form

\begin{equation}\label{6.15}
    \hat{\beta}_j^L = \begin{cases} y_j - \lambda /2 &\text{ if } y_j > \lambda / 2 \\
    y_j + \lambda /2 &\text{ if } y_j < \lambda / 2 \\
    0 &\text{ if } |y_j| \le \lambda / 2
    \end{cases}
    \tag{6.15}
\end{equation}

Figure 6.10 displays the situation.

![Ridge and Lasso vs Least Squares](./figures/6.10.png)
>**Figure 6.10.** The ridge regression and lasso coefficient estimates for a simple
setting with $n = p$ and $\boldsymbol{X}$ a diagonal matrix with $1$’s on the diagonal. *Left*: The
ridge regression coefficient estimates are shrunken proportionally towards zero,
relative to the least squares estimates. *Right*: The lasso coefficient estimates are
soft-thresholded towards zero.

We can see that ridge regression and the lasso perform two very different types of shrinkage. In ridge regression,
each least squares coefficient estimate is shrunken by the same proportion. In contrast, the lasso shrinks each least squares coefficient towards zero by a constant amount, $λ / 2$; the least squares coefficients that are less than $λ / 2$ in absolute value are shrunken entirely to zero. The fact that some lasso coefficients are shrunken entirely to zero explains why the lasso performs feature selection.

In the case of a more general data matrix $\boldsymbol{X}$, the story is a little more complicated than what is depicted in Figure 6.10, but the main ideas still hold approximately: **ridge regression more or less shrinks every dimension of the data by the same proportion, whereas the lasso more or less shrinks all coefficients toward zero by a similar amount, and sufficiently small coefficients are shrunken all the way to zero.**

## Selecting the Tuning Parameter
Implementing ridge regression and the lasso requires a method for selecting a value for the tuning parameter $\lambda$, or equivalently, the value of the constraint $s$. Cross-validation provides a simple way to tackle this problem.

We choose a grid of $λ$ values, and compute the cross-validation error for each value of $λ$, as described in Chapter 5. We then select the tuning parameter value for which the cross-validation error is smallest. Finally, the model is re-fit using all of the available observations and the selected value of the tuning parameter.

Note that for relatively small values of $\lambda$, as in Figure 6.12, with a wide range of values around the *optimal* $\lambda$ that would give very similar error, we might simply use the least squares solution.

![Cross Validation errors for Credit lambda](./figures/6.12.png)
>**FIGURE 6.12.** *Left*: Cross-validation errors that result from applying ridge
regression to the Credit data set with various value of λ. *Right*: The coefficient
estimates as a function of λ. The vertical dashed lines indicate the value of λ
selected by cross-validation.

---

# Dimension Reduction Methods
We now explore a class of approaches that transform the predictors and then fit a least squares model using the transformed variables. We will refer to these techniques as dimension reduction methods.

Let $Z_1, Z_2, \ldots , Z_M$ represent $M < p$ linear combinations of our original $p$ predictors. That is,

\begin{equation}\label{6.16}
    Z_m = \sum^p_{j=1} \phi_{jm}X_j
    \tag{6.16}
\end{equation}

for some constants $\phi_{1m} , \phi_{2m}, \ldots , \phi_{pm}$, $m = 1, \ldots , M$. We can then fit the linear regression model

\begin{equation}\label{6.17}
    y_i = \theta_0 + \sum^M_{m=1} \theta_m z_{im} + \epsilon_i, \; i = 1, \ldots, n
    \tag{6.17}
\end{equation}

using least squares. Note that in (\ref{6.17}), the regression coefficients are given by $\theta_0 , \theta_1 , \ldots , \theta_M$. If the constants $\phi_{1m} , \phi_{2m}, \ldots , \phi_{pm}$ are chosen wisely, then such dimension reduction approaches can often outperform least squares regression.

The term *dimension reduction* comes from the fact that this approach reduces the problem of estimating the $p + 1$ coefficients $\beta_0 , \beta_1 , \ldots , \beta_p$ to the simpler problem of estimating the $M + 1$ coefficients $\theta_0, \theta_1 , \ldots , \theta_M$, where $M < p$.  
In other words, the dimension of the problem has been reduced from $p + 1$ to $M + 1$.

Notice that from (\ref{6.16}),

\begin{align*}
    \sum^M_{m=1} \theta_m z_{im} = \sum^M_{m=1}\theta_m \sum^p_{j=1} \phi_{jm}x_{ij} = \sum^p_{j=1}\sum^M_{m=1} \theta_m \phi_{jm}x_{ij} = \sum^p_{j=1}\beta_j x_{ij}
\end{align*}

where

\begin{equation}\label{6.18}
    \beta_j = \sum^M_{m=1} \theta_m \phi_{jm}.
    \tag{6.18}
\end{equation}

Dimension reduction serves to constrain the estimated $\beta_j$ coefficients, since now they must take the form (\ref{6.18}). This constraint on the form of the coefficients has the potential to bias the coefficient estimates. However, in situations where $p$ is large relative to $n$, selecting a value of $M \ll p$ can significantly reduce the variance of the fitted coefficients.

All dimension reduction methods work in two steps. First, the transformed predictors $Z_1, Z_2, \ldots , Z_M$ are obtained. Second, the model is fit using these $M$ predictors. However, the choice of $Z_1, Z_2, \ldots , Z_M$, or equivalently, the selection of the $\phi_{jm}$’s, can be achieved in different ways. In this chapter, we will consider two approaches for this task: **principal components** and **partial least squares**.

## Principal Components Regression
PCA is discussed in greater detail as a tool for unsupervised learning in Chapter 10. Here we describe its use as a dimension reduction technique for regression.

### An Overview of Principal Components Analysis
PCA is a technique for reducing the dimension of a $n \times p$ data matrix $\boldsymbol{X}$. **The first principal component direction of the data is that along which the observations vary the most**. For instance, consider Figure 6.14, which shows population size (pop) in tens of thousands of people, and ad spending for a particular company (ad) in thousands of dollars, for 100 cities. The green solid line represents the first principal component direction of the data. We can see by eye that this is the direction along which there is the greatest variability in the data.

![Principal Component](./figures/6.14.png)
>**Figure 6.14.** The population size (pop) and ad spending (ad) for 100 different
cities are shown as purple circles. The green solid line indicates the first principal
component, and the blue dashed line indicates the second principal component.

The first principal component is displayed graphically in Figure 6.14, but how can it be summarized mathematically? It is given by the formula

\begin{equation}\label{6.19}
    Z_1 = 0.839 \times (\text{pop} - \overline{\text{pop}}) + 0.544 \times (\text{ad} - \overline{\text{ad}})
    \tag{6.19}
\end{equation}

Here $\phi_{11} = 0.839$ and $\phi_{21} = 0.544$ are the **pricipal component loadings**, which define the direction referred to above. The idea is that out of every possible linear combination of $\text{pop}$ and $\text{ad}$ such that $\phi_{11}^2 + \phi_{21}^2 = 1$, this particular linear combination yields the highest variance.  It is necessary to consider only linear combinations of the form $\phi_{11}^2 + \phi_{21}^2 = 1$, since otherwise we could increase $\phi_{11}$ and $\phi_{21}$ arbitrarily in order to blow up the variance. In (\ref{6.19}), the two loadings are both positive and have similar size, and so $Z_1$ is almost an average of the two variables.

Since $n = 100$, *pop* and *ad* are vectors of length $100$, and so is $Z_1$ in (\ref{6.19}). For instance,

\begin{equation}\label{6.20}
    z_{i1} = 0.839 \times (\text{pop}_i - \overline{\text{pop}}) + 0.544 \times (\text{ad}_i - \overline{\text{ad}})
    \tag{6.20}
\end{equation}

The values of $z_{11}, \ldots , z_{n1}$ are known as the **principal component scores**, and can be seen in the right-hand panel of Figure 6.15.

![Population and Advertising principal component](./figures/6.15.png)
>**Figure 6.15.** *Left*: The first principal component direction is
shown in green. It is the dimension along which the data vary the most, and it also
defines the line that is closest to all $n$ of the observations. The distances from each
observation to the principal component are represented using the black dashed line
segments. *Right*: The left-hand panel has been rotated so that the
first principal component direction coincides with the x-axis.

There is also another interpretation for PCA: *the first principal component vector defines the line that is as close as possible to the data.* For instance, in Figure 6.14, the first principal component line minimizes the sum of the squared perpendicular distances between each point and the line. These distances are plotted as dashed line segments in the left-hand panel of Figure 6.15, in which the crosses represent the *projection* of each point onto the first principal component line. The first principal component has been chosen so that the projected observations are *as close as possible* to the original observations.

In the right-hand panel of Figure 6.15, the left-hand panel has been rotated so that the first principal component direction coincides with the x-axis. It is possible to show that the first principal component score for the $i$th observation, given in (\ref{6.20}), is the distance in the x-direction of the $i$th cross from zero.

We can think of the values of the principal component $Z_1$ as single-number summaries of the joint *pop* and *ad* budgets for each location.  In this example, if $z_{i1} = 0.839 \times (\text{pop}_i − \overline{\text{pop}}) + 0.544 × (\text{ad}_i − \overline{\text{ad}}) < 0$, then this indicates a city with below-average population size and below-average ad spending. A positive score suggests the opposite.

How well can a single number represent both pop and ad? In this case, Figure 6.14 indicates that pop and ad have approximately a linear relationship, and so we might expect that a single-number summary will work well.

So far we have concentrated on the first principal component. In general, one can construct up to $p$ distinct principal components. The second principal component $Z_2$ is a linear combination of the variables that is uncorrelated with $Z_1$, and has largest variance subject to this constraint. The second principal component direction is illustrated as a dashed blue line in Figure 6.14.

**It turns out that the zero correlation condition of $Z_1$ with $Z_2$ is equivalent to the condition that the direction must be *perpendicular*, or *orthogonal*, to the first principal component direction.** Since the advertising data has two predictors, the first two principal components contain all of the information that is in $\text{pop}$ and $\text{ad}$. However, by construction, the first component will contain the most information.

Consider, for example, the much larger variability of $z_{i1}$ (the x-axis) versus $z_{i2}$ (the y-axis) in the right-hand panel of Figure 6.15. The fact that the second principal component scores are much closer to zero indicates that this component captures far less information.

With two-dimensional data, such as in our advertising example, we can construct at most two principal components. However, if we had other predictors, such as population age, income level, education, and so forth, then additional components could be constructed. They would successively maximize variance, subject to the constraint of being uncorrelated with the preceding components.

### The Principal Components Regression Approach
The principal components regression (PCR) approach involves constructing the first $M$ principal components, $Z_1 , \ldots , Z_M$, and then using these components as the predictors in a linear regression model that is fit using least squares. The key idea is that often a small number of principal components suffice to explain most of the variability in the data, as well as the relationship with the response. In other words, **we assume that the directions in which $X_1 , \ldots, X_p$ show the most variation are the directions that are associated with $Y$**. While this assumption is not guaranteed to be true, it often turns out to be a reasonable enough approximation to give good results.

If the assumption underlying PCR holds, then fitting a least squares model to $Z_1 , \ldots , Z_M$ will lead to better results than fitting a least squares model to $X_1, \ldots , X_p$, since most or all of the information in the data that relates to the response is contained in $Z_1 , \ldots, Z_M$, and by estimating only $M \ll p$ coefficients we can mitigate overfitting. In the advertising data, the first principal component explains most of the variance in both *pop* and *ad*, so a principal component regression that uses this single variable to predict some response of interest, such as *sales*, will likely perform quite well.

![Two Datasets](./figures/6.18.png)
>**Figure 6.18.** PCR was applied to two simulated data sets. *Left*: Simulated
data from Figure 6.8. *Right*: Simulated data from Figure 6.9.

Figure 6.18 displays the PCR fits on the simulated data sets from Figures 6.8 and 6.9. Recall that both data sets were generated using $n = 50$ observations and $p = 45$ predictors. However, *while the response in the first data set was a function of all the predictors, the response in the second data set was generated using only two of the predictors*. As more principal components are used in the regression model, the bias decreases, but the variance increases. This results in a typical U-shape for the mean squared error. When $M = p = 45$, then PCR amounts simply to a least squares fit using all of the original predictors.

The figure indicates that performing PCR with an appropriate choice of $M$ can result in a substantial improvement over least squares, especially in the left-hand panel. However, by examining the ridge regression and lasso results in Figures 6.5, 6.8, and 6.9, we see that PCR does not perform as well as the two shrinkage methods in this example.

The relatively worse performance of PCR in Figure 6.18 is a consequence of the fact that the data were generated in such a way that many principal components are required in order to adequately model the response. In contrast, PCR will tend to do well in cases when the first few principal components are sufficient to capture most of the variation in the predictors as well as the relationship with the response.

![PCR on favorable data](./figures/6.19.png)
>**Figure 6.19.** PCR, ridge regression, and the lasso were applied to a simulated
data set in which the first five principal components of X contain all the information
about the response $Y$. In each panel, the irreducible error $\text{Var}(\epsilon)$
is shown as a horizontal dashed line.

The left-hand panel of Figure 6.19 illustrates the results from another simulated data set designed to be more favorable to PCR. Here the response was generated in such a way that it depends exclusively on the first five principal components. Now the bias drops to zero rapidly as $M$, the number of principal components used in PCR, increases. The mean squared error displays a clear minimum at $M = 5$. The right-hand panel of Figure 6.19 displays the results on these data using ridge regression and the lasso. All three methods offer a significant improvement over least squares. However, PCR and ridge regression slightly outperform the lasso.

We note that even though PCR provides a simple way to perform regression using $M < p$ predictors, **it is not a feature selection method**. This is because each of the $M$ principal components used in the regression is a linear combination of all $p$ of the original features.

In PCR, the number of principal components, $M$, is typically chosen by cross-validation. The results of applying PCR to the Credit data set are shown in Figure 6.20. On these data, the lowest cross-validation error occurs when there are $M = 10$ components; this corresponds to almost no dimension reduction at all, since PCR with $M = 11$ is equivalent to simply performing least squares.

![PCR on Credit Data](./figures/6.20.png)
>**Figure 6.20.** *Left*: PCR standardized coefficient estimates on the Credit data
set for different values of $M$.  
*Right*: The ten-fold cross validation MSE obtained using PCR, as a function of $M$.

When performing PCR, we generally recommend **standardizing** each predictor, using (\ref{6.6}), prior to generating the principal components. This standardization ensures that all variables are on the same scale. **In the absence of standardization, the high-variance variables will tend to play a larger role in the principal components obtained, and the scale on which the variables are measured will ultimately have an effect on the final PCR model.** However, if the variables are all measured in the same units (say, kilograms, or inches), then one might choose not to standardize them.

### Partial Least Squares
The PCR approach that we just described involves identifying linear combinations, or directions, that best represent the predictors $X_1, \ldots, X_p$. These directions are identified in an unsupervised way, since the response $Y$ is not used to help determine the principal component directions. Consequently, PCR suffers from a drawback: there is no guarantee that the directions that best explain the predictors will also be the best directions to use for predicting the response.

A **supervised** alternative to PCR is **partial least squares** (PLS). Like PCR, PLS is a dimension reduction method, which first identifies a new set of features $Z_1 , \ldots, Z_M$ that are linear combinations of the original features, and then fits a linear model via least squares using these $M$ new features. Roughly speaking, the PLS approach attempts to find directions that help explain both the response and the predictors.

**After standardizing the $p$ predictors, PLS computes the first direction $Z_1$ by setting each $\phi_{j1}$ in (\ref{6.16}) equal to the coefficient from the simple linear regression of $Y$ onto $X_j$.** One can show that this coefficient is proportional to the correlation between $Y$ and $X_j$. Hence, in computing $Z_1 = \sum_{j=1}^p \phi_{j1} X_j$, PLS places the highest weight on the variables that are most strongly related
to the response.

Figure 6.21 displays an example of PLS on a synthetic dataset with Sales in each of 100 regions as the response, and two predictors; Population Size and Advertising Spending. The solid green line indicates the first PLS direction, while the dotted line shows the first principal component direction.

![PLS](./figures/6.21.png)
>**FIGURE 6.21.** For the advertising data, the first PLS direction (solid line) and
first PCR direction (dotted line) are shown.

PLS has chosen a direction that has less change in the *ad* dimension per unit change in the *pop* dimension, relative to PCA. This suggests that *pop* is more highly correlated with the response than is *ad*. **The PLS direction does not fit the predictors as closely as does PCA, but it does a better job explaining the response.**

To identify the second PLS direction we first *adjust* each of the variables for $Z_1$, by regressing each variable on $Z_1$ and taking *residuals*. These residuals can be interpreted as the remaining information that has not been explained by the first PLS direction. We then compute $Z_2$ using this orthogonalized data in exactly the same fashion as $Z_1$ was computed based on the original data.

This iterative approach can be repeated $M$ times to identify multiple PLS components $Z_1 , \ldots, Z_M$. Finally, at the end of this procedure, we use least squares to fit a linear model to predict $Y$ using $Z_1, \ldots, Z_M$ in exactly the same fashion as for PCR.

As with PCR, the number $M$ of partial least squares directions used in PLS is a tuning parameter that is typically chosen by cross-validation. We generally standardize the predictors and response before performing PLS.

In practice it often performs no better than ridge regression or PCR. While the supervised dimension reduction of PLS can reduce bias, it also has the potential to increase variance, so that the overall benefit of PLS relative to PCR is a wash.

---

# Considerations in High Dimensions
## High-Dimensional Data
Most traditional statistical techniques for regression and classification are intended for the low-dimensional setting in which n, the number of observations, is much greater than p, the number of features. In otherwords, $n \gg p$ is defined as low-dimensional.

It is now commonplace to collected an almost unlimited number of feature measurements ($p$ very large). While $p$ can be extremely large, the number of observations $n$ is often limited due to cost, sample availability, or other considerations.

Data sets containing more features than observations are often referred to as **high-dimensional**. Classical approaches such as least squares linear regression are not appropriate in this setting. Many of the issues that arise in the analysis of high-dimensional data were discussed earlier in this book, since they apply also when $n > p$: these include the role of the bias-variance trade-off and the danger of overfitting.

We have defined the high-dimensional setting as the case where the number of features p is larger than the number of observations n. But the considerations that we will now discuss certainly also apply if p is slightly smaller than n, and are best always kept in mind when performing supervised learning.

## What Goes Wrong in High Dimensions?
Note that in this discussion, we will examine least squares regression. But the same concepts apply to logistic regression, linear discriminant analysis, and other classical statistical approaches.

When the number of features p is as large as, or larger than, the number of observations n, least squares as described in Chapter 3 cannot (or rather, *should not*) be performed. The reason is simple: **regardless of whether or not there truly is a relationship between the features and the response, least squares will yield a set of coefficient estimates that result in a perfect fit to the data, such that the residuals are zero**.

Consider the two fits of Figure 6.22, one where $p = 1$ and $n = 20$ and another plot with $p=1$ and $n = 2$.

![Two least squares fits](./figures/6.22.png)
>**Figure 6.22.** *Left*: Least squares regression in the low-dimensional setting.  
*Right*: Least squares regression with $n = 2$ observations and two parameters to be
estimated (an intercept and a coefficient).

When there are only two observations, then regardless of the values of those observations, the regression line will fit the data exactly. This is problematic because this perfect fit will almost certainly lead to overfitting of the data. In other words, though it is possible to perfectly fit the training data in the high-dimensional setting, the resulting linear model will perform extremely poorly on an independent test set, and therefore does not constitute a useful model.

The problem is simple: when $p > n$ or $p \approx n$, a simple least squares regression line is *too flexible* and hence overfits the data.

In the section [Choosing the Optimal Model](#Choosing-the-Optimal-Model), we saw a number of approaches for adjusting the training set RSS or $R^2$ in order to account for the number of variables used to fit a least squares model. Unfortunately, the $C_p$, AIC, and BIC approaches are not appropriate in the high-dimensional setting, because estimating $\hat{\sigma}^2$ is problematic. (For instance, the formula for $\hat{\sigma}^2$ from Chapter 3 yields an estimate $\hat{\sigma}^2 = 0$ in this setting.) Similarly, problems arise in the application of adjusted $R^2$ in the high-dimensional setting, since one can easily obtain a model with an adjusted $R^2$ value of 1. Clearly, alternative approaches that are better-suited to the high-dimensional setting are required.

## Regression in High Dimensions
It turns out that many of the methods seen in this chapter for fitting *less flexible* least squares models, such as forward stepwise selection, ridge regression, the lasso, and principal components regression, are particularly useful for performing regression in the high-dimensional setting.

Figure 6.24 illustrates the performance of the lasso in a simple simulated example.

![Lasso on p 20 50 100](./figures/6.24.png)
>**Figure 6.24.** The lasso was performed with n = 100 observations and three
values of p, the number of features. Of the p features, 20 were associated with
the response. The boxplots show the test MSEs that result using three different
values of the tuning parameter λ in (\ref{6.7}).  
For ease of interpretation, rather than reporting λ,
the degrees of freedom are reported; for the lasso this turns out
to be simply the number of estimated non-zero coefficients.  
When p = 20, the lowest test MSE was obtained with the smallest amount of regularization.  
When p = 50, the lowest test MSE was achieved when there is a substantial amount of regularization.  
When p = 2,000 the lasso performed poorly regardless of the amount of regularization,
due to the fact that only 20 of the 2,000 features truly are associated with the outcome.

Figure 6.24 highlights three important points: (1) regularization or shrinkage plays a key role in high-dimensional problems, (2) appropriate tuning parameter selection is crucial for good predictive performance, and
(3) the test error tends to increase as the dimensionality of the problem (i.e. the number of features or predictors) increases, unless the additional features are truly associated with the response.

The third point above is in fact a key principle in the analysis of high-dimensional data, which is known as the **curse of dimensionality**. In general, *adding additional signal features that are truly associated with the response will improve the fitted model*, in the sense of leading to a reduction in test set error. However, *adding noise features that are not truly associated with the response will lead to a deterioration in the fitted model*, and consequently an increased test set error. **This is because noise features increase the dimensionality of the problem, exacerbating the risk of overfitting (since noise features may be assigned nonzero coefficients due to chance associations with the response on the training set) without any potential upside in terms of improved test set error.**

Thus, we see that new technologies that allow for the collection of measurements for thousands or millions of features are a double-edged sword: they can lead to improved predictive models if these features are in fact relevant to the problem at hand, but will lead to worse results if the features are not relevant. Even if they are relevant, *the variance incurred in fitting their coefficients may outweigh the reduction in bias that they
bring.*

## Interpreting Results in High Dimensions
When we perform the lasso, ridge regression, or other regression procedures in the high-dimensional setting, we must be quite cautious in the way that we report the results obtained. In Chapter 3, we learned about *multi-collinearity*, the concept that the variables in a regression might be correlated with each other. In the high-dimensional setting, the multicollinearity problem is extreme: any variable in the model can be written as a linear combination of all of the other variables in the model.

Essentially, this means that we can never know exactly which variables (if any) truly are predictive of the outcome, and we can never identify the best coefficients for use in the regression. At most, we can hope to assign large regression coefficients to variables that are correlated with the variables that truly are predictive of the outcome.

For instance, suppose that we are trying to predict blood pressure on the basis of half a million *single nucleotide polymorphisms* (SNPs), and that forward stepwise selection indicates that 17 of those SNPs lead to a good predictive model on the training data. It would be incorrect to conclude that these 17 SNPs predict blood pressure more effectively than the other SNPs not included in the model. If we were to obtain an independent data set and perform forward stepwise selection on that data set, we would likely obtain a model containing a different, and perhaps even non-overlapping, set of SNPs. This does not detract from the value of the model obtained—for instance, the model might turn out to be very effective in predicting blood pressure on an independent set of patients, and might be clinically useful for physicians. But we must be careful not to overstate the results obtained, and to make it clear that **what we have identified is simply one of many possible models for predicting blood pressure, and that it must be further validated on independent data sets.**

It is also important to be particularly careful in reporting errors and measures of model fit in the high-dimensional setting. We have seen that when $p > n$, it is easy to obtain a useless model that has zero residuals. Therefore, one should never use sum of squared errors, p-values, $R^2$ statistics, or other traditional measures of model fit on the training data as evidence of a good model fit in the high-dimensional setting. **It is important to instead report results on an independent test set, or cross-validation errors.** For instance, the MSE or $R^2$ on an independent test set is a valid measure of model fit, but the MSE on the training set certainly is not.

---
# End Chapter