# Contents

1. [The Basics of Decision Trees](#The-Basics-of-Decision-Trees)
2. [Bagging, Random Forests, Boosting](#Bagging,-Random-Forests,-Boosting)

---

# The Basics of Decision Trees
## Regression Trees
In order to motivate regression trees, we begin with a simple example.
### Predicting Baseball Players’ Salaries Using Regression Trees
We use the Hitters data set to predict a baseball player’s Salary based on Years (the number of years that he has played in the major leagues) and Hits (the number of hits that he made in the previous year). We first remove observations that are missing Salary values, and log-transform Salary so that its distribution has more of a typical bell-shape. (Recall that Salary is measured in thousands of dollars.)

![Hitters Tree](./figures/8.1.png)
>**Figure 8.1.** For the Hitters data, a regression tree for predicting the log
salary of a baseball player, based on the number of years that he has played in
the major leagues and the number of hits that he made in the previous year. At a
given internal node, the label (of the form $X_j < t_k$ ) indicates the left-hand branch
emanating from that split, and the right-hand branch corresponds to $X_j \ge t_k$.
For instance, the split at the top of the tree results in two large branches. The
left-hand branch corresponds to $\text{Years }<4.5$, and the right-hand branch corresponds
to $\text{Years } \ge 4.5$. The tree has two internal nodes and three terminal nodes, or
leaves. The number in each leaf is the mean of the response for the observations
that fall there.

Figure 8.1 shows a regression tree fit to this data. It consists of a series of splitting rules, starting at the top of the tree. Overall, the tree stratifies or segments the players into three regions of predictor space: players who have played for four or fewer years, players who have played for five or more years and who made fewer than 118 hits last year, and players who have played for five or more years and who made at least 118 hits last year.

These three regions can be written as $R_1 =\{X | \text{ Years }<4.5\}, R_2 =\{X | \text{ Years } \ge 4.5, \text{ Hits }<117.5\},$ and $R_3 =\{X | \text{ Years } \ge 4.5, \text{ Hits }  \ge 117.5\}$. Figure 8.2 illustrates the regions as a function of Years and Hits. The predicted salaries for these three groups are $\$1,000 \times e^{5.107} =\$165,174$, $\$1,000 \times e^{5.999} =\$402,834$, and $\$1,000 \times e^{6.740} =\$845,346$ respectively.

We might interpret the regression tree displayed in Figure 8.1 as follows: **Years is the most important factor in determining Salary, and players with less experience earn lower salaries than more experienced players.**

![Three Tree Regions](./figures/8.2.png)
>**Figure 8.2.** The three-region partition for the Hitters data set from the
regression tree illustrated in Figure 8.1.

The regression tree shown in Figure 8.1 is likely an over-simplification of the true relationship between
Hits, Years, and Salary. However, it has advantages over other types of regression models (such as those seen in Chapters 3 and 6): it is easier to interpret, and has a nice graphical representation.

## Prediction via Stratification of the Feature Space
We now discuss the process of building a regression tree. Roughly speaking, there are two steps.

1. We divide the predictor space—that is, the set of possible values for $X_1, X_2, \ldots, X_p$ —into $J$ distinct and non-overlapping regions, $R_1 , R_2, \ldots , R_J$.
2. For every observation that falls into the region $R_j$, we make the same prediction, which is simply the mean of the response values for the training observations in $R_j$.

For instance, suppose that in Step 1 we obtain two regions, $R_1$ and $R_2$, and that the response mean of the training observations in the first region is 10, while the response mean of the training observations in the second region is 20. Then for a given observation $X = x$, if $x \in R_1$ we will predict a value of 10, and if $x \in R_2$ we will predict a value of 20.

How do we construct the regions $R_1 , \ldots, R_J$? In theory, the regions could have any shape. However, we
choose to divide the predictor space into high-dimensional rectangles, or boxes, for simplicity and for ease of interpretation of the resulting predictive model. The goal is to find boxes $R_1, \ldots, R_J$ that minimize the RSS, given by

\begin{equation}\label{8.1}
    \sum^J_{j=1} \sum_{i \in R_j} (y_i - \hat{y}_{R_j})^2
    \tag{8.1}
\end{equation}

where $\hat{y}_{R_j}$ is the mean response for the training observations within the $j$th box.

Unfortunately, it is computationally infeasible to consider every possible partition of the feature space into $J$ boxes. For this reason, we take a *top-down*, *greedy* approach that is known as **recursive binary splitting**. The approach is *top-down* because it begins at the top of the tree (at which point all observations belong to a single region) and then successively splits the predictor space; each split is indicated via two new branches further down on the tree. It is *greedy* because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step.

In order to perform recursive binary splitting, **we first select the predictor $X_j$ and the cutpoint $s$ such that splitting the predictor space into the regions $\{ X | X_j < s\}$ and $\{X | X_j \ge s\}$ leads to the greatest possible reduction in RSS**. That is, we consider all predictors $X_1 , \ldots , X_p$, and all possible values of the cutpoint $s$ for each of the predictors, and then choose the predictor and cutpoint such that the resulting tree has the lowest RSS. In greater detail, for any $j$ and $s$, we define the pair of half-planes

\begin{equation}\label{8.2}
    R_1(j,s) = \{ X | X_j < s \} \text{ and } R_2(j,s) = \{ X | X_j \ge s \}
    \tag{8.2}
\end{equation}

and we seek the value of j and s that minimize the equation

\begin{equation}\label{8.3}
    \sum_{i: x_i \in R_1(j,s)} (y_i - \hat{y}_{R_1})^2 + \sum_{i: x_i \in R_2 (j,s)} (y_i - \hat{y}_{R_2})^2
    \tag{8.3}
\end{equation}

where $\hat{y}_{R_1}$ is the mean response for the training observations in $R_1(j,s)$, and $\hat{y}_{R_2}$ is the mean response for the training observations in $R_2(j, s)$. Finding the values of $j$ and $s$ that minimize (\ref{8.3}) can be done quite quickly, especially when the number of features $p$ is not too large.

Next, we repeat the process, looking for the best predictor and best cutpoint in order to split the data further so as to minimize the RSS within each of the resulting regions. However, this time, instead of splitting the entire predictor space, we split one of the two previously identified regions. We now have three regions. Again, we look to split one of these three regions further, so as to minimize the RSS. The process continues until a stopping criterion is reached; for instance, we may continue until no region contains more than five observations.

Once the regions $R_1, \ldots, R_J$ have been created, we predict the response for a given test observation using the mean of the training observations in the region to which that test observation belongs. A five-region example of this approach is shown in Figure 8.3.

### Tree Pruning
The process described above may produce good predictions on the training set, but is likely to overfit the data, leading to poor test set performance. This is because the resulting tree might be too complex. A smaller tree with fewer splits (that is, fewer regions $R_1, \ldots, R_J$) might lead to lower variance and better interpretation at the cost of a little bias.

One possible alternative to the process described above is to build the tree only so long as the decrease in the RSS due to each split exceeds some (high) threshold. This strategy will result in smaller trees, but is too short-sighted since a seemingly worthless split early on in the tree might be followed by a very good split—that is, a split that leads to a large reduction in RSS later on.

Therefore, a better strategy is to grow a very large tree $T_0$, and then prune it back in order to obtain a *subtree*. How do we determine the best way to prune the tree? Intuitively, our goal is to select a subtree that leads to the lowest test error rate. Given a subtree, we can estimate its test error using cross-validation or the validation set approach. However, **estimating the cross-validation error for every possible subtree would be too cumbersome, since there is an extremely large number of possible subtrees.** Instead, we need a way to select a small set of subtrees for consideration.

*Cost complexity pruning*—also known as *weakest link pruning*—gives us a way to do just this. Rather than considering every possible subtree, we consider a sequence of trees indexed by a nonnegative tuning parameter $\alpha$.

For each value of $\alpha$ there corresponds a subtree $T \subset T_0$ such that

\begin{equation}\label{8.4}
    \sum^{|T|}_{m=1} \sum_{i: x_i \in R_m} (y_i - \hat{y}_{R_m})^2 + \alpha|T|
    \tag{8.4}
\end{equation}

is as small as possible. Here $|T|$ indicates the number of terminal nodes of the tree $T$, $R_m$ is the rectangle (i.e. the subset of predictor space) corresponding to the $m$th terminal node, and $\hat{y}_{R_m}$ is the predicted response associated with $R_m$—that is, the mean of the training observations in $R_m$.

The tuning parameter $\alpha$ controls a trade-off between the subtree’s complexity and its fit to the training data. When $\alpha = 0$, then the subtree $T$ will simply equal $T_0$, because then (\ref{8.4}) just measures the training error. However, as $\alpha$ increases, there is a price to pay for having a tree with many terminal nodes, and so the quantity (\ref{8.4}) will tend to be minimized for a smaller subtree.

It turns out that as we increase $\alpha$ from zero in (\ref{8.4}), branches get pruned from the tree in a nested and predictable fashion, so obtaining the whole sequence of subtrees as a function of $\alpha$ is easy. **We can select a value of $\alpha$ using a validation set or using cross-validation.** We then return to the full data set and obtain the subtree corresponding to $\alpha$. This process is summarized in Algorithm 8.1.

**Algorithm 8.1 -** *Building a Regression Tree*

>1. Use recursive binary splitting to grow a large tree on the training
data, stopping only when each terminal node has fewer than some
minimum number of observations.
>2. Apply cost complexity pruning to the large tree in order to obtain a
sequence of best subtrees, as a function of $\alpha$.
>3.  Use K-fold cross-validation to choose $\alpha$. That is, divide the training
observations into $K$ folds. For each $k = 1, \ldots, K$:
>
>    (a) Repeat Steps 1 and 2 on all but the kth fold of the training data.
>
>    (b)  Evaluate the mean squared prediction error on the data in the left-out kth fold, as a function of $\alpha$.  
> Average the results for each value of α, and pick α to minimize the
average error.
>
> 4. Return the subtree from Step 2 that corresponds to the chosen value of $\alpha$.

Figures 8.4 and 8.5 display the results of fitting and pruning a regression tree on the Hitters data, using nine of the features. First, we randomly divided the data set in half, yielding 132 observations in the training set
and 131 observations in the test set. We then built a large regression tree on the training data and varied $\alpha$ in (\ref{8.4}) in order to create subtrees with different numbers of terminal nodes. Finally, we performed six-fold cross-validation in order to estimate the cross-validated MSE of the trees as a function of $\alpha$. The unpruned regression tree is shown in Figure 8.4.

![Unpruned Tree](./figures/8.4.png)
>**Figure 8.4.** Regression tree analysis for the Hitters data. The unpruned tree
that results from top-down greedy splitting on the training data is shown.

![Mean Squared Error and Tree Size](./figures/8.5.png)
>**Figure 8.5.** Regression tree analysis for the Hitters data. The training,
cross-validation, and test MSE are shown as a function of the number of terminal nodes in the pruned tree. Standard error bands are displayed. The minimum cross-validation error occurs at a tree size of three.
The pruned tree containing three terminal nodes is shown in Figure 8.1.

## Classification Trees
Recall that for a regression tree, the predicted response for an observation is given by the mean response of the training observations that belong to the same terminal node. In contrast, **for a classification tree, we predict that each observation belongs to the most commonly occurring class of training observations in the region to which it belongs**. In interpreting the results of a classification tree, we are often interested not only in the class prediction corresponding to a particular terminal node region, but also in the class proportions among the training observations that fall into that region.

The task of growing a classification tree is quite similar to the task of growing a regression tree. Just as in the regression setting, we use recursive binary splitting to grow a classification tree. However, in the classification setting, **RSS cannot be used as a criterion for making the binary splits**.

A natural alternative to RSS is the classification error rate. Since we plan to assign an observation in a given region to the most commonly occurring class of training observations in that region, the classification error rate is simply the fraction of the training observations in that region that do not belong to the most common class:

\begin{equation}\label{8.5}
    E = 1 - \max_\limits{k}(\hat{p}_{mk}).
    \tag{8.5}
\end{equation}

Here $\hat{p}_{mk}$ represents the *proportion of training observations in the $m$th region that are from the $k$th class*. **However, it turns out that classification error is not sufficiently sensitive for tree-growing, and in practice two other measures are preferable.**

The **Gini index** is defined by

\begin{equation}\label{8.6}
    G = \sum^K_{k=1} \hat{p}_{mk}(1-\hat{p}_{mk}),
    \tag{8.6}
\end{equation}

a measure of total variance across the *K* classes. It is not hard to see that the Gini index takes on a small value if all of the $\hat{p}_{mk}$’s are close to zero or one. For this reason the Gini index is referred to as a measure of node purity—a small value indicates that a node contains predominantly observations from a single class.

An alternative to the Gini index is **entropy**, given by

\begin{equation}\label{8.7}
    D = - \sum^K_{k=1} \hat{p}_{mk} \log \hat{p}_{mk}.
    \tag{8.7}
\end{equation}

Since $ 0 \le \hat{p}_{mk} \le 1$, it follows that $0 \le - \hat{p}_{mk} \log \hat{p}_{mk}$. One can show that
the entropy will take on a value near zero if the $\hat{p}_{mk}$’s are all near zero or near one. Therefore, like the Gini index, the entropy will take on a small value if the $m$th node is pure.

When building a classification tree, either the Gini index or the entropy are typically used to evaluate the quality of a particular split, since these two approaches are more sensitive to node purity than is the classification error rate.

**Any of these three approaches might be used when *pruning* the tree, but the classification error rate is preferable if prediction accuracy of the final pruned tree is the goal**.

Figure 8.6 shows an example on the Heart data set. These data contain a binary outcome `HD` for 303 patients who presented with chest pain. An outcome value of Yes indicates the presence of heart disease based on an angiographic test, while No means no heart disease.

There are 13 predictors including `Age`, `Sex`, `Chol` (a cholesterol measurement), and other heart and lung function measurements. Cross-validation results in a tree with six terminal nodes.

![Heart Data](./figures/8.6.png)
>**Figure 8.6.** Heart data. *Top*: The unpruned tree.  
*Bottom Left*: Cross-validation error, training, and test error, for different sizes of the pruned tree.  
*Bottom Right*: The pruned tree corresponding to the minimal cross-validation error.

In our discussion thus far, we have assumed that the predictor variables take on continuous values. However, decision trees can be constructed even in the presence of qualitative predictor variables. For instance, in the Heart data, some of the predictors, such as `Sex`, `Thal` (Thallium stress test), and `ChestPain`, are qualitative. Therefore, a split on one of these variables amounts to assigning some of the qualitative values to one branch and assigning the remaining to the other branch.

For instance, the text `Thal:a` indicates that the left-hand branch coming out of that node consists of observations with the *first value* of the `Thal` variable (normal), and the right-hand node consists of the *remaining observations* (fixed or reversible defects). The text `ChestPain:bc` two splits down the tree on the left indicates that the left-hand branch coming out of that node consists of observations with the second and third values of the `ChestPain` variable, where the possible values are typical angina, atypical angina, non-anginal pain, and asymptomatic.

Figure 8.6 has a surprising characteristic: **some of the splits yield two terminal nodes that have the same predicted value**. For instance, consider the split `RestECG<1` near the bottom right of the unpruned tree. Regardless of the value of `RestECG`, a response value of Yes is predicted for those observations. Why, then, is the split performed at all? The split is performed because it leads to increased *node purity*. That is, all 9 of the observations corresponding to the right-hand leaf have a response value of `Yes`, whereas 7/11 of those corresponding to the left-hand leaf have a response value of `Yes`.

Why is node purity important? Suppose that we have a test observation that belongs to the region given by that right-hand leaf. Then we can be pretty certain that its response value is `Yes`. In contrast, if a test observation belongs to the region given by the left-hand leaf, then its response value is probably `Yes`, but we are much less certain. **Even though the split `RestECG<1` does not reduce the classification error, it improves the Gini index and the entropy, which are more sensitive to node purity**.

## Trees Versus Linear Models
Regression and classification trees have a very different flavor from the more classical approaches for regression and classification presented in Chapters 3 and 4. In particular, linear regression assumes a model of the form

\begin{equation}\label{8.8}
    f(X) = \beta_0 + \sum^p_{j=1} X_j \beta_j
    \tag{8.8}
\end{equation}

whereas regression trees assume a model of the form

\begin{equation}\label{8.9}
    f(X) = \sum^{M}_{m=1} c_m \cdot 1_{(X \in R_m)}
    \tag{8.9}
\end{equation}

where $R_1, \ldots, R_M$ represent a partition of feature space, as in Figure 8.3.

If the relationship between the features and the response is well approximated by a linear model as in (\ref{8.8}), then an approach such as linear regression will likely work well, and will outperform a method such as a regression tree that does not exploit this linear structure.

**If instead there is a highly non-linear and complex relationship between the features and the response as indicated by model (\ref{8.9}), then decision trees may outperform classical approaches**. An illustrative example is displayed in Figure 8.7. The relative performances of tree-based and classical approaches can be assessed by estimating the test error, using either cross-validation or the validation set approach.

![Linear and Nonlinear decision boundaries](./figures/8.7.png)
>**Figure 8.7.** *Top Row*: A two-dimensional classification example in which
the true decision boundary is linear, and is indicated by the shaded regions.
A classical approach that assumes a linear boundary (left) will outperform a decision
tree that performs splits parallel to the axes (right). 
*Bottom Row*: Here the true decision boundary is non-linear. Here a linear model is unable to capture
the true decision boundary (left), whereas a decision tree is successful (right).

## Advantages and Disadvantages of Trees
Decision trees for regression and classification have a number of advantages over the more classical approaches:
#### Pros
- Trees are very easy to explain to people. In fact, they are even easier to explain than linear regression!
- Some people believe that decision trees more closely mirror human decision-making than do the regression and classification approaches seen in previous chapters.
-  Trees can be displayed graphically, and are easily interpreted even by a non-expert (especially if they are small).
- Trees can easily handle qualitative predictors without the need to create dummy variables.

#### Cons
- Unfortunately, trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches seen in this book.
- Additionally, trees can be very non-robust. In other words, a small change in the data can cause a large change in the final estimated tree.

However, by aggregating many decision trees, using methods like *bagging*, *random forests*, and *boosting*, the predictive performance of trees can be substantially improved.

---

# Bagging, Random Forests, Boosting
## Bagging
The decision trees discussed in [Section 8.1](#The-Basics-of-Decision-Trees) suffer from *high variance*. This means that if we split the training data into two parts at random, and fit a decision tree to both halves, the results that we get could be quite different.

Linear regression tends to have *low variance*, if the ratio of $n$ to $p$ is moderately large. **Bootstrap aggregation**, or **bagging**, is a general-purpose procedure for reducing the variance of a statistical learning method; we introduce it here because it is particularly useful and frequently used in the context of decision trees.

Recall that given a set of $n$ independent observations $Z_1, \ldots, Z_n$, each with variance $\sigma^2$, the variance of the mean $\bar{Z}$ of the observations is given by $\sigma^2/n$. In other words, averaging a set of observations reduces variance. **Hence a natural way to reduce the variance and hence increase the prediction accuracy of a statistical learning method is to take many training sets from the population, build a separate prediction model using each training set, and average the resulting predictions**.

In other words, we could calculate $\hat{f}^1(x), \hat{f}^2(x), \ldots, \hat{f}^B(x)$ using $B$ separate training sets, and average them in order to obtain a single low-variance statistical learning model, given by

\begin{align*}
    \hat{f}_{avg}(x) = \frac{1}{B} \sum^B_{b=1} \hat{f}^b(x)
\end{align*}

Of course, this is not practical because we generally do not have access to multiple training sets. Instead, we can **bootstrap**, by taking repeated samples from the (single) training data set. In this approach we generate $B$ different bootstrapped training data sets. We then train our method on the $b$th bootstrapped training set in order to get $\hat{f}^{∗b}(x)$, and finally average all the predictions, to obtain

\begin{align*}
    \hat{f}_{bag}(x) = \frac{1}{B} \sum^B_{b=1} \hat{f}^{*b}(x)
\end{align*}

This is called **bagging**

To apply bagging to regression trees, we simply construct $B$ regression trees using $B$ bootstrapped training sets, and average the resulting predictions. **These trees are grown deep, and are not pruned.** Hence each individual tree has high variance, but low bias. Averaging these $B$ trees reduces the variance.

How can bagging be extended to a classification problem where $Y$ is qualitative? The simplest approach is for a given test observation, we can record the class predicted by each of the $B$ trees, and take a **majority vote**: the overall prediction is the most commonly occurring class among the $B$ predictions.

Figure 8.8 shows the results from bagging trees on the Heart data. The test error rate is shown as a function of $B$, the number of trees constructed using bootstrapped training data sets. We see that the bagging test error
rate is slightly lower in this case than the test error rate obtained from a single tree. **The number of trees $B$ is not a critical parameter with bagging; using a very large value of $B$ will not lead to overfitting**. In practice we use a value of $B$ sufficiently large that the error has settled down. Using $B = 100$ is sufficient to achieve good performance in this example.

![Bagging and Random Forest](./figures/8.8.png)
>**Figure 8.8.** Bagging and random forest results for the Heart data. The test
error (black and orange) is shown as a function of B, the number of bootstrapped
training sets used. Random forests were applied with $m = \sqrt{p}$. The dashed line
indicates the test error resulting from a single classification tree. The green and
blue traces show the OOB error, which in this case is considerably lower.

### Out-of-Bag Error Estimation
It turns out that there is a very straightforward way to estimate the test error of a bagged model, without the need to perform cross-validation or the validation set approach. Recall that the key to bagging is that trees are
repeatedly fit to bootstrapped subsets of the observations. One can show that on average, **each bagged tree makes use of around two-thirds of the observations**. (See Exercise 5.2) The remaining one-third of the observations not used to fit a given bagged tree are referred to as the **out-of-bag** (OOB) observations.

We can predict the response for the $i$th observation using each of the trees in which that observation was OOB. **This will yield around $B/3$ predictions for the $i$th observation.** In order to obtain a single prediction for the $i$th observation, we can average these predicted responses (if regression is the goal) or can take a majority vote (if classification is the goal). This leads to a single OOB prediction for the $i$th observation.

An OOB prediction can be obtained in this way for each of the $n$ observations, from which the overall OOB MSE (for a regression problem) or classification error (for a classification problem) can be computed. The resulting OOB error is a valid estimate of the test error for the bagged model, since the response for each observation is predicted using only the trees that were not fit using that observation.

It can be shown that with $B$ sufficiently large, OOB error is virtually equivalent to leave-one-out cross-validation error. The OOB approach for estimating the test error is particularly convenient when performing bagging on large data sets for which cross-validation would be computationally onerous.

### Variable Importance Measures
Recall that one of the advantages of decision trees is the attractive and easily interpreted diagram that results, such as the one displayed in Figure 8.1. However, when we bag a large number of trees, it is no longer possible to represent the resulting statistical learning procedure using a single tree, and it is no longer clear which variables are most important to the procedure.

Although the collection of bagged trees is much more difficult to interpret than a single tree, **one can obtain an overall summary of the importance of each predictor using the RSS (for bagging regression trees) or the Gini index (for bagging classification trees)**. In the case of bagging regression trees, we can record the total amount that the RSS (\ref{8.1}) is decreased due to splits over a given predictor, averaged over all $B$ trees. *A large value indicates an important predictor*.

A graphical representation of the variable importances in the Heart data is shown in Figure 8.9. We see the mean decrease in Gini index for each variable, relative to the largest. The variables with the largest mean decrease
in Gini index are `Thal`, `Ca`, and `ChestPain`.

![Variable Importance](./figures/8.9.png)
>**Figure 8.9.** A variable importance plot for the Heart data. Variable
importance is computed using the mean decrease in Gini index, and expressed relative
to the maximum.

## Random Forests
Random forests provide an improvement over bagged trees by way of a small tweak that *decorrelates* the trees. As in bagging, we build a number of decision trees on bootstrapped training samples. **But when building these decision trees, each time a split in a tree is considered, *a random sample of $m$ predictors* is chosen as split candidates from the full set of $p$ predictors**. The split is allowed to use only one of those $m$ predictors. A fresh sample of $m$ predictors is taken at each split, and typically we choose $m \approx \sqrt{p}$—that is, the number of predictors considered at each split is approximately equal to the square root of the total number of predictors. (4 out of the 13 for the Heart data).

In other words, in building a random forest, at each split in the tree, the algorithm is *not even allowed to consider* a majority of the available predictors. This may sound crazy, but it has a clever rationale.

Suppose that there is one very strong predictor in the data set, along with a number of other moderately strong predictors. Then in the collection of bagged trees, most or all of the trees will use this strong predictor in the top split. Consequently, all of the bagged trees will look quite similar to each other.

Hence the predictions from the bagged trees will be highly correlated. Unfortunately, averaging many highly correlated quantities does not lead to as large of a reduction in variance as averaging many uncorrelated quantities.

On average $(p − m)/p$ of the splits will not even consider the strong predictor, and so other predictors will have more of a chance in a random forest approach. We can think of this process as decorrelating the trees, thereby making the average of the resulting trees less variable and hence more reliable.

#### Random Forests vs Bagging
The main difference between bagging and random forests is the choice of predictor subset size $m$. **For instance, if a random forest is built using $m = p$, then this amounts simply to bagging**. On the Heart data, random forests using $m = \sqrt{p}$ leads to a reduction in both test error and OOB error over bagging (Figure 8.8).

Using a small value of $m$ in building a random forest will typically be helpful when we have a large number of correlated predictors. We applied random forests to a high-dimensional biological data set consisting of expression measurements of 4,718 genes measured on tissue samples from 349 patients. In this data set, each of the patient samples has a qualitative label with 15 different levels: either normal or 1 of 14 different types of cancer. Our goal was to use random forests to predict cancer type based on the 500 genes that have the largest variance in the training set. We randomly divided the observations into a training and a test set, and applied random forests to the training set for three different values of the number of splitting variables $m$. The results are shown in Figure 8.10. As with bagging, random forests will not overfit if we increase $B$, so in practice we use a value of $B$ sufficiently large for the error rate to have settled down.

![Gene Expression using Bagging and Random Forests](./figures/8.10.png)
>**Figure 8.10.** Results from random forests for the 15-class gene expression
data set with $p = 500$ predictors. The test error is displayed as a function of
the number of trees. Each colored line corresponds to a different value of $m$, the
number of predictors available for splitting at each interior tree node. Random
forests ($m < p$) lead to a slight improvement over bagging ($m = p$). A single
classification tree has an error rate of $45.7 \%$.

## Boosting
Like bagging, boosting is a general approach that can be applied to many statistical learning methods for regression or classification. Here we restrict our discussion of boosting to the context of decision trees.

Recall that bagging involves creating multiple copies of the original training data set using the bootstrap, fitting a separate decision tree to each copy, and then combining all of the trees in order to create a single predictive model. Notably, each tree is built on a bootstrap data set, independent of the other trees.

Boosting works in a similar way, except that the trees are grown *sequentially*: **each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set**.

Consider first the regression setting. Like bagging, boosting involves combining a large number of decision trees, $\hat{f}^1, \ldots, \hat{f}^B$. Boosting is described in Algorithm 8.2.

**Algorithm 8.2 -** *Boosting for Regression Trees*

>1. Set $\hat{f}(x) = 0$ and $r_i = y_i$ for all $i$ in the training set.
>2. For $b = 1, 2, \ldots, B,$ repeat:
>
>    (a) Fit a tree $\hat{f}^b$ with $d$ splits ($d+1$ terminal nodes) to the training data $(X, r)$.
>
>    (b) Update $\hat{f}$ by adding in a shrunken version of the new tree:
\begin{equation}\label{8.10}
    \hat{f}(x) \gets \hat{f}(x) + \lambda \hat{f}^b(x).
    \tag{8.10}
\end{equation}
>
>    (c) Update the residuals,
\begin{equation}\label{8.11}
    r_i \gets r_i - \lambda \hat{f}^b(x_i).
    \tag{8.11}
\end{equation}
>3. Output the boosted model,
\begin{equation}\label{8.12}
    \hat{f}(x) = \sum^B_{b=1} \lambda \hat{f}^b(x).
    \tag{8.12}
\end{equation}

What is the idea behind this procedure? Unlike fitting a single large decision tree to the data, which amounts to *fitting the data hard* and potentially overfitting, the boosting approach instead learns *slowly*. Given the current model, we fit a decision tree to the residuals from the model. **That is, we fit a tree using the current residuals, rather than the outcome $Y$, as the response**. We then add this new decision tree into the fitted function in order to update the residuals. Each of these trees can be rather small, with just a few terminal nodes, determined by the parameter $d$ in the algorithm.

By fitting small trees to the residuals, we slowly improve $\hat{f}$ in areas where it does not perform well. The shrinkage parameter $\lambda$ slows the process down even further, allowing more and different shaped trees to attack the residuals. *In general, statistical learning approaches that learn slowly tend to perform well.* Note that in boosting, unlike in bagging, the construction of each tree depends strongly on the trees that have already been grown.

We have just described the process of boosting regression trees. Boosting classification trees proceeds in a similar but slightly more complex way, and the details are omitted here.

Boosting has three tuning parameters:
1. The number of trees $B$. Unlike bagging and random forests, **boosting can overfit if $B$ is too large, although this overfitting tends to occur slowly if at all. We use cross-validation to select $B$**.
2. The shrinkage parameter $\lambda$, a small positive number. **This controls the rate at which boosting learns**. Typical values are $0.01$ or $0.001$, and the right choice can depend on the problem. Very small $\lambda$ can require using a very large value of $B$ in order to achieve good performance.
3.  The number $d$ of splits in each tree, which controls the complexity of the boosted ensemble. Often $d = 1$ works well, in which case each tree is a stump, consisting of a single split. In this case, the boosted ensemble is fitting an additive model, since each term involves only a single variable. More generally $d$ is the interaction depth, and controls the interaction order of the boosted model, since d splits can involve at most $d$ variables.

In Figure 8.11, we applied boosting to the 15-class cancer gene expression data set, in order to develop a classifier that can distinguish the normal class from the 14 cancer classes. We display the test error as a function of the total number of trees and the interaction depth $d$.

We see that simple stumps with an interaction depth of one perform well if enough of them are included. This model outperforms the depth-two model, and both out-perform a random forest. This highlights one difference between boosting and random forests: **in boosting, because the growth of a particular tree takes into account the other trees that have already been grown, smaller trees are typically sufficient. Using smaller trees can aid in interpretability as well; for instance, using stumps leads to an additive model**.

![Boosting and Random Forests](./figures/8.11.png)
>**Figure 8.11.** Results from performing boosting and random forests on the
15-class gene expression data set in order to predict cancer versus normal. The
test error is displayed as a function of the number of trees. For the two boosted
models, $\lambda = 0.01$. Depth-1 trees slightly outperform depth-2 trees, and both
out-perform the random forest, although the standard errors are around $0.02$, making
none of these differences significant. The test error rate for a single tree is $24\%$.

---

# End Chapter