In [1]:
from IPython.core.display import HTML
HTML("""
<style>
div.text_cell_render { /* Customize text cells */
font-family: 'Times New Roman';
font-size:1.3em;
line-height:1.4em;
padding-left:1.5em;
padding-right:1.5em;
}
</style>
""")

<h1><center>Tree-Based Methods</center></h1>

<b>Tree-based</b> methods <b>stratify</b> or <b>segment</b> the predictor space into a number of simple regions. To make a prediction, we use the <b>mean</b> or the <b>mode</b> of the training observations in the region in which the observation to be predicted belongs. The set of splitting rules can be summarized via a tree, these methods are also known as <b>decision tree</b> methods. <b>Bagging, random forests</b> and <b>boosting</b> produce multiple trees and then combine them in a single model to make the prediction. They provide improved accuracy at the cost of interpretability.

### 8.1 The Basics of Decision Trees
#### 8.1.1 Regression Trees
##### Predicting Baseball Players’ Salaries Using Regression Trees

The regression tree which predicts the salary of baseball players is shown below. Regression tree consists of a series of splitting rules and stratifies or segments the players into three regions :

$$R_1 = \{X \ | Years < 4.5\}$$

$$R_2 = \{X \ | Years \geq 4.5, Hits < 117.5\}$$

$$R_3 = \{X \ | Years \geq 4.5, Hits \geq 117.5\}$$

<img src="images/decision_tree.PNG"  width="500px">

The regions $R_1, R_2, R_3$ are known as <b>terminal nodes</b> or <b>leaves</b>. The points along the tree where the predictor space is split is called as <b>internal nodes</b>. The segments of the tree which connect the nodes are called <b>branches</b>. Regression trees are easier to interpret.

##### Prediction via Stratification of the Feature Space

The regression tree can be build by:

 -  Dividing the predictor space $X_1, X_2, ..., X_p$ into $J$ <b>distinct</b> and <b>non-overlapping</b> regions $R_1, R_2, ..., R_J$.
 
 - Every observation that falls into region $R_j$, the prediction is simply the mean of the training observations in the region.
 
Theoretically, the regions can have any shape. In practice, the regions are divided into high dimensional rectangles for simplicity and ease of interpretation. The goal is to find the regions $R_1, R_2, ..., R_J$ that minimizes

$$\sum_{j=1}^{J}\sum_{i \in R_j} (y_i - \widehat{y_{R_j}})^2$$

where $\widehat{y_{R_j}}$ is the <b>mean</b> of the training observation in the $j$th box. Predictors are split into regions by <b>top-down greedy approach</b> which is known as <b>recursive binary splitting</b>.

In recursive binary splitting, we first select a predictor $X_j$ and a <b>cutpoint</b> $s$, such that splitting the predictor space into region $\{X|X_j < s\}$ and $\{X|X_j \geq s\}$ leads to the greatest possible reduction in RSS. We consider all possible predictors $X_1, X_2, ..., X_p$, and all possible cutpoints $s$ for each of the predictors and then choose the predictor and cutpoint that leads to lowest possible RSS. To be more specific, for any $j$ and $s$, the pair of half-planes are defined as:

$$R_1(j, s) = \{X|X_j < s\}$$

$$R_2(j, s) = \{X|X_j \geq s\}$$

and hence, we need to find the value of $j$ and $s$ which minimizes

$$\sum_{i: x_i \in R_1(j, s)} (y_i - \widehat{y_{R_1}})^2 + \sum_{i: x_i \in R_2(j, s)} (y_i - \widehat{y_{R_2}})^2$$

This process continues until a stopping criteria is reached (such as no regions contain more than 5 observations).