🆕 Please install the `GraphViz` package as an administrator, i.e., `sudo apt-get install graphviz`.

🆕 Please install the `pydot` library as an administrator, i.e., `conda install -c anaconda pydot`.

🆕 Please install the `six` library as an administrator, i.e., `conda install -c anaconda six`.

<h1>The Basic of Decision Trees</h1>

These involve _stratifying_ or _segmenting_ the predictor space into a number of simple regions. In order to make a prediction for a given observation, we typically use the mean or the mode of the training observations in the region to which it belongs. Since the set of splitting rules used to segment the predictor space can be summarized in a tree, these types of approaches are known as **decision tree** methods.

```
Hitters.csv

Major League Baseball Data from the 1986 and 1987 seasons containing 322 observations of major league players on the following 20 variables:

AtBat     Number of times at bat in 1986
Hits      Number of hits in 1986
HmRun     Number of home runs in 1986
Runs      Number of runs in 1986
RBI       Number of runs batted in in 1986
Walks     Number of walks in 1986
Years     Number of years in the major leagues
CAtBat    Number of times at bat during his career
CHits     Number of hits during his career
CHmRun    Number of home runs during his career
CRuns     Number of runs during his career
CRBI      Number of runs batted in during his career
CWalks    Number of walks during his career
League    A factor with levels A and N indicating player’s league at the end of 1986
Division  A factor with levels E and W indicating player’s division at the end of 1986
PutOuts   Number of put outs in 1986
Assists   Number of assists in 1986
Errors    Number of errors in 1986
Salary    1987 annual salary on opening day in thousands of dollars
NewLeague A factor with levels A and N indicating player’s league at the beginning of 1987
```

In [None]:
import pydot
from IPython.display import Image
from six import StringIO  
from sklearn.tree import export_graphviz

# This function creates images of tree models using pydot
def print_tree(estimator, features, class_names=None, filled=True):
    tree = estimator
    names = features
    color = filled
    classn = class_names
    
    dot_data = StringIO()
    export_graphviz(estimator, out_file=dot_data, feature_names=features, class_names=classn, filled=filled)
    (graph,) = pydot.graph_from_dot_data(dot_data.getvalue())
    return(graph)

<h1>Regression Tree: Example</h1>

💻 We first download the `Hitters` data set from the internet and drop all observations with missing values.

In [None]:
import pandas as pd
df = pd.read_csv('https://r-data.pmagunia.com/system/files/datasets/dataset-87300.csv').dropna()

💻 We first log-transform `Salary` so that its distribution has more of a typical bell-shape. (Recall that `Salary` is measured in thousands of dollars.)

In [None]:
import numpy as np
X = df[['Years', 'Hits']]
y = np.log(df.Salary)

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
plt.style.use('seaborn-white')

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(11,4))
ax1.hist(df.Salary,color='gold',edgecolor='black',linewidth=1.2,bins=20,density=True)
ax1.set_xlabel('Salary')
ax2.hist(y,color='gold',edgecolor='black',linewidth=1.2,bins=20,density=True)
ax2.set_xlabel('Log(Salary)');

In [None]:
from sklearn.tree import DecisionTreeRegressor
regr = DecisionTreeRegressor(max_leaf_nodes=3)
regr.fit(X, y)

In [None]:
graph = print_tree(regr, features=['Years', 'Hits'])
Image(graph.create_png())

👆🏼 shows a regression tree fit to this data. It consists of a series of splitting rules, starting at the top of the tree. The top split assigns observations having $\text{Years}\le 4.5$ to the left branch. The predicted salary for these players is given by the mean response value for the 90 players in the data set with $\text{Years}\le 4.5$, i.e., $\$165,174=\exp(5.107)$.

👆🏼 Players with $\text{Years}>4.5$ are assigned to the right branch, and then that group is further subdivided by $\text{Hits}$.

📝 Overall, the tree stratifies or segments the players into three regions of predictor space: players who have played for four or fewer years, players who have played for five or more years and who made fewer than 118 hits last year, and players who have played for five or more years and who made at least 118 hits last year.

In [None]:
df.plot('Years', 'Hits', kind='scatter', color='orange', figsize=(7,6))
plt.xlim(0,25)
plt.ylim(ymin=-5)
plt.xticks([1, 4.5, 24])
plt.yticks([1, 117.5, 238])
plt.vlines(4.5, ymin=-5, ymax=250)
plt.hlines(117.5, xmin=4.5, xmax=25)
plt.annotate('R1', xy=(2,117.5), fontsize='xx-large')
plt.annotate('R2', xy=(11,60), fontsize='xx-large')
plt.annotate('R3', xy=(11,170), fontsize='xx-large')

These three regions can be written as
1. $R_{1}=\{\mathrm{X} \mid \text{Years} <4.5\}$,
2. $R_{2}=\{\mathrm{X} \mid \text{Years} \ge 4.5, \text{Hits} <117.5\}$, and
3. $R_{3}=\{\mathrm{X} \mid \text{Years} \ge 4.5, \text{Hits} \ge 117.5\}$.

The predicted salaries for these three groups are
1. $\$1,000 \times \exp(5.107)= \$165,174$,
2. $\$1,000 \times \exp(5.998)= \$402,834$, and
3. $\$1,000 \times \exp(6.740)= \$845,346$ respectively.

<h2>Terminology</h2>

👉🏼 $R_1$, $R_2$, and $R_3$ are known as <span style="color:blue">_terminal nodes_</span> or <span style="color:blue">_leaves_</span> of the tree.

👉🏼 The points along the tree where the predictor space is split are referred to as <span style="color:blue">_internal nodes_</span>, i.e., $\text{Years}\le 4.5$ and $\text{Hits}\le 117.5$.

👉🏼 The segment connecting the internal nodes are called <span style="color:blue">_branches_</span>.

<h2>Interpretation</h2>

💡 Years is the most important factor in determining Salary, and players with less experience earn lower salaries than more experienced players.

💡 Given that a player is less experienced, the number of hits that he made in the previous year seems to play little role in his salary.

💡 But among players who have been in the major leagues for five or more years, the number of hits made in the previous year does affect salary, and players who made more hits last year tend to have higher salaries.

<h2>Prediction via Stratification</h2>

There are two steps:
1. We divide the predictor space- that is, the set of possible values for $X_{1}, X_{2}, \ldots, X_{p}$ -into $J$ distinct and non-overlapping regions, $R_{1}, R_{2}, \ldots, R_{J}$.
2. For every observation that falls into the region $R_{j}$, we make the same prediction, which is simply the mean of the response values for the training observations in $R_{j}$.

<blockquote>For instance, suppose that in Step 1 we obtain two regions, $R_{1}$ and $R_{2}$, and that the response mean of the training observations in the first region is 10 , while the response mean of the training observations in the second region is 20 . Then for a given observation $X=x$, if $x \in R_{1}$ we will predict a value of 10 , and if $x \in R_{2}$ we will predict a value of 20.</blockquote>

<h3>How do we construct the regions $R_1,\ldots,R_J$ ?</h3>

The goal is to find <span style="color:blue">_boxes_</span> $R_{1}, \ldots, R_{J}$ that minimize the $RSS$, given by
$$
\sum_{j=1}^{J} \sum_{i \in R_{j}}\left(y_{i}-\hat{y}_{R_{j}}\right)^{2}
$$
where $\hat{y}_{R_{j}}$ is the mean response for the training observations within the $j$th box.

⛔ <span style="color:red">Unfortunately, it is computationally infeasible to consider every possible partition of the feature space into $J$ boxes.</span>

<ins>_**Recursive Binary Splitting**_</ins>

In order to perform recursive binary splitting, we first select the predictor $X_{j}$ and the cutpoint $s$ such that splitting the predictor space into the regions $\left\{X \mid X_{j}<s\right\}$ and $\left\{X \mid X_{j} \geq s\right\}$ leads to the greatest possible reduction in $RSS$. (The notation $\left\{X \mid X_{j}<s\right\}$ means the region of predictor space in which $X_{j}$ takes on a value less than $s$.) That is, we consider all predictors $X_{1}, \ldots, X_{p}$, and all possible values of the cutpoint $s$ for each of the predictors, and then choose the predictor and cutpoint such that the resulting tree has the lowest $RSS$. In greater detail, for any $j$ and $s$, we define the pair of half-planes
$$
R_{1}(j, s)=\left\{X \mid X_{j}<s\right\} \text { and } R_{2}(j, s)=\left\{X \mid X_{j} \geq s\right\}
$$
and we seek the value of $j$ and $s$ that minimize the equation
$$
\sum_{i: x_{i} \in R_{1}(j, s)}\left(y_{i}-\hat{y}_{R_{1}}\right)^{2}+\sum_{i: x_{i} \in R_{2}(j, s)}\left(y_{i}-\hat{y}_{R_{2}}\right)^{2}
$$
where $\hat{y}_{R_{1}}$ is the mean response for the training observations in $R_{1}(j, s)$, and $\hat{y}_{R_{2}}$ is the mean response for the training observations in $R_{2}(j, s)$. Finding the values of $j$ and $s$ that minimize the last equation can be done quite quickly, especially when the number of features $p$ is not too large.

Next, we repeat the process, looking for the best predictor and best cutpoint in order to split the data further so as to minimize the $RSS$ within each of the resulting regions. However, this time, instead of splitting the entire predictor space, we split one of the two previously identified regions. We now have three regions. Again, we look to split one of these three regions further, so as to minimize the $RSS$. The process continues until a stopping criterion is reached; for instance, we may continue until no region contains
more than five observations.

In [None]:
X = df[['Years', 'Hits', 'RBI', 'PutOuts', 'Walks', 'CRuns']]
y = np.log(df.Salary)
regr = DecisionTreeRegressor(max_leaf_nodes=12,random_state=42)
regr.fit(X, y)
graph = print_tree(regr, features=['Years', 'Hits', 'RBI', 'PutOuts', 'Walks', 'CRuns'])
Image(graph.create_png())

<h3>Choosing Hyperparameters</h3>

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import mean_squared_error

scoring = make_scorer(mean_squared_error)
g_cv = GridSearchCV(DecisionTreeRegressor(random_state=42),
              param_grid={'max_leaf_nodes': range(2, 14)},
              scoring=scoring, cv=5, refit=True)

g_cv.fit(X_train,y_train)

In [None]:
results_df = pd.DataFrame(g_cv.cv_results_)
results_df = results_df.sort_values(by=['rank_test_score'],ascending=False)
results_df = (
    results_df
    .set_index(results_df["params"].apply(
        lambda x: "_".join(str(val) for val in x.values()))
    )
    .rename_axis('max_leaf_nodes')
)
results_df[
    ['params', 'rank_test_score', 'mean_test_score', 'std_test_score']
]

In [None]:
mean_squared_error(y_test, g_cv.best_estimator_.predict(X_test))