## Chapter 6 -  Decision Trees

### Regression Trees

Besides classification tasks, decision trees are also capable of performing regression tasks. The following is an example of a regression tree built using the housing dataset.

<img src="tree2.png" height="300" />

Like the classification tree, the concepts of splitting rules and parts of a tree remain. However, this time the prediction is a value, which is the mean value of all observations within that node $R_m$.

### Training a Tree
When training a regression tree, the prediction is a value instead of a class / probability. Given $n$ training examples, each with $p$ features $x_1, \cdots,x_i,\cdots, x_n \in \mathbb R^p$ and their associated values $y_1, \cdots, y_j, \cdots, y_n$, the training steps remain the same. 

In the Housing example, we have now obtained 8 regions, from left $R_1, R_2, \cdots, R_8$. The response in each of the nodes are the mean value of prices for all observations in the node. So for a new observation $x^*$, if $x^* \in R_3$ then its value prediction is $90983.3$.

To construct the regions $R_m$, we now find regions that minimise the mean squared error or MSE, where:

$$\text{MSE}_{R_m} = \frac{1}{n_{R_m}}\sum_{i \in R_m}(\hat{y}-y^{(i)})^2$$

<b>Interpretation</b> - A low MSE means that the predicted values are close to the actual values for each observation.

During training, the CART algorithm is now:

$$J(g,t_g) = \frac{m_{\text{left}}}{m}\text{MSE}_\text{left} + \frac{m_{\text{right}}}{m}\text{MSE}_\text{right}$$

where $\text{MSE}_{\text{left or right}}$ is the MSE of the left or right subset. The split with the lowest MSE for both subsets also miminimse the cost function $J$.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import export_graphviz

from graphviz import Source # For creating the visualisation of the decision tree

from sklearn.metrics import mean_squared_error, mean_absolute_error

In [2]:
# Ingest, preprocessing
X = pd.read_csv('housing_X_feateng_complete.csv')
y = pd.read_csv('housing_y_feateng_complete.csv')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [3]:
# Train
tree_reg = DecisionTreeRegressor(max_depth=3)
tree_reg.fit(X_train,y_train)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=3,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

In [4]:
# Visualise tree
# graph = Source(export_graphviz(tree_reg, out_file=None, 
#                                feature_names=X.columns,))
# graph.view()

In [5]:
# Predict
print(tree_reg.predict(X_test[:2]))
print(y_test[:2]['median_house_value'].tolist())

[183621.088      257344.33890049]
[136900.0, 241300.0]


### Regularisation

If left unconstrained the tree usually overfits the training data, leading to poor test set performance. There are some ways to regularise the model. In SKLearn, you can restrict the:
- `max_depth` maximum number of traversals from root to leaf

- `min_samples_split` minimum number of samples a node must have before a split occurs

- `min_samples_leaf` minimum number of samples a leaf node must have

- `max_leaf_nodes` maximum number of leaf nodes

- `max_features` maximum number of features evaluated for splitting each node.

### Tree Pruning

To overcome overfitting, one might consider fitting a smaller tree (with a lower depth). This can lead to lower variance and better interpretation at the cost of some bias. Another alternative is to split results in a high reduction in RSS. By controlling the `max_depth`, we grow a large tree, and then <u>prune</u> it to get a smaller <u>subtree</u>. 

We can estimate the performace of this subtree using cross-validation.

In [6]:
# Best tree Cross Validation Example
tree_param_grid = [{'max_depth' : [1,2,3,4,5,6], 
                    'max_features' : [2,4,6,8], 
                    'max_leaf_nodes':[4,5,6,78,9,10,11,12,13,14,15,16]}]
tree_reg = DecisionTreeRegressor()
grid_search = GridSearchCV(tree_reg, tree_param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse',
                                             max_depth=None, max_features=None,
                                             max_leaf_nodes=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             presort='deprecated',
                                             random_state=None,
                                             splitter='best'),
             iid='deprecated', n_jobs=None,
             param_grid=[{'max_depth': [1, 2, 3, 4, 5, 6],
                          'max_features': [2, 4, 6, 8],
                          'max_leaf_node

In [7]:
print(grid_search.best_estimator_)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=6,
                      max_features=8, max_leaf_nodes=78,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')
