## Chapter 6 -  Decision Trees

### The CART Training Algorithm

SKLearn uses the Classification and Regression Tree (CART) algorithm to train Decision Trees. First, split the training set using a feature $k$ and a threshold $t_k$. Find this feature-threshold pair $(k,t_k)$ that produces the purest subsets. The cost function to minimise is:

$$J(k,t_k) = \frac{m_{\text{left}}}{m}G_\text{left} + \frac{m_{\text{right}}}{m}G_\text{right}$$

where $G_{\text{left or right}}$ is the impurity of the left or right subset and $m_{\text{left or right}}$ is the number of instances in the left or right subset. $G$ can be either Gini or Entropy.

Once it has successfully split the training set in two, it continues to do so recursively until it reaches the max depth or it cannot find a split to further reduce impurity.

To determine whether to use Gini or Entropy, most of the time it does not matter but Gini tends to isolate the most frequent class in its own branch of the tree while entropy leads to more balanced trees.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import export_graphviz

from graphviz import Source # For creating the visualisation of the decision tree

from sklearn.metrics import mean_squared_error, mean_absolute_error

### Regularisation Hypterparameters

Decision trees do not need to assume the model is linear. If left unconstrained the tree usually overfits the training data. There are some ways to regularise the model. You can restrict the:
- `max_depth` maximum number of traversals from root to leaf
- `min_samples_split` minimum number of samples a node must have before a split occurs
- `min_samples_leaf` minimum number of samples a leaf node must have
- `max_leaf_nodes` maximum number of leaf nodes
- `max_features` maximum number of features evaluated for splitting each node.

### Regression

Decision trees are also capable of performing regression tasks.

In [2]:
X = pd.read_csv('housing_X_feateng_complete.csv')
y = pd.read_csv('housing_y_feateng_complete.csv')

In [3]:
tree_reg = DecisionTreeRegressor(max_depth=3)
tree_reg.fit(X,y)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=3,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

In [4]:
# Visualise tree
graph = Source(export_graphviz(tree_reg, out_file=None, 
                               feature_names=X.columns,))
graph.view()

'Source.gv.pdf'

Instead of predicting a class, the tree now predicts a value. The prediction is simply an average target value of all the instances in the leaf node.

In [5]:
y_pred = tree_reg.predict(X)
np.sqrt(mean_squared_error(y, y_pred))

74660.69256640242

During training, the CART algorithm remains the same, with some tweaks:

$$J(k,t_k) = \frac{m_{\text{left}}}{m}\text{MSE}_\text{left} + \frac{m_{\text{right}}}{m}\text{MSE}_\text{right}$$

where $\text{MSE}_{\text{left or right}}$ is the MSE of the left or right subset measured as $\sum_{i}(\hat{y}-y^{(i)})^2$ and $\hat{y} = \frac{1}{m_{\text{node}}}\sum_i y^{(i)}$. Make a prediction using the mean of all instances in the node and calculate MSE from that mean in the node.

### Limitations

Decision trees have orgotonal decision boundaries (splits are perpendicular to an axis), which makes the training set sensitive to rotation. One way to help with this is to use PCA. Decision trees are also sensitive to small variations in the data. Random forests can be a way to limit these sensitivities.