# Decision Trees (CART)
<br><br>
First it needs to be noted that CART stands for Classification and Regression Tree and it is the same as a decision tree. Also, CART can be used for either classification or regression. At its core the decision tree makes binary decisions on the data by picking the feature and its value that will increase the cost function the least amount possible. For classification the cost function is: 
<br>
$$ cost(x_i, y_i) \in R_{m'} = \frac{1}{N_{R_m}} \sum_{x_i \in R_{m'}} \mathbb{1}[y_i \ne \hat{y_i}(R_{m'})]$$
<br><br>
And for regression the cost function is:<br>
$$ cost(x_i, y_i) \in R_{m'} = \sum_{x_i \in R_{m'}} (y_i - w_{m'})^2$$
<br><br>
So using the ubiquitous iris classification example, the figure below on the left shows a sample set of decisions made in a particular tree. At each iteration a so called 'greedy algorithm' is used to minimize the cost function and thereby determining the specific split. On the right of the figure you can see how the feature space is partitioned into blocks that are parallel to the features. If you follow the decision tree on the left you will see how the figure on the right is derived. 

![l1 and l1 regularization](cart.png)Figure taken from Machine Learning A Probabilistic Perspective, Kevin P. Murphy
<br><br>
You could create a tree that exhaustively partitions the data so that there is a single training data point at every leaf. But of course that would be tantamount to over fitting and may very well be computationally wasteful. Therefore a **stopping condition** is usually defined that could seek a minimum number of data points per leaf/node (say 10). Lastly, you could also **prune** the tree to further increase performance. There are several methods to prune a tree but a common one is to remove a leaf/node if doing so reduces the cost function on the test set. 


In [1]:
'''
In this example we are reading in a house description and sale dataset. For this classification we are going to 
estimate whether a house will sell(and with what probability) within 90 days of being put on the market.
'''
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# this data has already been cleaned up, standardized, one hot encoded and vetted
df = pd.read_csv("classification_house_sale_px_data.csv", parse_dates=True, sep=',', header=0)
df_labels = pd.read_csv("classification_house_sale_px_labels.csv", parse_dates=True, sep=',', header=0)

# split data into training and test sets
train, test, y_train, y_test = train_test_split(df, df_labels, train_size=.6, test_size=.4, shuffle=True)

# run the classifier on the training data
clf = DecisionTreeClassifier(max_depth=5, min_samples_leaf=5)
clf.fit(train, list(y_train.label.values))
# make prediction on the test data
#predicted = clf.predict(test)
print("CART: Test set accuracy (% correct) when max_depth = 5: {0:.3f}".format(clf.score(test, y_test.label.values)))
# run the classifier on the training data
clf = DecisionTreeClassifier(max_depth=50, min_samples_leaf=5)
clf.fit(train, list(y_train.label.values))
print("CART: Test set accuracy (% correct) when max_depth = 50: {0:.3f}".format(clf.score(test, y_test.label.values)))

CART: Test set accuracy (% correct) when max_depth = 5: 0.600
CART: Test set accuracy (% correct) when max_depth = 50: 0.558


<br>
Note how a deeper tree leads to overfitting as can be seen by the lower score (% correct) when max_depth is set to 50.
<br>

# Take away
- CART is a Decision Tree
- CART makes use of binary decisions 
- CART can be improved by pruning