# Decision Trees

**Decision Trees**, also know as Classification and Regression Trees (CART), are part of a family of machine learning models known as **Tree-Based Machine Learning Models**. One of their advantages is that they are easy to interpret since they produce simple decision rules that can be understood even by a non-technical audience. Based on features in the training data, decision tree models learn a series of questions to infer the class labels of samples. In the following figure, a decision tree is shown that helps a person to choose an action by asking yes or no questions.

<img src="tree.png" alt="Drawing" style="width: 500px;"/>

Image taken from [4]. 

Some of the advantages of decision trees are the following:

- As mentioned before, trees are very easy to explain to people. In fact, they are even easier to explain than linear regression.
- Some people believe that decision trees mirror more closely human decision-making than other regression and classification approaches.
- Trees can be displayed graphically, and are easily interpreted even by a non-expert (especially if they are small).
- Trees can easily handle qualitative predictors without the need to create dummy variables.

Unfortunately, decision trees tend to overfit the data and, generally, do not have the same level of predictive accuracy as some of the other regression and classification approaches [1].

The main goal of these models is to segment the feature space into simple rectangular regions, which is convenient since this allows us to do either regression or classification. 

## Regression Trees

Let us suppose that we have $n$ observations: 

$$D=\{(x_1,y_1),(x_2,y_2),\dots,(x_n,y_n)\},$$

where $x_i\in\mathbb{R}^n$ and $y_i\in\mathbb{R}$, $i=1,2,\dots,n$. Say we want to split the space where our data $D$ lives into $m$ regions $R_1,R_2,\dots,R_m$. Then, the **hypothesis** that we can implement with a **regression tree** is the following:

$$h(x)=\sum_{i=1}^mc_i\mathbb{1}_{R_i}(x),$$

where $\mathbb{1}_{R_i}(x)$ is the **indicator function**, $i=1,2,\dots,m$, which is equal to one if $x\in R_i$, and it is equal to zero otherwise.

Given the data $D$, the training process should return the optimal values of both the constants $\{c_i\}_{i=1}^m$ and the regions $\{R_i\}_{i=1}^m$. Let us define the following half-planes:

$$R_1(j,s)=\{X:X_j\leq s\},$$
$$R_2(j,s)=\{X:X_j> s\},$$,

where $X_j$ is the jth variable of the vector of variables $X$. Then, the optimal constants and regions can be obtained by solving the following optimization problem:

$$\min_{j,s}\left\{\min_{c_1}\sum_{x_i\in R_1(j,s)}(y_i-c_1)^2+\min_{c_2}\sum_{x_i\in R_2(j,s)}(y_i-c_2)^2\right\}$$

For each splitting variable, the determination of the split point $s$ can be done very quickly and by scanning through all of the inputs, determination of the best pair $(j, s)$ is achieved. Having found the best split, the data is split into the two resulting regions and the splitting process is repeated on each of these two regions. This process is repeated until we have $m$ regions.

In the following figure, the top right panel shows a partition of a two-dimensional feature space by recursive binary splitting, as used in CART, applied to some fake data. The top left panel shows a general partition that cannot be obtained from recursive binary splitting. Bottom left panel shows the tree corresponding to the partition in the top right panel, and a perspective plot of the prediction surface appears in the bottom right panel.

<img src="split.png" alt="Drawing" style="width: 500px;"/>

This image was taken from [2].

## Classification Trees

If the target is a classification outcome taking values $1,2,...,K$, the only changes needed in the tree algorithm pertain to the criteria for splitting the nodes. In a node $m$, representing a region $R_m$ with $n_m$ observations, let 

$$p_{mk}=\frac{1}{n_m}\sum_{i=1}^m\mathbb{1}_k(y_i)$$

be the proportion of class $k$ observations in node $m$. Now, let $k(m)$ be the *majority class in node $m$*:

$$k(m)=\text{argmax}_kp_{mk}.$$

Then, we can the following impurity measures:

$$\text{Missclasification Error: } 1-p_{mk(m)},$$
$$\text{Gini Index: } \sum_{k=1}^Kp_{mk}(1-p_{mk}),$$
$$\text{Cross-entropy: } -\sum_{k=1}^Kp_{mk}\log(p_{mk}).$$

<img src="impure.png" alt="Drawing" style="width: 500px;"/>

Image taken from [2].

Intuitively speaking, the training of a decision tree for classification goes as follows: first, among all the features that we are working with, we pick the one that splits the data the best as the root; that is, the feature with the lowest impurity. Then, given the partition created by the root node, out of all the remaining features, for each child node we choose the feature that yields the best separation. This splitting process may continue until there is no reduction in the impurity of the "youngest" child nodes. 

Consider the following image:

<img src="iris.png" alt="Drawing" style="width: 400px;"/>

In this case, using two features, petal lenght and petal width, we are using decision trees to distinguish among three types of **iris plants**. The data comes from the famous **iris dataset**, which is one of the many datasets included in the `scikit-learn` library. Image taken from [3].

Now, let us load some libraries.

In [None]:
import matplotlib.pyplot as plt
import numpy  as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree

Guess what? We will be using our diabetes dataset again.

In [None]:
diabetes = pd.read_csv('diabetes-dataset.csv')
diabetes.head()

In the following two cells we process our data to divide it into a training and a test set. In this case the 80% percent of the data will be used for training and the rest for testing. For now we will focous on the training of the decision tree. Notice that in this case we will be using the `train_test_split` method from the `scikit-learn` library.

In [None]:
y = diabetes['Outcome']
X_data = diabetes.copy()
X_data = X_data.drop(columns='Outcome')

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_data, y, test_size=0.2, random_state=0)

In [None]:
X_train

In [None]:
y_train

Now we can train our model.

In [None]:
clf = tree.DecisionTreeClassifier(random_state=0, max_depth=4)
clf.fit(X_train,y_train)

Furthermore, we can visualize the decision tree that we obtained!

In [None]:
variables = diabetes.columns
plt.figure(figsize=[20,7])
tree.plot_tree(clf, feature_names=variables, class_names=['No Diabetes', 'Diabetes'], fontsize=8, filled=True)
plt.show()

And now we predict...

In [None]:
y_pred = clf.predict(X_train)
hits = sum(y_pred == y_train)
accuracy = hits / len(y_train)
accuracy

## Bibliography

[1] *James, Gareth, et al., "An introduction to statistical learning," Vol. 112. New York: springer, 2013*.

[2] *Hastie, T., Tibshirani, R., Friedman, J. H., "The elements of statistical learning: data mining, inference, and prediction," New York, Springer, 2009.*

[3].*Raschka, Sebastian, and Vahid Mirjalili, "Python machine learning: Machine learning and deep learning with Python," Second edition, 2017*.

[4] *Dangeti, Pratap, "Statistics for machine learning," Packt Publishing Ltd., 2017*.