# Decision Trees

Decision trees are powerful algorithms capable of fitting linear, non linear and even multi output datasets. They can also be components of random forest.
We are going to discuss its functionning, how to train it, regularization and validation.

## Training and visualizing

Let's see how it works on the iris dataset:

In [11]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from graphviz import Source

iris = load_iris(as_frame=True)
X = iris.data[["petal length (cm)", "petal length (cm)"]].values
y = iris.target

tree_clf = DecisionTreeClassifier(max_depth=2, random_state=42)
tree_clf.fit(X, y)
# We can visualize the tree using export_graphviz
export_graphviz(tree_clf, out_file="dTree.dot", 
                feature_names=["petal length (cm)", "petal width (cm)"], 
                class_names=iris.target_names, 
                rounded=True,
                filled=True)

# Source.from_file("dTree.dot") # This line display the graph but i dont have this folder in my PATH

__Side Note__: Decision Tree models does not require feature scaling at all.

## Training a decision tree using CART(Classification And Regression Tree) algorithm

The Cart algorithm function by splitting the training set into subsets based on 1 features _k_ and a threshold $t_k$. It searches iteratively the 
combination of k and $t_k$ that minimize its cost function(this is the functionning of greedy algorithms in general) which is defined as follows:
$$J(t, t_k) = \frac{m_{left}}{m}G_{left} + \frac{m_{right}}{m}G_{right}$$
- $G_{left/right}$ measures the impurity in the left or the right subset.
- $m_{left/right}$ is the number of instances in the left or the right subset.

## Gini impurity or entropy

Decision Trees in scikit-learn by default uses the gini impurity measure which is defined by the following equation:
$$G_i = 1 - \sum_{k=1}^{n}p_{i, k}^2$$
Where $G_i$ is the gini impurity of the $i^{th}$ node and $p_{i, k}$ is the ratio of instances of class _k_ in the node _i_. But instead of this, we can
choose to measure impurity using _entropy_(by setting the _criterion_ hyperparameter to "entropy"). For example the entropy of a set is equal to 0 when
all the instances present in it belong to 1 class.
$$H_i = -\sum_{k=1}^{n}p_{i, k}\log _2 (p_{i, k})$$
They usually produce the same trees but gini impurity is slighty faster to compute however entropy as tendency to produce more
balanced trees.

## Regularizing a decision tree

By default decision trees have an unlimited number of nodes which means it will eventually fit the data nearly perfectly (sometime overfitting it). The
way to avoid the overfitting to to restrain the maximum number of nodes through the _max\_depth_ parameter other parameters will also restrict the size
and shape of the tree:
- _max\_features_: The maximum number of features that are evaluated before splitting each node.
- _max\_leaf\_nodes_: The maximum of leafs the trees is allowed to have.
- _min\_samples\_split_: Minimum number of samples a node can have before splitting it.
- _min\_samples\_leaf_: Minimum number of leafs that have to be created.

Let's take the moon dataset and train one unregularized and one with the min_samples_leaf=5:

In [10]:
from sklearn.datasets import make_moons

X_moons, y_moons = make_moons(n_samples=150, noise=0.2, random_state=42)
tree_clf1 = DecisionTreeClassifier(random_state=42)
tree_clf2 = DecisionTreeClassifier(min_samples_leaf=5, random_state=42)
tree_clf1.fit(X_moons, y_moons)
tree_clf2.fit(X_moons, y_moons)
# Let's evaluate this 2 models
X_moons_test, y_moons_test = make_moons(n_samples=1000, noise=0.2, random_state=43)
print(tree_clf1.score(X_moons_test, y_moons_test))
print(tree_clf2.score(X_moons_test, y_moons_test))

0.898
0.92


## Regression tasks with decision trees

Let's train a decision regressor using a random dataset:

In [12]:
from sklearn.tree import DecisionTreeRegressor
import numpy as np

np.random.seed(42)
X_quad = np.random.rand(200, 1) - 0.5
y_quad = X_quad ** 2 + 0.025 * np.random.randn(200, 1)

tree_reg = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg.fit(X_quad, y_quad)

The difference between this and a classification decision tree is that the CART algorithm here try to minimize MSE instead of the gini/entropy algorithm.
Here is the exact cost function that is being used:
$$J(k, t_k) = \frac{m_{left}}{m}MSE_{left} + \frac{m_{right}}{m}MSE_{right}$$
And
$$MSE_{node} = \frac{\sum_{i \epsilon node}(\hat{y}_{node} - y^{(i)})^2}{m_{node}}$$
$$\hat{y} = \frac{\sum_{i \epsilon node} y^{(i)}}{m_{node}}$$

As always we can regularize this model using the same techniques outlined earlier.