Like SVMs, *Decision Trees* are versatile Machine Learning algorithms that can perform both classification and regression tasks,and even multioutput tasks.

Decision Trees are also the fundamental components of Random Forests, which are among the most powerful Machine Learning algorithm available today.

# Training and Visualizing a Decision Tree

In [2]:
#Importing the librairies

import os
import tarfile
import urllib

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

#Only for jupyter notebooks
%matplotlib inline

In [3]:
from sklearn.datasets import load_iris
from sklearn.tree import  DecisionTreeClassifier

In [4]:
iris = load_iris()
X = iris.data[:,2:] #Petal lenght and width
y = iris.target

In [5]:
tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X,y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [6]:
from sklearn.tree import export_graphviz

In [9]:
export_graphviz(
                tree_clf,
                out_file=("./iris_tree.dot"),
                feature_names=iris.feature_names[2:],
                class_names=iris.target_names,
                rounded=True,
                filled=True
                )

# Making Prediction

On top of a decision tree, the first node is called a **root node**. You can have two kind of root's child node. The first one is a **leaf / terminal node**  which doesn't have any child nodes. The second one is a **decison node / interior node** which will be split between two other nodes.

One of the many qualities of Decision Trees is that they require very little data preparation.

* A node's samples attribute counts how many training instances it applies to.
* A node's value attribute tells you how many training instances of each class this node applies to.
* A node's *gini* attribute measures its impurity. A node is pure (gini=0) if all training instances it applies to belong to the same class.

/!\ : CART Algorithm produces only binary trees whereas other algorithms like ID3 can produce Decision Trees with nodes that have more than two children.

Decision Trees are intuitive, and their decisions are easy to interpret. Such models are often called *white box models*. In contrast, as we will see, RandomForests or neural networks are generally considered as *black box models*.

# Estimating Class Probabilities

A Decision Tree can also estimate the probability that an instance belongs to a particular class k. First it traverses the tree to find the leaf node for instance, and then it returns the ratio of training instances of class k in this node.

In [13]:
tree_clf.predict_proba([[5,1.5]])

array([[0.        , 0.90740741, 0.09259259]])

In [14]:
tree_clf.predict([[5,1.5]])

array([1])

# The CART Training Algorithm

The CART Algorithm (Classification and Regression Tree) works by first splitting the training set into two subsets using a single feature k and a threshold tk. k and tk are chosen in a way that they produce the purest subsets (weighted by their size).

Once the CART Algorithm has successfully split the training set in two, it splits the subsets using the same logic then the sub-subsets and so on. It stops recursing once it reaches the maximum depth.

# Gini Impurity or Entropy?

By default, the Gini impurity is used, but entropy can be selected. In machine learning, entropy is frequently used as an impurity measure: a set's entropy is zero when it contains instances of only one class. A reduction of entropy is often called an *information gain*.

Most of the time, gini impurity or entropy led to similar trees. Gini impurity is slighly faster to compute, however when they differ, Gini impurity tends to isolate the most frequent class in its own branch of the tree, while entropy produces slightly more balanced trees.

# Regularization hyperparameters

Decision Trees make very few assumptions about the training data. If left unconstrained, the tree structure will adapt itself to the training data, fitting it closely (likely overfitting). Such a model is called *nonparametrics model* because the number of parameters is not determinded prior to training, so the model structure is free to stick closely to the data. In constract, a *parametric model* (such as a linear model), has a predetermined number of parameters, so its degree of freedom is limited, reducing the risk of overfitting (but increasing the risk of underfitting).

To avoid overfitting the training data, we need regularization. Increasing min_* hyperparameters reducing max_* hyperparameters will regularize the model.

Other algorithms work by first training the Decision Tree without restrictions, then pruning unnecessary nodes. A node whose children is considered unnecessary if the purity improvement it provides is not statistically significant. Standard statistical tests, such as chi-square test are used to estimate the probability that the improvement is purely the result of chance (*null hypothesis*). If this probability, called the p-value, is higher than a given threshold (5% in general, controlled by an hyperparameter), then the node is considered unnecessary and its children are deleted. The pruning continues until all unnecessary nodes have been pruned.

# Regression

Decision Trees are also capable of performing regression tasks.

In [15]:
from sklearn.tree import DecisionTreeRegressor

In [16]:
tree_reg = DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X,y)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

The main difference between a classification Decision Tree and a regression Decision Tree is that instead of prediction a class in each node, it predicts a value. The prediction is the average target value of the training instances associated with the leaf node, and it results in a mean squared error.

The algorithm splits each region in a way that makes most training instances as close as possible to that predicted value.

The CART algorithm works moslty as the same way as earlier, but instead of trying to split the training set in a way that minimizes impurity, it now tries to split the training set in a way that minimize the MSE.

Just like classification tasks, Decision Trees are prone to overfitting when dealing with regression tasks.

# Instability

Decision Trees have a few limitations:
* They love orthogonal decision boundaries (all splits are perpendicular to an axis), which makes them sensitive to training set rotation. One way to limit this problem is to use Principal Component Analysis (PCA), which often results in a better orientation of the training data.
* They are very sensitive to a small variations in the training data. Actually, since the algorithm used by Scikit-Learn is stochastic (it randomly selects the set of features to evaluate at each node), we may get very different models even on the same training data (unless we set the random_state hyperparameter).