# Training and Visualizing a Decision Tree

Convert _.dot_ file to a different format such as png or png using _graphviz_ package [1].

[1] http://www.graphviz.org/

In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X = iris.data[:, 2:]
y = iris.target

tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X, y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [2]:
from sklearn.tree import export_graphviz
export_graphviz(tree_clf, out_file="iris_tree.dot",
               feature_names=iris.feature_names[2:],
               class_names=iris.target_names,
               rounded=True, filled=True)

# Making Predictions

- *Gini's attribute* measures a tree impurity. A node is "pure" (gini=0) if all training instances it applies to belong to the same class. 

$\text{G}_i = 1 - \sum_{k=1}^n{p_{i,k}}^2$, where $p_{i,k}$ is the ratio of class k instances among the training instances in the $i^{th}$ node.

# Estimating class probabilities

In [3]:
tree_clf.predict_proba([[5, 1.5]])

array([[ 0.        ,  0.90740741,  0.09259259]])

In [4]:
tree_clf.predict([[5, 1.5]])

array([1])

# The CART training algorithm

- **CART cost function**

$\text{J}(k,t_k)=\frac{m_{left}}{m}\text{G}_{left} + \frac{m_{right}}{m}\text{G}_{right}$

, where 

$G_{left/right}$ measures the impurity of the $left/right$ subset.

$m_{left/right}$ is the number of instances in the $left/right$ subset.

# Gini impurity or Entropy

- **Entropy** In ML is used as an impurity measure: it is zero when it contains instances of only one class.

$\text{H}_i = -\sum_{k=1,p_{i,k}\neq{0}}^n p_{i,k} \cdot log(p_{i,k})$

- _Which one use?_ Most of the time does not make a big difference: they lead to similar trees. Gini impurity is slightly faster to compute. However, when they differ, Gini impurity tends to isolate the most frequent class in its own branch of the tree, whie entropy tends to produce slightly more balanced trees.

# Regularization Hyperparameters

- *max_depth*: depth of the DT, default None.
- *min_samples_split*: minimum number of samples a node must have before it can split.
- *min_samples_leaf*: minimum number of samples a leaf node must have.
- *min_weight_fraction*: Equal to *min_samples_leaf* but expressed as a fraction of the total number of weighted instances.
- *max_leaf_nodes*: maximum number of leaf nodes.
- *max_features*: maximum number of features that are evaluated for splitting at each node.

# Regression

Instead of minimizing impurity CART algorithm minimizes the MSE. 

$\text{J}(k,t_k) = \frac{m_{left}}{m}\text{MSE}_{left} + \frac{m_{right}}{m}\text{MSE}_{right}$
, where

$\text{MSE}_{node} = \sum_{i\in\text{node}}{(\hat{y}_{node}-y^{(i)})^2}$

$\hat{y} = \frac{1}{m_{node}} \sum_{i\in\text{node}}y^{(i)}$

In [8]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')

# Instability

**The main issue with Decision Trees is that they are very sensitive to small variations in the training data.**

# Exercises