# Lab9: Decision Trees
## Building a Classification Tree in `scikit-learn`

We'll build a classification tree using the **Iris** data set. We print the first 5 row of this dataset:

In [40]:
import pandas as pd
import matplotlib.pyplot as plt

# Read in the data.
path = './iris.csv'
iris = pd.read_csv(path, sep=',')

iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [41]:
# Define X and y.
feature_cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width' ]

X = iris[feature_cols]
y = iris.species

We now use *sklearn* library for creating the Decision Tree model:

In [42]:
# Fit a classification tree with max_depth=3 on all data.
from sklearn.tree import DecisionTreeClassifier

treeclf = DecisionTreeClassifier(max_depth=3, random_state=1)
treeclf.fit(X, y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=1, splitter='best')

We can now visualize the decision tree using ther **export_graphviz** method:

In [46]:
from sklearn.tree import export_graphviz

# Create a Graphviz file.
export_graphviz(treeclf, out_file='./tree_iris.dot', feature_names=feature_cols)

# At the command line, run this to convert to PNG:
#   dot -Tpng tree_titanic.dot -o tree_titanic.png

# Or, just drag this image to your desktop or Powerpoint.

![Tree for iris data](./tree_iris.png)

We can **observe** that we can classify correctly all the sample for the first class, just by check if `petal_width` $\lt 0.8$. For separating the samples from the second and the third class we again use the `petal_width` feature (and checking if $\lt 1.75$). This way, we classify correctly the $49/50$ samples from the second class and the $45/50$ samples from the third class.

We use the `petal_length` feature for further separation, between the second and the third class, but it is not as effective as `petal_width`.

We can compute the importance of each of our feature:

In [47]:
# Compute the feature importances (the Gini index at each node).

pd.DataFrame({'feature':feature_cols, 'importance':treeclf.feature_importances_})

Unnamed: 0,feature,importance
0,sepal_length,0.0
1,sepal_width,0.0
2,petal_length,0.053936
3,petal_width,0.946064


Since the `sepal_length` and `sepal_width` are not used in the decision tree, their importance is zero. The `petal_width` feature is more important the the `petal_length` one. This is expected since the `petal_width` feature is used to do basically (almost) all the samples classification.