# Visualizing Decision Trees

Some advantages of decision trees:

- Can be used for categorical or numeric data
- White box model (can easily explain model using boolean logic)
- Simple to understand and interpret
- **Can be easily visualized**

### How do we make decisions?

Decisions we make every day may seem thoughtless and automatic, but build upon a lifetime of learning and internalized rules. By examining the parts that go into a simple decision, we can get a good intuition for how decision trees work.

![](ex_tree.jpg)

**Terminology:**
- **root**: The base node containing all examples, represented at the top of the tree
- **node**: Represents a subset of our samples; this is where splitting occurs as defined by some rule learned from the features
- **branch**: Represents the path data take as they move *down* the tree
- **leaf**: A terminal node; data are classified by the majority class of the leaf


## Decision Trees in Python

sklearn has methods for both classification and regression using deicison trees. We'll be working with classification for this example.

In [None]:
import sklearn.datasets as datasets
import pandas as pd

In [None]:
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

In [None]:
print(iris.DESCR)

In [None]:
df.describe()

In [None]:
from sklearn.tree import DecisionTreeClassifier
dtree=DecisionTreeClassifier()
dtree.fit(df,y)

### Visualizing Trees in Python

sklearn has built-in support for visualizing decision trees. If you're getting errors regarding graphviz, we'll work on installing this in small groups as installation is OS specific.

In [None]:
!pip install pydotplus

In [None]:
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus

In [None]:
dot_data = StringIO()

export_graphviz(dtree, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,
                feature_names=df.columns,
#                 class_names=['Malignant', 'Benign']
                class_names=['Setosa' , 'Versicolour', 'Virginica']
                )


graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

### Interpretting Decision Trees

- What is the split criteria for the root node?
- What can we say about the purity of the leaves?
- What is the path data take to get to the far left leaf?
- What is the minimal number of splits to reach a leaf? What's the maximum?
- Can we assume anything about which features are most important?
- Are there any hyperparameters you might change to reduce potential overfitting?
- Do you think that a decision tree is a good model for this dataset?

### Independent Practice

Now let's try using the breast cancer dataset, which has many more features but only two classes. (You may wish to view the output in a new window as it's quite wide.)

In [None]:
cancer=datasets.load_breast_cancer()

In [None]:
# Your code here

In [None]:
dot_data = StringIO()

export_graphviz(dtree, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,
                feature_names=df.columns,
                class_names=['Malignant', 'Benign']
                )


graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

What is the split criteria for the root node?

What can we say about the purity of the leaves?

What is the path data take to get to the far left leaf?

What is the minimal number of splits to reach a leaf? What's the maximum?

Can we assume anything about which features are most important?

Are there any hyperparameters you might change to reduce potential overfitting?

Do you think that a decision tree is a good model for this dataset?