# Decision Trees

What are our learning objectives for this lesson?

* Learn how to apply Decision Trees in a classification problem
* Get more familiar with the scikit-learn library

In this lab, we will once again construct a machine learning model that predicts the species of iris based on its petal and sepal dimensions. This time, we will use another approach in the supervised learning toolbox–the decision tree, instead of using the support vector machine which was used in the last lab. 

Beginning with the root node, every node that is not a leaf node acts as a decision node in the tree. In its essence, a decision tree architecture is where we do a greedy search to find the optimal split point in a tree. The decision nodes are where the data is split, and the leave nodes represents outputs like a class label. 

Content used in this lesson is based upon information in the following sources:
* Géron, A. (2019). Hands-on machine learning with Scikit-Learn, Keras and TensorFlow: concepts, tools, and techniques to build intelligent systems (2nd ed.). O’Reilly.

## Lab Tasks 

1. Import the iris dataset
2. Read the documentation for Scikit-Learn's DecisionTreeClassifier and use it
3. Visualize the tree using Scikit-Learn

### Import the Iris Dataset

* Import the Iris dataset
* Split the dataset into train and test sets, use a 70:30 split ratio
    * You can reduce the dimension of the data by dropping features or by projecting the inputs to a lower dimension, you can also keep the data as it is.

Here are some import statements to get you started.

In [12]:
!pip install -U scikit-learn



In [19]:
from sklearn import datasets
import sklearn.tree as tree
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score

In [20]:
# TODO: import the iris datset here

iris = datasets.load_iris()

# TODO: Split the data into train and test set. 
#       Reduce the dimension if you wish, 
#       

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)

### Select a Model

Visit the [documentation](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.tree) of ```sklearn.tree```, select a tree model that best fit the task we are training the model to perform. 

Once you have selected a model, we want to determine the depth of the tree. In Scikit-Learn, we can adjust the ```max_depth``` parameter to limit the depth the tree is allowed to grow. The deeper the tree, the more splits we make in the data, and the more complex the model will become. If the number of splits is too low, the model underfits the data and if it is too high the model overfits. 

Recall that the root node is considered to have a depth of 0. You can try different depths. For now, we can simply put in ```None``` (which is the default value for this optional parameter) for the parameter and see how many levels ends up in our tree.

What we want to do in the cell below:
* Set up the model of choice
* fit the train set to the model

In [22]:
# TODO: Set up the model of choice and fit the training data to it

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = clf.score(X_test, y_test)
scores = cross_val_score(clf, iris.data, iris.target, cv=10)
print("Depth of tree:", clf.get_depth())
print("Cross-validation scores:", scores)
print("Accuracy:", accuracy)

Depth of tree: 6
Cross-validation scores: [1.         0.93333333 1.         0.93333333 0.93333333 0.86666667
 0.93333333 0.93333333 1.         1.        ]
Accuracy: 1.0


### Test the Fitted Model

Visit the documentation of your particular Scikit-Learn tree model to find different built-in methods for testing your model. Try the different methods available.

In [23]:
# TODO: Try different built-in testing methods

from sklearn.tree import DecisionTreeRegressor
clf = DecisionTreeRegressor()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = clf.score(X_test, y_test)
print("Depth of tree:", clf.get_depth())
print("Accuracy:", accuracy)

Depth of tree: 6
Accuracy: 1.0


##### 💯 What is the accuracy of your model?

### Visualize the Tree

The decision tree model makes the internal decisions made by the model intuitive to interpret. One way to exmaple the internal decision process of a decision tree model is to print out the tree and see for ourselves what kind of decision is being made in each node. We will use a module in ```sklearn.tree``` called ```export_graphviz``` to visualize the tree.

In order to visualize the graph inside of this notebook (instead of save .dot and image files in our local directory outside of the notebook), we will install and import some modules. 

If you wish to save the .dot graph to a local directory and convert the .dot to an image file instead, note the following:
* pick the directory you would like to save the .dot to
    *  ```f = open("some/directory/on/your/machine/iris_tree.dot", 'w')```
* add  ```out_file=f``` to the parameters when you call the ```export_graphviz``` function
* you don't have to save the output of ```export_graphviz``` to a variable since the output is being saved to the specified directory
* run ```!dot -Tpng iris_tree.dot > iris_tree.png``` in the directory where the .dot is saved to obtain a png of the tree graph

In [32]:
# DELETE THIS CELL if you wish to save the image file locally

!pip install pydotplus
!pip install graphviz
import pydotplus
from IPython.display import Image, display

Collecting graphviz
  Downloading graphviz-0.20.1-py3-none-any.whl (47 kB)
     ---------------------------------------- 47.0/47.0 kB 2.3 MB/s eta 0:00:00
Installing collected packages: graphviz
Successfully installed graphviz-0.20.1


What we want to do here:
* Visit the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html) of ```export_graphviz``` to find out what parameters it takes
* Experiment with the ```max_depth```, print out the graphs with different depths, find the ```max_depth``` value that yields the highest performance. 

In [37]:
from sklearn.tree import export_graphviz
import graphviz


# TODO: fill in the parameters for this function
dot_data = export_graphviz(
                            clf, out_file=None, 
                            feature_names=iris.feature_names,  
                            class_names=iris.target_names,  
                            filled=True, rounded=True,  
                            special_characters=True
 )
clf = DecisionTreeClassifier(max_depth=3)
# DELETE THESE 2 LINES if you wish to save the image file locally
graph = graphviz.Source(dot_data)
graph

NotFittedError: This DecisionTreeClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

##### What is the depth that seemed to work the best for your model? Is it deeper or shallower than the initial depth with the default setting? Do you have a hypothesis on why that might be?