# Lab 2 - Decision Trees


##Decision Trees

Decision trees are great because they allow us to convey complex predictive models to those uninitiated in Machine Learning. We can build a model and then "drop" a new observation through it (like Plinko!) until we reach an outcome -- be it quantitative or qualitative. However, remember that they have their limitations by only being able to make horizontal/vertical decision boundaries. This can limit their accuracy in real-world applications. Furthermore, they are also prone to overfitting models (more later!) unless we make some adjustments to how the algorithm runs.

Since we are already becoming comfortable with the Boston Housing Prices data set, let's continue to use that. We converted our target variable, **MEDV**, to a binary outcome so we will do the same here and use decision trees for classification modeling.

##Shortcuts
*   Use the "+ Code" button in the top left corner to add another block like this one only for running **code** or the "+ text" button for adding a block that runs **text**
*   I suggest looking at "Tools-->Keyboard Shortcuts..." for additional ways to run Colaboratory but here are a few useful ones:
> **Ctrl+F9** - Run all blocks   
> **Ctrl+Enter** - Run selected block   
> **Alt+Enter** - Run block and add a new block beneath   
> **Shift+Enter** - Run block and select next block   
> **Ctrl+F8** - Run all blocks before selected block   
> **Ctrl+F10** - Run selected block and all following blocks   
> **Ctrl+M+Y** - Convert selected block to a *code* block   
> **Ctrl+M+M** - Convert selected block to a *text* block

Also useful, Colaboratory supports code completion. Start typing code and press the **Tab** key. A drop down will appear with likely code based on what you typed. If only one possible command exists, it should complete it for you automatically.

Even better yet, if there is an error produced by your code, Colaboratory will provide a button at the bottom of the code output to search StackOverflow for an answer!

**Remember**: Code blocks need to be run in succession or they might produce errors!


In [0]:
!apt-get -qq install -y graphviz && pip install -q pydot
import pydot

In [0]:
#Install packages not native to Colaboratory
!pip -q install graphviz
!pip -q install pydot
import pydot

In [0]:
#Load our packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import graphviz
from google.colab import files
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.cluster import KMeans
from sklearn import datasets
%matplotlib inline 

In [0]:
#Upload the 'HousingData.csv' file from your local computer 
files.upload()

In [0]:
#Assign data to objects
raw = pd.read_csv('HousingData.csv')

In [0]:
#Repeat the data cleaning activites from Lab 1
raw[['CRIM','INDUS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV']] = raw[['CRIM','INDUS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV']].fillna(raw.mean())

for col in ['ZN','CHAS']:
    raw[col].fillna(raw[col].mode()[0], inplace=True)
    
#Binary coding of target  
raw['MEDV'] = np.where(raw['MEDV'] < 21.2, 0, 1)

Let's just build a decision tree in scikit-learn using the default values for its parameters. Remember, you should always be able to check the arguments a Python function takes by putting your mouse within its parenthesis "()" and pressing **Tab**.

In [0]:
#Build and visualize a decision tree classifier
X = raw.iloc[:, :-1].values
y = raw['MEDV'].values

dt = DecisionTreeClassifier()
dt.fit(X, y)

dt_viz = tree.export_graphviz(dt, 
                              out_file=None, 
                              feature_names=raw.columns[0:13],  
                              class_names=['low','high'],  
                              filled=True, 
                              rounded=True,
                              proportion=True,
                              special_characters=True)

graph = graphviz.Source(dt_viz)
graph.format = 'png'
graph

From the above, we can see a lot of the features we learned about in the decision trees lecture: 

* The order variables are being "split" on
* The Gini Impurity score (note that the darker the color of the node, the greater proportion of one class to the other)
* The proportion of all observations included in each node (i.e. "samples = xx%") 
* The proportion of observations of each class within a node (i.e. "value = [low, high]")
* The outcome we assign to the node based on the highest occuring outcome (i.e. "class=low")

We can also see that the decision tree is giant and has several branches because we built it without stopping it or pruning it which means it is very likely *overfitting* (most of the terminal nodes near the bottom only have few observations in them). Thankfully we can adjust things like:

* `max_depth` - How *low* will we let our tree grow overall (i.e. how many levels of splits are allowed)?
* `min_samples_split` - What are minimal number of observations needed to be in a node for us to consider splitting further?
* `min_samples_leaf` - What are minimal number of observations needed to be a terminal node?
* `min_impurity_decrease` - What is the minimum decrease in Gini Impurity we'd need to consider splitting a node further?

Let's play with these (and I encourage you to do the same)!

In [0]:
#Build and visualize a decision tree classifier with stopping/pruning criteria
dt = DecisionTreeClassifier(max_depth=4,
                            min_samples_leaf=25)
dt.fit(X, y)

dt_viz = tree.export_graphviz(dt, 
                              out_file=None, 
                              feature_names=raw.columns[0:13],  
                              class_names=['low','high'],  
                              filled=True, 
                              rounded=True,
                              proportion=True,
                              special_characters=True)

graph = graphviz.Source(dt_viz)
graph.format = 'png'
graph

In this example above, we set the `max_depth` to "4" meaning that after the first split, we won't allow more than 4 additional ones. This is visually obvious in the decision tree above. The value of "25" for `min_samples_leaf` indicates that we should not let our tree make a node unless at least 25 observations would fall into it. You can rebuild this tree as many times as you'd like using different parameter values or even some of the parameters we didn't use like `max_samples_leaf` and `max_impurity_decrease`.

How well do you like the final tree you created? Why did you decide on the paramters you chose?

##Your Turn!
We've already loaded the **"Admission_Predict.csv"** above and transformed the **"Chance to Admit"** variable into a binary one. Let's explore it using decision trees next!

**Build a default decision tree to predict "Chance of Admit" using the remaining features of the "Admission_Predict.csv" data set. What feature is considered the most "important" in the context of decision trees?**

In [0]:
#Upload the "Admission_Predict.csv" file from your local computer
files.upload()

**Build another tree. This time adjust the stopping/pruning parameters like we did above and find a tree you feel comfortable with. Why do you feel that way? Also, feel free to explore the other parameters within `DecisionTree()` that you can adjust to optimize your tree. Why did you choose those?**