## Lab - Decision Trees

This is a multi part lab. In the first part, you will train a simple decision tree on a dataset. Next you will plot the decision boundary of the classifier and third you will output a decision tree diagram. 

You will use the Heart dataset to predict whether the patient has AHD or not.
The dataset contains information about various patient with heart conditions including their age, sex and other medical parameters.
Your task is to fit a DecisionTree model , and predict the value for AHD (yes or no) for the given data sample.
You will pick only two parameters - Max heart rate and age to predict AHD. 
Data is availabe in : https://raw.githubusercontent.com/colaberry/DSin100days/master/data/Heart.csv

"Some of the data in this lab are taken from "An Introduction to Statistical Learning, with applications in R"  (Springer, 2013) from the authors: G. James, D. Witten,  T. Hastie and R. Tibshirani " 

In [None]:
# Importing pandas
import pandas as pd
import numpy as np 
heart = pd.read_csv('https://raw.githubusercontent.com/colaberry/DSin100days/master/data/Heart.csv', na_values='?').dropna()
heart.info()
heart.head()


## Part 1: training the Classifier
Start by training the decision tree classifier. This is similar to what we did in the main section. 

In [None]:
# get dataset  
data_set = heart[["Age","MaxHR","AHD"]]

In [None]:
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap 
from sklearn.preprocessing import LabelEncoder


labels = # use Label encoder to fit transform the AHD column data
colors = ['yellow','black']
cmap= ListedColormap(colors)
plt.figure(figsize=(10,10))
plt.xlabel('Age', fontsize=15)
plt.ylabel('MaxHR', fontsize=15)
# plot age vs maxhr 
plt.scatter(data_set['####'].values, data_set['####'].values, c=labels, cmap=cmap )


<img src="../../../images/age_vs_hr.png">

In [None]:
from sklearn import tree, metrics
from sklearn.model_selection import train_test_split 

X = # age and maxhr form the features for our dataset
y = # the target is the AHD value but in a label encoder form 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=1)
print("y value min and max are : {},{}".format('####')) # print the min and max of the target 

y min and max are :0,1

In [None]:
tr_clf = # get an instance of DecisionTreeClassifier by setting a max_depth of 2 and a random state of 12 

# train the decision tree and get predicitons on the test set. Following this, calcuate the accuracy.
# at most 2 to 3 lines of code. Make sure that acc is a seperate variable. You will be using the 
# accuracy score as the metric

print("accuracy of the classifier on the test set {}".format(acc))

accuracy of the classifier on the test set 0.7333333333333333


## Part 2: Plotting the decision surface
This part can be a tricky. The goal of this section is to plot the decision surface. To this do this you need to use the meshgrid function. We have written a simple function that converts 1d array points and converts them to 2d array so that you can use them to visualize. 

The idea is the generate 2d points using the function ```python to_3d(x,y,plot_step)``` then use then run prediction on these 2d points and plot them as a contour graph. This is done using the function ```python plot_contour(xx,yy,Z)``` Note that Z must be the same size as xx and yy for this to work. This is essentially a 3d plot since you have x and y dimensions as the variables and Z as the predicted the value. We are converting the Z dimension to a label and plotting it as a color map rather than projecting it into the 3rd dimension. 

These sort of plots help visualizes the regions which region each class represents. 

In [None]:

def to_3d(x,y,plot_step=0.01): 
   

    x_min, x_max = x[:, 0].min() - 1, x[:, 0].max() + 1
    y_min, y_max = x[:, 1].min() - 1, x[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                         np.arange(y_min, y_max, plot_step))
    return xx, yy 

def plot_contour(xx,yy,Z): 
    plt.tight_layout(h_pad=0.5, w_pad=0.5, pad=2.5)
    cs = plt.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu)
    return cs


In [None]:

xx, yy = to_3d(X,y)

Z = tr_clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = # reshape Z to the same shape as xx

plt.figure(figsize=(10,10))
cmap= ListedColormap(colors)
_ = plot_contour(xx,yy, Z)
# you will need to add a line of code that plots the scatter plot for age vs maxhr.
# this is so that you can compare the regions of decision vs the ground truth data
plt.show()

<img src="../../../images/label_map_dt.png">

In [None]:
print("Z should be {}".format(Z.shape))

Z should be (13300, 5000)


In [None]:
def visualize_tree(sktree, features, classes, impurity = False, label = 'all', proportion = True):
    dot_data=StringIO()
    tree.export_graphviz(sktree
                         , feature_names=features
                         , class_names=classes
                         , filled=True
                         , rounded=True
                         , impurity = impurity
                         , label = label
                         , special_characters=True
                         , proportion = proportion
                         , out_file=dot_data)
    graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
    #  graph.write_pdf("tree.pdf") # Save to your current folder
    return(Image(graph.create_png()))


## Part 3: Visual the decision tree

In [None]:
from sklearn.externals.six import StringIO
import pydotplus
from IPython.display import Image

classes = ["Age","MaxHR"]
features = # looking at the examples from the decision tree notebook. What should be the features? 
visualize_tree(tr_clf, features, classes)

<img src="../../../images/dt_diagram.png">