### This is a simple notebook to build and visualize decision trees.

Author: Viviana Acquaviva

License: [BSD-3-clause](https://opensource.org/license/bsd-3-clause/).
    
Some visualization-inspiration credits:

https://towardsdatascience.com/scikit-learn-decision-trees-explained-803f3812290d

https://medium.com/@rnbrown/creating-and-visualizing-decision-trees-with-python-f8e8fa394176


In [None]:
import numpy as np

import matplotlib

import matplotlib.pyplot as plt

import matplotlib.patches as mpatches

import pandas as pd #to load data into a data frame

from sklearn.model_selection import train_test_split #we don't use it here, but it's a useful function!

from sklearn.tree import DecisionTreeClassifier #how methods are imported 

from sklearn import metrics #this will give us access to evaluation metrics

In [None]:
font = {'size'   : 20}
matplotlib.rc('font', **font)
matplotlib.rc('xtick', labelsize=20) 
matplotlib.rc('ytick', labelsize=20) 
matplotlib.rcParams['figure.dpi'] = 300

In [None]:
#Here is a bunch of packages for visualization purposes only - this cell can be skipped if troublesome

from io import StringIO
from IPython.display import Image  
import pydotplus
from sklearn.tree import export_graphviz

### We use a selection of data from https://phl.upr.edu/projects/habitable-exoplanets-catalog

### We begin by reading in the data set using pandas.

In [None]:
LearningSet = pd.read_csv('HPLearningSet.csv')

In [None]:
pd.read_csv?

In [None]:
LearningSet

In [None]:
LearningSet = LearningSet.drop(LearningSet.columns[0], axis=1) #We want to drop the first column of the file

The structure we created is called a data frame.

It's nice because we can refer to columns with their names as well as indices, and it looks neat. 

In [None]:
LearningSet

### Let's pick the same train/test set we had in the slides.

Note the use of ".iloc" (integer location) to access indices in data frames.

In [None]:
TrainSet =  LearningSet.iloc[:13,:]  #normally this would happen at random, using the function train_test_split

TestSet = LearningSet.iloc[13:,:]

In [None]:
TrainSet

In [None]:
TestSet

### We split the train and test sets in features and labels.

In [None]:
Xtrain = TrainSet.drop(['P_NAME','P_HABITABLE'],axis=1)

Xtest = TestSet.drop(['P_NAME','P_HABITABLE'],axis=1)

In [None]:
Xtrain

In [None]:
ytrain = TrainSet.P_HABITABLE

ytest = TestSet.P_HABITABLE

In [None]:
ytrain

### And we are ready to fit the model with our decision tree!

Note: The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. 

To obtain a deterministic behaviour during fitting, random_state has to be fixed.


In [None]:
model = DecisionTreeClassifier(random_state = 3) #This is how we specify which method we'd like to use, and any parameters.

model.fit(Xtrain, ytrain) #This tiny line is how we build models in sklearn.

### Finally, we can visualize the tree.

In [None]:
dot_data = StringIO()
export_graphviz(
            model,
            out_file =  dot_data,
            feature_names = ['Stellar Mass (M*)', 'Orbital Period (d)', 'Distance (AU)'],
            class_names = ['Not Habitable','Habitable'],
            filled = True,
            rounded = True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
nodes = graph.get_node_list()

for node in nodes:
    if node.get_label():
        values = [int(ii) for ii in node.get_label().split('value = [')[1].split(']')[0].split(',')]
        values = [255 * v / sum(values) for v in values]
        
        values = [int(255 * v / sum(values)) for v in values]
            
        if values[0] > values[1]:
            alpha = int(values[0] - values[1])
            alpha = '{:02x}'.format(alpha) #turn into hexadecimal
            color = '#20 B2 AA'+str(alpha)
        else:
            alpha = int(values[1] - values[0])
            alpha = '{:02x}'.format(alpha)
            color = '#FF 00 FF'+str(alpha)
        node.set_fillcolor(color)

graph.set_dpi('300')

Image(graph.create_png())

#Image(graph.write_png('Graph.png'))

### This is an alternative visualization, which only relies on the sklearn package.

In [None]:
from sklearn import tree

plt.figure(figsize=(40,20))  # customize according to the size of your tree

tree.plot_tree(model, feature_names = ['Stellar Mass (M*)', 'Orbital Period (d)', 'Distance (AU)'], class_names = ['Not Habitable','Habitable'])

plt.show()

### We can visualize the splits as well and then answer some questions.

In [None]:
plt.figure(figsize=(12,8))

cmap = matplotlib.colors.LinearSegmentedColormap.from_list("", ['#20B2AA','#FF00FF'])

#Will now plot the train set and test set points

plt.scatter(TrainSet['S_MASS'], TrainSet['P_PERIOD'], marker = '*',\
            c = TrainSet['P_HABITABLE'], s = 100, cmap=cmap, label = 'Train')

plt.scatter(TestSet['S_MASS'], TestSet['P_PERIOD'], marker = 'o',\
            c = TestSet['P_HABITABLE'], s = 100, cmap=cmap, label = 'Test')

plt.yscale('log')

plt.xlabel('Mass of Parent Star (Solar Mass Units)')

plt.ylabel('Period of Orbit (days)');

#I can add the splits to the plot

plt.axvline(x=0.83, linewidth =1, ls = '-', label = '1st split', c='k')

plt.axhline(y=4.891, xmin = 0, xmax = 0.655, linewidth =1, ls = '--', label = '2nd split',c='k')

plt.text(0.845, 10**3, '1st split', fontsize=14)
         
plt.text(0.65, 6, '2nd split', fontsize=14)

#Add legend, including unlabeled objects

bluepatch = mpatches.Patch(color='#20B2AA', label='Not Habitable')

magentapatch = mpatches.Patch(color='#FF00FF', label='Habitable')

plt.legend();

ax = plt.gca()

predhab = mpatches.Rectangle((0,4.891),0.83,ax.get_ylim()[1], 
                        fill = True,
                        color = '#FF00FF',
                        alpha = 0.3)

prednothab1 = mpatches.Rectangle((0.83,ax.get_ylim()[0]),ax.get_xlim()[1],ax.get_ylim()[1], 
                        fill = True,
                        color = '#20B2AA',
                        alpha = 0.3)

prednothab2 = mpatches.Rectangle((0,ax.get_ylim()[0]),0.83,4.891-ax.get_ylim()[0], 
                        fill = True,
                        color = '#20B2AA',
                        alpha = 0.3)

leg = ax.get_legend()
leg.legendHandles[2].set_color('k')
leg.legendHandles[3].set_color('k')

plt.gca().add_patch(predhab)
plt.gca().add_patch(prednothab1)
plt.gca().add_patch(prednothab2)

leg = ax.get_legend()
leg.legendHandles[2].set_color('k')
leg.legendHandles[3].set_color('k')


plt.legend(handles=[leg.legendHandles[2],leg.legendHandles[3], magentapatch, bluepatch],\
           loc = 'upper left', fontsize = 14);


### Questions: 
    
- What is the accuracy (percentage of correct classifications) on the training set? 




- How about on the test set (you have to run the test example through the tree, or look at the figure above)? 



In [None]:
#We want, of course, to be able to answer the questions in code as well.

ypred_train = ....

ypred_test = ....

In [None]:
metrics.accuracy_score(.... ) #test score, or comparison of
#real labels on test set with predicted labels on test set

In [None]:
metrics.accuracy_score(.... ) #train score, or comparison of
#real labels on train set with predicted labels on train set

### Our final reflection will be an exercise in picking a different train/test split.

- Pick the first 5 objects for test, objects 5:18 for training;

- Build the train and test sets, for features and labels;

- Build a decision tree model on the new train set;

- Visualize the new tree;

- Calculate the train and test scores for this new model.

### Let's draw some conclusions together.

- Strengths of DT algorithm?

- Limitations?

- Possible concerns?
