# Decision Trees

[<font color='#E8800A'>1 - Building a Decision Tree</font>](#first-bullet) <br>
[<font color='#E8800A'>2 - Avoid Overfitting</font>](#first-bullet) <br>
[<font color='#E8800A'>3 - Feature importance with Decision Trees</font>](#first-bullet) <br>

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import warnings
warnings.filterwarnings('ignore')
import time
from sklearn.model_selection import KFold
import numpy as np
from sklearn import tree
import matplotlib.pyplot as plt

diabetes = pd.read_csv(r'diabetes.csv')
X = diabetes.iloc[:,:-1]
y = diabetes.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 15, stratify = y)

In [None]:
def avg_score(model):
    # apply kfold
    kf = KFold(n_splits=10)
    # create lists to store the results from the different models 
    score_train = []
    score_test = []
    timer = []
    for train_index, test_index in kf.split(X):
        # get the indexes of the observations assigned for each partition
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        # start counting time
        begin = time.perf_counter()
        # fit the model to the data
        model.fit(X_train, y_train)
        # finish counting time
        end = time.perf_counter()
        # check the mean accuracy for the train
        value_train = model.score(X_train, y_train)
        # check the mean accuracy for the test
        value_test = model.score(X_test,y_test)
        # append the accuracies, the time and the number of iterations in the corresponding list
        score_train.append(value_train)
        score_test.append(value_test)
        timer.append(end-begin)
    # calculate the average and the std for each measure (accuracy, time and number of iterations)
    avg_time = round(np.mean(timer),3)
    avg_train = round(np.mean(score_train),3)
    avg_test = round(np.mean(score_test),3)
    std_time = round(np.std(timer),2)
    std_train = round(np.std(score_train),2)
    std_test = round(np.std(score_test),2)
    
    return str(avg_time) + '+/-' + str(std_time), str(avg_train) + '+/-' + str(std_train),\
str(avg_test) + '+/-' + str(std_test)

In [None]:
# in anaconda prompt: conda install python-graphviz
#!pip install pydotplus
from sklearn.tree import export_graphviz
import graphviz
import pydotplus

def plot_tree(model):
    dot_data = export_graphviz(model,
                               feature_names=X_train.columns,  
                               class_names=["No Diabetes", "Diabetes"],
                               filled=True)
    pydot_graph = pydotplus.graph_from_dot_data(dot_data)
    pydot_graph.set_size('"20,20"')
    return graphviz.Source(pydot_graph.to_string())

In [None]:
def show_results(df, *args):
    """
    Receive an empty dataframe and the different models and call the function avg_score
    """
    count = 0
    # for each model passed as argument
    for arg in args:
        # obtain the results provided by avg_score
        time, avg_train, avg_test = avg_score(arg)
        # store the results in the right row
        df.iloc[count] = time, avg_train, avg_test
        count+=1
    return df

<div class="alert alert-block alert-info">
    

# 1. Building a Decision Tree
    
</div>

__`Step 1`__ - Create an instance of DecisionTreeClassifier with the default parameters and name it as __dt_gini__

In [None]:
dt_gini = DecisionTreeClassifier()

__`Step 2`__ - Fit your data to the model __dt_gini__ <br>

In [None]:
dt_gini.fit(X_train, y_train)

__`Step 3`__ __Predicted Values__ <br>
a) Check the predicted values for the test dataset using the method __predict()__ in your model<br>

In [None]:
y_pred = dt_gini.predict(X_test)
y_pred

b) Check the predicted class probabilities for the test dataset using the method __predict_proba()__ in your model<br>

In [None]:
y_pred_prob = dt_gini.predict_proba(X_test)
y_pred_prob

__`Step 4`__ Check the depth (__get_depth()__), the number of nodes (__.tree_.node_count__) and the number of leaves (__get_n_leaves()__) of the model __dt_gini__

In [None]:
print('The defined three has a depth of ' + str(dt_gini.get_depth()) + ', ' + str(dt_gini.tree_.node_count) + 
      ' nodes and a total of ' + str(dt_gini.get_n_leaves()) + ' leaves.')

### <font color='#E8800A'>criterion | </font>  <font color='#3a7f8f'>Changing the split criteria</font> <a class="anchor" id="first-bullet"></a><br><br>`default = 'gini'`

- A decision tree is split by using the impurity - a measure of homogeneity of the labels on the node.
- There are two possibilities in sklearn:
    - Gini Index - Gini Impurity measures the divergences between the probability distributions of the target attribute’s values and splits a node such that it gives the least amount of impurity.
    - Entropy - Information gain uses the entropy measure as the impurity measure and splits a node such that it gives the most amount of information gain.
    
- In most cases, the choice of splitting criteria will not make much difference in the performance of the model. However, and according to the "No free lunch Theorem", each criterion is superior in some cases and inferior in others.
- The main difference is that entropy might be a little slower to compute because it requires you to compute a logarithmic function

__`Step 5`__ - Create an instance of DecisionTreeClassifier named as __dt_entropy__ and define the parameter __criterion='entropy'__, and fit the data to your model. Check the results.

In [None]:
dt_entropy = DecisionTreeClassifier(criterion = 'entropy').fit(X_train, y_train)

In [None]:
df = pd.DataFrame(columns = ['Time','Train','Test'], index = ['Gini','Entropy'])
show_results(df,dt_gini, dt_entropy)

<div class="alert alert-block alert-info">

# 2. Avoiding Overfitting
</div>

<div class="alert alert-block alert-success">

## Prepruning a tree

[2.1. - The splitter](#splitter) <br>
[2.2. - The maximum depth](#depth)<br>
[2.3. - The minimum number of samples required to split](#samples)<br>
[2.4. - The minimum samples in each leaf](#leaf)<br>
[2.5. - The minimum weight fraction in each leaf](#weight)<br>
[2.6. - The maximum number of features](#features)<br>
[2.7. - The maximum number of leaf nodes](#nodes)<br>
[2.8. - The minimum impurity decrease](#decrease)<br>

</div>

### <font color='#E8800A'>splitter| </font>  <font color='#3a7f8f'>Changing the splitter</font> <a class="anchor" id="first-bullet"></a><br><br>`default = 'best'`

If random, it selects a random feature and a random split in each feature. 
- It's less computation intensive than calculating the optimal split of every feature at every leaf.
- It should be less prone to overfitting.

__`Step 6`__ - Create an instance of DecisionTreeClassifier named as __dt_random__ and define the parameter __splitter='random'__, and fit the data to your model. Check the results.

In [None]:
dt_random = DecisionTreeClassifier(splitter = 'random').fit(X_train, y_train)

In [None]:
df = pd.DataFrame(columns = ['Time','Train','Test'], index = ['best','random'])
show_results(df,dt_gini, dt_random)

### <font color='#E8800A'>max_depth | </font>  <font color='#3a7f8f'>Changing the maximum depth of a tree</font> <a class="anchor" id="first-bullet"></a><br><br>`default = 'None'`



- If you don’t specify a depth for the tree, scikit-learn will expand the nodes until all leaves are pure (unless other parameters are defined)
- The deeper you allow your tree to grow, the more complex your model will be. 
- __`High Depth`__ - This will increase the number of slipts and captures more information about the data. However, this is one of the major causes associated with overfitting, since your model will fit perfectly for the training data, and it will not be able to generalize well on test. 
- __`Low Depth`__ - This is one of the major causes associated with underfitting.


__`Step 7`__ - Create an instance of DecisionTreeClassifier named as __dt_depth2__ and define the parameter __max_depth=2__, and fit the data to your model. Check the results.

In [None]:
dt_depth2 = DecisionTreeClassifier(max_depth = 2).fit(X_train, y_train)

In [None]:
df = pd.DataFrame(columns = ['Time','Train','Test'], index = ['full','depth2'])
show_results(df,dt_gini, dt_depth2)

__`Step 8`__ - Use the package graphviz to visualize the Decision Tree just created.

In [None]:
plot_tree(dt_depth2)

### <font color='#E8800A'>min_samples_split |</font>  <font color='#3a7f8f'>Changing the minimum number of samples required to split an internal node</font> <a class="anchor" id="first-bullet"></a><br><br>`default = 2`

- An internal node can have further splits (on the other hand, leafs is a node without children)
- It is used to control overfitting
- __`High Values`__ - Prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
- __`Too high Values`__ - Can lead to underfitting

__`Step 9`__ - Create an instance of DecisionTreeClassifier named as __dt_min10__ and define the parameter __min_samples_split=10__, and fit the data to your model.

In [None]:
dt_min10 = DecisionTreeClassifier(min_samples_split = 10).fit(X_train, y_train)

__`Step 10`__ - Create an instance of DecisionTreeClassifier named as __dt_min500__ and define the parameter __min_samples_split=500__, and fit the data to your model. Check the results for both models.

In [None]:
dt_min500 = DecisionTreeClassifier(min_samples_split = 500).fit(X_train, y_train)

df = pd.DataFrame(columns = ['Time','Train','Test'], index = ['dt_min10','dt_min500'])
show_results(df, dt_min10, dt_min500)

__`Step 11`__ Plot the decision tree __dt_min500__

In [None]:
plot_tree(dt_min500)

### <font color='#E8800A'>min_samples_leaf |</font> <font color='#3a7f8f'>Changing the minimum number of samples required to be at a leaf node</font> <a class="anchor" id="first-bullet"></a><br><br>`default = 1`

- A leaf is a node without children
- It is used to control overfitting, by defining that each leaf has more than one element
- __`Small Values`__ - The tree will overfit
- __`Too high Values`__ - Can lead to underfitting

__`Step 12`__ - Create an instance of DecisionTreeClassifier named as __dt_min_sam200__ and define the parameter __min_samples_split=200__, and fit the data to your model

In [None]:
dt_min_sam200 = DecisionTreeClassifier(min_samples_leaf = 200).fit(X_train, y_train)

__`Step 13`__ - Create an instance of DecisionTreeClassifier named as __dt_min_sam500__ and define the parameter __min_samples_split=500__, and fit the data to your model. Compare the results between the baseline model and the models created in step12 and step13

In [None]:
dt_min_sam500 = DecisionTreeClassifier(min_samples_leaf = 500).fit(X_train, y_train)

In [None]:
df = pd.DataFrame(columns = ['Time','Train','Test'], index = ['dt_min_sam1','dt_min_sam200','dt_min_sam500'])
show_results(df,dt_gini, dt_min_sam200, dt_min_sam500)

__`Step 14`__ Plot the decision tree __dt_min_sam200__

In [None]:
plot_tree(dt_min_sam200)

### <font color='#E8800A'>min_weight_fraction_leaf |</font> <font color='#3a7f8f'>Changing the minimum number of samples required to be at a leaf node</font> <a class="anchor" id="first-bullet"></a><br><br>`default = 0.0`

Is the fraction of the input samples required to be at a leaf node where weights are determined by sample_weight, this is a way to deal with class imbalance. Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights for each class to the same value.

__`Step 15`__ - Create an instance of DecisionTreeClassifier named as __dt_min_weight__ and define the parameter __min_weight_fraction_leaf=0.15__, and fit the data to your model. Compare the results with the baseline model.

In [None]:
dt_min_weight = DecisionTreeClassifier(min_weight_fraction_leaf = 0.15).fit(X_train, y_train)

df = pd.DataFrame(columns = ['Time','Train','Test'], index = ['dt_gini','dt_min_weight'])
show_results(df,dt_gini, dt_min_weight)

__`Step 16`__ Plot the decision tree __dt_min_weight__

In [None]:
plot_tree(dt_min_weight)

### <font color='#E8800A'>max_features |</font> <font color='#3a7f8f'>Changing the number of features to consider when looking for the best split</font> <a class="anchor" id="first-bullet"></a><br><br>`default = 'None'`

- It is computationally heavy to look at all the features every single time, so you can just check some of them using the various max_features options
- It also allows to minimize overfitting - by choosing a reduced number of features, we can increase the stability of the tree and reduce variance and overfitting

There are several options (let's imagine we are dealing with 32 variables): <br>
`int` - The defined value is the number of maximum features to be considered at each split<br>
    - A value of 10 will consider 10 features
`float` - The defined value will be multiplied by the number of features and those are considered to each split
    - A value of 0.5 will consider 16 features
`auto` - The number of features considered is equal to sqrt(total number of features)
    - It will be considered 6 features
`log2` - The number of features considered is equal to log2(total number of features)
    - It will be considered 5 features
`None`- The number of features considered is equal to the total number of features
    - 32 variables will be considered
    
The option to select will depend on the number of features you have, the computational intensity you want to reduce or the amount of overfitting you have, so if you have a high computational cost or you have a lot of overfitting, you can try with “log2” and depending on what that produces, you can either bring it slightly up using sqrt or take it down further using a custom float value.

__`Step 17`__ - Create the following instances of a DecisionTreeClassifier:
- where __max_features = None__ and name it as __dt_none__ (The baseline model)
- where __max_features = 2__ and name it as __dt_int__
- where __max_features = 0.5__ and name it as __dt_float__
- where __max_features = 'auto'__ and name it as __dt_auto__
- where __max_features = 'log2'__ and name it as __dt_log2__


Check the results.

In [None]:
dt_none = DecisionTreeClassifier(max_features = None).fit(X_train, y_train)
dt_int = DecisionTreeClassifier(max_features = 2).fit(X_train, y_train)
dt_float = DecisionTreeClassifier(max_features = 0.5).fit(X_train, y_train)
dt_auto = DecisionTreeClassifier(max_features = 'auto').fit(X_train, y_train)
dt_log2 = DecisionTreeClassifier(max_features = 'log2').fit(X_train, y_train)

In [None]:
df = pd.DataFrame(columns = ['Time','Train','Test'], index = ['None (Baseline)','Int','Float','Auto','Log2'])
show_results(df,dt_none, dt_int, dt_float, dt_auto, dt_log2)

__`Step 18`__ - Create the following instances of a DecisionTreeClassifier:
- where __max_features = 2__ and __max_depth = 2__ and name it as __dt_int2__
- where __max_features = 2__ and __max_depth = 2__ and name it as __dt_int3__

In [None]:
dt_int2 = DecisionTreeClassifier(max_features = 2, max_depth = 2).fit(X_train, y_train)
dt_int3 = DecisionTreeClassifier(max_features = 2, max_depth = 2).fit(X_train, y_train)

__`Step 19`__ Plot the decision tree __dt_int2__

In [None]:
plot_tree(dt_int2)

__`Step 20`__ Plot the decision tree __dt_int3__

In [None]:
plot_tree(dt_int3)

### <font color='#E8800A'>max_leaf_nodes |</font> <font color='#3a7f8f'>Define the total number of leaf nodes</font> <a class="anchor" id="first-bullet"></a><br><br>`default = 'None'`

__`Step 21`__ - Create an instance of DecisionTreeClassifier named as __dt_maxleaf5__ and define the parameter __max_leaf_nodes=5__, and fit the data to your model. Compare the results with the baseline model.

In [None]:
dt_maxleaf5 = DecisionTreeClassifier(max_leaf_nodes = 5).fit(X_train, y_train)

df = pd.DataFrame(columns = ['Time','Train','Test'], index = ['Baseline','dt_maxleaf5'])
show_results(df,dt_gini, dt_maxleaf5)

__`Step 22`__ Plot the decision tree __dt_maxleaf5__

In [None]:
plot_tree(dt_maxleaf5)

### <font color='#E8800A'>min_impurity_decrease |</font> <font color='#3a7f8f'>Decide if a node will be split according to the decrease of impurity</font> <a class="anchor" id="first-bullet"></a><br><br>`default = '0.'`

A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

__`Step 23`__ - Create an instance of DecisionTreeClassifier named as __dt_impurity02__ and define the parameter __min_impurity_decrease=0.02__, and fit the data to your model. Compare the results with the baseline model.

In [None]:
dt_impurity02 = DecisionTreeClassifier(min_impurity_decrease=0.02).fit(X_train, y_train)

df = pd.DataFrame(columns = ['Time','Train','Test'], index = ['Baseline','dt_impurity04'])
show_results(df,dt_gini, dt_impurity04)

__`Step 24`__ Plot the decision tree __dt_impurity02__

In [None]:
plot_tree(dt_impurity02)

<div class="alert alert-block alert-info">
    
# 2. Use a decision tree to evaluate feature importance
</div>

__`Step 25`__ Calculate the feature importance using the split criteria 'Gini' and 'Entropy'

In [None]:
gini_importance = DecisionTreeClassifier().fit(X_train, y_train).feature_importances_
entropy_importance = DecisionTreeClassifier(criterion='entropy').fit(X_train, y_train).feature_importances_

__`Step 26`__ Plot the feature importances for both criterions

In [None]:
import seaborn as sns

zippy = pd.DataFrame(zip(gini_importance, entropy_importance), columns = ['gini','entropy'])
zippy['col'] = X_train.columns
tidy = zippy.melt(id_vars='col').rename(columns=str.title)
tidy.sort_values(['Value'], ascending = False, inplace = True)

plt.figure(figsize=(15,8))
sns.barplot(y='Col', x='Value', hue='Variable', data=tidy)