In [None]:
import pandas as pd
import numpy as np

# Decision Trees

**Decision Tree** is a popular supervised learning algorithm known for its high interpretability. Decision Tree models are considered to have low computational cost in comparison to many other algorithms, but Decision Trees are [non parametric](https://sebastianraschka.com/faq/docs/parametric_vs_nonparametric.html) which means their [computational cost](https://www.thekerneltrip.com/machine/learning/computational-complexity-learning-algorithms/) grows exponentially as the size of the dataset increases.

## Pros & Cons

### Pros

- Easy to explain and visualize
- Irrelevant features will not be used by the model (built in feature selection)
    - Decision Trees can be used to identify the predictive power of features with non-linear relationships! More [here](https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598)
- Preprocessing needs are minimal
    - Scaling doesn't have a high impact on results.
    - Multicollinearity doesn't impact results.
    - Outliers have a low impact on results.
- Very fast prediction speed

### Cons

- High variability (Easily overfits to data)
- Training time increases exponentially as sample size increases.

### Pro + Con

- Decision Trees have a lot of tuning options (hyper parameters)
    - Pro
        - Models can be highly customizable and optimized!
    - Con
        - There is a learning curve to using these tuning options. 

## What is decision tree?

A decision tree filters the data according to a series of `if` and `else` statements. 

<img src="simple_example.png" style="height:300px;">

Above is an extremely simple example of a decision tree, let's look at an example that is a little more complicated. 

**Say we have an unorganized drawer of pens and pencils. The pens and pencils are made with the following materials:**

<img src="writing_utensils.png" style="width:600px;">

In the cell below, we import data for this drawer:
> For the type column, 0 = pencil and 1 = pen

In [None]:
utensils = pd.read_csv('writing_utensils.csv')
utensils.head()

Whenever we are doing a classification problem, one of the first things we should check is the distribution of the classification column. We can do this by running `.value_counts()` on this column.

In [None]:
utensils.type.value_counts(normalize=True)

The data is very close to evenly split. That's a good thing!

If we group by `type` we can look at the joint probabilities for pens and pencils:

In [None]:
utensils.groupby('type').sum()

In [None]:
utensils.shape[0]

**Using the table above, what observations can we make about the pens and pencils in our drawer?**

1.   
2.   
3.  

In [None]:
from IPython.display import HTML
HTML('<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vRjpSr7iFUUJLV8RrrLtBF3czpuhV4iKnAylzbhDLlrvYirUO6hmNxyEerJjC8uTyIbA8WpDN1ZvnHR/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="400" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>')

### Let's code out our decision tree!

In [None]:
# Your code here

**What observations did we classify incorrectly?**

In [None]:
# Your code here

**How can we do this with sklearn?**

In [None]:
from sklearn.tree import DecisionTreeClassifier

X = utensils.drop('type', axis=1)
y = utensils.type
dt = DecisionTreeClassifier()
dt.fit(X,y)
dt.score(X,y)

**Plotting an sklearn decision tree!**

In [None]:
from sklearn.tree import plot_tree
plot_tree(dt, feature_names=X.columns, filled=True);

## How does a decision tree make decisions?

There are two available cost functions that a decision tree can use:
1. The Gini Index
2. Entropy

### Gini Index

$Gini Index = 1 – \sum_{j}p_{j}^{2}$

j = A class label

p = The proportion of observations in the split for the j class label

Let's turn this into a function!

In the cell below we create a `gini_index` function that:
1. receives a pandas series
2. calculates the proportions for each class in the series
3. Calculates and returns the gini index

In [None]:
# Your code here

**Let's use our gini index function to calculate the gini score for each split of our decision tree!**

First we calculate the gini score for whether or not the utensil is made of wood. 

In [None]:
# Collect the index for observations 
# that are made of wood
wood_index = X[utensils.wood > .5].index

# Collect the index for observations 
# that are not made of wood
not_wood_index = X[utensils.wood <= .5].index

# Collect the labels for both groups
wood_labels = y.loc[wood_index]
not_wood_labels = y.loc[not_wood_index]

# Calculate the gini score for each!
wood_gini = gini_index(wood_labels)
not_wood_gini = gini_index(not_wood_labels)
print(f'Wood Gini: {wood_gini}')
print(f'Not Wood Gini: {not_wood_gini}')

Next, we take the observations that are not made of wood, and split them according to whether or not they have a cap. 

In [None]:
# Collect the index for observations 
# that are not made of wood and have a cap
mask = (X.loc[not_wood_index].has_cap > .5)
no_wood_with_cap_index = (X.loc[not_wood_index][mask]).index

# Collect the index for observations 
# that are not made of wood and have do not have a cap
mask = (X.loc[not_wood_index].has_cap <= .5)
no_wood_no_cap_index = (X.loc[not_wood_index][mask]).index

# Collect class labels for both groups
no_wood_with_cap_labels = y.loc[no_wood_with_cap_index]
no_wood_no_cap_labels = y.loc[no_wood_no_cap_index]

# Calculate the gini score for each!
no_wood_with_cap_gini = gini_index(no_wood_with_cap_labels)
no_wood_no_cap_gini = gini_index(no_wood_no_cap_labels)


print(f'No Wood With Cap Gini: {no_wood_with_cap_gini}')
print(f'No Wood No Cap Gini: {no_wood_no_cap_gini}')

Now, we take the observations that are not made of wood and do not have have a cap, and we split them according to whether or not they are made of metal. 

In [None]:
# Collect the index for observations 
# that are not made of wood and have a cap
mask = (X.loc[no_wood_no_cap_index].metal > .5)
no_wood_no_cap_metal_index = (X.loc[no_wood_no_cap_index][mask]).index

# Collect the index for observations 
# that are not made of wood and have do not have a cap
mask = (X.loc[no_wood_no_cap_index].metal <= .5)
no_wood_no_cap_no_metal_index = (X.loc[no_wood_no_cap_index][mask]).index

# Collect class labels for both groups
no_wood_no_cap_metal_labels = y.loc[no_wood_no_cap_metal_index]
no_wood_no_cap_no_metal_labels = y.loc[no_wood_no_cap_no_metal_index]

# Calculate the gini score for each!
no_wood_no_cap_metal_gini = gini_index(no_wood_no_cap_metal_labels)
no_wood_no_cap_no_metal_gini = gini_index(no_wood_no_cap_no_metal_labels)


print(f'No Wood No Cap Metal Gini: {no_wood_no_cap_metal_gini}')
print(f'No Wood No Cap No Metal Gini: {no_wood_no_cap_no_metal_gini}')

### How does Decision Tree decide on splits?

For every split, the decision tree:
1. Loops over every column
2. Splits the data according to every unique value in the column
3. Calculates the gini score for every split
4. Returns the split that resulted in the best gini score.

In [None]:
# The maximum score for gini is .5
# Any score that is less than 1 is an improvement
# So we will set the best score to .5 and check 
# if our calculates scores of better
best_score = .5
# Variable to contain information about 
# The best split of the data
best_split = None

# Loop over every column
for column in X.columns:
    # Find all the unique values in the column
    unique = utensils[column].unique()
    # Loop over every unique value
    for val in unique:
        # split the data according to the unique value
        split_1 = utensils[utensils[column] <= val]
        split_2 = utensils[utensils[column] >= val]
        # Calculate the gini score for each split
        split_1_score = gini_index(split_1.type)
        split_2_score = gini_index(split_2.type)
        # Add the gini scores together
        score = split_1_score + split_2_score
        # If the score if less than the best score
        # set the best score to the score we have 
        # calculated and set the best split
        # to the name of the column and the value
        # we split on
        if score < best_score:
            best_score = score
            best_split = (column, val)
            
print(f'Best Score: {best_score}')
print(f'Best Split: {best_split}')

### Entropy

$Entropy = – \sum_{j}p_{j} \cdot log_{2} \cdot p_{j}$

In [None]:
def entropy(series):
    proportions = series.value_counts(normalize=True)
    proportions = proportions * np.log2(proportions)
    proportions = proportions.sum()
    return -proportions

Let's run the code above, and replace `gini_index` with our `entropy` function

In [None]:
# The maximum score for entropy is 1
# Any score that is less than 1 is an improvement
# So we will set the best score to 1 and check 
# if our calculates scores of better
best_score = 1
# Variable to contain information about 
# The best split of the data
best_split = None

# Loop over every column
for column in X.columns:
    # Find all the unique values in the column
    unique = utensils[column].unique()
    # Loop over every unique value
    for val in unique:
        # split the data according to the unique value
        split_1 = utensils[utensils[column] <= val]
        split_2 = utensils[utensils[column] >= val]
        # Calculate the entropy for each split
        split_1_score = entropy(split_1.type)
        split_2_score = entropy(split_2.type)
        # Add the entropy scores together
        score = split_1_score + split_2_score
        # If the score if less than the best score
        # set the best score to the score we have 
        # calculated and set the best split
        # to the name of the column and the value
        # we split on
        if score < best_score:
            best_score = score
            best_split = (column, val)
            
print(f'Best Score: {best_score}')
print(f'Best Split: {best_split}')

In [None]:
dt = DecisionTreeClassifier(criterion='entropy')
dt.fit(X,y)
plot_tree(dt, feature_names=X.columns, filled=True);

What is the difference between gini and entropy? 

Due to time, this question is slightly out of scope for this lesson. For a good explanation, see [this article](https://quantdare.com/decision-trees-gini-vs-entropy/)

The TL;DR is that entropy is more computationally expensive (the model takes longer to fit to data) than gini, but entropy tends to perform slightly better in terms of accuracy than gini. 

## Using Sklearn

In [None]:
df = pd.read_csv('https://gist.githubusercontent.com/michhar/2dfd2de0d4f8727f873422c5d959fff5/raw/fa71405126017e6a37bea592440b4bee94bf7b9e/titanic.csv')

In [None]:
df.head()

In [None]:
# Import decision tree classifier

# Other imports:

In [None]:
# Select the features I'd like to use

# Drop null values for according to selected features

# Isolate the target

# Isolate the predictors

# Instantiate a label encoder

# Fit the label encoder to the categorical data

# Transform the categorical data


# Create a train test split


# Fit the decision tree


Let's score the decision tree model on the training data

In [None]:
dt.score(X_train,y_train)

Let's score the decision tree model on testing data

In [None]:
dt.score(X_test,y_test)

Plot the decision tree

In [None]:
plot_tree(dt, feature_names=X.columns, filled=True);

### Tune the model!

In [None]:
second_model = DecisionTreeClassifier(max_depth=5, min_samples_split = 10, max_leaf_nodes=100)

In [None]:
second_model.fit(X_train, y_train)
second_model.score(X_train,y_train)

In [None]:
second_model.score(X_test, y_test)

In [None]:
plot_tree(second_model, feature_names=X.columns, filled=True);

### Grid Search

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
grid_params = {'min_samples_split' : [6,10,14],
               'max_depth': [2,4, 5],
               'max_leaf_nodes': [40,80,100]}

grid_search = GridSearchCV(dt, grid_params, cv=3, scoring='accuracy')
grid_search.fit(X_train, y_train)

In [None]:
grid_search.score(X_train, y_train)

In [None]:
grid_search.score(X_test, y_test)

In [None]:
grid_search.best_params_