# Decision Trees

Decision trees use many popular methods to figure out what values to split on. Two popular methods are gini index and entropy. Below you will code these two methods

In [1]:
# import libs
import numpy as np
import pandas as pd

# Part 1 (15 points)

## Entropy
entropy of a random variable is the average level of “information“, “surprise”, or “uncertainty” inherent in the variable’s possible outcomes.

Below is the formula for entropy

![entropy](images/entropy.png)

Here is another formula that is the same but explains it with more detail

![entropy2](images/entropy2.png)

In [21]:
# implement entropy function
from math import log2


def entropy(num_positive_classes:int,num_negative_classes:int) -> float:
    ####################################STUDENT CODE########################################
    entropy = -1*(num_positive_classes*(log2(num_positive_classes)) + (num_negative_classes*(log2(num_negative_classes))))
    ######################################END CODE##########################################
    return entropy

In [22]:
# test entropy code
positive_class = 7
negative_class = 3
value = entropy(positive_class,negative_class)
np.testing.assert_approx_equal(0.88129089,value)
print('passed')

AssertionError: 
Items are not equal to 7 significant digits:
 ACTUAL: 0.88129089
 DESIRED: -24.406371956566694

## Gini Index
The Gini index, Gini coefficient, or Gini impurity computes the degree of probability of a specific variable that is wrongly being classified when chosen randomly

Below is the formula

![gini](images/gini.png)


In [23]:
# implement gini index function
def gini_index(num_positive_classes,num_negative_classes) -> float:
    ####################################STUDENT CODE########################################
    gini_value = 1 - (num_positive_classes*num_negative_classes)**2 
    ######################################END CODE##########################################
    return gini_value

In [16]:
# test gini index code
positive_class = 7
negative_class = 3
value = gini_index(positive_class,negative_class)
np.testing.assert_approx_equal(0.4200000,value)
print('passed')

AssertionError: 
Items are not equal to 7 significant digits:
 ACTUAL: 0.42
 DESIRED: -440.0

## Information Gain
Information gain can be defined as the amount of information improved in the nodes before splitting. You want to seperate the nodes based on features that give you the most information gain. In the picture below the left tree has more information gain. The classes are divided up into seperate child nodes. The tree on the left is more homogeneous and the tree on the right is more heterogeneous.

![tree](images/tree.png)

Information Gain is defined by

![info](images/infogain.png)

You have now observed how decision trees choose which values to split on. Next we will use a descision tree to predict some data

# Part 2 (15 points)

## Loan Data
The dataset we will be using is related to finances. The problem is the bank wants to know who they should give loans out to. They do not want to lend money to those who have defaulted on their loans. Using a decision tree algorithm we will see if we can predict who will and won't default on their loans

In [3]:
# load libraries needed
from sklearn.tree import DecisionTreeClassifier,plot_tree
from sklearn.preprocessing import MinMaxScaler, LabelEncoder

ModuleNotFoundError: No module named 'sklearn'

In [None]:
# read in the data and clean the data
df = None
path = 'loan_defualter.csv'
############################STUDENT CODE GOES HERE#########################
# read in csv

# there are three columns that are not needed
# remove the Surname, CustomerId, and RowNumber columns from the df

# There are two columns that are strings Geography and Gender.
# Computers can not process strings easily and it's difficult to do math with strings.
# The two most common approches to dealing with strings is One Hot Encoding and Label Encoding
# Use label encoding to convert these two columns to integer values
# use sklearn label encoding function for this

#################################END CODE##################################

assert(type(df) == pd.DataFrame)
assert(df['Gender'].max() == 1)
assert(df['Geography'].max() == 2)
df

### Split data into train and test sets
Data should be split in 80% train and 20% test. You should also shuffle the data. Use the sklearn function train_test_split for this

In [None]:
from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test_y= None,None,None,None
############################STUDENT CODE GOES HERE#########################
# split the data using train_test_split (set random_state arg to 42, train_size=0.8, test_size=0.2, shuffle=True)

#################################END CODE##################################

## Do Decision Trees need Scaled (Normalized) Data?

Remember that normalizing your data is very important for ML algorithms. Trees are an exception in that the data does not have to be normalized. The reason for this is is essentially you are splitting the data by values of features. You are asking is feature x >= some value. However, normalizing won't hurt the tree algorithm and you should get the same results. Here are some examples of why you want to normalize your data for most types of algorithms.

Some examples of algorithms where feature scaling matters are:

* k-nearest neighbors with an Euclidean distance measure if want all features to contribute equally
* k-means (see k-nearest neighbors)
* logistic regression, SVMs, perceptrons, neural networks etc. if you are using gradient descent/ascent-based optimization, otherwise some weights will update much faster than others
* linear discriminant analysis, principal component analysis, kernel principal component analysis since you want to find directions of maximizing the variance (under the constraints that those directions/eigenvectors/principal components are orthogonal); you want to have features on the same scale since you’d emphasize variables on “larger measurement scales” more.
* Neural Networks because you want to optimize gradient descent and not introduce bias to the weights.

For now since trees are the exception to this normalization rule we will not normalize them

In [None]:
# train the decision tree on the train data train_x and train_y
decision_tree = DecisionTreeClassifier(random_state=42)

############################STUDENT CODE GOES HERE#########################
# fit the tree to the data

#################################END CODE##################################

In [None]:
# plot the decision tree
# WARNING: This can take a while to run since this tree is large. took me around 1.2 minutes and I have good hardware
# The image is hard to see in the notebook but you can save it off to look at on your own.
# Wanted you to get a feel for how a real tree may look
############################STUDENT CODE GOES HERE#########################
# use the plot tree function that was imported a few cells above
# if you want a more viewable tree you can give the function another argument of max_depth=1

#################################END CODE##################################


## Evaluate the Decision Tree

You now have a decision tree that has been trained on the loan dataset. What we want to know now is if the model can be used on data outside of the training set. We want to see how the model generalizes to data it has not seen before. Common metrics for this are accuracy, precision, and recall. Compute these metrics below.

In [None]:
# the decision tree is trained so now we need to evaluate the model and see how it performs on the test set
from sklearn.metrics import accuracy_score, precision_score, recall_score

results_dict = {'Accuracy':0,'Precision':0,'Recall':0}
############################STUDENT CODE GOES HERE#########################
# use the decision tree to predict on the test data

# compare the truth labels with the predicted labels for accuracy, precision, and recall
# store the results into the dataframe
#################################END CODE##################################

assert(results_dict['Accuracy'] > 0.7)
assert(results_dict['Precision'] > 0.38)
assert(results_dict['Recall'] > 0.45)
results_dict

You have now succesfully trained and evaluated a decision tree. This is just the a basic decision tree. There are many things you can do to improve these metrics. One thing that is helpful is to have some more info as the model trains. There are many ways to do this. The way we are going to visualize training is with a learning curve.

## Visualize training a decision tree

use the helper function plot_learning_curve to show you how the tree performs during training

In [None]:
from plot_helpers import plot_learning_curve
decision_tree = DecisionTreeClassifier(random_state=42)
############################STUDENT CODE GOES HERE#########################
# use the plot_learning_curve function to return a graph of your training
# this fuction will use cross validation for getting the validation metrics
# Hint: you can use the scoring argument and set it to a string to show a certain metric ("accuracy","precision","recall")
# plot at least one of the metrics
#################################END CODE##################################
plt.show()

You should notice how the training score is way higher than the validation score. This decision tree model is overfit to the training data. We need to make it perform better generalization. Decision trees are prone to overfitting. This is what you will work on in your own experiments. Congrats on training your decision tree and using some basic sklearn functions. This completes this notebook