## Decision Trees

### Data Exploration

In [6]:
import pandas as pd
import numpy as np
import random

# rename your dataset to car-price.csv
df = pd.read_csv('car-price.csv', delimiter=';')

# we will take two variables,
# we will use doornumber as our target
# and the others as our indpendent variables
df = df[['drivewheel','fueltype','aspiration','doornumber','carbody']]

df.sample(n=10)

Unnamed: 0,drivewheel,fueltype,aspiration,doornumber,carbody
62,fwd,gas,std,four,sedan
130,fwd,gas,std,four,wagon
154,4wd,gas,std,four,wagon
36,fwd,gas,std,four,wagon
175,fwd,gas,std,four,hatchback
102,fwd,gas,std,four,wagon
191,fwd,gas,std,four,sedan
103,fwd,gas,std,four,sedan
38,fwd,gas,std,two,hatchback
57,rwd,gas,std,two,hatchback


### Calculating Gini Index

In [16]:
# function will calculate gini_index for each column
# function from scratch
# in a dataframe
# and print out the best column to split on
import pandas as pd
import numpy as np

def gini_index(dataset, targetcol):
    
    # store all of our columns and gini scores
    gini_scores = []
    
    # iterate through each column in your dataframe
    for col in dataset.columns:
        
        # skip our target column
        # no information gain on target columns!
        # we can't split here
        if col == targetcol:
            continue
        
        # resets for each column in your dataset
        gini = 0
        
        # get the value counts for that column
        unique_values = dataset[col].value_counts()
        
        # iterate through each unique value for that column
        for key, val in unique_values.items():
        
            # get the target variable seperated, based on
            # the independent variable
            filteredDf = dataset[targetcol][dataset[col] == key].value_counts()
            
            # need n for the length
            n = len(dataset)
            
            # sum of the value counts for that column
            ValueSum = filteredDf.sum()
            
            # need the probabilities of each class
            p = 0
            
            # we now have to send it to our gini impurity formula
            for i,j in filteredDf.items():
                p += (filteredDf[i] / ValueSum) ** 2
            
            # gini total for column 
            # is all uniques from each column
            gini += (val / n) * (1-p)

        print(f'Variable {col} has Gini Index of {round(gini,4)}\n')
        
        # append our column name and gini score
        gini_scores.append((col,gini))
    
    # sort our gini scores lowest to highest
    split_pair = sorted(gini_scores, key=lambda x: -x[1], reverse=True)[0]
    
    # print out the best score
    print(f'''Split on {split_pair[0]} With Gini Index of {round(split_pair[1],3)}''')
        
        
final = gini_index(df, 'doornumber')

Variable drivewheel has Gini Index of 0.4865

Variable fueltype has Gini Index of 0.4745

Variable aspiration has Gini Index of 0.4921

Variable carbody has Gini Index of 0.2137

Split on carbody With Gini Index of 0.214


### Calculating Entropy

In [20]:
import numpy as np
import pandas as pd
import math

def entropy(dataset, targetcol):
    # store all of our columns and entropy scores
    entropy_scores = []
    
    # iterate through each column in your dataframe
    for col in dataset.columns:
        
        if col == targetcol:
            continue
        
        # get the value_counts normalized, saving us having to iterate through
        # each variable
        value_counts = dataset[col].value_counts(normalize=True, sort=False)
        
        # calculate our entropy for the column
        entropy = -(value_counts * np.log(value_counts) / np.log(math.e)).sum()
        
        print(f'Variable {col} has Entropy of {round(entropy,4)}\n')
        
        # append our column name and entropy score
        entropy_scores.append((col,entropy))
    
    # sort our entropy scores lowest to highest
    split_pair = sorted(entropy_scores, key=lambda x: -x[1], reverse=True)[0]
    
    # print out the best score
    print(f'''Split on {split_pair[0]} With Information Gain of {round(1-split_pair[1],3)}''')
        
final = entropy(df, 'carbody')

Variable drivewheel has Entropy of 0.8186

Variable fueltype has Entropy of 0.3197

Variable aspiration has Entropy of 0.4721

Variable doornumber has Entropy of 0.6857

Split on fueltype With Information Gain of 0.68


_What are the key differences between these two metrics that help in determining how a feature should split the data to form homogeneous nodes (or leaves)?_

Gini score:
- measures the probability of incorrectly classifying a randomly chosen element in the dataset.
- tends to be biased towards larger partitions. 
- works well when the classes are imbalanced or when there is no distinct majority class. 
- it is less sensitive to outliers.
- nodes are split based on the lowest Gini impurity.

Entropy score:
- measures the average amount of information needed to classify a sample.
- tends to create more balanced trees, and it can be sensitive to outliers.
- it may be more suitable when there is a clear majority class in the dataset.
- nodes are split based on the highest information gain.

_Which metric should be used in what scenarios?_

As previously mentioned, gini impurity is less sensitive to outliers, therefore it would be more suitable for datasets with an imbalanced class distribution, or for more complex decision trees. 

Entropy might be preffered for situations where there is a relatively balanced class distribution in the dataset. It may be suitable when the goal is to create more balanced trees with smaller depths.

_Which metric is computationally intensive?_

In [19]:
from datetime import datetime

# Gini Impurity
start_time_gini = datetime.now()
def gini_index(dataset, targetcol):
    
    # store all of our columns and gini scores
    gini_scores = []
    
    # iterate through each column in your dataframe
    for col in dataset.columns:
        
        # skip our target column
        # no information gain on target columns!
        # we can't split here
        if col == targetcol:
            continue
        
        # resets for each column in your dataset
        gini = 0
        
        # get the value counts for that column
        unique_values = dataset[col].value_counts()
        
        # iterate through each unique value for that column
        for key, val in unique_values.items():
        
            # get the target variable seperated, based on
            # the independent variable
            filteredDf = dataset[targetcol][dataset[col] == key].value_counts()
            
            # need n for the length
            n = len(dataset)
            
            # sum of the value counts for that column
            ValueSum = filteredDf.sum()
            
            # need the probabilities of each class
            p = 0
            
            # we now have to send it to our gini impurity formula
            for i,j in filteredDf.items():
                p += (filteredDf[i] / ValueSum) ** 2
            
            # gini total for column 
            # is all uniques from each column
            gini += (val / n) * (1-p)

        
        # append our column name and gini score
        gini_scores.append((col,gini))
    
    # sort our gini scores lowest to highest
    split_pair = sorted(gini_scores, key=lambda x: -x[1], reverse=True)[0]
    
    # print out the best score
    print(f'''Split on {split_pair[0]} With Gini Index of {round(split_pair[1],3)}''')

final = gini_index(df, 'doornumber')
end_time_gini = datetime.now()

print('Duration of Gini Index: {}'.format(end_time_gini - start_time_gini))

Split on carbody With Gini Index of 0.214
Duration of Gini Index: 0:00:00.017879


In [23]:
from datetime import datetime

# Entropy
start_time_entropy = datetime.now()
def entropy(dataset, targetcol):
    # store all of our columns and entropy scores
    entropy_scores = []
    
    # iterate through each column in your dataframe
    for col in dataset.columns:
        
        if col == targetcol:
            continue
        
        # get the value_counts normalized, saving us having to iterate through
        # each variable
        value_counts = dataset[col].value_counts(normalize=True, sort=False)
        
        # calculate our entropy for the column
        entropy = -(value_counts * np.log(value_counts) / np.log(math.e)).sum()
        
        # append our column name and entropy score
        entropy_scores.append((col,entropy))
    
    # sort our entropy scores lowest to highest
    split_pair = sorted(entropy_scores, key=lambda x: -x[1], reverse=True)[0]
    
    # print out the best score
    print(f'''Split on {split_pair[0]} With Information Gain of {round(1-split_pair[1],3)}''')
        
final = entropy(df, 'carbody')
end_time_entropy = datetime.now()

print('Duration of Entropy: {}'.format(end_time_entropy - start_time_entropy))

Split on fueltype With Information Gain of 0.68
Duration of Entropy: 0:00:00.006418


In [26]:
gini_index_time = end_time_gini - start_time_gini
entropy_time = end_time_entropy - start_time_entropy
faster_method = max((gini_index_time, "Gini Index"), (entropy_time, "Entropy"))

print(f'The faster method is: {faster_method[1]} with time {faster_method[0]}')

The faster method is: Gini Index with time 0:00:00.017879


As expected, the gini index is computationally less intensive compared to entropy most likely because it does not involve logarithmic functions.