# Quiz for Maximizing Information Gain

For the following quiz, consider the data found in `ml-bugs.csv`, consisting of twenty-four made-up insects measured on their length and color.

Which of the following splitting criteria provides the most information gain for discriminating Mobugs from Lobugs?
* Color = Brown
* Color = Blue
* Color = Green
* Length < 17
* Length < 20


In [107]:
import numpy as np
import math
import pandas as pd

df = pd.read_csv('ml-bugs.csv')

def probability(name, items):
    result = 0
    names = np.unique(items)
    all_count = np.count_nonzero(items)
    item_count = np.count_nonzero(items[items == name])
    return item_count / all_count

def entropy(items):
    result = 0
    names = np.unique(items)
    all_count = np.count_nonzero(items)
 
    for name in names:
        p = probability(name, items)
        result -= math.log2(p) * p
        
    return result

def two_group_entropy(df, criteria):
    column = df.iloc[:,0]
    total = column.count()
    first = column[criteria].count()
#     print('entropy 1',  entropy(column[criteria]))
#     print('entropy 2',  entropy(column[criteria == False]))
    return first / total * entropy(column[criteria]) + \
        (total - first) / total * entropy(column[criteria == False])

In [108]:
result = [
    ['Color = Brown', entropy(df['Species']) - two_group_entropy(df, df['Color'] == 'Brown')],
    ['Color = Blue', entropy(df['Species']) - two_group_entropy(df, df['Color'] == 'Blue')],
    ['Color = Green', entropy(df['Species']) - two_group_entropy(df, df['Color'] == 'Green')],
    ['Length < 17', entropy(df['Species']) - two_group_entropy(df, df['Length (mm)'] < 17)],
    ['Length < 20', entropy(df['Species']) - two_group_entropy(df, df['Length (mm)'] < 20)]
]
result = pd.DataFrame(result, columns = ['Criteria', 'Information Gain']) 
result

Unnamed: 0,Criteria,Information Gain
0,Color = Brown,0.061573
1,Color = Blue,0.00059
2,Color = Green,0.042776
3,Length < 17,0.112607
4,Length < 20,0.100733


In [109]:
best_gain = result.max(0)['Information Gain']
best_criteria = result[result['Information Gain'] == best_gain]
print('Best criteria for discriminating Mobugs from Lobugs')
best_criteria

Best criteria for discriminating Mobugs from Lobugs


Unnamed: 0,Criteria,Information Gain
3,Length < 17,0.112607
