This lab is to visualize using information gain to split a decision tree node.
Dataset involves determining if an image is of a cat or dog, using features:

Ear shape: Pointy = 1; Floppy = 0;

Face shape: Round = 1; Not Round = 0;

Whiskers: Present = 1; Absent = 0;

The output is a binary digit: Cat = 1; Dog = 0;

In [5]:
# Standard initial imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from utils import *

In [6]:
# Dataset
X_train = np.array([
    [1, 1, 1],
    [0, 0, 1],
    [0, 1, 0],
    [1, 0, 1],
    [1, 1, 1],
    [1, 1, 0],
    [0, 0, 0],
    [1, 1, 0],
    [0, 1, 0],
    [0, 1, 0]
])

y_train = np.array([1, 1, 0, 0, 1, 1, 0, 1, 0, 0])

In [8]:
X_train[0] # Pointy ears, round face, and whiskers

array([1, 1, 1])

From the notes:

$\text{Information Gain} = H(p_1^{node}) - (w^{left} * H(p_1^{left}) + w^{right} * H(p_1^{right}))$

Where $H$ is the entropy, defined as:

$H(p_1) = -p_1 * log_2(p_1) - (1 - p_1) * log_2(1 - p_1)$

$ $

Note log is in base 2.

$ $

On each node, compute the information gain for each feature, splitting the node on the feature with the highest information gain, by comparing the entropy of the node with the **weighted** entropy of the two split nodes.

$ $

Keeping in mind that the whole dataset has 5 cats and 5 dogs in it, the root node which has all these classes results in:

$p_1^{node} = \frac{5}{10} = 0.5$

In [9]:
def entropy(p):
    # Compute entropy of a group based on the percentage of positive cases in a group.
    if p == 0 or p == 1:
        return 0 # Pure set either way, no entropy.
    else:
        return -p * np.log2(p) - (1 - p) * np.log2(1 - p)

In [11]:
print(entropy(0.5)) # The least pure possible

1.0


So now compute the information gain if we split the initial node on each of the available features...

In [12]:
def split_indices(X, index_feature):
    """Given a dataset, X, and an index feature, return two lists for the split nodes.
    The left node is the positive case, with the feature equal to 1.
    The right node is the negative case, with the feature equal to 0.
    From above:
    index_feature 0 -> ear shape
    index_feature 1 -> face shape
    index_feature 2 -> whiskers
    """
    
    left_indices = []
    right_indices = []
    for i, x in enumerate(X):
        if x[index_feature] == 1:
            left_indices.append(i)
        else:
            right_indices.append(i)
    return left_indices, right_indices


In [14]:
split_indices(X_train, 0) # Ear shape pointy, ear shape floppy.

([0, 3, 4, 5, 7], [1, 2, 6, 8, 9])

We also need to compute the weighted entropy in a split node, needing $w^{left}$ and $w^{right}$ (proportions of examples in each node) as well as $p^{left}$ and $p^{right}$ (proportions of positive examples in each split).

$ $

So for example, using the above results, 

$w^{left} = \frac{5 \text{(length of left array)}}{10 \text{(length of parent node)}}$ and 

$p^{left} = \frac{4 \text{(number in left array that are positive)}}{5 \text{(length of array)}}$

In [34]:
def weighted_entropy(X, y, left_indices, right_indices):
    """Given a dataset and the split dataset on some feature, 
    return the weighted entropy of that split."""
    w_left = len(left_indices) / len(X)
    w_right = len(right_indices) / len(X)
    # Don't really need X for this, huh?
    p_left = sum(y[left_indices]) / len(left_indices)
    p_right = sum(y[right_indices]) / len(right_indices)
    
    weighted_entropy = w_left * entropy(p_left) + w_right * entropy(p_right)
    return weighted_entropy

In [37]:
left_indices, right_indices = split_indices(X_train, 0)
WE = weighted_entropy(X_train, y_train, left_indices, right_indices)
print(WE) # Expect ~0.72

0.7219280948873623


So from the weighted entropy, the information gain is got by subtracting from entropy of parent node.

In [39]:
def information_gain(X, y, left_indices, right_indices):
    p_node = sum(y) / len(y)
    h_node = entropy(p_node)
    w_entropy = weighted_entropy(X, y, left_indices, right_indices)
    
    return h_node - w_entropy

In [40]:
information_gain(X_train, y_train, left_indices, right_indices) # Expect ~0.28

0.2780719051126377

Then do the same for each other feature to find the highesst information gain!

In [42]:
for i, feature_name in enumerate(['Ear Shape', 'Face Shape', 'Whiskers']):
    left_indices, right_indices = split_indices(X_train, i)
    i_gain = information_gain(X_train, y_train, left_indices, right_indices)
    print(f"Feature: {feature_name}\nInformation Gain on split: {i_gain:.2f}\n")

Feature: Ear Shape
Information Gain on split: 0.28

Feature: Face Shape
Information Gain on split: 0.03

Feature: Whiskers
Information Gain on split: 0.12



Based on this, ear shape is the best feature to split  on, as we gain the most information from it.