Information Gain: Information Gain measures how much "information" a feature gives us about the target variable. It's often used in decision trees to split data.

Chi-Square: It measures the association between two categorical variables. It’s useful for testing if a feature is independent of the target.

Entropy: It measures the amount of uncertainty or impurity in a dataset. It’s used in decision trees to determine splits.

In [1]:
from sklearn.feature_selection import mutual_info_classif
import pandas as pd
from scipy.stats import entropy
import numpy as np
from sklearn.feature_selection import chi2

Read the csv file

In [2]:
df = pd.read_csv('mtcars.csv')

Information Gain without parent entropy

In [3]:
X = df[['mpg', 'cyl']]
y = df['hp']
info_gain = mutual_info_classif(X, y)
print(info_gain)

[1.09634524 1.42821239]


CHI-SQUARE

In [4]:
chi_scores, p_values = chi2(X, y)
print(chi_scores)
print(p_values)

[55.18447141 15.54882155]
[6.64456551e-05 7.94441298e-01]


PARENT ENTROPY

In [8]:
y = df['vs']
values, counts = np.unique(y, return_counts=True)
probabilities = counts / len(y)
ent = entropy(probabilities, base=2)
print(f"Entropy: {ent*100}%")

Entropy: 98.86994082884975%


Information Gain with the calulated entropy

In [11]:
# Features (X) and target (y)
X = df[['mpg', 'cyl']].values
y = df['vs'].values

In [12]:
# Calculating parent Entropy (S)
values, counts = np.unique(y, return_counts=True)
parent_entropy = entropy(counts / len(y), base=2)
print(f"Parent Entropy: {parent_entropy*100}%")

Parent Entropy: 98.86994082884975%


In [13]:
X_feature = X[:, 0]

# 2. Split Data Based on Feature X
median_value = np.median(X_feature)

# Subsets for X_feature <= median and X_feature > median
y_left = y[X_feature <= median_value]
y_right = y[X_feature > median_value]

# Calculate the entropy of subsets
values_left, counts_left = np.unique(y_left, return_counts=True)
entropy_left = entropy(counts_left / len(y_left), base=2)

values_right, counts_right = np.unique(y_right, return_counts=True)
entropy_right = entropy(counts_right / len(y_right), base=2)

print(f"Entropy Left (<= median): {entropy_left}")
print(f"Entropy Right (> median): {entropy_right}")

# Subset sizes
left_weight = len(y_left) / len(y)
right_weight = len(y_right) / len(y)

Entropy Left (<= median): 0.672294817075638
Entropy Right (> median): 0.8366407419411673


In [14]:
# Information Gain
info_gain = parent_entropy - (left_weight * entropy_left + right_weight * entropy_right)
print(f"Information Gain: {info_gain}")

Information Gain: 0.23936743893214252
