### Entropy

Entropy measure the uncertainity of a random variable $X$. It is defined as:

\begin{equation*}
H(X) = -\sum{P(x_i)\log{P(x_i)}}
\end{equation*}

where:
 * $P(x_i)$ is the probability of the event $x_i | x_i \in X$


### Information Gain
 
Information gain compares a set of samples against its subset of classified data. The initial set is $X$, whereas its $n$ classified sub-sets are $\{Y_1,Y_2,Y_3,...,Y_n\}$. The Information Gain $G$ is defined as:

\begin{equation*}
G = H(X) - \frac{\sum{\omega_i H(Y_i)}}{\sum{\omega_i}}
\end{equation*}
 
where:
  * $H(X)$ is the entropy of $X$
  * $\omega_i\$ is the number of samples of the i-th subset

The second term it is a weighted average. Check on wikipedia for statistical properties of weighted average.

In [35]:
import pandas as pd
import numpy as np
import math

In [4]:
data = pd.read_csv(r"data\ml-bugs.csv")
data.head()

Unnamed: 0,Species,Color,Length (mm)
0,Mobug,Brown,11.6
1,Mobug,Blue,16.3
2,Lobug,Blue,15.1
3,Lobug,Green,23.7
4,Lobug,Blue,18.4


In [63]:
splicings = {
    'Color':{
        'Brown':lambda x:x=='Brown',
        'Blue':lambda x:x=='Blue',
        'Green':lambda x:x=='Green'},
    'Length (mm)':{
        'Length17':lambda x:x<17.0,
        'Length20':lambda x:x<20.0
    }
}

def split(df, col, criterion):
    """
    split a dataframe into two df
    
    :param df: pandas dataframe
    :param criterion: function to be applied to a df column
    
    :returns: df1, df2
    """
    mask = criterion(df[col])
    return df[mask], df[[not m for m in mask]]
    
def entropy(ary):
    """
    
    """
    n = len(ary)
    unique, counts = np.unique(ary, return_counts=True)
    H = 0.0
    
    for c in counts:
        p = float(c)/float(n)
        H += -p*math.log2(p)
    return H

def info_gain(a, subs):
    """
    
    """
    H = entropy(a)
    n = sum([len(sb) for sb in subs])
    Y = 0.0
    
    for sb in subs:
        Y += len(sb)*entropy(sb)
        
    G = H - Y/float(n)
    return G

In [64]:
df1,df2 = split(data, 'Color', splicings['Color']['Green'])
df2

Unnamed: 0,Species,Color,Length (mm)
0,Mobug,Brown,11.6
1,Mobug,Blue,16.3
2,Lobug,Blue,15.1
4,Lobug,Blue,18.4
5,Lobug,Brown,17.1
6,Mobug,Brown,15.7
8,Lobug,Blue,22.9
9,Lobug,Blue,21.0
10,Lobug,Blue,20.5
12,Mobug,Brown,13.8


calculate a sample entropy

In [65]:
entropy(np.array(df2['Species']))

1.0

calculate sample info gain

In [66]:
info_gain(np.array(data['Species']),[np.array(df1['Species']),np.array(df2['Species'])])

0.042776048498108454

calculate info gain for all the scenarios

In [69]:
results = {}
for feature,rules in splicings.items():
    for rule_name, rule in rules.items():
        df1,df2 = split(data,feature,rule)
        results[rule_name] = info_gain(np.array(data['Species']),[np.array(df1['Species']),np.array(df2['Species'])])

results

{'Brown': 0.06157292259666325,
 'Blue': 0.000589596275060833,
 'Green': 0.042776048498108454,
 'Length17': 0.11260735516748954,
 'Length20': 0.10073322588651734}