In [1]:
import pandas as pd
import numpy as np
import warnings
import pandasql as ps

warnings.filterwarnings('ignore')

In [2]:
def get_entropy(df, target):
    pos_list = df[df[target] == 1]
    neg_list = df[df[target] == 0]
    
    prob_pos_neg = np.array([len(pos_list) / len(df), len(neg_list) / len(df)])
    prob_pos_neg = prob_pos_neg[prob_pos_neg != 0]
    log_prob_pos_neg = np.log2(prob_pos_neg)
    
    return -sum(prob_pos_neg * log_prob_pos_neg)


def get_attribute_entropy(df, attribute, target):
    attribute_list = df[attribute].unique()
    attr_values = []
    
    for attr in attribute_list:
        df_attr = df[df[attribute] == attr]
        attr_values.append((len(df_attr) / len(df)) * get_entropy(df_attr, target))
    
    return sum(attr_values)


def attribute_entropy(df, target):
    column_list = [col for col in df if col != target]
    summary_list = []
    
    for col in column_list:
        query = """
            SELECT '{attribute}'                           AS attribute
                 , {attribute}                             AS details
                 , SUM(CASE WHEN {target_class} == 1 
                                 THEN 1 ELSE 0 END)        AS yes
                 , SUM(CASE WHEN {target_class} == 0 
                                 THEN 1 ELSE 0 END)        AS no
            FROM df
            GROUP BY 2
        """.format(attribute = col, target_class = target)
        
        summary_output = ps.sqldf(query, locals())
        summary_list.append(summary_output)
    
    return summary_list

# Entropy and Information Gain

### 1.1 Understanding Entropy
* Information Gain is an Entropy-based approach used in Decision Tree algorithm to select attributes that best split a dataset.
    * In this context, Entropy is a metric that quantifies <em>disorderdness</em> or <em>impurity</em> of values in a set of data.
    * The mathematical formula for Entropy (often denoted as either $H$ or $E$):
        * $H(D)$ = -$\sum_{i=1}^{m} p_i\log_2(p_i)$ where 
            * $D$ is dataset
            * $p_i$ is a frequentist probability of element/class i in a dataset $D$
* Decision Tree algorithm seeks to select features in a dataset that best divides its target variable. One way it measures how orderly its target values are divided is to measure its target variable's entropy.
* Dataset used below to calculate the entropy of the target variable's initial entropy.
    * We seek to answer if a customer is likely to purchase a computer based on his/her age, income, student, and credit rating status.
    * We can use the formula above to calculate the target variable's entropy:
        * $H(D)$ = -$\sum_{i=1}^{m} p_i\log_2(p_i)$
        * each $i$ refers to different class levels within the target variable: 0 if a customer did not purchase a computer and 1 if the customer did purchase.
        * $H(D)$ = -$\sum_{i=1}^{m} p_i\log_2(p_i)$ = $- (\frac{9}{14}\log_2(\frac{9}{14}) + \frac{5}{14}\log_2(\frac{5}{14})) \approx 0.94$
    * Entropy that is equal to 1 means that a set of data is perfectly disorderd, whereas the metric equal to 0 means that the data is perfectly ordered
        * Entropy 1 (perfectly disordered): 10 positives and 10 negatives in a set of data
        * Entropy 0 (perfectly ordered): 20 positives and 10 negatives in a set of data

In [3]:
#import dataset
df = pd.read_csv('https://raw.githubusercontent.com/AugustLONG/ML01/master/01decisiontree/AllElectronics.csv')

#cleasing data: remove RID column
df.drop("RID", axis = 1, inplace = True)

#cleasing data: rename target variable name
df.rename({'class_buys_computer': 'target_class'}, axis=1, inplace=True)

#change target values from text to boolean values
class_mapper = {"no": 0, "yes": 1}
df['target_class'] = df['target_class'].map(class_mapper)

#print data
df

Unnamed: 0,age,income,student,credit_rating,target_class
0,youth,high,no,fair,0
1,youth,high,no,excellent,0
2,middle_aged,high,no,fair,1
3,senior,medium,no,fair,1
4,senior,low,yes,fair,1
5,senior,low,yes,excellent,0
6,middle_aged,low,yes,excellent,1
7,youth,medium,no,fair,0
8,youth,low,yes,fair,1
9,senior,medium,yes,fair,1


In [4]:
get_entropy(df, 'target_class')

0.9402859586706311

### 1.2 Information Gain Using Entropy

* As mentioned above, entropy is a metric used to measure disorderedness or impurity of a dataset
    * In decision tree, whenever we talk about entropy, we are referring to the entropy of the target variable
    * Calculated to be approximately 0.94 previous to any splitting
    * The goal of Decision Tree algorithm is to find attributes or variables to split on that will reduce the target variable's entropy
* In order to quantify this reduction, **Information Gain** is introduced
    * $Information\space Gain(D, A) = Entropy(D) - Entropy(D, A)$ where
        * $Entropy(D)$ is the entropy of dataset $D$
        * $Entropy(D, A)$ is the entropy of dataset $D$ after being split on attribute (variable) A
        * $Entropy(D, A) = \sum_{j=1}^{m} \frac{D_j}{D} * Info(D_j)$
* Methods of finding the attribute that best reduces entropy of dataset
    1. With respect to every attributes, find $Entropy(D, A)$
    2. Subtract from the previous the parent node's entropy to find reduction in entropy.
* Our target variable can be split in ways as below table. For example, if we were to first split by the **Age** attribute, three child nodes will be created and our goal is to calculate the weighted average of the three nodes entropy
    * 4 Yes and 0 No in the **middle_aged Age** group
    * 3 Yes and 2 No in the **senior Age** group
    * 2 Yes and 3 No in the **youth Age** group

In [5]:
pd.concat(attribute_entropy(df, 'target_class'), ignore_index=True)

Unnamed: 0,attribute,details,yes,no
0,age,middle_aged,4,0
1,age,senior,3,2
2,age,youth,2,3
3,income,high,2,2
4,income,low,3,1
5,income,medium,4,2
6,student,no,3,4
7,student,yes,6,1
8,credit_rating,excellent,3,3
9,credit_rating,fair,6,2


* We can calculate **Information Gain** with respect to every attributes as below
* Age has the greatest Information Gain, or reduction in Entropy, meaning that splitting our dataset by different values in the age variable will result in the most ordered dispersion of the target values

In [236]:
row_list = []

for col in [col for col in df if col != 'target_class']:
    attr_name = col
    attr_entropy = get_entropy(df, 'target_class') - get_attribute_entropy(df, col, 'target_class')
    row_list.append([attr_name, round(attr_entropy, 2)])
    
pd.DataFrame(row_list, columns = ['attribute', 'information_gain'])

Unnamed: 0,attribute,information_gain
0,age,0.25
1,income,0.03
2,student,0.15
3,credit_rating,0.05
