# Intro 
The data is from the 1994 census, and contains information on an individual's marital status, age, type of work, and more. The target column, or what we want to predict, is whether individuals make less than or equal to 50k a year, or more than 50k a year.

In [1]:
import pandas as pd

income = pd.read_csv('income.csv', index_col=False)
income.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


The data contains several categorical variable that must be converted to numerical variables before being used in the model. By converting the columns to a categorical type, each unique category variable is assigned a value and these values are repopulated in the dataframe.

In [2]:
for name in ['workclass','education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country', 'high_income']:
    col = pd.Categorical.from_array(income[name])
    income[name] = col.codes
income.head()

  


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income
0,39,7,77516,9,13,4,1,1,4,1,2174,0,40,39,0
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39,0
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39,0
3,53,4,234721,1,7,2,6,0,2,1,0,0,40,39,0
4,28,4,338409,9,13,2,10,5,2,0,0,0,40,5,0


Split the data based on whether the person works in the private sector (`Private` maps to 4 in the `workclass` column)

In [3]:
private = income[income['workclass'] == 4]
public = income[income['workclass'] != 4]
print(private.shape)
print(public.shape)

(22696, 15)
(9865, 15)


Splitting the dataset on a specific node (in this case whether the person works in the private sector or not) results in two branches since there were only two possible outcomes. The goal of these splits is to have eventually broken up the data to the point where terminal nodes are created that allow for predictions to be made on new data. Terminal nodes are unique in that all the rows contained within one should have a single value for the target column (in our case the target column, `high_income` is a binary categorical column).

To better understand how the values in the target column are grouped together, it's common to look at a metric called entropy. In our case entropy refers to the disorder of the values, wherein more uniformity results in lower entropy. Entropy can be represented by the following formula:

$-\sum_{i=1}^{c}P(x_i)\log_bP(x_i)$

We iterate through each unique value in a single column (in this case, high_income), and assign it to $i$. Then compute the probability of that value occurring in the data $(P(x_i))$. Next multiply the two factors, and sum all of the values together. $b$ is the base of the logarithm (commonly use the value 2 for this, but can also set it to 10 or another value)

The entropy of the `high_income` column is calculated below:

In [4]:
import math
prob_0 = income[income['high_income'] == 0].shape[0] / income.shape[0]
prob_1 = income[income['high_income'] == 1].shape[0] / income.shape[0]
income_entropy = -(prob_0*math.log(prob_0,2) + prob_1*math.log(prob_1, 2))
income_entropy

0.7963839552022132

Knowing that the end goal is to reduce entropy as much as possible, we need a method to determine which splits will result in the most reduction in entropy. Information gain is a method by which entropy of each branch is used in conjunction with the entropy of the original node to determine if the overall entropy has been reduced. Information gain is used on the `age` column in conjunction with the `high_income` column below:

In [5]:
import numpy

def calc_entropy(column):
    # Compute the counts of each unique value in the column
    counts = numpy.bincount(column)
    # Divide by the total column length to get a probability
    probabilities = counts / len(column)
    
    entropy = 0
    # Loop through the probabilities, and add each one to the total entropy
    for prob in probabilities:
        if prob > 0:
            entropy += prob * math.log(prob, 2)
    
    return -entropy

income_entropy = calc_entropy(income['high_income'])
median_age = income['age'].median()

left_split = income[income['age'] <= median_age]
right_split = income[income['age'] > median_age]

age_information_gain = income_entropy - ((left_split.shape[0] / income.shape[0]) * calc_entropy(left_split['high_income']) \
                                         + ((right_split.shape[0] / income.shape[0]) * calc_entropy(right_split['high_income'])))
age_information_gain

0.047028661304691965

This means that if we split on the age column then we will gain 0.047 bits of information on whether the person will have a 1 in the `high_income` column. 

Knowing the basics of information gain, we can apply this method to the initial split to help determine which variable is best to split first (i.e. which variable has the highest information gain value):

In [6]:
def calc_information_gain(data, split_name, target_name):
    # Calculate the original entropy
    original_entropy = calc_entropy(data[target_name])
    
    # Find the median of the column we're splitting
    column = data[split_name]
    median = column.median()
    
    # Make two subsets of the data, based on the median
    left_split = data[column <= median]
    right_split = data[column > median]
    
    # Loop through the splits and calculate the subset entropies
    to_subtract = 0
    for subset in [left_split, right_split]:
        prob = (subset.shape[0] / data.shape[0]) 
        to_subtract += prob * calc_entropy(subset[target_name])
    
    # Return information gain
    return original_entropy - to_subtract

columns = ["age", "workclass", "education_num", "marital_status", "occupation", "relationship", "race",\
           "sex", "hours_per_week", "native_country"]
information_gains = []

# Loop through the columns and calculate each information gain
for col in columns:
    information_gain = calc_information_gain(income, col, 'high_income')
    information_gains.append(information_gain)

# Pull the index of the highest gain and return the column name
highest_gain_index = information_gains.index(max(information_gains))
highest_gain = columns[highest_gain_index]
print('The variable with the highest gain is {} with {} information gain'.format(highest_gain, round(max(information_gains),3)))

The variable with the highest gain is marital_status with 0.111 information gain
