# Census Income Decision Tree

We have data from the 1994 US Census and we are interested in a person's income. Specifically, we want to know what demographic factors we could use to accurately predict if a person makes more or less than $50K a year.

The data has 15 columns altogether: 'Age', 'Employeer', 'Num-Rep', 'Education', 'Edu-Num', 'Marital', 'Job', 'Relationship', 'Race', 'Sex', 'Cap-Gain', 'Cap-Loss', 'Hours', 'Nationality', 'Income'. 

The __Age__, __Race__, __Sex__, and __Nationality__ columns are all self-explainitory.  Age is continous while the rest are categorical.

The __Employeer__ column is cateogrial and decribes what type of employeer the subject has, while the also categorical __Job__ column describes what type of job the subject has. Then, the __Hours__ column is continous and decribes how many hours a week the subject works. 

The __Num-Rep__ is continuous and represents how many people this subject is expected to represent in the population, it will probably not be a useful variable. 

The __Education__ column decribes the highest education the subject achieved categorically and __Edu-Num__ does the same continuously.

The __Marital__ column is categorical and describes the subjects marital status, while the __Relationship__ column expands on it and is also categorical. 

Both the __Cap-Gain__ and __Cap-Loss__ columns and continuous and describe the subject's capital gains and losses respectively. 

Finally, the __Income__ column is categorical and tells whether or not a subject earns more than $50K a year, and is special because it is our outcome variable.

I want to create a decision tree that can accurately predict whether or not a person's income is above $50K

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math
import random
from collections import Counter
from IPython.core.display import HTML
from sklearn.preprocessing import StandardScaler
from sklearn import linear_model
import statsmodels.formula.api as smf
import time

%matplotlib inline

# Importing the Data

In [2]:
census = pd.read_csv('adult.csv')
census = census.dropna()
display(census.head())

Unnamed: 0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K


The data that should be our first row has been shifted up and become the column names, since our data didn't start with any.

In [3]:
first = list(census.columns.values)
census.columns = ['Age','Employeer','Num-Rep','Education','Edu-Num','Marital','Job','Relationship',
                  'Race','Sex','Cap-Gain','Cap-Loss','Hours','Nationality','Income']

census.append(first)
display(census.head())

  result = result.union(other)


Unnamed: 0,Age,Employeer,Num-Rep,Education,Edu-Num,Marital,Job,Relationship,Race,Sex,Cap-Gain,Cap-Loss,Hours,Nationality,Income
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K


In [4]:
census['Income'] = census['Income'].apply({' <=50K':0,' >50K':1}.get)

In [5]:
display(census.shape)
census.head()

(32560, 15)

Unnamed: 0,Age,Employeer,Num-Rep,Education,Edu-Num,Marital,Job,Relationship,Race,Sex,Cap-Gain,Cap-Loss,Hours,Nationality,Income
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,0


In [6]:
categorical = census.select_dtypes(include=['object'])
for i in categorical:
    column = categorical[i]
    print(i)
    print(column.nunique())

Employeer
9
Education
16
Marital
7
Job
15
Relationship
6
Race
5
Sex
2
Nationality
42


## Creating a Simple Data Set

Now I'll create a simple dataset to test as a baseline. This data set will only contain 2 columns and 100 rows.  The first column, Value, will simply be a list of numbers from 0 to 99.  The second column, Greater, will be a binary value indicating if the number in Value is greater than 50 or not. 

In [7]:
simple = pd.DataFrame()
simple['Value'] = range(100)
simple['Greater'] = ['0']*50 + ['1']*50

## Testing the Run Time and Accuracy

Now that I have both of the data sets I want to work with, I want to be able to visualize the computational stress each model has.  I can do that but outputing the run time of each model.  I will start with just a basic decision tree for my complex data set, then look at a random forest for both my complex and simple data sets.

In [8]:
from sklearn import tree
start_time = time.time()
# Initialize and train our tree.
decision_tree = tree.DecisionTreeClassifier(
    criterion='entropy',
    max_features=1,
    max_depth=4,
)
X = census.drop('Income', 1)
Y = census['Income']
X = pd.get_dummies(X)
decision_tree.fit(X, Y)
print(decision_tree.n_features_)
print(decision_tree.score(X,Y))
print("\n--- %s seconds ---" % (time.time() - start_time))

108
0.759490171990172

--- 0.11569452285766602 seconds ---


In [9]:
from sklearn import ensemble
from sklearn.model_selection import cross_val_score
import time
start_time = time.time()
rfc = ensemble.RandomForestClassifier()
X = census.drop('Income', 1)
Y = census['Income']
X = pd.get_dummies(X)
rfc.fit(X,Y)

print(rfc.score(X,Y))
score_ols = cross_val_score(rfc, X, Y, cv=10)
print("\nError: %0.2f (+/- %0.2f)" % (score_ols.mean(), score_ols.std() * 2))
print("\n--- %s seconds ---" % (time.time() - start_time))

0.9875307125307126

Error: 0.85 (+/- 0.01)

--- 7.448711633682251 seconds ---


In [10]:
from sklearn import ensemble
from sklearn.model_selection import cross_val_score

start_time = time.time()
rfc = ensemble.RandomForestClassifier()
X = simple['Value']
Y = simple['Greater']
X = pd.get_dummies(X)
rfc.fit(X,Y)

print(rfc.score(X,Y))
score_ols = cross_val_score(rfc, X, Y, cv=10)
print("\nError: %0.2f (+/- %0.2f)" % (score_ols.mean(), score_ols.std() * 2))
print("\n--- %s seconds ---" % (time.time() - start_time))

0.98

Error: 0.50 (+/- 0.00)

--- 0.19551396369934082 seconds ---


It is incredible just how much more cnplicated a random forest is over a decision tree.  A decision tree made from data many magnitudes more complicated than the simple data set took a small fraction of the time it took to make the simple random forest.

The time it took to make the complex random forest is much closer to the time of the simple random forest, than the simple random forest time to the complex decision tree.

## References

http://archive.ics.uci.edu/ml/datasets/Adult