# Introduction

The pupose of this project was to find a classification model to predict whether or not a US citizen makes above or below the average salary amount.

This dataset was uploaded onto Kaggle by the UCI ML repository. The data consisits of over 32,000 records of many interesting variables scrapped from the US census.


# Import

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.cross_validation import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn import metrics
from sklearn.metrics import accuracy_score
import zipfile
from ipywidgets import interact, fixed
from IPython.display import display
import seaborn as sns
import matplotlib.pyplot as plt
from altair import *



In [2]:
zFile = !ls /data/brucerowan/ 
with zipfile.ZipFile('/data/brucerowan/'+zFile[1],'r') as zf:
    df = pd.read_csv(zf.open('adult.csv'))
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


# Tidy

The missing values in the dataset were set to '?' so I changed them to nans. After seeing I would still have 30,000 rows I decided to just drop all of the nans from the data set.

I changed marital.status to mStatus and native.country to nCountry, because having periods in variable names caused problems when graphing. 

I also couldn't make sense of how the relationship variable was different than marital status so I've decided to drop that column.

In [3]:
df.replace('?', np.nan, inplace=True)

df['mStatus']=df['marital.status']
df=df.drop('marital.status', axis=1)
df['nCountry']=df['native.country']
df=df.drop('native.country', axis=1)

#one hot encode income 
Over_fifty= pd.get_dummies(df['income'], drop_first=True)
Tidy= df.join(Over_fifty)
Tidy=Tidy.dropna()
Tidy=Tidy.drop('relationship', axis=1)
Tidy.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,occupation,race,sex,capital.gain,capital.loss,hours.per.week,income,mStatus,nCountry,>50K
1,82,Private,132870,HS-grad,9,Exec-managerial,White,Female,0,4356,18,<=50K,Widowed,United-States,0
3,54,Private,140359,7th-8th,4,Machine-op-inspct,White,Female,0,3900,40,<=50K,Divorced,United-States,0
4,41,Private,264663,Some-college,10,Prof-specialty,White,Female,0,3900,40,<=50K,Separated,United-States,0
5,34,Private,216864,HS-grad,9,Other-service,White,Female,0,3770,45,<=50K,Divorced,United-States,0
6,38,Private,150601,10th,6,Adm-clerical,White,Male,0,3770,40,<=50K,Separated,United-States,0


### Preprocessing (One Hot Encoding)
The process of transforming a single category into seperate columns of 0's and 1's.

In [4]:
df=Tidy
Tidy=Tidy.drop('income', axis=1)

#1 hot encode native country (nan dropped)
Native= pd.get_dummies(Tidy['nCountry'], drop_first=True)
Native
Clean= Tidy.join(Native)
Clean=Clean.drop('nCountry',axis=1)

#1 hot encode sex(females dropped)
Male= pd.get_dummies(df['sex'], drop_first=True)
Clean= Clean.join(Male)
Clean=Clean.drop('sex',axis=1)

#1 hot encode Race (Amer-Indian-Eskimo dropped)
Race = pd.get_dummies(df['race'],drop_first=True)
Clean= Clean.join(Race)
Clean=Clean.drop('race',axis=1)

#1 hot encoding occupation
Occupation = pd.get_dummies(df['occupation'],drop_first=True)
Clean= Clean.join(Occupation)
Clean=Clean.drop('occupation',axis=1)

#Table of one hot encoded Marital Status
MS=pd.get_dummies(df['mStatus'],drop_first=True)
Clean= Clean.join(MS)
Clean=Clean.drop('mStatus',axis=1)

Clean=Clean.drop('education',axis=1)

#Table of one hot encoded classes 
Classes=pd.get_dummies(df['workclass'],drop_first=True)
Clean= Clean.join(Classes)
Clean=Clean.drop('workclass',axis=1)

#For Presentation Purposes 
processed= Clean

## Examining the number of records in each category of Income level

In [5]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,occupation,race,sex,capital.gain,capital.loss,hours.per.week,income,mStatus,nCountry,>50K
1,82,Private,132870,HS-grad,9,Exec-managerial,White,Female,0,4356,18,<=50K,Widowed,United-States,0
3,54,Private,140359,7th-8th,4,Machine-op-inspct,White,Female,0,3900,40,<=50K,Divorced,United-States,0
4,41,Private,264663,Some-college,10,Prof-specialty,White,Female,0,3900,40,<=50K,Separated,United-States,0
5,34,Private,216864,HS-grad,9,Other-service,White,Female,0,3770,45,<=50K,Divorced,United-States,0
6,38,Private,150601,10th,6,Adm-clerical,White,Male,0,3770,40,<=50K,Separated,United-States,0


In [None]:
Chart(df).mark_bar().encode(
    X('income'),
    Y('count(*)')
)

It was a little strange seeing how uneven the data was, given that the median salary of the U.S. is about 50k. After doing more research on the dataset I figured it may be because the creators of the dataset pulled an equal amount of records from each state without putting a weight on each states populations. 

## Examining correlations between variables and making over 50k

In [None]:
cors= processed.corrwith(processed['>50K'])
print("Highest positive correlations:")
print(cors.sort_values(axis=0, ascending=False).head(16))
print("\nLowest negative correlations:")
print(cors.sort_values(axis=0, ascending=True).head(15))

### Reduce dataset to only correlations above 0.05

In [None]:
x=[]
for i in range(len(cors)):
    if(abs(cors[i])>0.07):
        x.append(processed.columns[i])

Features = processed[x]
Features.head()

## Corrrelation Matrix / HeatMap

In [None]:
corr = Features.corr()
sns.heatmap(corr)
plt.show()

## Examining Marital Status on income

In [None]:
Chart(df).mark_bar().encode(
    Y('mStatus',sort=SortField(field='>50K', op='mean', order='descending')),
    X('count(*):Q'),
    Color('income', scale=Scale(range=['red','green']))
)

## Examining the income level when grouped by sex

In [None]:
Chart(df).mark_bar().encode(
    X('sex'),
    Y('count(*):Q'),
    Color('income', scale=Scale(range=['red','green']))       
)

## Effects of age on income

In [None]:
Chart(df).mark_bar().encode(
    X('age'),
    Y('count(*):Q'),
    Color('income', scale=Scale(range=['red','green']))
)

## Function to make graphs in Altiar 
   
Returns a Bar Chart relating any explanitory variable to the target.

Sorted by the proportion size of people who make over 50k

In [None]:
def Graph(col):
    '''
    Simple function to create color coated bar charts in Altiar relating any variable and the income classification  
    '''
    return Chart(df).mark_bar().encode(
              X(col,sort=SortField(field='>50K', op='mean', order='descending')),
              Y('count(*):Q'),
              Color('income', scale=Scale(range=['red','green']))       
    )

#saving the column names as a list 
variables=list(df)

In [None]:
interact(Graph,col=variables)

In [None]:
cors= processed.corrwith(df['>50K'])
print("Highest positive correlations:")
print(cors.sort_values(axis=0, ascending=False).head(16))
print("\nLowest negative correlations:")
print(cors.sort_values(axis=0, ascending=True).head(15))

## Creating DataFrame of Features

Adding all columns with a correlation absolute value greater than 0.1.

I didn't run a grid search but I ran through each algorithm with several correlations and this was the most sucessful.

Target is income >50k

In [None]:
x=[]
for i in range(len(cors)):
    if(abs(cors[i])>0.1):
        x.append(processed.columns[i])

Features = processed[x]
Features=Features.drop('>50K',axis=1)
Features=Features.drop('Never-married', axis=1)

Target= df['>50K']
Features.head()

## Test/Train Split

In [None]:
Xtrain, Xtest, ytrain, ytest = train_test_split(Features, Target, random_state=0, train_size=0.7)

## Null Error Rate

In [None]:
1-Target.mean()

This means that we can correctly predict 75% of our dataset just by picking less than. Hopefully we can create a model that does better than that. 

## Gaussian Naive-Bayes Classifier

Good starting classifier because it trains very fast. However, our accuracy score wasn't much higher than the Null Error Rate.

In [None]:
model1 = GaussianNB()
model1.fit(Xtrain, ytrain)

predicted=model1.predict(Xtest)
accuracy_score(predicted,ytest)

In [None]:
scores=cross_val_score(model1, Features, Target, cv=5)
scores

In [None]:
print("mean: ",scores.mean())

## Random Forest Classifier

In [None]:
model2 = RandomForestClassifier(n_estimators=100)
model2.fit(Xtrain, ytrain)

predicted=model2.predict(Xtest)
accuracy_score(predicted,ytest)

In [None]:
scores=cross_val_score(model2, Features, Target, cv=5)
scores

In [None]:
print("mean: ",scores.mean())

## Feature Importance for Random Forest

In [None]:
importances= model2.feature_importances_

d={'Feature Names': Features.columns, 'Importance': importances}
features = pd.DataFrame(data=d)
features.sort('Importance', ascending=False)

Importance of features in Random Forest Classifier. We see that "age", "married", and "education" are the most influential

## Logistic Regression
This ended up being my highest performing classifier. Logistic Regression does very well when trying to do binary classifications. Although the concept isn't the most complicated, it gets the job done well.

In [None]:
model3 = LogisticRegression()

model3.fit(Xtrain, ytrain)
predicted=model3.predict(Xtest)

accuracy_score(predicted,ytest)

In [None]:
scores=cross_val_score(model3, Features, Target, cv=5)
scores

In [None]:
print("mean: ",scores.mean())

In [None]:
print(metrics.classification_report(ytest, predicted))

In [None]:
coefs=pd.DataFrame([zip(Features.columns, np.transpose(model3.coef_))])
coefs.transpose()

Coefficients of the Logistic Regression. We see that "Married","Occupation","Education" are influential factors

## Suport Vector Machine
    Takes a very long time to train data. Didn't give the best results

In [None]:
model4 = svm.SVC(kernel='rbf')
model4.fit(Xtrain, ytrain)

predicted=model4.predict(Xtest)
accuracy_score(predicted,ytest)

In [None]:
scores=cross_val_score(model4, Features, Target, cv=3)
print(scores)

print("mean: ",scores.mean())

# Conclusion

Our most successful models are Randomom Forests and Logistic Regression. Out of the two, I would go with logistic regression because it takes less time to train. 

Our other 2 classifiers were only slightly less efficient. Using any of the 4 models, we would expect to see about 80% accuracy given the inputs of the model. We see that age, education, marital status, and occupation are the most influential factors depending on what model you decide to use. 

Given more time, I would experiement with interaction terms, normalizing features, and come up with a more rigorus approach to decide what features to include 