# Introduction

## Can we predict whether someone is high income or low income?

To investigate this, we can use machine learning on University of Massachusetts adult income dataset.
This dataset (taken in 1995) records 13 demographic columns for workers such as age, education, marital status,
etc, and displays if the person makes > 50k per yer (high income) or less than 50k per year (low income).

This is a binary classification question.  We're trying to use demographic features to predict if a person is Class A
or Class B.  Here we will generate both a Logistic Regression and a Decision Tree model and compare their predictive accuracy. 

### Import Packages

In [51]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, roc_curve, auc
from sklearn.preprocessing import StandardScaler, LabelEncoder

#### Data Prep

We start with some simple data prep.  We download the dataset into a pandas data frame, and drop rows with missing data.

In [52]:
# download data set
# dataset describes an individual using a varierty of parameters and lists if 
# the person makes more than or less than 50k
income = pd.read_csv('http://mlr.cs.umass.edu/ml/machine-learning-databases/adult/adult.data', header=None)
income.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
                'marital_status', 'occupation', 'relationship', 'race', 'sex',
                'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']
income.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [53]:
income.shape

(32561, 15)

In [54]:
# drop columns with missing rows
income = income.drop(income[income.workclass == '?'].index)
income = income.drop(income[income.occupation == '?'].index)
income = income.drop(income[income.native_country == '?'].index)

Next, we need to transform the data into a workable matrix.  
We'll label encode the discrete columns, then normalize the whole matrix

In [55]:
income.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [56]:
# turn categorical columns into labels
income['workclass'] = LabelEncoder().fit_transform(income['workclass'])
income['marital_status'] = LabelEncoder().fit_transform(income['marital_status'])
income['occupation'] = LabelEncoder().fit_transform(income['occupation'])
income['relationship'] = LabelEncoder().fit_transform(income['relationship'])
income['race'] = LabelEncoder().fit_transform(income['race'])
income['sex'] = LabelEncoder().fit_transform(income['sex'])
income['native_country'] = LabelEncoder().fit_transform(income['native_country'])
income['education'] = LabelEncoder().fit_transform(income['education'])
income['education_num'] = income['education']
y = LabelEncoder().fit_transform(income['income'])

In [57]:
income.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,7,77516,9,9,4,1,1,4,1,2174,0,40,39,<=50K
1,50,6,83311,9,9,2,4,0,4,1,0,0,13,39,<=50K
2,38,4,215646,11,11,0,6,1,4,1,0,0,40,39,<=50K
3,53,4,234721,1,1,2,6,0,2,1,0,0,40,39,<=50K
4,28,4,338409,9,9,2,10,5,2,0,0,0,40,5,<=50K


In [58]:
# normalizing numeric columns
scaler = StandardScaler()
income[['age', 'workclass', 'fnlwgt', 'education', 'education_num',
                'marital_status', 'occupation', 'relationship', 'race', 'sex',
                'capital_gain', 'capital_loss', 'hours_per_week', 'native_country']] = \
    scaler.fit_transform(income[['age', 'workclass', 'fnlwgt', 'education', 'education_num',
                'marital_status', 'occupation', 'relationship', 'race', 'sex',
                'capital_gain', 'capital_loss', 'hours_per_week', 'native_country']])

In [59]:
income.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,0.030671,2.150579,-1.063611,-0.335437,-0.335437,0.921634,-1.317809,-0.277805,0.393668,0.703071,0.148453,-0.21666,-0.035429,0.291569,<=50K
1,0.837109,1.463736,-1.008707,-0.335437,-0.335437,-0.406212,-0.608387,-0.900181,0.393668,0.703071,-0.14592,-0.21666,-2.222153,0.291569,<=50K
2,-0.042642,0.09005,0.245079,0.181332,0.181332,-1.734058,-0.135438,-0.277805,0.393668,0.703071,-0.14592,-0.21666,-0.035429,0.291569,<=50K
3,1.057047,0.09005,0.425801,-2.402511,-2.402511,-0.406212,-0.135438,-0.900181,-1.962621,0.703071,-0.14592,-0.21666,-0.035429,0.291569,<=50K
4,-0.775768,0.09005,1.408176,-0.335437,-0.335437,-0.406212,0.810458,2.211698,-1.962621,-1.422331,-0.14592,-0.21666,-0.035429,-4.054223,<=50K


Now we have an input matrix populated entirely by normalized numbers!

We need to save the income column separately from our data frame, so one matrix is used as input data, and the other matrix (y is used for the label)

In [60]:
y
income = income.drop(columns='income')

### Model Training

Now we're ready to begin training the model.  We'll train a logisitic regression model and a decision tree model and compare their accuracy.

In [61]:
# split data into training and test set (using test size 20% as per the standard)
X_train, X_test, y_train, y_test = train_test_split(income, y, test_size=0.2)

# run logistic regression on data because we can explore the significance of each feature
clf = LogisticRegression(C=50. / len(X_train), multi_class='ovr', 
                         penalty='l1', solver='saga', tol=0.1)
clf.fit(X_train, y_train)
print('coefficients:')
print(clf.coef_) # each row of this matrix corresponds to each one of the classes of the dataset
clf_pred = clf.predict(X_test)

coefficients:
[[ 2.96727393e-01  9.81287571e-05  0.00000000e+00  5.96564102e-02
   5.96564102e-02 -2.24229638e-01  1.76003439e-02 -2.11901328e-01
   1.78662137e-05  1.89943067e-01  8.69719135e-01  1.77063540e-01
   3.36250731e-01  0.00000000e+00]]


In [62]:
# run random forest on the data for comparison
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)

In [63]:
# present accuracy measures
print('Logistic Regression Predicition Accuracy: ', accuracy_score(y_test, clf_pred))
print('Decision Tree Predicition Accuracy: ', accuracy_score(y_test, dt_pred))

Logistic Regression Predicition Accuracy:  0.8022416705051436
Decision Tree Predicition Accuracy:  0.8086903116843237
