Problem Statement

This data was extracted from the census bureau database found at http://www.census.gov/ftp/pub/DES/www/welcome.html
Donor: Ronny Kohavi and Barry Becker,

Data Mining and Visualization
Silicon Graphics.
e-mail: ronnyk@sgi.com for questions.

Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random).
48842 instances, mix of continuous and discrete (train=32561, test=16281)
45222 if instances with unknown values are removed (train=30162, test=15060)
Duplicate or conflicting instances : 6
Class probabilities for adult.all file
Probability for the label '>50K' : 23.93% / 24.78% (without unknowns)
Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)
Extraction was done by Barry Becker from the 1994 Census database. A set of
reasonably clean records was extracted using the following conditions:
((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)) Prediction task is to
determine whether a person makes over 50K a year. Conversion of original data as
follows:
1. Discretized a gross income into two ranges with threshold 50,000.
2. Convert U.S. to US to avoid periods.
3. Convert Unknown to "?"
4. Run MLC++ GenCVFiles to generate data,test.

Description of fnlwgt (final weight)
The weights on the CPS files are controlled to independent estimates of the civilian
noninstitutional population of the US. These are prepared monthly for us by Population
Division here at the Census Bureau. We use 3 sets of controls.
These are:
1. A single cell estimate of the population 16+ for each state.
2. Controls for Hispanic Origin by age and sex.
3. Controls by Race, age and sex.
We use all three sets of controls in our weighting program and "rake" through them 6
times so that by the end we come back to all the controls we used.
The term estimate refers to population totals derived from CPS by creating "weighted
tallies" of any specified socio-economic characteristics of the population. People with
similar demographic characteristics should have similar weights. There is one important
caveat to remember about this statement. That is that since the CPS sample is actually a
collection of 51 state samples, each with its own probability of selection, the statement
only applies within state.

Dataset Link
https://archive.ics.uci.edu/ml/machine-learning-databases/adult/

Problem 1:
Prediction task is to determine whether a person makes over 50K a year.

Problem 2:
Which factors are important

Problem 3:
Which algorithms are best for this dataset

In [1]:
#import libraries 
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

In [2]:
cols = ['Age','Workclass','FinalWeight','Education','EduNumber','MaritalStatus','Job','Family','Race','Gender','CapitalGain','CapitalLoss','HrsWeek','NativeCountry','Salary']

In [3]:
data_ =pd.read_csv("adult.data.csv",names=cols,sep=', ')
test_ = pd.read_csv("adult.test.csv",names=cols)

  """Entry point for launching an IPython kernel.


In [4]:
data_.head()

Unnamed: 0,Age,Workclass,FinalWeight,Education,EduNumber,MaritalStatus,Job,Family,Race,Gender,CapitalGain,CapitalLoss,HrsWeek,NativeCountry,Salary
0,"""39",State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,"<=50K"""
1,"""50",Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,"<=50K"""
2,"""38",Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,"<=50K"""
3,"""53",Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,"<=50K"""
4,"""28",Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,"<=50K"""


In [None]:
data_.loc[data_.Salary == " >50K","CapitalGain"].shape

In [None]:
data_.loc[data_.Salary != " >50K","CapitalGain"].shape

In [11]:
#drop first row with bad data
test_.drop(0, inplace=True)
test_.reset_index(drop=True,inplace=True)

In [12]:
data_.isnull().any()

Age              False
Workclass        False
FinalWeight      False
Education        False
EduNumber        False
MaritalStatus    False
Job              False
Family           False
Race             False
Gender           False
CapitalGain      False
CapitalLoss      False
HrsWeek          False
NativeCountry    False
Salary           False
dtype: bool

In [13]:
target =data_.Salary

In [14]:
#find important features 
data_['Education'] = data_['Education'].str.strip()

In [15]:
data_.Education.unique()

array(['Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
       'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',
       '5th-6th', '10th', '1st-4th', 'Preschool', '12th'], dtype=object)

In [None]:
data_.loc[(data_.Salary ==' >50K')&(data_.Education == 'Masters'),:].shape[0]/data_.shape[0]

In [None]:
for p in data_.Workclass.unique():
    print(p + "  {0:.6f}".format(data_.loc[(data_.Salary ==' >50K')&(data_.Workclass == p),:].shape[0]/data_.shape[0]))

In [None]:
for p in data_.Education.unique():
    print(p + "  {0:.6f}".format(data_.loc[(data_.Salary ==' >50K')&(data_.Education == p),:].shape[0]/data_.shape[0]))

In [19]:
#classify Education by EdNum into groups: <Bachelors, Bachelors,Advanced degrees to see if there is more info derived
data_.loc[data_['Education'] == 'Bachelors']

Unnamed: 0,Age,Workclass,FinalWeight,Education,EduNumber,MaritalStatus,Job,Family,Race,Gender,CapitalGain,CapitalLoss,HrsWeek,NativeCountry,Salary
0,"""39",State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,"<=50K"""
1,"""50",Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,"<=50K"""
4,"""28",Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,"<=50K"""
9,"""42",Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,">50K"""
11,"""30",State-gov,141297,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,India,">50K"""
12,"""23",Private,122272,Bachelors,13,Never-married,Adm-clerical,Own-child,White,Female,0,0,30,United-States,"<=50K"""
25,"""56",Local-gov,216851,Bachelors,13,Married-civ-spouse,Tech-support,Husband,White,Male,0,0,40,United-States,">50K"""
32,"""45",Private,386940,Bachelors,13,Divorced,Exec-managerial,Own-child,White,Male,0,1408,40,United-States,"<=50K"""
41,"""53",Self-emp-not-inc,88506,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,40,United-States,"<=50K"""
42,"""24",Private,172987,Bachelors,13,Married-civ-spouse,Tech-support,Husband,White,Male,0,0,50,United-States,"<=50K"""


In [20]:
#less than Bachelors will be 0, Bachelors = 1, Masters & Doctorate = 2
def f(row):
    if row['EduNumber'] < 13:
        val = 'NoDegree'
    elif row['EduNumber'] == 13:
        val = 'Bachelors'
    else:
        val = 'AdvDegree'
    return val

In [21]:
data_['EduClass'] = data_.apply(f, axis=1)

In [None]:
#this looks like more useful information
for p in data_.EduClass.unique():
    print(p + "  {0:.6f}".format(data_.loc[(data_.Salary ==' >50K')&(data_.EduClass == p),:].shape[0]/data_.shape[0]))

In [23]:
data_.drop('Education',axis=1,inplace=True)

In [24]:
data_.drop('EduNumber',axis=1,inplace=True)

In [25]:
#classify HrsWeek
def hrs(row):
    if row['HrsWeek'] < 40:
        val = 'PartTime'
    elif row['HrsWeek'] == 40:
        val = 'FullTime'
    else:
        val = 'WorksALot'
    return val

In [26]:
data_['WorkRate'] = data_.apply(hrs, axis=1)

In [None]:
#difference between PartTime and Fulltime
for p in data_.WorkRate.unique():
    print(p + "  {0:.6f}".format(data_.loc[(data_.Salary ==' >50K')&(data_.WorkRate == p),:].shape[0]/data_.shape[0]))

In [28]:
data_.drop('HrsWeek',axis=1,inplace=True)

In [29]:
#classify Age
def age_(row):
    if row['Age'] < 20:
        val = 'LessThan20'
    elif row['Age'] == 20 and row['Age'] <30:
        val = 'Twenties'
    elif row['Age'] >= 30 and row['Age'] <40:
        val = 'Thirties'
    elif row['Age'] >= 40 and row['Age'] <50:
        val = 'Forties'
    elif row['Age'] >= 50 and row['Age'] <60:
        val = 'Fifties'
    elif row['Age'] >= 60 and row['Age'] <70:
        val = 'Sixties'
    else:
        val = 'Elderly'
    return val

In [None]:
data_['AgeClass'] = data_.apply(age_, axis=1)

In [None]:
#not a good feature
for p in data_.AgeClass.unique():
    print(p + "  {0:.6f}".format(data_.loc[(data_.Salary ==' >50K')&(data_.AgeClass == p),:].shape[0]/data_.shape[0]))

In [32]:
data_.drop('Age',axis=1,inplace=True)

In [None]:
for p in data_.Job.unique():
    print(p + "  {0:.6f}".format(data_.loc[(data_.Salary ==' >50K')&(data_.Job == p),:].shape[0]/data_.shape[0]))

In [None]:
for p in data_.Family.unique():
    print(p + "  {0:.6f}".format(data_.loc[(data_.Salary ==' >50K')&(data_.Family == p),:].shape[0]/data_.shape[0]))

In [None]:
for p in data_.Gender.unique():
    print(p + "  {0:.6f}".format(data_.loc[(data_.Salary ==' >50K')&(data_.Gender == p),:].shape[0]/data_.shape[0]))

In [None]:
for p in data_.Race.unique():
    print(p + "  {0:.6f}".format(data_.loc[(data_.Salary ==' >50K')&(data_.Race == p),:].shape[0]/data_.shape[0]))

In [None]:
p in data_.NativeCountry.unique():
    print(p + "  {0:.6f}".format(data_.loc[(data_.Salary ==' >50K')&(data_.NativeCountry == p),:].shape[0]/data_.shape[0]))

Problem #2: Which factors are important

Following factors are important that are >=0.10 or 10%

Workclass.Private

Workclass.Local-gov

MaritalStatus.Married-civ-spouse

Family.Husband

Gender.Male

Race.White

NativeCountry.United-States

EduClass.NoDegree

WorkRate.WorkALot

So there are 9 important factors that are above 10%. 

In [38]:
#drop unimportant columns
data_.drop('FinalWeight',axis=1,inplace=True)
data_.drop('CapitalGain',axis=1,inplace=True)
data_.drop('CapitalLoss',axis=1,inplace=True)

In [39]:
#transform the test data
test_.drop('FinalWeight',axis=1,inplace=True)
test_.drop('CapitalGain',axis=1,inplace=True)
test_.drop('CapitalLoss',axis=1,inplace=True)
test_['WorkRate'] = test_.apply(hrs, axis=1)
test_['EduClass'] = test_.apply(f, axis=1)

In [40]:
#cast Age in test data from string to int
test_.Age = pd.to_numeric(test_.Age, errors='coerce')

In [41]:
test_['AgeClass'] = test_.apply(age_, axis=1)

In [42]:
test_.drop('Age',axis=1,inplace=True)
test_.drop('Education',axis=1,inplace=True)
test_.drop('EduNumber',axis=1,inplace=True)
test_.drop('HrsWeek',axis=1,inplace=True)

In [None]:
#convert Gender values into numbers.
#Male =1, Female = 0
dfGender=data_.Gender
dfGender=dfGender.str.strip()
dfGenTest=test_.Gender
dfGenTest=dfGenTest.str.strip()
ser1 = pd.Series(np.where(dfGender == 'Male', 1,0))
ser2 = pd.Series(np.where(dfGenTest == 'Male', 1,0))
dfGender = pd.DataFrame(data=ser1,columns=['Gender'])
dfGenTest = pd.DataFrame(data=ser2, columns=['Gender'])

In [None]:
#transform Salary columns  <=50K =1 , >50 = 0
dfSalary = data_.Salary.str.strip()
dfSalTest = test_.Salary.str.strip()
dfSalary = pd.Series(np.where(dfSalary == '<=50K', 1,0))
dfSalTest = pd.Series(np.where(dfSalTest == '<=50K', 1,0))

In [45]:
data_.drop('Salary',axis=1,inplace=True)
test_.drop('Salary',axis=1,inplace=True)

In [None]:
#create dummy variables 
dfStrEncode = pd.get_dummies(data=data_)
dfStrEncode.astype('int32',copy=True)
dfStrEncTest = pd.get_dummies(data=test_)
dfStrEncTest.astype('int32',copy=True)
dfStrEncTest.tail()

In [47]:
#insert a column in test data with all zeros as the test data is missing an entry for NativeCountry_ Holand-Netherlands
a = np.zeros(shape=(16281,1))
dfHoland = pd.DataFrame(a,columns=['NativeCountry_ Holand-Netherlands'])

In [None]:
dfStrEncode.columns.get_loc('NativeCountry_ Holand-Netherlands')

In [49]:
#put data in column order
temp1 = dfStrEncTest.iloc[:,0:59]
temp2 = dfStrEncTest.iloc[:,59:]

In [50]:
dfStrEncTest2 = pd.concat([temp1,dfHoland,temp2],axis=1)

In [None]:
#combine dataframes and assign to testing variable
x_train = pd.concat([dfStrEncode,dfGender], axis=1)
x_test = pd.concat([dfStrEncTest2,dfGenTest], axis=1)
y_train = dfSalary
y_test = dfSalTest

In [None]:
# train the decision tree
dtree = tree.DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=50)
dtree.fit(x_train, y_train)

In [None]:
y_pred = dtree.predict(x_test)

In [None]:
x_train.columns[np.where(dtree.feature_importances_!=0)]

In [None]:
len(dtree.feature_importances_)

In [None]:
#check accuracy
from sklearn import metrics
count_misclassified = (y_test != y_pred).sum()
print('Misclassified samples: {}'.format(count_misclassified))
accuracy = metrics.accuracy_score(y_test, y_pred)
print('Accuracy: {:.2f}'.format(accuracy))

In [None]:
#Model was run on the entire data set. Using only the previously identified important columns 
x_train2 = x_train[["Workclass_ Private","Workclass_ Local-gov","MaritalStatus_ Married-civ-spouse","Family_ Husband","Gender_ Male","Race_ White","NativeCountry_ United-States","EduClass_NoDegree","WorkRate_WorksALot"]]
x_test2 = x_test[["Workclass_ Private","Workclass_ Local-gov","MaritalStatus_ Married-civ-spouse","Family_ Husband","Gender_ Male","Race_ White","NativeCountry_ United-States","EduClass_NoDegree","WorkRate_WorksALot"]]

In [None]:
# train the decision tree
dtree = tree.DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=50)
dtree.fit(x_train2, y_train)

In [None]:
y_pred = dtree.predict(x_test2)

In [None]:
#check accuracy again
from sklearn import metrics
count_misclassified = (y_test != y_pred).sum()
print('Misclassified samples: {}'.format(count_misclassified))
accuracy = metrics.accuracy_score(y_test, y_pred)
print('Accuracy: {:.2f}'.format(accuracy))

In [None]:
#check other models
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
bagging = BaggingClassifier(KNeighborsClassifier(), n_estimators=10)

In [None]:
bag_knn = BaggingClassifier(KNeighborsClassifier(n_neighbors=5),
n_estimators=10, max_samples=0.5,
bootstrap=True, random_state=3)

In [None]:
#bagging with the full data set
bag_knn.fit(x_train, y_train)
bag_knn.score(x_test, y_test)

In [None]:
#random forest
from sklearn.ensemble import RandomForestClassifier

In [None]:
clf=RandomForestClassifier(n_estimators=20)
clf.fit(x_train,y_train)
y_pred2=clf.predict(x_test)

In [None]:
#check accuracy
count_misclassified = (y_test != y_pred2).sum()
print('Misclassified samples: {}'.format(count_misclassified))
accuracy = metrics.accuracy_score(y_test, y_pred2)
print('Accuracy: {:.2f}'.format(accuracy))

In [None]:
#check xgboost
import xgboost as xgb

In [None]:
xg_reg = xgb.XGBRegressor(objective ='reg:linear', learning_rate = 0.1, max_depth = 5, alpha = 10, n_estimators = 10)

In [None]:
xg_reg.fit(x_train.values,y_train.values)
preds = xg_reg.predict(x_test.values)

In [None]:
from sklearn.metrics import mean_squared_error
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))

#Problem 3 - out of the different methods, random forests gave the highest accuracy

END OF ASSIGNMENT