## Problem Statement
<b>In this assignment students need to predict whether a person makes over 50K per year or not from classic adult dataset using XGBoost. The description of the dataset is as follows: </b>

<b>Data Set Information:</b>
<br>Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))

<b>Attribute Information:</b>
<br>Listing of attributes:
<br>target:>50K, <=50K.
<br>age: continuous.
<br>workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov,
<br>Without-pay, Never-worked.
<br>fnlwgt: continuous.
<br>education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
<br>education-num: continuous.
<br>marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed,Married-spouse-absent, Married-AF-spouse.
<br>occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial,Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing,Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
<br>relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
<br>race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
<br>sex: Female, Male.
<br>capital-gain: continuous.
<br>capital-loss: continuous.
<br>hours-per-week: continuous.
<br>native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany,Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras,Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France,Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala,Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong,Holand-Netherlands.

<b>Following is the code to load required libraries and data:</b>

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split, cross_val_score
import xgboost as xgb

In [2]:
train_set = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header = None)
test_set = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test', skiprows = 1, header = None)

In [3]:
col_labels = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status','occupation','relationship',
              'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'wage_class']
train_set.columns = col_labels
test_set.columns = col_labels

In [4]:
train_set.shape

(32561, 15)

In [5]:
test_set.shape

(16281, 15)

In [6]:
train_set.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
wage_class        0
dtype: int64

In [7]:
# set training and test indicator for the data and combine training and test set for data cleaning.
train_set['train_ind'] = 1
test_set['train_ind'] = 0
combined_data = train_set.append(test_set)

In [8]:
combined_data.shape

(48842, 16)

In [9]:
combined_data.describe()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,train_ind
count,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0
mean,38.643585,189664.1,10.078089,1079.067626,87.502314,40.422382,0.66666
std,13.71051,105604.0,2.570973,7452.019058,403.004552,12.391444,0.471412
min,17.0,12285.0,1.0,0.0,0.0,1.0,0.0
25%,28.0,117550.5,9.0,0.0,0.0,40.0,0.0
50%,37.0,178144.5,10.0,0.0,0.0,40.0,1.0
75%,48.0,237642.0,12.0,0.0,0.0,45.0,1.0
max,90.0,1490400.0,16.0,99999.0,4356.0,99.0,1.0


In [10]:
combined_data.describe(include = 'O')

Unnamed: 0,workclass,education,marital_status,occupation,relationship,race,sex,native_country,wage_class
count,48842,48842,48842,48842,48842,48842,48842,48842,48842
unique,9,16,7,15,6,5,2,42,4
top,Private,HS-grad,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States,<=50K
freq,33906,15784,22379,6172,19716,41762,32650,43832,24720


In [11]:
# replace '?' values with nulls.
df1 = combined_data.replace(' ?', np.nan)

In [12]:
df1.isnull().sum()

age                  0
workclass         2799
fnlwgt               0
education            0
education_num        0
marital_status       0
occupation        2809
relationship         0
race                 0
sex                  0
capital_gain         0
capital_loss         0
hours_per_week       0
native_country     857
wage_class           0
train_ind            0
dtype: int64

In [13]:
#fill nulls with the value 'unknown'
df1.fillna(' unknown', inplace = True)

In [14]:
df1['wage_class'].unique()

array([' <=50K', ' >50K', ' <=50K.', ' >50K.'], dtype=object)

In [15]:
# target variable wage_class is having 4 values. we will combine ' <=50K' and ' <=50K.' as one value 
# and the remaining two values as one value
df1['target_variable'] = 0
df1.loc[df1['wage_class'] == ' >50K' ,'target_variable'] = 1
df1.loc[df1['wage_class'] == ' >50K.' ,'target_variable'] = 1

In [16]:
df1['target_variable'].value_counts()

0    37155
1    11687
Name: target_variable, dtype: int64

In [17]:
#One Hot Encoding

In [18]:
dummies_relationship = pd.get_dummies(df1['relationship'], prefix = 'relationship')
dummies_workclass = pd.get_dummies(df1['workclass'], prefix = 'workclass')
dummies_education = pd.get_dummies(df1['education'], prefix = 'education')
dummies_marital_status = pd.get_dummies(df1['marital_status'], prefix = 'marital_status')
dummies_occupation = pd.get_dummies(df1['occupation'], prefix = 'occupation')
dummies_race = pd.get_dummies(df1['race'], prefix = 'race')
dummies_sex = pd.get_dummies(df1['sex'], prefix = 'sex')

In [19]:
df1 = pd.concat([df1,dummies_relationship,dummies_workclass,dummies_education,
                 dummies_marital_status,dummies_occupation,dummies_race,dummies_sex],axis = 1)

In [20]:
df1.drop(['relationship','workclass', 'education', 'marital_status', 'occupation', 'race', 'sex',
          'wage_class'], axis = 1, inplace = True)

In [21]:
df1.shape

(48842, 69)

In [22]:
# consolidating all countries other than United States as 1 country.
df1['country']= 0
df1.loc[df1['native_country'] == ' United-States' ,'country'] = 1
df1['country'].value_counts()
df1.drop('native_country', axis = 1, inplace = True)

In [23]:
#splitting the combined data into the final training and test set.
final_train_set = df1[df1["train_ind"] == 1]
final_test_set = df1[df1["train_ind"] == 0]

In [24]:
final_train_set.drop('train_ind', axis =1, inplace = True)
final_test_set.drop('train_ind', axis =1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [25]:
final_train_set.shape

(32561, 68)

In [26]:
final_test_set.shape

(16281, 68)

In [27]:
# Splitting the Training andvalidation set.

In [28]:
final_test_set_y = final_test_set.pop('target_variable')

In [29]:
final_test_set_X = final_test_set

In [30]:
y = final_train_set.pop('target_variable')

In [31]:
X = final_train_set

In [32]:
Xtrain, Xval, ytrain, yval = train_test_split(X, y, test_size = 0.2, random_state = 1982)

In [33]:
#implementing XGBoost method on the training and validation set and comparing the auc values against the Test set.

In [34]:
xgtrain = xgb.DMatrix(Xtrain, label = ytrain)
xgval = xgb.DMatrix(Xval, label = yval)
xgtest = xgb.DMatrix(final_test_set_X)

In [35]:
watchlist = [(xgtrain,'train'),(xgval, 'eval')]

In [36]:
params = {}
params["objective"] =  "binary:logistic"
params["booster"] = "gbtree"
params["max_depth"] = 7
params["eval_metric"] = 'auc'
params["subsample"] = 0.9
params["colsample_bytree"] = 0.9
params["silent"] = 1
params["seed"] = 4
params["eta"] = 0.1

plst = list(params.items())

In [37]:
num_rounds = 500
model_cv = xgb.train(plst, xgtrain, num_rounds, evals = watchlist, early_stopping_rounds = 10, 
                     verbose_eval = True)

[0]	train-auc:0.907192	eval-auc:0.89853
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.

Will train until eval-auc hasn't improved in 10 rounds.
[1]	train-auc:0.909089	eval-auc:0.899711
[2]	train-auc:0.916455	eval-auc:0.906834
[3]	train-auc:0.916949	eval-auc:0.907187
[4]	train-auc:0.918477	eval-auc:0.908451
[5]	train-auc:0.920223	eval-auc:0.910784
[6]	train-auc:0.920477	eval-auc:0.910036
[7]	train-auc:0.921532	eval-auc:0.911106
[8]	train-auc:0.922063	eval-auc:0.911469
[9]	train-auc:0.922357	eval-auc:0.911525
[10]	train-auc:0.923359	eval-auc:0.912061
[11]	train-auc:0.924452	eval-auc:0.912493
[12]	train-auc:0.924947	eval-auc:0.912952
[13]	train-auc:0.925404	eval-auc:0.913145
[14]	train-auc:0.925752	eval-auc:0.913346
[15]	train-auc:0.926366	eval-auc:0.913575
[16]	train-auc:0.926849	eval-auc:0.914162
[17]	train-auc:0.927316	eval-auc:0.914654
[18]	train-auc:0.927868	eval-auc:0.914761
[19]	train-auc:0.928212	eval-auc:0.914949
[20]	train-auc:0.928706	eval-a

<b>We are getting an AUC value of 92.6% for the test set.</b>

In [53]:
test_pred = model_cv.predict(xgtest, ntree_limit=model_cv.best_ntree_limit)

In [54]:
print(roc_auc_score(final_test_set_y,test_pred))

0.9275510240353322


<b>Final check against the final_test_set and we are getting a final AUC value of 92.75% which proves the model is robust.</b>