# <font color=red>SMU DS 7331 DATA MINING - LAB 2 CLASSIFICATION</font>

**Team Members:**
- YuMei Bennett
- Liang Huang
- Ganesh Kodi
- Eric McCandless

## <font color=blue>DATA PREPARATION PART 1</font>

**Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis.**

Initial prep of data:

In [76]:
# import all necessary modules
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import Imputer
import seaborn as sns

#Read in dataset.
col_names = ['age', 'employ_type', 'pop_num', 'edu_level', 'edu_years', 'marital', 'occ', 'relation', 'race', 'gender', 'cap_gain', 'cap_loss', 'hours_week', 'country_orig', 'income']
df = pd.read_csv('adult.csv', names=col_names, header=None)

#Replace "?" with "Other_cat"
df['employ_type'] = df['employ_type'].str.replace('?','Other_cat')
df['occ'] = df['occ'].str.replace('?','Other_cat')
df['country_orig'] = df['country_orig'].str.replace('?','Other_cat')

# Binary encoding of the target variable
df['income'] = df['income'].apply(lambda inc: 0 if inc ==" <=50K" else 1) 

#Transform employ_type into multiple columns with 0 and 1
# use pd.concat to join the new columns with original dataframe then drop the original 'employ_type' column (don't need it anymore)
df = pd.concat([df,pd.get_dummies(df['employ_type'], prefix='emp')],axis=1)
df.drop(['employ_type'],axis=1, inplace=True)

#Transform gender into multiple columns with 0 and 1
# use pd.concat to join the new columns with original dataframe then drop the original 'gender' column (don't need it anymore)
df = pd.concat([df,pd.get_dummies(df['gender'], prefix='gen')],axis=1)
df.drop(['gender'],axis=1, inplace=True)

#Transform race into multiple columns with 0 and 1
# use pd.concat to join the new columns with original dataframe then drop the original 'race' column (don't need it anymore)
df = pd.concat([df,pd.get_dummies(df['race'], prefix='rac')],axis=1)
df.drop(['race'],axis=1, inplace=True)

#Transform education_level into multiple columns with 0 and 1
# use pd.concat to join the new columns with original dataframe then drop the original 'edu_level' column (don't need it anymore)
df = pd.concat([df,pd.get_dummies(df['edu_level'], prefix='edu')],axis=1)
df.drop(['edu_level'],axis=1, inplace=True)

#Consolidate education levels because many of them have the similar impact to target income.
df['edu_ SomeCollege'] = df['edu_ Some-college'] + df['edu_ Assoc-acdm'] + df['edu_ Assoc-voc'] 
df['<HS'] = df['edu_ 12th'] + df['edu_ 11th'] + df['edu_ 10th'] + df['edu_ 9th'] + df['edu_ 7th-8th'] + df['edu_ 5th-6th']+ df['edu_ 1st-4th'] + df['edu_ Preschool'] 
df=df.drop(['edu_ Some-college','edu_ Assoc-acdm','edu_ Assoc-voc', 'edu_ 12th', 'edu_ 11th','edu_ 10th','edu_ 9th','edu_ 7th-8th','edu_ 7th-8th','edu_ 5th-6th','edu_ 1st-4th','edu_ Preschool'], 1)

# drop edu_years as it is highly correlated with edu_level.
df=df.drop(['edu_years'], 1)

#Transform relation into multiple columns with 0 and 1
# use pd.concat to join the new columns with original dataframe then drop the original 'relation' column (don't need it anymore)
df = pd.concat([df,pd.get_dummies(df['relation'], prefix='rel')],axis=1)
df.drop(['relation'],axis=1, inplace=True)

#Transform marital into multiple columns with 0 and 1
# use pd.concat to join the new columns with original dataframe then drop the original 'marital' column (don't need it anymore)
df = pd.concat([df,pd.get_dummies(df['marital'], prefix='mar')],axis=1)
df.drop(['marital'],axis=1, inplace=True)

#Consolidate marital status because too many similar categories.  Married-civ-spouse and Married-AF-spouse are similar as are non-married.
df['Married'] = df['mar_ Married-civ-spouse'] + df['mar_ Married-AF-spouse'] 
df['Sep_Div_Absent_Wid'] = df['mar_ Divorced'] + df['mar_ Separated'] + df['mar_ Widowed'] + df['mar_ Married-spouse-absent']
df['Never_Married'] = df['mar_ Never-married']
df=df.drop(['mar_ Married-civ-spouse','mar_ Married-AF-spouse','mar_ Divorced', 'mar_ Separated', 'mar_ Widowed','mar_ Married-spouse-absent','mar_ Never-married'], 1)

#Transform occ into multiple columns with 0 and 1
# use pd.concat to join the new columns with original dataframe then drop the original 'occ' column (don't need it anymore)
df = pd.concat([df,pd.get_dummies(df['occ'], prefix='occu')],axis=1)
df.drop(['occ'],axis=1, inplace=True)

#Consolidate occupation by combining 'Other-service', 'Other_cat', and 'Armed-Forces. 'Other' categories are combined because they are not defined and Armed-Forces has an extremely small number of occurences.
df['occu_ Other'] = df['occu_ Other-service'] + df['occu_ Other_cat'] + df['occu_ Armed-Forces'] 
df=df.drop(['occu_ Other-service','occu_ Other_cat','occu_ Armed-Forces'], 1)

# drop pop_num as population number is an assigned index number, it has no meaning or contribution to our target income.
df=df.drop(['pop_num'], 1)

# Combine all non-U.S. native countries as only ~10% people are not from US - code native country into binary 1=United-States
df['country_orig'] = df['country_orig'].apply(lambda inc: 1 if inc ==" United-States" else 0) 

# merge capital gain and capital losscap_gain and cap_loss as it can be mathmatically concatenated into a single feature cap_gain_loss = cap_gain - cap_loss.
df['cap_gain-loss'] = df['cap_gain'] - df['cap_loss'] 
df=df.drop(['cap_gain','cap_loss'], 1)
df.head(10)

Unnamed: 0,age,hours_week,country_orig,income,emp_ Federal-gov,emp_ Local-gov,emp_ Never-worked,emp_ Other_cat,emp_ Private,emp_ Self-emp-inc,...,occu_ Handlers-cleaners,occu_ Machine-op-inspct,occu_ Priv-house-serv,occu_ Prof-specialty,occu_ Protective-serv,occu_ Sales,occu_ Tech-support,occu_ Transport-moving,occu_ Other,cap_gain-loss
0,39,40,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2174
1,50,13,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,38,40,1,0,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
3,53,40,1,0,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
4,28,40,0,0,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0
5,37,40,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
6,49,16,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
7,52,45,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,31,50,1,1,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,14084
9,42,40,1,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,5178


## <font color=blue>DATA PREPARATION PART 2</font>

**Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created).**

In [77]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 50 columns):
age                        32561 non-null int64
hours_week                 32561 non-null int64
country_orig               32561 non-null int64
income                     32561 non-null int64
emp_ Federal-gov           32561 non-null uint8
emp_ Local-gov             32561 non-null uint8
emp_ Never-worked          32561 non-null uint8
emp_ Other_cat             32561 non-null uint8
emp_ Private               32561 non-null uint8
emp_ Self-emp-inc          32561 non-null uint8
emp_ Self-emp-not-inc      32561 non-null uint8
emp_ State-gov             32561 non-null uint8
emp_ Without-pay           32561 non-null uint8
gen_ Female                32561 non-null uint8
gen_ Male                  32561 non-null uint8
rac_ Amer-Indian-Eskimo    32561 non-null uint8
rac_ Asian-Pac-Islander    32561 non-null uint8
rac_ Black                 32561 non-null uint8
rac_ Other                 

In [82]:
num_high_income = sum(df['income']!=0)
print (num_high_income)

7841


In [44]:
from numpy import array
from numpy import count_nonzero

# calculate sparsity
sparsity = 1.0 - ( count_nonzero(df) / float(df.size) )
print(sparsity)
# only 21% none zero values in dataset, pretty sparse

0.7946678541813826


In [52]:
#Separate target vs. data into two different data frame
ds=df
ds.target=ds['income']
ds.data=ds
del ds.data['income']
ds.data=ds.data.values

# this holds the continuous feature data (which is tfidf)
print ('features shape:', ds.data.shape) # there are ~11000 instances and ~130k features per instance
print ('target shape:', ds.target.shape) 

features shape: (32561, 49)
target shape: (32561,)


## <font color=blue>MODELING AND EVALUATION 1</font>

**Choose and explain your evaluation metrics that you will use (i.e., accuracy, precision, recall, F-measure, or any metric we have discussed). Why are the measure(s) appropriate for analyzing the results of your modeling? Give a detailed explanation backing up any assertions.**

In [None]:
# We would like to gauge the model effectiveness by F measure. This is because 
# one, the dataset is inbanlanced, only ~23% instance map to target value of 1 (>50k)
# two, F measure is a combination of precision and recall, target "income" really has 
# no benefit to bias one way or the other.

In [70]:
#from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score


#cv = StratifiedShuffleSplit(ds.target, n_iter = 1, test_size = 0.5, train_size=0.5)
data=ds.data
labels=ds.target
sss = StratifiedShuffleSplit(n_splits=2,test_size = 0.5, train_size=0.5)
iter_num=0
for train_index, test_index in sss.split(data, labels):
    x_train, x_test = data[train_index], data[test_index]
    y_train, y_test = labels[train_index], labels[test_index]
    
    # train the reusable KNN classifier on the training data       
    clf = KNeighborsClassifier(n_neighbors=8, weights='uniform', metric='euclidean')
    clf.fit(x_train, y_train)
    y_hat = clf.predict(x_test) 

    # now let's get the accuracy and confusion matrix for this iterations of training/testing
    acc = accuracy_score(y_test,y_hat)
    precision = precision_score(y_test, y_hat)
    recall = recall_score(y_test, y_hat)
    f1=f1_score(y_test, y_hat)
    conf = confusion_matrix(y_test,y_hat)
    print("====Iteration",iter_num," ====")
    print("accuracy", acc )
    print("f1 score", f1 )
    print("precision score", precision )
    print("recall score", recall )
    print("confusion matrix\n",conf)
    iter_num+=1




====Iteration 0  ====
accuracy 0.8466310423192679
f1 score 0.6323074657635105
precision score 0.7480836236933798
recall score 0.5475643968375414
confusion matrix
 [[11637   723]
 [ 1774  2147]]
====Iteration 1  ====
accuracy 0.8475523616485474
f1 score 0.6283318358790058
precision score 0.7609720710917665
recall score 0.535067584799796
confusion matrix
 [[11701   659]
 [ 1823  2098]]


## <font color=blue>MODELING AND EVALUATION 2</font>

**Choose the method you will use for dividing your data into training and testing splits (i.e., are you using Stratified 10-fold cross validation? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate. For example, if you are using time series data then you should be using continuous training and testing sets across time.**

In [None]:
# We have tried 10 fold cross validation, all  10 round of result are amazingly similar. This is why we do not
# believe for this dataset, 10 folder validation is needed. 
# We've normalized data, that did not make any recognizable model improvement as well. The reason most likely due to
# the nature of the dataset are large, only few features are continues input, rest are all hot shot coded already, 
# data normalization does not play a huge role here. Another reason is we did see from mini lab with the same 
# dataset, there is no significant important feature in predicting the target. 

In [89]:
from sklearn import preprocessing
X = ds.data
y = ds.target
# normalize the data attributes
normalized_X = preprocessing.normalize(X)

Ndata=normalized_X
labels=y
sss = StratifiedShuffleSplit(n_splits=5,test_size = 0.5, train_size=0.5)
iter_num=0
for train_index, test_index in sss.split(Ndata, labels):
    x_train, x_test = Ndata[train_index], Ndata[test_index]
    y_train, y_test = labels[train_index], labels[test_index]
    
    # train the reusable KNN classifier on the training data       
    clf = KNeighborsClassifier(n_neighbors=8, weights='distance', metric='euclidean')
    clf.fit(x_train, y_train)
    y_hat = clf.predict(x_test) 

    # now let's get the accuracy and confusion matrix for this iterations of training/testing
    acc = accuracy_score(y_test,y_hat)
    precision = precision_score(y_test, y_hat)
    recall = recall_score(y_test, y_hat)
    f1=f1_score(y_test, y_hat)
    conf = confusion_matrix(y_test,y_hat)
    print("====Iteration",iter_num," ====")
    print("accuracy", acc )
    print("f1 score", f1 )
    print("precision score", precision )
    print("recall score", recall )
    print("confusion matrix\n",conf)
    iter_num+=1



====Iteration 0  ====
accuracy 0.8281432344450587
f1 score 0.6245303274288782
precision score 0.6590201076182385
recall score 0.5934710533027289
confusion matrix
 [[11156  1204]
 [ 1594  2327]]
====Iteration 1  ====
accuracy 0.8266077022295928
f1 score 0.6244512438472795
precision score 0.6526696329254728
recall score 0.598571792909972
confusion matrix
 [[11111  1249]
 [ 1574  2347]]
====Iteration 2  ====
accuracy 0.8288802899084823
f1 score 0.631578947368421
precision score 0.6558637736885471
recall score 0.6090283091048202
confusion matrix
 [[11107  1253]
 [ 1533  2388]]
====Iteration 3  ====
accuracy 0.8234752165100424
f1 score 0.6213438735177866
precision score 0.642681929681112
recall score 0.6013771996939556
confusion matrix
 [[11049  1311]
 [ 1563  2358]]
====Iteration 4  ====
accuracy 0.8269762299613046
f1 score 0.6260454002389485
precision score 0.6528239202657807
recall score 0.6013771996939556
confusion matrix
 [[11106  1254]
 [ 1563  2358]]


## <font color=blue>MODELING AND EVALUATION 3</font>

**Create three different classification/regression models for each task (e.g., random forest, KNN, and SVM for task one and the same or different algorithms for task two). Two modeling techniques must be new (but the third could be SVM or logistic regression). Adjust parameters as appropriate to increase generalization performance using your chosen metric. You must investigate different parameters of the algorithms!**

In [None]:
# investigated k value from 2 to 20, that did not make significant enough difference. In some case, resulted
# less favorable result. So we keep k=8, large enough to discount the noise, small enough not to waste resources
# and still produce same high quality model.
# we've also investigated number of stratified 10 fold cross validation.
# should we change metric? will computer able to handle it?
# should we change weights?
# what else?

## <font color=blue>MODELING AND EVALUATION 4</font>

**Analyze the results using your chosen method of evaluation. Use visualizations of the results to bolster the analysis. Explain any visuals and analyze why they are interesting to someone that might use this model.**

## <font color=blue>MODELING AND EVALUATION 5</font>

**Discuss the advantages of each model for each classification task, if any. If there are not advantages, explain why. Is any model better than another? Is the difference significant with 95% confidence? Use proper statistical comparison methods. You must use statistical comparison techniques—be sure they are appropriate for your chosen method of validation as discussed in unit 7 of the course.**

## <font color=blue>MODELING AND EVALUATION 6</font>

**Which attributes from your analysis are most important? Use proper methods discussed in class to evaluate the importance of different attributes. Discuss the results and hypothesize about why certain attributes are more important than others for a given classification task.**

## <font color=blue>DEPLOYMENT</font>

**How useful is your model for interested parties (i.e., the companies or organizations that might want to use it for prediction)? How would you measure the model's value if it was used by these parties? How would your deploy your model for interested parties? What other data should be collected? How often would the model need to be updated, etc.?**

## <font color=blue>EXCEPTIONAL WORK</font>

**You have free reign to provide additional analyses. One idea: grid search parameters in a parallelized fashion and visualize the performances across attributes. Which parameters are most significant for making a good model for each classification algorithm?**