#  <span style="text-decoration:underline;">Assignment 3</span>: Predicting Salaries via Classification

## Introduction

In this assignment, you will be working with [US census data](https://raw.githubusercontent.com/lapets/course-data-science/master/assignment-2018-12-03-data.tsv). You can find the schema for the data set [here](https://raw.githubusercontent.com/lapets/course-data-science/master/assignment-2018-12-03-schema.txt). The data is stored in a tab-separated value file in which each line represents an individual person. The data set was extracted from the 1994 US census data. Your goal is to evaluate different models for determining whether a person has an annual salary that is $50,000 or above.  

## Submission

Please use the following invitation link to create your assignment repository for this assignment: [https://classroom.github.com/a/3Af24wlA](https://classroom.github.com/a/3Af24wlA). Include your BU username within your submission by adding it here: **<asadeg02;**.

Do not delete the output of your code cells. This assignment must be completed **individually** by each student.

## <span style="text-decoration:underline;">Problem 1</span>: Feature Dimension Representations

**<span style="text-decoration:underline;">Part A</span> (20 points):** This data set contains categorical values in its feature dimensions. Most of the algorithms that were presented during lectures can only handle numeric quantities. Thus, it is necessary to create a new feature dimension for every unique value of each categorical variable. To accomplish this, you can use `pandas.get_dummies`. An example is provided below.

In [2]:
import pandas as pd
raw_data = {'age': [23, 62, 31, 48, 59],
        'salary': [60000, 100000, 120000, 150000, 95000],
        'education': ['Bachelor', 'Masters', 'PhD', 'Jd', 'Masters']}
df = pd.DataFrame(raw_data, columns = ['age', 'salary', 'education'])
df_edu = pd.get_dummies(df['education'], prefix = 'edu')
df_new = pd.concat([df, df_edu], axis=1)
df_new = df_new.drop( ['education'], axis = 1 )
df_new

Unnamed: 0,age,salary,edu_Bachelor,edu_Jd,edu_Masters,edu_PhD
0,23,60000,1,0,0,0
1,62,100000,0,0,1,0
2,31,120000,0,0,0,1
3,48,150000,0,1,0,0
4,59,95000,0,0,1,0


Convert all categorical feature dimensions within the data set in this way, storing the result as a new data frame.

In [1]:
import pandas as pd
file_name = 'US_sensus_data.tsv'
column_names = ['age', 'workclass','fnlwgt','education','education-num','marital-status','occupation',
                'relationship', 'race','sex','capital-gain','capital-loss','hours-per-week','nativecountry','class']
df1 = pd.read_table(file_name, header = None, names = column_names, index_col=False)

#clean data remove rows having ? as value for one of their columns
df1 = df1[df1.workclass != '?']
df1 = df1[df1.occupation != '?']
df1 = df1[df1.nativecountry != '?']

categorical_variables = ['workclass', 'education', 'marital-status', 'occupation', 'relationship','race','sex','nativecountry']

def convert_to_numerical(df, categorical_variables):
    for variable in categorical_variables:
        df_variable = pd.get_dummies(df[variable], prefix = variable)
        df = pd.concat([df, df_variable], axis=1)
        df = df.drop([variable], axis = 1 )
    return df    
          
df1 = convert_to_numerical(df1, categorical_variables)

#convert class column to 0 for <=50 and 1 for >50K
df1['class'] = df1['class'].map(lambda x: 0 if x=='<=50K' else 1)
df1

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,class,workclass_Federal-gov,workclass_Local-gov,workclass_Private,...,nativecountry_Portugal,nativecountry_Puerto-Rico,nativecountry_Scotland,nativecountry_South,nativecountry_Taiwan,nativecountry_Thailand,nativecountry_Trinadad&Tobago,nativecountry_United-States,nativecountry_Vietnam,nativecountry_Yugoslavia
0,39,77516,13,2174,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,50,83311,13,0,0,13,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,38,215646,9,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
3,53,234721,7,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
4,28,338409,13,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
5,37,284582,14,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
6,49,160187,5,0,0,16,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
7,52,209642,9,0,0,45,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
8,31,45781,14,14084,0,50,1,0,0,1,...,0,0,0,0,0,0,0,1,0,0
9,42,159449,13,5178,0,40,1,0,0,1,...,0,0,0,0,0,0,0,1,0,0


## <span style="text-decoration:underline;">Problem 2</span>: Classification Methods

For each part below, you may use either the data in its original form or the transformed representation of the data that you generated in Problem 1.

**<span style="text-decoration:underline;">Part A</span> (20 points):** Use $k$-nearest neighbors to implement an algorithm that predicts the whether an individual has an annual salary of 50,000 dollars or above. Note that the target feature dimension is discrete; you may use a boolean value or $\{0,1\}$. Explain how you chose $k$ and report the accuracy of your model on the data. Using `KNeighborsClassifier` is permitted.

In [2]:
####################readme######################################################
#for the first 2 parts I have tried multiple approaches for getting scores 
#the first approach is by first shuffling the data and then getting the first 80% of the records 
#as the train set data and for knn the scores obtained from diffrent approchaes is
#almost the same

#for choosing k i have tried knn with defferent values of k and the value 
# 25 has given a resaoble score. while doing that the train set has been the same 
############################################################

from sklearn.neighbors import KNeighborsClassifier
import sklearn.utils as utils

labels = df1['class'] #store class column as a seperate data frame

data = df1.drop(columns=['class'])#remove the class column from data

(data, labels) = utils.shuffle(data, labels, random_state=1)#shuffling data

num_rows, num_columns = (df1.shape)#getting total number of data entries in data set

train_set_size = int((num_rows * 80)/100)
test_set_size = num_rows - train_set_size
data_train = data[:train_set_size]
labels_train = labels[:train_set_size]
data_test = data[train_set_size:]
labels_test = labels[train_set_size:]

k = 25
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(data_train, labels_train)

labels_pred_test = knn.predict(data_test)
print('Accuracy on test data: {}'.format(knn.score(data_test, labels_test)))

y_pred_train = knn.predict(data_train)
print('Accuracy on training data: {}'.format(knn.score(data_train, labels_train)))

y_pred_data = knn.predict(data)
print('Accuracy on entire data: {}'.format(knn.score(data, labels)))


Accuracy on test data: 0.791710205503309
Accuracy on training data: 0.7986764193660746
Accuracy on entire data: 0.7972831765935214


In [3]:
################readme##################################
#here we get the scores by calling cross val score 
#here we can't run it seperately on test and train data
################################################################

import sklearn.model_selection as cross_validation
import sklearn.utils as utils

#first shuffle the data:
(data, labels) = utils.shuffle(data, labels, random_state=1)
k = 25
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_validation.cross_val_score(knn, data, labels, cv=5)
print(scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[0.79057349 0.79463664 0.79600604 0.79133767 0.7970274 ]
Accuracy: 0.79 (+/- 0.01)


In [4]:
################readme##################################
# here i get random samples from train-test-split 
# and run knn 5 times and report the avarega score as the score
################################################################
import numpy as np

test_size = 0.2
k = 25
knn = KNeighborsClassifier(n_neighbors=k)

def get_scores(data, labels, estimator, test_size):
    scores = {}
    scores['test_scores'] = []
    scores['train_scores'] = []
    for i in range(0,5):
        (X_train, X_test, y_train, y_test) = cross_validation.train_test_split(data, labels, test_size=test_size)
        estimator.fit(X_train, y_train)
        scores['test_scores'].append(estimator.score(X_test, y_test))
        scores['train_scores'].append(estimator.score(X_train, y_train))
    return scores 

scores = get_scores(data, labels, knn, test_size)
test_scores = np.asarray(scores['test_scores'])
train_scores = np.asarray(scores['train_scores'])
knn_mean_score_on_tests = test_scores.mean()
knn_mean_score_on_trains = train_scores.mean()
print(scores)
print(" knn Accuracy on test data: %0.2f (+/- %0.2f)" % (test_scores.mean(), test_scores.std() * 2))
print(" knn Accuracy on train data: %0.2f (+/- %0.2f)" % (train_scores.mean(), train_scores.std() * 2))
   

{'test_scores': [0.7989086264948334, 0.7958899338209683, 0.7991408336235922, 0.7893881342157204, 0.8022756298618368], 'train_scores': [0.797979797979798, 0.7973992801579008, 0.7978346685243237, 0.799373040752351, 0.7964123998606757]}
 knn Accuracy on test data: 0.80 (+/- 0.01)
 knn Accuracy on train data: 0.80 (+/- 0.00)


**<span style="text-decoration:underline;">Part B</span> (20 points):** Use [decision trees](http://scikit-learn.org/stable/modules/tree.html) to build a model that predicts the same target feature dimension (income of 50,000 or above). Report your accuracy and compare it to your results from part (a). Using `tree` from `sklearn` is permitted.

In [5]:
from sklearn import tree
dtc = tree.DecisionTreeClassifier()
dtc.fit(data_train, labels_train)

y_pred_test = dtc.predict(data_test)
print('DT accuracy on test data: ', dtc.score(data_test, labels_test))

y_pred_train = dtc.predict(data_train)
print('DT accuracy on training data: {}'.format(dtc.score(data_train, labels_train)))

y_pred_data = dtc.predict(data)
print('DT accuracy on entire data: {}'.format(dtc.score(data, labels)))

DT accuracy on test data:  0.8036688726343899
DT accuracy on training data: 0.9998838964356206
DT accuracy on entire data: 0.9606408916753745


In [6]:
dtc = tree.DecisionTreeClassifier()
dtc.fit(data_train, labels_train)

scores = cross_validation.cross_val_score(dtc, data, labels, cv=5)
print(scores)
print("DTC Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[0.81448804 0.81274669 0.80726808 0.81444496 0.81351602]
DTC Accuracy: 0.81 (+/- 0.01)


In [7]:
test_size = 0.2
dtc = tree.DecisionTreeClassifier()
dtc.fit(data_train, labels_train)


scores = get_scores(data, labels, dtc, test_size)
test_scores = np.asarray(scores['test_scores'])
train_scores = np.asarray(scores['train_scores'])
dtc_mean_score_on_tests = test_scores.mean()
dtc_mean_score_on_train = train_scores.mean()
print(scores)
print(" dtc Accuracy on test data: %0.2f (+/- %0.2f)" % (test_scores.mean(), test_scores.std() * 2))
print(" dtc Accuracy on train data: %0.2f (+/- %0.2f)" % (train_scores.mean(), train_scores.std() * 2))

#####################comparison with knn########################
#the scores from runing dtc on train set is 100% perecent where 
# knn accuracy on training data is 80% it's probably bacuase dtc has 
# created the dicision tree on training data
#dtc Accuracy on test data is 0.81 where knn Accuracy on test data is  0.80
#so both of them have almost the same accuracy on test data 
#########################################################

{'test_scores': [0.8110995007546732, 0.8073841866945315, 0.8036688726343899, 0.8102867758040172, 0.8101706722396378], 'train_scores': [0.9999419482178102, 0.9999419482178102, 0.9999129223267155, 0.9998838964356206, 0.9999419482178102]}
 dtc Accuracy on test data: 0.81 (+/- 0.01)
 dtc Accuracy on train data: 1.00 (+/- 0.00)


**<span style="text-decoration:underline;">Part C</span> (20 points):** Build a support vector machine model that solves the same problem. Report your accuracy and compare it to your results from parts (a) and (b). 

In [None]:
import sklearn.svm as svm
import sklearn.utils as utils

C=1.0
(data, labels) = utils.shuffle(data, labels, random_state=1)#shuffling data
train_set_size = num_rows - test_set_size
test_set_size = 43000
data_train = data[:train_set_size]
labels_train = labels[:train_set_size]
data_test = data[train_set_size:]
labels_test = labels[train_set_size:]
svc = svm.SVC(kernel='rbf', C=C)
svc.fit(data_train, labels_train)
print("here")
#y_pred_test = svc.predict(data_test)
print("Accuracy of SVM on test set:", svc.score(data_test, labels_test))

'''y_pred_train = dtc.predict(data_train)
print('Accuracy of SVM on training data: {}'.format(svc.score(data_train, labels_train)))

y_pred_data = dtc.predict(data)
print('Accuracy of SVM on entire data: {}'.format(svc.score(data, labels)))
'''
##############################readme####################################
#results for this part are on the branch svc-results
#svm Accuracy on test data: 0.75 svm Accuracy on train data 0.98 is 

####################################################################



In [11]:
'''(data, labels) = utils.shuffle(data, labels, random_state=1)
scores = cross_validation.cross_val_score(svc, data, labels, cv=5)
print(scores)
print("SVC Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
'''

'(data, labels) = utils.shuffle(data, labels, random_state=1)\nscores = cross_validation.cross_val_score(svc, data, labels, cv=5)\nprint(scores)\nprint("SVC Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))\n'

In [12]:
import sklearn.svm as svm
C=1.0
test_size = 30000
svc = svm.SVC(kernel='rbf', C=C)

scores = get_scores(data, labels, svc, test_size)
test_scores = np.asarray(scores['test_scores'])
train_scores = np.asarray(scores['train_scores'])
svm_mean_score_on_tests = test_scores.mean()
svm_mean_score_on_train = train_scores.mean()
print(scores)
print("svm Accuracy on test data: %0.2f (+/- %0.2f)" % (test_scores.mean(), test_scores.std() * 2))
print("svm Accuracy on train data: %0.2f (+/- %0.2f)" % (train_scores.mean(), train_scores.std() * 2))
#----------------------readme---------------------------
#results from this part are on the branch svs-results
#--------------------------------------------------------


'import sklearn.svm as svm\nC=1.0\ntest_size = 30000\nsvc = svm.SVC(kernel=\'rbf\', C=C)\n\nscores = get_scores(data, labels, svc, test_size)\ntest_scores = np.asarray(scores[\'test_scores\'])\ntrain_scores = np.asarray(scores[\'train_scores\'])\nsvm_mean_score_on_tests = test_scores.mean()\nsvm_mean_score_on_train = train_scores.mean()\nprint(scores)\nprint("svm Accuracy on test data: %0.2f (+/- %0.2f)" % (test_scores.mean(), test_scores.std() * 2))\nprint("svm Accuracy on train data: %0.2f (+/- %0.2f)" % (train_scores.mean(), train_scores.std() * 2))\n'

**<span style="text-decoration:underline;">Part D</span> (20 points):** Build a logistic regression model that solves the same problem. Report your accuracy and compare it to your results from parts (a), (b), and (c). 

In [55]:
import warnings
warnings.filterwarnings('ignore')
import statsmodels.api as sm
import sklearn.metrics as metrics

X_train, X_test, y_train, y_test = cross_validation.train_test_split(data, labels,test_size=0.2)
logit = sm.Logit(y_train, X_train)
result = logit.fit(disp=False)

y_pred_test = result.predict(X_test)
y_pred_test = y_pred_test > 0.5

y_pred_train = result.predict(X_train)
y_pred_train = y_pred_train > 0.5

precision_test = metrics.precision_score(y_test, y_pred_test)
recall_test = metrics.recall_score(y_test, y_pred_test)
F1score_test = metrics.f1_score(y_test, y_pred_test)  
print("Precision on test data: " + str(precision_test))
print("recall on test data: " + str(recall_test))
print("F1 score on test data: " + str(F1score_test))

precision_train = metrics.precision_score(y_train, y_pred_train)
recall_train = metrics.recall_score(y_train, y_pred_train)
F1score_train = metrics.f1_score(y_train, y_pred_train) 
print("Precision on train data: " + str(precision_train))
print("recall on train data: " + str(recall_train))
print("F1 score on train data: " + str(F1score_train))

result.summary2(y_pred) #get information about model


Precision on test data: 0.7306434023991276
recall on test data: 0.6121516674280494
F1 score on test data: 0.6661695252299279
Precision on train data: 0.738544474393531
recall on train data: 0.6118227758843577
F1 score on train data: 0.6692376912199511


0,1,2,3
Model:,Logit,Pseudo R-squared:,0.426
Dependent Variable:,"30939 False 35449 False 12562 False 31408 False 25631 False 39147 False 19257 True 14898 True 899 True 2896 True 26476 True 36172 False 13367 True 18904 False 27314 True 6165 True 40598 False 41967 False 17928 False 6928 False 18168 False 36449 False 45167 False 7282 False 40183 True 31475 True 26537 False 42974 False 20911 True 5804 False  ... 27185 False 19890 False 33359 True 31183 False 11891 False 37811 False 10814 True 32301 True 8230 False 6892 False 15157 False 27655 True 37974 False 45635 True 32629 False 18148 True 31991 False 4937 False 24137 True 38347 False 35619 False 41440 False 35104 True 8280 True 15488 True 27337 False 40793 False 31712 False 28085 True 414 True Length: 8613, dtype: bool",AIC:,22317.2902
Date:,2018-12-07 22:17,BIC:,23128.2332
No. Observations:,34452,Log-Likelihood:,-11063.0
Df Model:,95,LL-Null:,-19258.0
Df Residuals:,34356,LLR p-value:,0.0
Converged:,0.0000,Scale:,1.0
No. Iterations:,35.0000,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
age,0.0262,0.0016,16.2780,0.0000,0.0230,0.0293
fnlwgt,0.0000,0.0000,4.8135,0.0000,0.0000,0.0000
education-num,0.2572,,,,,
capital-gain,0.0003,0.0000,32.4837,0.0000,0.0003,0.0004
capital-loss,0.0007,0.0000,18.2643,0.0000,0.0006,0.0007
hours-per-week,0.0293,0.0016,18.5573,0.0000,0.0262,0.0324
workclass_Federal-gov,-1.0069,438218.1226,-0.0000,1.0000,-858892.7445,858890.7306
workclass_Local-gov,-1.5374,467462.4713,-0.0000,1.0000,-916211.1453,916208.0705
workclass_Private,-1.3828,471799.6333,-0.0000,1.0000,-924711.6720,924708.9063


In [3]:
import pandas as pd
data = {"knn:": [0.80, 0.80],
                     "dtc": [0.81, 1.00],
                      "svc": [0.75, 0.98],
                      "logistic- regresion:": [0.67,0.67]}
columns = ["test", "train"]
comparison_table = pd.DataFrame.from_dict(data, orient='index', columns=columns)
comparison_table

Unnamed: 0,test,train
knn:,0.8,0.8
dtc,0.81,1.0
svc,0.75,0.98
logistic- regresion:,0.67,0.67
