In [59]:
import pandas as pd 

1. Load the dataset, do basic data preprocessing, and split the dataset.

In [60]:
dataset = pd.read_csv('income.csv')
dataset.head()

Unnamed: 0,income,age,workclass,education,marital-status,occupation,relationship,race,sex,hours-per-week
0,0,39,State-gov,Bachelors,NotMarried,Adm-clerical,Not-in-family,White,Male,40
1,0,50,Self-emp-not-inc,Bachelors,Married,Exec-managerial,Husband,White,Male,13
2,0,38,Private,HS-grad,Separated,Handlers-cleaners,Not-in-family,White,Male,40
3,0,53,Private,11th,Married,Handlers-cleaners,Husband,Black,Male,40
4,0,28,Private,Bachelors,Married,Prof-specialty,Wife,Black,Female,40


(a) Describing dataset before preprocessing

In [61]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26215 entries, 0 to 26214
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   income          26215 non-null  int64 
 1   age             26215 non-null  int64 
 2   workclass       24819 non-null  object
 3   education       26215 non-null  object
 4   marital-status  26215 non-null  object
 5   occupation      24814 non-null  object
 6   relationship    26215 non-null  object
 7   race            26215 non-null  object
 8   sex             26215 non-null  object
 9   hours-per-week  26215 non-null  int64 
dtypes: int64(3), object(7)
memory usage: 2.0+ MB


(b) Handling missing values

In [62]:
dataset = dataset.dropna()

(c) Remove duplicates

In [63]:
dataset = dataset.drop_duplicates()

(d) Handle categorical values

In [64]:
print("workclass: ",dataset['workclass'].unique())
print("education: ",dataset['education'].unique())
print("marital-status: ",dataset['marital-status'].unique())
print("occupation: ",dataset['occupation'].unique())
print("relationship: ",dataset['relationship'].unique())
print("race: ",dataset['race'].unique())
print("sex: ",dataset['sex'].unique())

workclass:  ['State-gov' 'Self-emp-not-inc' 'Private' 'Federal-gov' 'Local-gov'
 'Self-emp-inc' 'Without-pay']
education:  ['Bachelors' 'HS-grad' '11th' 'Masters' '9th' 'Some-college' 'Assoc-acdm'
 'Assoc-voc' '7th-8th' 'Doctorate' 'Prof-school' '12th' '10th' '5th-6th'
 '1st-4th' 'Preschool']
marital-status:  ['NotMarried' 'Married' 'Separated' 'Widowed']
occupation:  ['Adm-clerical' 'Exec-managerial' 'Handlers-cleaners' 'Prof-specialty'
 'Other-service' 'Sales' 'Craft-repair' 'Transport-moving'
 'Farming-fishing' 'Machine-op-inspct' 'Tech-support' 'Protective-serv'
 'Priv-house-serv' 'Armed-Forces']
relationship:  ['Not-in-family' 'Husband' 'Wife' 'Own-child' 'Unmarried' 'Other-relative']
race:  ['White' 'Black' 'Asian-Pac-Islander' 'Amer-Indian-Eskimo' 'Other']
sex:  ['Male' 'Female']


In [65]:
dataset['sex'] = dataset['sex'].replace({
    'Male': 0, 
    'Female': 1
    })
dataset['education'] = dataset['education'].replace({
    'Preschool': 1, 
    '1st-4th': 2, 
    '5th-6th': 3, 
    '7th-8th': 4, 
    '9th': 5, 
    '10th': 6, 
    '11th': 7, 
    '12th': 8, 
    'HS-grad': 9, 
    'Some-college': 10, 
    'Assoc-voc': 11, 
    'Assoc-acdm': 12, 
    'Bachelors': 13, 
    'Masters': 14, 
    'Prof-school': 15, 
    'Doctorate': 16
})
dataset['race'] = dataset['race'].replace({
    'White': 0,
    'Black': 1,
    'Asian-Pac-Islander': 2,
    'Amer-Indian-Eskimo': 3,
    'Other': 4
    })
dataset['relationship'] = dataset['relationship'].replace({
    'Not-in-family': 0,
    'Husband': 1,
    'Wife': 2,
    'Own-child': 3,
    'Unmarried': 3,
    'Other-relative': 4
    })
dataset['occupation'] = dataset['occupation'].replace({
    'Adm-clerical': 0,
    'Exec-managerial': 1,
    'Handlers-cleaners': 2,
    'Prof-specialty': 3,
    'Other-service': 4,
    'Sales': 5,
    'Craft-repair': 6,
    'Transport-moving': 7,
    'Farming-fishing': 8,
    'Machine-op-inspct': 9,
    'Tech-support': 10,
    'Protective-serv': 11,
    'Priv-house-serv': 12,
    'Armed-Forces': 13
    })
dataset['marital-status'] = dataset['marital-status'].replace({
    'NotMarried': 0,
    'Married': 1,
    'Separated': 2,
    'Widowed': 3
})
dataset['workclass'] = dataset['workclass'].replace({
    'State-gov': 1,
    'Self-emp-not-inc': 2,
    'Private': 3,
    'Federal-gov': 4,
    'Local-gov': 5, 
    'Self-emp-inc': 6,
    'Without-pay': 7
})

  dataset['sex'] = dataset['sex'].replace({
  dataset['education'] = dataset['education'].replace({
  dataset['race'] = dataset['race'].replace({
  dataset['relationship'] = dataset['relationship'].replace({
  dataset['occupation'] = dataset['occupation'].replace({
  dataset['marital-status'] = dataset['marital-status'].replace({
  dataset['workclass'] = dataset['workclass'].replace({


In [66]:
dataset.head()

Unnamed: 0,income,age,workclass,education,marital-status,occupation,relationship,race,sex,hours-per-week
0,0,39,1,13,0,0,0,0,0,40
1,0,50,2,13,1,1,1,0,0,13
2,0,38,3,9,2,2,0,0,0,40
3,0,53,3,7,1,2,1,1,0,40
4,0,28,3,13,1,3,2,1,1,40


(e) Split the dataset into training and testing

In [67]:
df = dataset.copy()

In [68]:
array = df.values
print(array)

[[ 0 39  1 ...  0  0 40]
 [ 0 50  2 ...  0  0 13]
 [ 0 38  3 ...  0  0 40]
 ...
 [ 0 27  3 ...  0  1 38]
 [ 0 58  3 ...  0  1 40]
 [ 1 52  6 ...  0  1 40]]


In [96]:
y = array[:, 0]
print(y)
X = array[:, 1:]
print(X)

[0 0 0 ... 0 0 1]
[[39  1 13 ...  0  0 40]
 [50  2 13 ...  0  0 13]
 [38  3  9 ...  0  0 40]
 ...
 [27  3 12 ...  0  1 38]
 [58  3  9 ...  0  1 40]
 [52  6  9 ...  0  1 40]]


In [70]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)

(f) Apply normalization on X_train & X_test

In [71]:
from sklearn.preprocessing import MinMaxScaler

norm = MinMaxScaler().fit(X)

X_train = norm.transform(X_train)
X_test = norm.transform(X_test)

2. Train and evaluate the 2 classification models on the training set with the 
cross-validation method, optimize the models and evaluate models on the 
test set.

(a) Define the two regression models, including Logistic Regression, and
SVM, with their default settings

In [72]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

model1 = LogisticRegression()
model2 = SVC()

(b) Define 10-fold cross-validation to train and evaluate the two models 
based on the average score

In [73]:
from sklearn.model_selection import KFold

kfold = KFold(n_splits=10, shuffle=True, random_state=2)

In [74]:
from sklearn.model_selection import cross_val_score

results1 = cross_val_score(model1, X_train, y_train, cv=kfold)
print("Average Accuracy of LR:",results1.mean())

results2 = cross_val_score(model2, X_train, y_train, cv=kfold)
print("Average Accuracy of SVM:",results2.mean())

Average Accuracy of LR: 0.7679406096468608
Average Accuracy of SVM: 0.8027646361603734


(c) Apply parameter finetuning steps to the two models separately to 
optimize the model performances and compare the cross-validated 
results before and after finetuning for each model.

LinearRegression

In [75]:
from sklearn.model_selection import GridSearchCV

grid_params_lr = {
    'penalty': ['l1', 'l2'],
    'C': [1, 10],
    'solver': ['saga', 'liblinear']
}

lr = LogisticRegression(max_iter=150)
gs_lr_result = GridSearchCV(lr, grid_params_lr, cv=kfold).fit(X_train, y_train)
print(gs_lr_result.best_score_)

0.7681986075828773


In [76]:
test_accuracy = gs_lr_result.best_estimator_.score(X_test, y_test)
print("Accuracy in testing:", test_accuracy)

Accuracy in testing: 0.7469823584029712


In [77]:
gs_lr_result.best_params_

{'C': 1, 'penalty': 'l1', 'solver': 'liblinear'}

SVM

In [78]:
grid_params_svc = {
    'kernel': ['linear', 'poly', 'rbf'],
    'C': [1, 10],
    'degree': [3, 8],
    'gamma': ['auto','scale']
}

svc = SVC(max_iter=100000)
gs_svc_result = GridSearchCV(svc, grid_params_svc, cv=kfold, n_jobs=-1, verbose=1).fit(X_train, y_train)
print(gs_svc_result.best_score_)

Fitting 10 folds for each of 24 candidates, totalling 240 fits




0.8049831256842467


In [79]:
test_accuracy = gs_svc_result.best_estimator_.score(X_test, y_test)
print("Accuracy in testing:", test_accuracy)

Accuracy in testing: 0.797585886722377


In [80]:
gs_svc_result.best_params_

{'C': 10, 'degree': 3, 'gamma': 'scale', 'kernel': 'rbf'}

(e) Evaluate the two optimized models (with the best parameter setting from 
the above step for each model type) on the test set, and compare the 
results with what you got from 2b.

In [81]:
model1_1 = LogisticRegression(
    C = 1,
    penalty = 'l1',
    solver = 'liblinear'
)
model2_2 = SVC(
    kernel= 'rbf',
    C= 10,
    degree= 3,
    gamma= 'scale'
)


In [85]:
results1_1 = cross_val_score(model1_1, X_train, y_train, cv=kfold)
print("Average Accuracy of LR_1:",results1_1.mean())

Average Accuracy of LR_1: 0.768250180558638


In [83]:
results2_2 = cross_val_score(model2_2, X_train, y_train, cv=kfold)
print("Average Accuracy of LR:",results2_2.mean())

Average Accuracy of LR: 0.8049831256842467


Comparison

In [86]:
print("LR has been optimized from accuracy: ", results1.mean(), "->", results1_1.mean(), '\n')
print("SVC has been optimized from accuracy: ", results2.mean(), "->", results2_2.mean())

LR has been optimized from accuracy:  0.7679406096468608 -> 0.768250180558638 

SVC has been optimized from accuracy:  0.8027646361603734 -> 0.8049831256842467


3. Apply K Means clustering on the normalized training input X, and 
understand the grouping of training data by investigating the prototype 
from each cluster

(a) Apply clustering on the normalized training input X (you can determine 
the number of clusters by considering how many classes for the target y)

In [99]:
from sklearn.cluster import KMeans

norm = MinMaxScaler().fit(X)
X_norm = norm.transform(X)

kmeans = KMeans(n_clusters=2, random_state=0).fit(X_norm)


(b) Identify how many data samples have been assigned to each cluster

In [100]:
import numpy as np

unique_labels, unique_counts = np.unique(kmeans.labels_, return_counts=True)
print("Number of data samples assigned to cluster 0: ", unique_counts[0])
print("Number of data samples assigned to cluster 1: ", unique_counts[1])

Number of data samples assigned to cluster 0:  7013
Number of data samples assigned to cluster 1:  14524


(c) Extract a prototype from each cluster and investigate their similarity and 
difference.

In [101]:
from sklearn.metrics.pairwise import pairwise_distances_argmin

kmeans_cluster_centers = kmeans.cluster_centers_
closest = pairwise_distances_argmin(kmeans.cluster_centers_, X)

dataset.iloc[closest, :]

Unnamed: 0,income,age,workclass,education,marital-status,occupation,relationship,race,sex,hours-per-week
12338,0,17,3,6,0,5,3,0,0,5
12338,0,17,3,6,0,5,3,0,0,5


(d) Evaluate the clustering accuracy with the testing set and compare with 
the results from 2d.

In [103]:
from sklearn.metrics import accuracy_score

kmeans_labels = kmeans.labels_

accuracy = accuracy_score(y, kmeans_labels)
print("k-means prediction accuracy:", accuracy)

k-means prediction accuracy: 0.5246784603240934
