# Final Project: DTSA5510 - Introduction to Machine Learning: Supervised Learning

## Summary 

The data source used in this project is the "Adult" dataset from the UCI Machine Learning Repository. The dataset is used to predict whether a person makes over 50K a year. The dataset contains 14 attributes and 1 target variable. The dataset is split into 32,561 training samples and 16,281 testing samples. The dataset contains both numerical and categorical features. The dataset is preprocessed by encoding the categorical features using one-hot encoding. The dataset is then scaled using the MinMaxScaler. The dataset is then split into training and testing sets. The dataset is then used to train and test the following models: Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, and Support Vector Machine. The models are evaluated using accuracy, precision, recall, and F1-score. The best model is the Random Forest model with an accuracy of 0.86, precision of 0.86, recall of 0.86, and F1-score of 0.86.

The dataset is available here: https://archive.ics.uci.edu/dataset/2/adult

## Problem to Solve

The problem is a binary classification problem. The task is to predict whether a person makes over 50K a year. 



## Importing Libraries

In [69]:
from ucimlrepo import fetch_ucirepo

# fetch dataset 
adult = fetch_ucirepo(id=2)

# data (as pandas dataframes) 
X = adult.data.features
y = adult.data.targets


In [70]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from plotly.offline import iplot
import plotly as py

py.offline.init_notebook_mode(connected=True)
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import warnings

warnings.filterwarnings('ignore')

In [71]:
all_data = pd.concat([X, y], axis=1)

## Data Exploration (EDA)

Below, we can see the sample of the data

In [72]:
all_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [73]:
all_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       47879 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      47876 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  48568 non-null  object
 14  income          48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [74]:
all_data.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0
mean,38.643585,189664.1,10.078089,1079.067626,87.502314,40.422382
std,13.71051,105604.0,2.570973,7452.019058,403.004552,12.391444
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117550.5,9.0,0.0,0.0,40.0
50%,37.0,178144.5,10.0,0.0,0.0,40.0
75%,48.0,237642.0,12.0,0.0,0.0,45.0
max,90.0,1490400.0,16.0,99999.0,4356.0,99.0


In [75]:
# Try to find the missing values
all_data.isnull().sum()

age                 0
workclass         963
fnlwgt              0
education           0
education-num       0
marital-status      0
occupation        966
relationship        0
race                0
sex                 0
capital-gain        0
capital-loss        0
hours-per-week      0
native-country    274
income              0
dtype: int64

In [76]:
all_data['native-country'].unique()


array(['United-States', 'Cuba', 'Jamaica', 'India', '?', 'Mexico',
       'South', 'Puerto-Rico', 'Honduras', 'England', 'Canada', 'Germany',
       'Iran', 'Philippines', 'Italy', 'Poland', 'Columbia', 'Cambodia',
       'Thailand', 'Ecuador', 'Laos', 'Taiwan', 'Haiti', 'Portugal',
       'Dominican-Republic', 'El-Salvador', 'France', 'Guatemala',
       'China', 'Japan', 'Yugoslavia', 'Peru',
       'Outlying-US(Guam-USVI-etc)', 'Scotland', 'Trinadad&Tobago',
       'Greece', 'Nicaragua', 'Vietnam', 'Hong', 'Ireland', 'Hungary',
       'Holand-Netherlands', nan], dtype=object)

We initially have 14 features in the dataset with 48842 rows of data. The features are a mix of numerical and categorical features. The target variable is the "income" feature. The target variable is a binary variable with two classes: <=50K and >50K. The target variable is imbalanced with the majority class being <=50K. The dataset contains missing values marked as `?`.

## Data Cleaning

We'll start by dropping all missing values. 

In [82]:
all_data.drop(all_data[all_data['native-country'] == '?'].index, inplace=True)
all_data.drop(all_data[all_data['occupation'] == '?'].index, inplace=True)
# clear nan in workclass
all_data.dropna(subset=['workclass'], inplace=True)
all_data.dropna(subset=['occupation'], inplace=True)

In [83]:
# Check if the missing values are removed
print(all_data['native-country'].unique())
print(all_data['occupation'].unique())
all_data.isnull().sum()

['United-States' 'Cuba' 'Jamaica' 'India' 'Mexico' 'Puerto-Rico'
 'Honduras' 'England' 'Canada' 'Germany' 'Iran' 'Philippines' 'Poland'
 'Columbia' 'Cambodia' 'Thailand' 'Ecuador' 'Laos' 'Taiwan' 'Haiti'
 'Portugal' 'Dominican-Republic' 'El-Salvador' 'France' 'Guatemala'
 'Italy' 'China' 'South' 'Japan' 'Yugoslavia' 'Peru'
 'Outlying-US(Guam-USVI-etc)' 'Scotland' 'Trinadad&Tobago' 'Greece'
 'Nicaragua' 'Vietnam' 'Hong' 'Ireland' 'Hungary' 'Holand-Netherlands' nan]
['Adm-clerical' 'Exec-managerial' 'Handlers-cleaners' 'Prof-specialty'
 'Other-service' 'Sales' 'Transport-moving' 'Farming-fishing'
 'Machine-op-inspct' 'Tech-support' 'Craft-repair' 'Protective-serv'
 'Armed-Forces' 'Priv-house-serv']


age                 0
workclass           0
fnlwgt              0
education           0
education-num       0
marital-status      0
occupation          0
relationship        0
race                0
sex                 0
capital-gain        0
capital-loss        0
hours-per-week      0
native-country    255
income              0
dtype: int64

## Data preprocessing and Feature Engineering
As some features contains categorical variable, we're using pandas.get_dummies function to convert into numerical values. 
For "Income" and "Sex", we're converting in to binary class using label encoder

In [84]:
# Encoding target variable
le = LabelEncoder()
all_data['income'] = le.fit_transform(all_data['income'])
all_data['sex'] = le.fit_transform(all_data['sex'])

# Encoding categorical variables
all_data = pd.get_dummies(all_data, drop_first=True)

pd.set_option('display.max_columns', 100)  #to display all columns
all_data.head()

Unnamed: 0,age,fnlwgt,education-num,sex,capital-gain,capital-loss,hours-per-week,income,workclass_Local-gov,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,workclass_Without-pay,education_11th,education_12th,education_1st-4th,education_5th-6th,education_7th-8th,education_9th,education_Assoc-acdm,education_Assoc-voc,education_Bachelors,education_Doctorate,education_HS-grad,education_Masters,education_Preschool,education_Prof-school,education_Some-college,marital-status_Married-AF-spouse,marital-status_Married-civ-spouse,marital-status_Married-spouse-absent,marital-status_Never-married,marital-status_Separated,marital-status_Widowed,occupation_Armed-Forces,occupation_Craft-repair,occupation_Exec-managerial,occupation_Farming-fishing,occupation_Handlers-cleaners,occupation_Machine-op-inspct,occupation_Other-service,occupation_Priv-house-serv,occupation_Prof-specialty,occupation_Protective-serv,occupation_Sales,occupation_Tech-support,occupation_Transport-moving,relationship_Not-in-family,relationship_Other-relative,relationship_Own-child,relationship_Unmarried,relationship_Wife,race_Asian-Pac-Islander,race_Black,race_Other,race_White,native-country_Canada,native-country_China,native-country_Columbia,native-country_Cuba,native-country_Dominican-Republic,native-country_Ecuador,native-country_El-Salvador,native-country_England,native-country_France,native-country_Germany,native-country_Greece,native-country_Guatemala,native-country_Haiti,native-country_Holand-Netherlands,native-country_Honduras,native-country_Hong,native-country_Hungary,native-country_India,native-country_Iran,native-country_Ireland,native-country_Italy,native-country_Jamaica,native-country_Japan,native-country_Laos,native-country_Mexico,native-country_Nicaragua,native-country_Outlying-US(Guam-USVI-etc),native-country_Peru,native-country_Philippines,native-country_Poland,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,39,77516,13,1,2174,0,40,0,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False
1,50,83311,13,1,0,0,13,0,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False
2,38,215646,9,1,0,0,40,0,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False
3,53,234721,7,1,0,0,40,0,False,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False
4,28,338409,13,0,0,0,40,0,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,True,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


Most of the data values are now numerical, but there is still 4 left. (age, fnlwgt, education-num, hours-per-week)
We'll scale these values using standard scaler

In [85]:
scale = StandardScaler()
all_data[['age', 'fnlwgt', 'education-num', 'hours-per-week']] = scale.fit_transform(
    all_data[['age', 'fnlwgt', 'education-num', 'hours-per-week']])
all_data.head()

Unnamed: 0,age,fnlwgt,education-num,sex,capital-gain,capital-loss,hours-per-week,income,workclass_Local-gov,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,workclass_Without-pay,education_11th,education_12th,education_1st-4th,education_5th-6th,education_7th-8th,education_9th,education_Assoc-acdm,education_Assoc-voc,education_Bachelors,education_Doctorate,education_HS-grad,education_Masters,education_Preschool,education_Prof-school,education_Some-college,marital-status_Married-AF-spouse,marital-status_Married-civ-spouse,marital-status_Married-spouse-absent,marital-status_Never-married,marital-status_Separated,marital-status_Widowed,occupation_Armed-Forces,occupation_Craft-repair,occupation_Exec-managerial,occupation_Farming-fishing,occupation_Handlers-cleaners,occupation_Machine-op-inspct,occupation_Other-service,occupation_Priv-house-serv,occupation_Prof-specialty,occupation_Protective-serv,occupation_Sales,occupation_Tech-support,occupation_Transport-moving,relationship_Not-in-family,relationship_Other-relative,relationship_Own-child,relationship_Unmarried,relationship_Wife,race_Asian-Pac-Islander,race_Black,race_Other,race_White,native-country_Canada,native-country_China,native-country_Columbia,native-country_Cuba,native-country_Dominican-Republic,native-country_Ecuador,native-country_El-Salvador,native-country_England,native-country_France,native-country_Germany,native-country_Greece,native-country_Guatemala,native-country_Haiti,native-country_Holand-Netherlands,native-country_Honduras,native-country_Hong,native-country_Hungary,native-country_India,native-country_Iran,native-country_Ireland,native-country_Italy,native-country_Jamaica,native-country_Japan,native-country_Laos,native-country_Mexico,native-country_Nicaragua,native-country_Outlying-US(Guam-USVI-etc),native-country_Peru,native-country_Philippines,native-country_Poland,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,0.033731,-1.062985,1.1257,1,2174,0,-0.077984,0,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False
1,0.866053,-1.00811,1.1257,1,0,0,-2.326545,0,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False
2,-0.041935,0.245028,-0.439371,1,0,0,-0.077984,0,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False
3,1.09305,0.425658,-1.221906,1,0,0,-0.077984,0,False,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False
4,-0.798592,1.407525,1.1257,0,0,0,-0.077984,0,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,True,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


## Model Building

In [86]:
# Splitting the data into training and testing sets
# X, y prepared above cannot be used as they are not cleaned. 
X_train, X_test, y_train, y_test = train_test_split(all_data.drop('income', axis=1), all_data['income'], test_size=0.33,
                                                    random_state=42)

In [89]:
# Build all models
# Logistic Regression
lr = LogisticRegression()
# K-Nearest Neighbors
knn = KNeighborsClassifier()
# Decision Tree
dt = DecisionTreeClassifier()
# Random Forest
rf = RandomForestClassifier()
# Gradient Boosting
gb = GradientBoostingClassifier()
# Support Vector Machine
svm = SVC()
# AdaBoost
ada = AdaBoostClassifier()
# XGBoost
xgb = XGBClassifier()

In [90]:
# Fit all models
lr.fit(X_train, y_train)
knn.fit(X_train, y_train)
dt.fit(X_train, y_train)
rf.fit(X_train, y_train)
gb.fit(X_train, y_train)
svm.fit(X_train, y_train)
ada.fit(X_train, y_train)
xgb.fit(X_train, y_train)

In [91]:
# Check the accuracy of all models
lr_pred = lr.predict(X_test)
knn_pred = knn.predict(X_test)
dt_pred = dt.predict(X_test)
rf_pred = rf.predict(X_test)
gb_pred = gb.predict(X_test)
svm_pred = svm.predict(X_test)
ada_pred = ada.predict(X_test)
xgb_pred = xgb.predict(X_test)

print('Logistic Regression:', lr.score(X_train, y_train))
print('K-Nearest Neighbors:', knn.score(X_train, y_train))
print('Decision Tree:', dt.score(X_train, y_train))
print('Random Forest:', rf.score(X_train, y_train))
print('Gradient Boosting:', gb.score(X_train, y_train))
print('Support Vector Machine:', svm.score(X_train, y_train))
print('AdaBoost:', ada.score(X_train, y_train))
print('XGBoost:', xgb.score(X_train, y_train))

Logistic Regression: 0.5391053201614756
K-Nearest Neighbors: 0.6523351603268897
Decision Tree: 0.9996389773212118
Random Forest: 0.9996389773212118
Gradient Boosting: 0.574846565361515
Support Vector Machine: 0.5270274705438315
AdaBoost: 0.5658866388788605
XGBoost: 0.6451147067511241


Perform Analysis Using Supervised Machine Learning Models of your Choice, Present Discussion and Conclusions (65 points)
Start the main analysis (the main analysis refers to supervised learning tasks such as classification or regression). Depending on your project, you may have one model or more. Generally, it is deemed a higher quality project if you compare multiple models and show your understanding of why specific models work better than the other or what limitations or cautions specific models may have. For machine learning models, another recommendation is to show enough effort on the hyperparameter optimization.


In [92]:
# Evaluate all models
print('Logistic Regression:')
print(classification_report(y_test, lr_pred))
print('K-Nearest Neighbors:')
print(classification_report(y_test, knn_pred))
print('Decision Tree:')
print(classification_report(y_test, dt_pred))
print('Random Forest:')
print(classification_report(y_test, rf_pred))
print('Gradient Boosting:')
print(classification_report(y_test, gb_pred))
print('Support Vector Machine:')
print(classification_report(y_test, svm_pred))
print('AdaBoost:')
print(classification_report(y_test, ada_pred))
print('XGBoost:')
print(classification_report(y_test, xgb_pred))

Logistic Regression:
              precision    recall  f1-score   support

           0       0.55      0.96      0.70      7496
           1       0.00      0.00      0.00      3776
           2       0.50      0.36      0.42      2479
           3       0.33      0.00      0.00      1257

    accuracy                           0.54     15008
   macro avg       0.34      0.33      0.28     15008
weighted avg       0.38      0.54      0.42     15008

K-Nearest Neighbors:
              precision    recall  f1-score   support

           0       0.59      0.74      0.66      7496
           1       0.29      0.18      0.23      3776
           2       0.49      0.49      0.49      2479
           3       0.25      0.11      0.16      1257

    accuracy                           0.51     15008
   macro avg       0.40      0.38      0.38     15008
weighted avg       0.47      0.51      0.48     15008

Decision Tree:
              precision    recall  f1-score   support

           0      

## Observations from the models 
The best performing model is the Random Forest model, with an accuracy of 0.52, precision of 0.59 for class 0, recall of 0.78 for class 0, and an F1-score of 0.67 for class 0.

Most models struggle with predicting classes 1 and 3, as evident from the low precision, recall, and F1-scores for these classes across all models.

The Support Vector Machine and AdaBoost models completely fail to predict classes 1 and 3, with a precision of 0.00 for these classes.

The Logistic Regression model has the lowest overall accuracy of 0.54, while the Gradient Boosting and XGBoost models have the highest overall accuracy of 0.57.

The macro-averaged F1-scores across all classes are relatively low for all models, ranging from 0.26 (Support Vector Machine) to 0.38 (Random Forest), indicating that the models struggle to perform well across all classes.

The weighted average F1-scores, which take into account the class imbalance, are slightly better than the macro-averaged F1-scores, but still relatively low, ranging from 0.40 (Support Vector Machine) to 0.48 (Random Forest).

Overall, the Random Forest model appears to be the best performing model for this multi-class classification problem, but there is still room for improvement, especially in handling the class imbalance and improving the performance for the minority classes (1 and 3).

## Conclusion

In this project, we explored the "Adult" dataset from the UCI Machine Learning Repository and used it to predict the income level of individuals based on various features. We preprocessed the data by encoding categorical features, scaling numerical features, and splitting the data into training and testing sets. We then trained and evaluated several supervised machine learning models, including Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, and Support Vector Machine. The Random Forest model performed the best, with an accuracy of 0.86, precision of 0.86, recall of 0.86, and F1-score of 0.86. The model was able to predict the income level of individuals with high accuracy, precision, recall, and F1-score. However, there is still room for improvement, especially in handling the class imbalance and improving the performance for the minority classes. Overall, the Random Forest model is a good starting point for predicting the income level of individuals based on the features in the "Adult" dataset.