## Income (>50k / <50k) Classification: KNN, LogReg, Random Forest, XGBoost & Neural Network
## for Loyola University New Orleans: COSC-A406-001: Machine Learning
### Carter Roberts
### May 4th, 2025

Semester final assignment for Loyola New Orleans Machine Learning course, taking csv from https://www.kaggle.com/datasets/lodetomasi1995/income-classification kaggle dataset. Goal is to determine whether an individual makes less or equal to/more than 50k a year.

## I was able to have accuracy for all attempted models above 82% for both train and test, and in neural network, maintain a train / test loss of only 0.34 / 0.34 on average.

In [333]:
# Importing libraries...
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch 
import torch.nn as nn
import torch.nn.functional as F
from sklearn.model_selection import train_test_split

In [334]:
# acquire dataset
data = pd.read_csv("./income_evaluation.csv")
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [335]:
# check for column types and # non-nulls
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              32561 non-null  int64 
 1    workclass       32561 non-null  object
 2    fnlwgt          32561 non-null  int64 
 3    education       32561 non-null  object
 4    education-num   32561 non-null  int64 
 5    marital-status  32561 non-null  object
 6    occupation      32561 non-null  object
 7    relationship    32561 non-null  object
 8    race            32561 non-null  object
 9    sex             32561 non-null  object
 10   capital-gain    32561 non-null  int64 
 11   capital-loss    32561 non-null  int64 
 12   hours-per-week  32561 non-null  int64 
 13   native-country  32561 non-null  object
 14   income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [336]:
# get all unique values
for col in data.columns:
    unique = data[col].unique()
    print("unique values for", col, "are:", unique)

unique values for age are: [39 50 38 53 28 37 49 52 31 42 30 23 32 40 34 25 43 54 35 59 56 19 20 45
 22 48 21 24 57 44 41 29 18 47 46 36 79 27 67 33 76 17 55 61 70 64 71 68
 66 51 58 26 60 90 75 65 77 62 63 80 72 74 69 73 81 78 88 82 83 84 85 86
 87]
unique values for  workclass are: [' State-gov' ' Self-emp-not-inc' ' Private' ' Federal-gov' ' Local-gov'
 ' ?' ' Self-emp-inc' ' Without-pay' ' Never-worked']
unique values for  fnlwgt are: [ 77516  83311 215646 ...  34066  84661 257302]
unique values for  education are: [' Bachelors' ' HS-grad' ' 11th' ' Masters' ' 9th' ' Some-college'
 ' Assoc-acdm' ' Assoc-voc' ' 7th-8th' ' Doctorate' ' Prof-school'
 ' 5th-6th' ' 10th' ' 1st-4th' ' Preschool' ' 12th']
unique values for  education-num are: [13  9  7 14  5 10 12 11  4 16 15  3  6  2  1  8]
unique values for  marital-status are: [' Never-married' ' Married-civ-spouse' ' Divorced'
 ' Married-spouse-absent' ' Separated' ' Married-AF-spouse' ' Widowed']
unique values for  occupation are: ['

In [337]:
# replace "?" unique value in non-numerics with that column's most common value
for col in data.columns:
    if pd.api.types.is_numeric_dtype(data[col]) == False:
        new = data[col].mode()[0] # needs to return a single value, not Series
        data[col] = data[col].replace('?', new)
        data[col] = data[col].replace(' ?', new)

# check uniques of non-numerics
for col in data.columns:
    if pd.api.types.is_numeric_dtype(data[col]) == False:
        unique = data[col].unique()
        print("unique values for", col, "are:", unique)

unique values for  workclass are: [' State-gov' ' Self-emp-not-inc' ' Private' ' Federal-gov' ' Local-gov'
 ' Self-emp-inc' ' Without-pay' ' Never-worked']
unique values for  education are: [' Bachelors' ' HS-grad' ' 11th' ' Masters' ' 9th' ' Some-college'
 ' Assoc-acdm' ' Assoc-voc' ' 7th-8th' ' Doctorate' ' Prof-school'
 ' 5th-6th' ' 10th' ' 1st-4th' ' Preschool' ' 12th']
unique values for  marital-status are: [' Never-married' ' Married-civ-spouse' ' Divorced'
 ' Married-spouse-absent' ' Separated' ' Married-AF-spouse' ' Widowed']
unique values for  occupation are: [' Adm-clerical' ' Exec-managerial' ' Handlers-cleaners' ' Prof-specialty'
 ' Other-service' ' Sales' ' Craft-repair' ' Transport-moving'
 ' Farming-fishing' ' Machine-op-inspct' ' Tech-support'
 ' Protective-serv' ' Armed-Forces' ' Priv-house-serv']
unique values for  relationship are: [' Not-in-family' ' Husband' ' Wife' ' Own-child' ' Unmarried'
 ' Other-relative']
unique values for  race are: [' White' ' Black' ' Asia

In [338]:
# remove numeric outliers
for col in data.columns:
    if pd.api.types.is_numeric_dtype(data[col]) == True:
        q1 = data[col].quantile(0.25)
        q3 = data[col].quantile(0.75)
        iqr = q3 - q1
        lowerb = q1 - 1.5*iqr
        upperb = q3 + 1.5*iqr
        data = data.drop(data[(data[col] < lowerb) | (data[col] > upperb)].index)
    
# check non-null and indexes
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19004 entries, 2 to 32558
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              19004 non-null  int64 
 1    workclass       19004 non-null  object
 2    fnlwgt          19004 non-null  int64 
 3    education       19004 non-null  object
 4    education-num   19004 non-null  int64 
 5    marital-status  19004 non-null  object
 6    occupation      19004 non-null  object
 7    relationship    19004 non-null  object
 8    race            19004 non-null  object
 9    sex             19004 non-null  object
 10   capital-gain    19004 non-null  int64 
 11   capital-loss    19004 non-null  int64 
 12   hours-per-week  19004 non-null  int64 
 13   native-country  19004 non-null  object
 14   income          19004 non-null  object
dtypes: int64(6), object(9)
memory usage: 2.3+ MB


In [339]:
# get all unique values, again, to ensure things are running fine
for col in data.columns:
    unique = data[col].unique()
    print("unique values for", col, "are:", unique)

unique values for age are: [38 53 28 37 52 30 32 40 25 43 35 59 56 19 49 23 20 48 21 31 24 57 44 41
 29 42 36 27 46 33 34 76 47 22 39 61 70 45 51 18 60 50 65 58 26 64 77 63
 55 62 54 17 72 69 73 67 66 75 68 74 71 78]
unique values for  workclass are: [' Private' ' Self-emp-not-inc' ' State-gov' ' Federal-gov' ' Local-gov'
 ' Self-emp-inc' ' Never-worked' ' Without-pay']
unique values for  fnlwgt are: [215646 234721 338409 ... 321865 257302 151910]
unique values for  education are: [' HS-grad' ' 11th' ' Bachelors' ' Masters' ' Assoc-acdm' ' Assoc-voc'
 ' 9th' ' Some-college' ' Doctorate' ' Prof-school' ' 10th' ' 12th']
unique values for  education-num are: [ 9  7 13 14 12 11  5 10 16 15  6  8]
unique values for  marital-status are: [' Divorced' ' Married-civ-spouse' ' Never-married' ' Separated'
 ' Widowed' ' Married-spouse-absent' ' Married-AF-spouse']
unique values for  occupation are: [' Handlers-cleaners' ' Prof-specialty' ' Exec-managerial' ' Sales'
 ' Craft-repair' ' Farming-fishi

In [340]:
# since outlier removal made these always 0, we're dropping them
data = data.drop(' capital-gain', axis=1)
data = data.drop(' capital-loss', axis=1)
# check again
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19004 entries, 2 to 32558
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              19004 non-null  int64 
 1    workclass       19004 non-null  object
 2    fnlwgt          19004 non-null  int64 
 3    education       19004 non-null  object
 4    education-num   19004 non-null  int64 
 5    marital-status  19004 non-null  object
 6    occupation      19004 non-null  object
 7    relationship    19004 non-null  object
 8    race            19004 non-null  object
 9    sex             19004 non-null  object
 10   hours-per-week  19004 non-null  int64 
 11   native-country  19004 non-null  object
 12   income          19004 non-null  object
dtypes: int64(4), object(9)
memory usage: 2.0+ MB


In [341]:
# one-hot encode automatically the object columns which don't need specific encodings
data = pd.get_dummies(data, columns=[' workclass', ' marital-status', ' relationship', ' race', ' sex'], drop_first = True)


In [342]:
# check that this was done correctly
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19004 entries, 2 to 32558
Data columns (total 31 columns):
 #   Column                                  Non-Null Count  Dtype 
---  ------                                  --------------  ----- 
 0   age                                     19004 non-null  int64 
 1    fnlwgt                                 19004 non-null  int64 
 2    education                              19004 non-null  object
 3    education-num                          19004 non-null  int64 
 4    occupation                             19004 non-null  object
 5    hours-per-week                         19004 non-null  int64 
 6    native-country                         19004 non-null  object
 7    income                                 19004 non-null  object
 8    workclass_ Local-gov                   19004 non-null  bool  
 9    workclass_ Never-worked                19004 non-null  bool  
 10   workclass_ Private                     19004 non-null  bool  
 11   workcl

In [343]:
# now for label encoding the ones with nominal order
from sklearn.preprocessing import LabelEncoder

# education & income are only ones that needs specific mapping rather than a generic label encoding
education_map = {' 9th': 1, ' 10th': 2, ' 11th': 3, ' 12th': 4, ' HS-grad': 5, ' Prof-school': 6, ' Some-college': 7, 
                 ' Assoc-acdm': 8, ' Assoc-voc': 9, ' Bachelors': 10, ' Masters': 11, ' Doctorate': 12}
data['education'] = data[' education'].map(education_map)
data = data.drop([' education'], axis=1)
# in binary classifying fashion, then drop the poorly-formatted one that isn't binary 
income_map = {' <=50K': 0, ' >50K': 1}
data['income'] = data[' income'].map(income_map)
data = data.drop([' income'], axis=1)

# these just get generic label encodings since the order is largely arbitrary in these (for native-country, too many options for 1hot)
le = LabelEncoder()
data[' occupation'] = le.fit_transform(data[' occupation'])
data[' native-country'] = le.fit_transform(data[' native-country'])

data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19004 entries, 2 to 32558
Data columns (total 31 columns):
 #   Column                                  Non-Null Count  Dtype
---  ------                                  --------------  -----
 0   age                                     19004 non-null  int64
 1    fnlwgt                                 19004 non-null  int64
 2    education-num                          19004 non-null  int64
 3    occupation                             19004 non-null  int64
 4    hours-per-week                         19004 non-null  int64
 5    native-country                         19004 non-null  int64
 6    workclass_ Local-gov                   19004 non-null  bool 
 7    workclass_ Never-worked                19004 non-null  bool 
 8    workclass_ Private                     19004 non-null  bool 
 9    workclass_ Self-emp-inc                19004 non-null  bool 
 10   workclass_ Self-emp-not-inc            19004 non-null  bool 
 11   workclass_ State-go

In [344]:
# change type of all bools to int64 with 0 or 1 as values, for computer legibility
for col in data.columns:
    if data[col].dtype == bool:
        data[col] = data[col].astype(int)
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19004 entries, 2 to 32558
Data columns (total 31 columns):
 #   Column                                  Non-Null Count  Dtype
---  ------                                  --------------  -----
 0   age                                     19004 non-null  int64
 1    fnlwgt                                 19004 non-null  int64
 2    education-num                          19004 non-null  int64
 3    occupation                             19004 non-null  int64
 4    hours-per-week                         19004 non-null  int64
 5    native-country                         19004 non-null  int64
 6    workclass_ Local-gov                   19004 non-null  int64
 7    workclass_ Never-worked                19004 non-null  int64
 8    workclass_ Private                     19004 non-null  int64
 9    workclass_ Self-emp-inc                19004 non-null  int64
 10   workclass_ Self-emp-not-inc            19004 non-null  int64
 11   workclass_ State-go

In [345]:
# scale numerics
for col in data.columns:
            mxv = data[col].max()
            data[col] = data[col] / mxv

# check
data.head()

Unnamed: 0,age,fnlwgt,education-num,occupation,hours-per-week,native-country,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,workclass_ Self-emp-inc,...,relationship_ Own-child,relationship_ Unmarried,relationship_ Wife,race_ Asian-Pac-Islander,race_ Black,race_ Other,race_ White,sex_ Male,education,income
2,0.487179,0.517863,0.5625,0.384615,0.769231,0.948718,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.416667,0.0
3,0.679487,0.563671,0.4375,0.384615,0.769231,0.948718,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.25,0.0
4,0.358974,0.812672,0.8125,0.692308,0.769231,0.102564,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.833333,0.0
5,0.474359,0.68341,0.875,0.230769,0.769231,0.948718,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.916667,0.0
7,0.666667,0.503445,0.5625,0.230769,0.865385,0.948718,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.416667,1.0


## Preprocessing is now done. Now, it's about getting X and Y, train/test splitting, and doing the models.

In [346]:
# separate X and Y by features and target
X = data.drop(columns=['income'])
y = data['income']

# separate x and y into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(15203, 30) (15203,)
(3801, 30) (3801,)


## KNN: around 84% / 84% train/test consistently

In [349]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

knn = KNeighborsClassifier(n_neighbors=37)
knn.fit(X_train, y_train)
y_pred_train = knn.predict(X_train)
y_pred_test = knn.predict(X_test)

# Model Evaluation:
print('Train Accuracy:', accuracy_score(y_train, y_pred_train), 
      'Test Accuracy:', accuracy_score(y_test, y_pred_test))

Train Accuracy: 0.8425968558837071 Test Accuracy: 0.8403051828466193


## Logistic Regression: around 83% / 84% train/test consistently

In [352]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(max_iter=999)
lr.fit(X_train, y_train)
y_pred_train = lr.predict(X_train)
y_pred_test = lr.predict(X_test)

# Model evaluation:
print('Train Accuracy:', accuracy_score(y_train, y_pred_train), 
      'Test Accuracy:', accuracy_score(y_test, y_pred_test))  

Train Accuracy: 0.833322370584753 Test Accuracy: 0.8410944488292554


## Random Forest: around 83% / 84% train/test consistently

In [355]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=150, max_depth=5)
rf.fit(X_train, y_train)
y_pred_train = rf.predict(X_train)
y_pred_test = rf.predict(X_test)

# Model Evaluation:
print('Train Accuracy:', accuracy_score(y_train, y_pred_train), 
      'Test Accuracy:', accuracy_score(y_test, y_pred_test))

Train Accuracy: 0.8311517463658489 Test Accuracy: 0.842409892133649


## XGBoost: around 86% / 85% train/test consistently

In [358]:
import xgboost as xgb
xgb_model = xgb.XGBClassifier(n_estimators=200, learning_rate=0.05, max_depth=5)
xgb_model.fit(X_train, y_train)
y_pred_train = xgb_model.predict(X_train)
y_pred_test = xgb_model.predict(X_test)

# Model Evaluation: 
print('Train Accuracy:', accuracy_score(y_train, y_pred_train), 
      'Test Accuracy:', accuracy_score(y_test, y_pred_test))

Train Accuracy: 0.8620666973623627 Test Accuracy: 0.8521441725861615


## For Neural Network, tensor conversion and class setup is necessary.

In [406]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

Using device: cpu


In [407]:
# change to numpy arrays first
xtrt = X_train.to_numpy()
xtst = X_test.to_numpy()
ytrt = y_train.to_numpy()
ytst = y_test.to_numpy()

# define tensors for train & test sets
X_train_t = torch.tensor(xtrt, dtype = torch.float32).to(device)
X_test_t = torch.tensor(xtst, dtype=torch.float32).to(device)
# view here ensures y is only the income
y_train_t = torch.tensor(ytrt, dtype=torch.float32).view(-1, 1).to(device)
y_test_t = torch.tensor(ytst, dtype=torch.float32).view(-1, 1).to(device)

print(X_train_t.shape)
print(X_test_t.shape)
print(y_train_t.shape)
print(y_test_t.shape)

torch.Size([15203, 30])
torch.Size([3801, 30])
torch.Size([15203, 1])
torch.Size([3801, 1])


In [408]:
# define neural network class
class nnIC(nn.Module):
    # initializing constructor
    def __init__(self):
        # initialize parent
        super().__init__()
        self.layer1 = nn.Linear(30, 20)
        self.layer2 = nn.Linear(20, 15)
        self.layer3 = nn.Linear(15, 10)
        self.layer4 = nn.Linear(10, 5)
        self.layer5 = nn.Linear(5, 1)
    def forward(self, x):
        # multiplies inputted x by weights and adds bias
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.layer5(x)
        return F.sigmoid(x)

## Neural Network: consistent train / test loss of about 0.34 / 0.34

In [412]:
# training of neural network
model = nnIC().to(device)
loss = nn.BCELoss() # binary cross entropy is used for classification
optimizer = torch.optim.Adam(model.parameters(), lr = 0.01) 

In [413]:
# finish training of model
# num of times to loop
epochs = 1000
# loop of run model
for epoch in range(epochs):
    y_pred = model(X_train_t) # get predicted y value in model
    loss_fn = loss(y_pred, y_train_t) # 1/n * (y - yp)^2
    optimizer.zero_grad() # zero out gradient in network
    loss_fn.backward() # compute derivatives
    optimizer.step() # Update weights
    if ((epoch+1) % 100 == 0):
        print("loss =", loss_fn.item())

loss = 0.354406476020813
loss = 0.3479243516921997
loss = 0.34707126021385193
loss = 0.3469681441783905
loss = 0.3471136689186096
loss = 0.34695133566856384
loss = 0.34695836901664734
loss = 0.3469509482383728
loss = 0.34695157408714294
loss = 0.3470097482204437


In [414]:
with torch.no_grad():
    y_pred = model(X_test_t)
loss_fn = loss(y_pred, y_test_t)
print("test loss =", loss_fn)

test loss = tensor(0.3409)
