#**Midterm project**

##**Describe the problem and how ML can help**
The **Adult (Census Income)** dataset from the UCI Machine Learning Repository contains demographic and employment data from the 1994 U.S. Census.
The goal is to predict whether a person earns more than `$50,000` per year based on features such as age, education, occupation, hours worked per week, and others.

This is a binary classification problem (`≤50K` vs.` >50K`).

Machine learning can help by:
- Automatically learning relationships between demographic factors and income level.
- Identifying key predictors of high income (e.g., education, occupation, work hours).
- Supporting data-driven decision making for social, economic, or marketing analyses.


Common algorithms used include **Logistic Regression**, **Decision Trees**, **Random Forests** and **XGBoost**.

In [22]:
import numpy as np
import pandas as pd
import seaborn as sns
import zipfile

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from matplotlib import pyplot as plt
%matplotlib inline

##**Load dataset**

In [23]:
!wget -O adult.csv https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data

--2025-11-09 10:55:08--  https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘adult.csv’

adult.csv               [  <=>               ]   3.79M  17.5MB/s    in 0.2s    

2025-11-09 10:55:08 (17.5 MB/s) - ‘adult.csv’ saved [3974305]



In [24]:
columns = [
    'age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
    'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
    'hours-per-week', 'native-country', 'income'
]

df = pd.read_csv('adult.csv', header=None, names=columns, sep=',', engine='python')

##**EDA**

In [25]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [26]:
pd.DataFrame({
    'dtype': df.dtypes,
    'missing_count': df.isnull().sum(),
    'missing_%': (df.isnull().sum() / len(df) * 100).round(2)
})

Unnamed: 0,dtype,missing_count,missing_%
age,int64,0,0.0
workclass,object,0,0.0
fnlwgt,int64,0,0.0
education,object,0,0.0
education-num,int64,0,0.0
marital-status,object,0,0.0
occupation,object,0,0.0
relationship,object,0,0.0
race,object,0,0.0
sex,object,0,0.0


In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [28]:
df.describe(include='all', percentiles=[.01,.05,.25,.5,.75,.95,.99]).T

Unnamed: 0,count,unique,top,freq,mean,std,min,1%,5%,25%,50%,75%,95%,99%,max
age,32561.0,,,,38.581647,13.640433,17.0,17.0,19.0,28.0,37.0,48.0,63.0,74.0,90.0
workclass,32561.0,9.0,Private,22696.0,,,,,,,,,,,
fnlwgt,32561.0,,,,189778.366512,105549.977697,12285.0,27185.8,39460.0,117827.0,178356.0,237051.0,379682.0,510072.0,1484705.0
education,32561.0,16.0,HS-grad,10501.0,,,,,,,,,,,
education-num,32561.0,,,,10.080679,2.57272,1.0,3.0,5.0,9.0,10.0,12.0,14.0,16.0,16.0
marital-status,32561.0,7.0,Married-civ-spouse,14976.0,,,,,,,,,,,
occupation,32561.0,15.0,Prof-specialty,4140.0,,,,,,,,,,,
relationship,32561.0,6.0,Husband,13193.0,,,,,,,,,,,
race,32561.0,5.0,White,27816.0,,,,,,,,,,,
sex,32561.0,2.0,Male,21790.0,,,,,,,,,,,


##**Preparing the dataset**

In [29]:
# Created target variable with two classes
# 0 = less than 50K and 1 = more than 50K
df['income'] = df['income'].str.strip()
df['target'] = (df['income'] == '>50K').astype(int)
df = df.drop(['income'], axis=1)

In [30]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

In [31]:
len(df_train), len(df_val), len(df_test)

(19536, 6512, 6513)

In [32]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [33]:
y_train = df_train.target.values
y_val = df_val.target.values
y_test = df_test.target.values

del df_train['target']
del df_val['target']
del df_test['target']

In [34]:
dv = DictVectorizer(sparse=False)

train_dict = df_train.to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

val_dict = df_val.to_dict(orient='records')
X_val = dv.transform(val_dict)

val_test = df_test.to_dict(orient='records')
X_test = dv.transform(val_test)

##**Logistic regression classifier**

In [35]:
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)
model.fit(X_train, y_train)

In [36]:
y_pred = model.predict(X_val)

In [37]:
y_pred = model.predict(X_test)
test_score = accuracy_score(y_pred, y_test)
print(f"Test score: {test_score:.2f}")

Test score: 0.81


##**XGBoost classifier**

In [38]:
model = XGBClassifier(n_estimators = 500,
                      learning_rate = 0.05,
                      eval_metric = "logloss",
                      early_stopping_rounds = 5,
                      n_jobs = -1)

In [39]:
model.fit(X_train, y_train,
          eval_set = [(X_val,y_val)],
          verbose = False)

In [40]:
y_pred = model.predict(X_test)
test_score = accuracy_score(y_pred, y_test)
print(f"Test score: {test_score:.2f}")

Test score: 0.87
