# Optimization of Fairness–Accuracy Trade-offs in Gradient-Based Classification Models

**Course:** COSC 3P99 – Independent Research Project  
**Student:** David Shodipo  
**Supervisor:** Dr. Blessing Ogbuokiri  
**Term:** Winter 2026  


## Project Overview

Machine learning classification models are commonly trained using gradient-based optimization techniques with the primary goal of maximizing predictive accuracy. However, focusing only on accuracy can cause the model to perform better for some demographic groups than others, which can be unfair and raise ethical concerns.

This project investigates the **trade-off between predictive accuracy and fairness** in gradient-based classification models by introducing fairness-aware regularization during training. The study focuses on how optimization parameters such as learning rate, training duration, and fairness regularization strength influence both accuracy and group fairness metrics.


## Research Questions

This project aims to answer the following questions:

1. How do fairness constraints affect predictive accuracy in gradient-based classifiers?
2. How do optimization parameters influence the fairness–accuracy trade-off?
3. Are fairness effects consistent across different application domains/groups?



## Objectives

- Train baseline accuracy-only classifiers
- Introduce fairness-aware regularization into the loss function
- Measure changes in:
  - Accuracy / AUC
  - Demographic Parity Difference
  - Equal Opportunity Difference
- Visualize and interpret fairness–accuracy trade-offs


## Datasets

Two publicly available datasets are used in this study:

### 1. Healthcare Dataset
- **Dataset:** UCI Heart Disease Dataset
- **Task:** Predict presence of heart disease
- **Sensitive Attribute:** Sex (optional extension: age group)

### 2. Non-Healthcare Dataset
- **Dataset:** UCI Adult Income Dataset
- **Task:** Predict whether income exceeds $50K (Defined of  the dataset Information Provided)
- **Sensitive Attribute:** Sex or Race


In [7]:
# Libraries
%pip install pandas numpy matplotlib seaborn scikit-learn
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning Libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, roc_curve

Defaulting to user installation because normal site-packages is not writeableNote: you may need to restart the kernel to use updated packages.



In [8]:
# Reproducibility
np.random.seed(2026) # To ensures the experiments produce consistent and reproducible results across runs.

## Week 1: Problem Setup and Baseline Models

The goal of Week 1 is to clean and preprocess both datasets and train basic models that focus only on accuracy using standard log loss. These models would act as a benchmark for comparing the effects of adding fairness constraints later in the project.

### Installing the ucimlrepo package

In [9]:
%pip install ucimlrepo
# To fetch datasets from the UCI Machine Learning Repository
from ucimlrepo import fetch_ucirepo


Defaulting to user installation because normal site-packages is not writeableNote: you may need to restart the kernel to use updated packages.



In [None]:
# fetch dataset
adult = fetch_ucirepo(id=2)

# data (as pandas dataframes)
X = adult.data.features # type: ignore
y = adult.data.targets # type: ignore

# Combine features and target into a single DataFrame for easier manipulation
data = pd.concat([X, y], axis=1)

#Print dataframe
display(data.head())

data.shape


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


(48842, 15)

In [15]:
# Print dataframe info
display(data.info())


<class 'pandas.core.frame.DataFrame'>
Index: 48790 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48790 non-null  int64 
 1   workclass       45995 non-null  object
 2   fnlwgt          48790 non-null  int64 
 3   education       48790 non-null  object
 4   education-num   48790 non-null  int64 
 5   marital-status  48790 non-null  object
 6   occupation      45985 non-null  object
 7   relationship    48790 non-null  object
 8   race            48790 non-null  object
 9   sex             48790 non-null  object
 10  capital-gain    48790 non-null  int64 
 11  capital-loss    48790 non-null  int64 
 12  hours-per-week  48790 non-null  int64 
 13  native-country  47934 non-null  object
 14  income          48790 non-null  object
dtypes: int64(6), object(9)
memory usage: 6.0+ MB


None

In [12]:
import pandas as pd
import numpy as np

# Cleaning the data

def clean_adult_df(df: pd.DataFrame) -> pd.DataFrame:
    """
    Clean the Adult Income dataset by standardizing strings,
    handling missing values, fixing label formatting,
    and removing duplicate records.
    """
    # Work on a copy to avoid modifying the original DataFrame
    df = df.copy()

    # Identify all categorical (string) columns
    categorical_cols = df.select_dtypes(include="object").columns

    # Strip leading/trailing whitespace from string columns
    for col in categorical_cols:
        df[col] = df[col].str.strip()

    # Convert Adult dataset's '?' placeholder to proper missing values
    df = df.replace("?", np.nan)

    # Fix income labels in the test set (e.g., '>50K.' -> '>50K')
    if "income" in df.columns:
        df["income"] = df["income"].str.replace(".", "", regex=False)

    # Remove exact duplicate rows
    df = df.drop_duplicates()

    return df


In [14]:
data = clean_adult_df(data)
# Check for missing values
display(data.head())

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
