---
title: Auditing Bias
author: Evan Flaks
date: '2025-03-09'
image: "bias.jpeg"
description: "Building a model that predicts employment status and auditing racial bias"
format: html
---

# Abstract

For this project, I will be creating a machine learning algorithm to predict the employment status of Maryland citizens on the basis of non-racial demographics, and then audit for racial bias.

# Downloading Data

First we will download some PUMS data from the state of Maryland using folktables. 

In [27]:
from folktables import ACSDataSource, ACSEmployment, BasicProblem, adult_filter
import numpy as np

STATE = "MD"

data_source = ACSDataSource(survey_year='2023', 
                            horizon='1-Year', 
                            survey='person')

acs_data = data_source.get_data(states=[STATE], download=True)

acs_data.head()

Unnamed: 0,RT,SERIALNO,DIVISION,SPORDER,PUMA,REGION,STATE,ADJINC,PWGTP,AGEP,...,PWGTP71,PWGTP72,PWGTP73,PWGTP74,PWGTP75,PWGTP76,PWGTP77,PWGTP78,PWGTP79,PWGTP80
0,P,2023GQ0000068,5,1,201,3,24,1019518,27,62,...,27,29,24,25,29,27,25,29,28,28
1,P,2023GQ0000079,5,1,502,3,24,1019518,13,21,...,13,26,17,15,13,32,0,2,2,13
2,P,2023GQ0000088,5,1,1400,3,24,1019518,25,35,...,32,25,70,54,42,24,2,34,49,23
3,P,2023GQ0000093,5,1,1300,3,24,1019518,19,61,...,14,19,23,27,2,22,27,3,20,2
4,P,2023GQ0000100,5,1,802,3,24,1019518,106,73,...,136,206,9,15,106,209,115,210,211,15


# Data Cleaning

The dataset has a *lot* of features. For our modeling task, we will only use the following possible features

In [28]:
possible_features=['AGEP', 'SCHL', 'MAR', 'DIS', 'ESP', 'CIT', 'MIG', 'MIL', 'ANC', 'NATIVITY', 'DEAR', 'DEYE', 'DREM', 'SEX', 'RAC1P', 'ESR']
acs_data[possible_features].head()

Unnamed: 0,AGEP,SCHL,MAR,DIS,ESP,CIT,MIG,MIL,ANC,NATIVITY,DEAR,DEYE,DREM,SEX,RAC1P,ESR
0,62,17.0,5,1,,1,1.0,4.0,4,1,2,1,1.0,1,1,6.0
1,21,19.0,5,2,,1,1.0,4.0,2,1,2,2,2.0,2,9,6.0
2,35,16.0,5,2,,1,1.0,4.0,1,1,2,2,2.0,1,2,6.0
3,61,18.0,3,1,,1,3.0,4.0,4,1,2,2,1.0,1,1,6.0
4,73,13.0,5,1,,1,1.0,4.0,4,1,2,2,2.0,1,1,6.0


For documentation on what these features mean, please consult the appendix of [this paper](https://arxiv.org/pdf/2108.04884) that introduced the package.

The feautures that I am going to train my model on are educational attainment (SCHL), employment status of parents (ESP), mobility status (MIG), and age (AGEP). I will use these features to predict employment status (ESR).

In [29]:
features_to_use = ["SCHL","ESP","MIG","AGEP"]

Now we can construct a BasicProblem that expresses our wish to use these feautures to predict employment status (ESR), using race (RAC1P) as the group label.

In [30]:
EmploymentProblem = BasicProblem(
    features=features_to_use,
    target='ESR',
    target_transform=lambda x: x == 1,
    group='RAC1P',
    preprocess=lambda x: x,
    postprocess=lambda x: np.nan_to_num(x, -1),
)

features, label, group = EmploymentProblem.df_to_numpy(acs_data)

Before we touch the data anymore, we should perform a train-test split

In [31]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, group_train, group_test = train_test_split(
    features, label, group, test_size=0.2, random_state=0)

# Basic Descriptives

Now we want to answer some basic questions about the data we are working with. We can answer these questions by turning our training data into a data from for easy analysis.

In [32]:
import pandas as pd
df = pd.DataFrame(X_train, columns = features_to_use)
df["group"] = group_train
df["label"] = y_train

1. How many individuals are in the data?

In [41]:
num_rows = df.shape[0]
print(f"Number of people in the data: {num_rows}")


Number of people in the data: 49624


2. Of these individuals, what proportion are employed?

3. Of these individuals, how many are in each racial group?

In [43]:
race_counts = df['RAC1P'].value_counts()
print(race_counts)


KeyError: 'RAC1P'

4. In each group, what proportion of individuals have target label equal to 1?

5. Now I will look for some intersectional trends.