# Check for Bias in Datasets

In this notebook, we check for some common fairness evaluation metrics in two new datasets that we think may be good options to stress test our selected algorithms on. We are aiming to stress test these algorithms to see how good they are at reducing bias.

We use disparate impact as the key metric for how fair the data inherently. It is represented as $\frac{P(Y = 1 | unpriviledged\;group)}{P(Y = 1 | priviledged\;group)}$, where $Y=1$ represents the probability of receiving the positive outcome. We use the AIF360 package to calculate this metric.

- The first dataset is the adult dataset, which predicts whether someone's income exceeds $50k based on census data. The sensitive variables in this dataset are race, sex, marital status, age, native country. This dataset has a pretty comprehensive set of protected variables that we can test for bias with, but the problem with the data is that the target variable – assessing income – differs quite a bit from the objective of our clients. 

- The second dataset we are testing is the PDKK dataset, which is a credit lending dataset using Brazilian company data. The main protected variable here is sex, we have a marital status variable but there's no information on how it's encoded so it's pretty much useless to us. 

- The third datset we are testing is the 2018 Home Mortgage Disclosure Act data. It has a lot of information on the mortage themselves and credit characteristics of the borrowers as well as the borrowers' races, genders and age. 

In [1]:
import pandas as pd
import numpy as np
# Dataset repo from UCI
from ucimlrepo import fetch_ucirepo 
from aif360.datasets import AdultDataset, StandardDataset
# Fairness metrics
from aif360.metrics import BinaryLabelDatasetMetric
from aif360.metrics import ClassificationMetric
# Explainers
from aif360.explainers import MetricTextExplainer

pip install 'aif360[LawSchoolGPA]'
pip install 'aif360[Reductions]'
pip install 'aif360[Reductions]'
pip install 'aif360[Reductions]'


### Adult Dataset

In [30]:
# fetch dataset 
adult = AdultDataset()  



Here, we build a binary label fairness metric object using the AIF 360 package and fit it using the Adult dataset. 

In [31]:
### Get disparate impact on the intersection of race and sex
adult_fairness_metric = BinaryLabelDatasetMetric(
    adult,
    unprivileged_groups=[{"race":0, "sex":0}],
    privileged_groups=[{"race":1, "sex":1}]
)
print(
    "Disparate Impact based on intersection of race and sex: " +
    f"{adult_fairness_metric.disparate_impact():.3f}.")

### Get disparate impact based on race only
adult_fairness_metric = BinaryLabelDatasetMetric(
    adult,
    unprivileged_groups=[{"race":0}],
    privileged_groups=[{"race":1}]
)
print(
    "Disparate Impact based on race alone: " +
    f"{adult_fairness_metric.disparate_impact():.3f}.")

### Get disparate impact based on sex only
adult_fairness_metric = BinaryLabelDatasetMetric(
    adult,
    unprivileged_groups=[{"sex":0}],
    privileged_groups=[{"sex":1}]
)
print(
    "Disparate Impact based on sex alone: " +
    f"{adult_fairness_metric.disparate_impact():.3f}.")

Disparate Impact based on intersection of race and sex: 0.235.
Disparate Impact based on race alone: 0.604.
Disparate Impact based on sex alone: 0.363.


### PDKK Dataset

In [55]:
pdkk = pd.read_table(
    "../data/PAKDD2010_Modeling_Data.txt", delimiter='\t', encoding='latin1', 
    names=['ID_CLIENT', 'CLERK_TYPE', 'PAYMENT_DAY', 'APPLICATION_SUBMISSION_TYPE', 
    'QUANT_ADDITIONAL_CARDS', 'POSTAL_ADDRESS_TYPE', 'SEX', 'MARITAL_STATUS', 
    'QUANT_DEPENDANTS', 'EDUCATION_LEVEL', 'STATE_OF_BIRTH', 'CITY_OF_BIRTH', 
    'NACIONALITY', 'RESIDENCIAL_STATE', 'RESIDENCIAL_CITY', 'RESIDENCIAL_BOROUGH', 
    'FLAG_RESIDENCIAL_PHONE', 'RESIDENCIAL_PHONE_AREA_CODE', 'RESIDENCE_TYPE', 
    'MONTHS_IN_RESIDENCE', 'FLAG_MOBILE_PHONE', 'FLAG_EMAIL', 
    'PERSONAL_MONTHLY_INCOME', 'OTHER_INCOMES', 'FLAG_VISA', 'FLAG_MASTERCARD', 
    'FLAG_DINERS', 'FLAG_AMERICAN_EXPRESS', 'FLAG_OTHER_CARDS', 
    'QUANT_BANKING_ACCOUNTS', 'QUANT_SPECIAL_BANKING_ACCOUNTS', 
    'PERSONAL_ASSETS_VALUE', 'QUANT_CARS', 'COMPANY', 'PROFESSIONAL_STATE', 
    'PROFESSIONAL_CITY', 'PROFESSIONAL_BOROUGH', 'FLAG_PROFESSIONAL_PHONE', 
    'PROFESSIONAL_PHONE_AREA_CODE', 'MONTHS_IN_THE_JOB', 'PROFESSION_CODE', 
    'OCCUPATION_TYPE', 'MATE_PROFESSION_CODE', 'EDUCATION_LEVEL_MATE', 
    'FLAG_HOME_ADDRESS_DOCUMENT', 'FLAG_RG', 'FLAG_CPF', 'FLAG_INCOME_PROOF', 
    'PRODUCT', 'FLAG_ACSP_RECORD', 'AGE', 'RESIDENCIAL_ZIP_3', 'PROFESSIONAL_ZIP_3', 'TARGET_LABEL'])

# Clean the sex column by removing the Ns and empty entries
pdkk = pdkk.loc[(pdkk["SEX"] != "N") & (pdkk["SEX"] != " ")]

  pdkk = pd.read_table(


In [64]:
### Instantiate the PDKK dataset as a BinaryLabelDataset

# Protected attributes
protected_attributes = ['SEX']
# We keep certain features 
selected_features = ['SEX', 'AGE', 'TARGET_LABEL']
# Priviledged classes
privileged_classes = [["M"]]
# Favorble Target label
favorable_target_label = [0]
# Create the binary label dataset object
pdkk_dataset = StandardDataset(
    df = pdkk,
    label_name = "TARGET_LABEL",
    favorable_classes = favorable_target_label,
    protected_attribute_names = protected_attributes,
    privileged_classes = privileged_classes,
    features_to_keep = selected_features
)

In [66]:
privileged_group = [{"SEX": 1}]
unprivileged_group = [{"SEX": 0}]

pdkk_metric = BinaryLabelDatasetMetric(
    pdkk_dataset,
    unprivileged_groups=unprivileged_group,
    privileged_groups=privileged_group
)

pdkk_metric.disparate_impact()

1.028081281563987

Not much disparate impact there fortunately for the brazillian population and unfortunately for our use case.

### HMDA

This dataset is from the 2018 Home Mortgage Disclosure Act (HMDA). 

Filtering Out Specific Loans:

- Loans that were withdrawn by the applicants are removed. This is because withdrawn applications do not provide information about the lending decision.
- Loans that were closed due to incompleteness are also removed, as these did not reach a decision point due to missing information.
- Loans purchased by other institutions are excluded because the focus is on original lending decisions, not secondary market transactions.

Mapping Action Outcomes:

- The action_taken variable is recoded to simplify the analysis. Loans with outcomes of 1 (loan originated), 2 (application approved but not accepted), and 8 (preapproval request approved but not accepted) are mapped to 1, indicating approval.
- Outcomes of 3 (application denied by financial institution) and 7 (preapproval request denied by financial institution) are mapped to 0, indicating denial.
This step effectively transforms the action_taken variable into a binary indicator of loan approval status.

Recode the Race Variable:

- Observations where the race is not available or is listed as "Joint" are removed to focus on applications with clear racial identification. 
- The race variable is then recoded into a binary format, where "White" is coded as 0 and all other non-white races are combined and coded as 1. This simplifies the analysis to a comparison between White and Non-White applicants.

Recode the Sex Variable:

- Similar to race, observations where sex is not available or listed as "Joint" are removed.
- The sex variable is recoded into a binary format, with "Male" coded as 0 and "Female" as 1, facilitating gender-based analysis.


In [23]:
# Load in data
hmda = pd.read_csv("../data/HMDA 2022 New Jersey.csv")

# Take out loans that were withdrawn by applicants
hmda = hmda.loc[hmda["action_taken"] != 4]
# Take out loans that were closed for incompleteness
hmda = hmda.loc[hmda["action_taken"] != 5]
# Take out loans that were purchased by other institutions
hmda = hmda.loc[hmda["action_taken"] != 6]

# Map action outcomes 1, 2, 8 as approved
hmda["action_taken"] = hmda["action_taken"].replace([1, 2, 8], 1)
hmda["action_taken"] = hmda["action_taken"].replace([3, 7], 0)

# Recode the race variable
hmda = hmda.loc[hmda["derived_race"] != "Race Not Available"]
hmda = hmda.loc[hmda["derived_race"] != "Joint"]

# Replace all non-white observations as non-white
hmda.loc[hmda["derived_race"] != "White", "derived_race"] = "Non-White"
hmda["derived_race"] = hmda["derived_race"].map({"White": 0, "Non-White": 1})

# Recode the sex variable
hmda = hmda.loc[hmda["derived_sex"] != "Sex Not Available"]
hmda = hmda.loc[hmda["derived_sex"] != "Joint"]

# Encode into 0 and 1
hmda["derived_sex"] = hmda["derived_sex"].map({"Male": 0, "Female": 1})

print(f"The resulting shape of the dataframe is {hmda.shape}")

  hmda = pd.read_csv("../data/HMDA 2022 New Jersey.csv")


The resulting shape of the dataframe is (136442, 99)


In [24]:
## Build a standard dataset with the HDMA data

# Protected attributes
protected_attributes = ['derived_sex', 'derived_race']
# We keep certain features 
selected_features = ['derived_sex', 'derived_race', 'action_taken']
# Priviledged classes
privileged_classes = [[0,0]]
# Favorble Target label
favorable_target_label = [1]
# Create the binary label dataset object
hmda_dataset = StandardDataset(
    df = hmda,
    label_name = "action_taken",
    favorable_classes = favorable_target_label,
    protected_attribute_names = protected_attributes,
    privileged_classes = privileged_classes,
    features_to_keep = selected_features
)

In [28]:
hmda_sex_and_race = BinaryLabelDatasetMetric(
    hmda_dataset,
    unprivileged_groups=[{"derived_sex": 0, "derived_race": 0}],
    privileged_groups=[{"derived_sex": 1, "derived_race": 1}]
)

print(
    "Disparate Impact based on intersection of race and sex: " +
    f"{hmda_sex_and_race.disparate_impact():.3f}.")

hmda_sex = BinaryLabelDatasetMetric(
    hmda_dataset,
    unprivileged_groups=[{"derived_sex": 0}],
    privileged_groups=[{"derived_sex": 1}]
)

print(
    "Disparate Impact based on sex alone: " +
    f"{hmda_sex.disparate_impact():.3f}.")

hmda_race = BinaryLabelDatasetMetric(
    hmda_dataset,
    unprivileged_groups=[{"derived_race": 0}],
    privileged_groups=[{"derived_race": 1}]
)

print(
    "Disparate Impact based on race alone: " +
    f"{hmda_race.disparate_impact():.3f}.")

Disparate Impact based on intersection of race and sex: 1.083.
Disparate Impact based on sex alone: 0.993.
Disparate Impact based on race alone: 1.086.
