<center><img src="./images/logo.png" alt="drawing" width="400" style="background-color:white; padding:1em;" /></center> <br/>

# ML through Application 
## Module 3, Lab 4, Notebook 1: Bias Mitigation during Preprocessing

This notebook shows how to implement reweighting as a bias mitigation method to use before a model is trained.

You will learn how to implement a bias mitigation step by using reweighting (manually and using a fairness Python library). The Python library you will use calls this process "reweighing" instead of "reweighting" (but uses the same technique as described in the course).

__Dataset:__ 
You will use [Folktables](https://github.com/zykls/folktables) to download a dataset for this lab. Folktables provides an API to download data from the American Community Survey (ACS) Public Use Microdata Sample (PUMS) files, which the U.S. Census Bureau manages. The data itself is governed by the terms of use that are provided by the Census Bureau. For more information, see the [Terms of Service](https://www.census.gov/data/developers/about/terms-of-service.html). 

You will filter the ACS PUMS data sample to include only individuals who are above the age of 16, reported usual working hours of at least 1 hour per week in the past year, and have an income of at least \\$100. 
The threshold of \\$50,000 was chosen so that this dataset can serve as a comparable replacement to the [UCI Adult dataset](https://archive.ics.uci.edu/ml/datasets/adult), but the income threshold can be changed easily to define new prediction tasks. Historically, the [UCI Adult dataset](https://archive.ics.uci.edu/ml/datasets/adult) served as the basis for the development and comparison of many algorithmic fairness interventions but has limited documentation, outdated feature encodings, and only contains a binary target label which can lead to misrepresentations for certain subpopulations. In order to compare your results with scientific findings that utilize the UCI Adult dataset, and to have greater control and flexibility in setting up the problem, you will utilize the ACS PUMS data with the filters and thresholds described above.

__ML problem:__ 
The goal is to predict whether an individual's income is above \\$50,000. 
This is a binary prediction task that can enable organizations and businesses to target their marketing efforts more effectively. Alternatively, governments could leverage these predictions to design better social welfare programs and allocate resources efficiently. Keep these kinds of problems in mind, when working through the notebook.

Reference: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.

----

You will be presented with activities throughout the notebook: <br/>

|<img style="float: center;" src="./images/activity.png" alt="Activity" width="125"/>| 
| --- | 
|<p style="text-align:center;"> No coding is needed for an activity. You try to understand a concept, <br/>answer questions, or run a code cell.</p>|

## Index

- [Read in the dataset](#Read-in-the-dataset)
- [Data processing](#Data-processing)

Before loading in the dataset, make sure to install and import all required libraries.

In [None]:
# Use pip to install libraries
!pip install --no-deps -U -q -r requirements.txt

In [None]:
%%capture

# Import the libraries needed for the notebook

# Reshaping/basic libraries
import pandas as pd
import numpy as np
import io
import seaborn as sns

sns.set_style("darkgrid", {"axes.facecolor": ".9"})

# Import questions
from MLUMLA_EN_M3_Lab4_quiz_questions import *

# Fairness libraries
from folktables.acs import *
from folktables.folktables import *
from folktables.load_acs import *
from aif360.datasets import BinaryLabelDataset, Dataset
from aif360.algorithms.preprocessing import Reweighing

# Jupyter(lab) libraries
import warnings

warnings.filterwarnings("ignore")

---
## Read in the dataset

Import the data from Folktables.

In [None]:
income_features = [
    "AGEP",  # age individual
    "COW",  # class of worker
    "SCHL",  # educational attainment
    "MAR",  # marital status
    "OCCP",  # occupation
    "POBP",  # place of birth
    "RELP",  # relationship
    "WKHP",  # hours worked per week past 12 months
    "SEX",  # sex
    "RAC1P",  # recorded detailed race code
    "PWGTP",  # persons weight
    "GCL",  # grandparents living with grandchildren
]

# Define the prediction problem and features
ACSIncome = folktables.BasicProblem(
    features=income_features,
    target="PINCP",  # total persons income
    target_transform=lambda x: x > 50000,
    group="RAC1P",
    preprocess=adult_filter,  # applies the following conditions; ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
    postprocess=lambda x: x,  # applies post processing, for example: fill all NAs
)

# Initialize year, duration ("1-Year" or "5-Year") and granularity (household or person)
data_source = ACSDataSource(survey_year="2018", horizon="1-Year", survey="person")
# Specify region (here: California) and load data
ca_data = data_source.get_data(states=["CA"], download=True)
# Apply transformation as per problem statement above
ca_features, ca_labels, ca_group = ACSIncome.df_to_numpy(ca_data)

# Convert NumPy array to DataFrame
df = pd.DataFrame(
    np.concatenate((ca_features, ca_labels.reshape(-1, 1)), axis=1),
    columns=income_features + [">50k"],
)

# For further modeling, use only two groups
df = df[df["RAC1P"].isin([6, 8])].copy(deep=True)

---
## Data processing

Split the categorical and numerical features to keep them separate. Start by creating a list for each feature type.

In [None]:
categorical_features = [
    "COW",
    "SCHL",
    "MAR",
    "OCCP",
    "POBP",
    "RELP",
    "SEX",
    "RAC1P",
]

numerical_features = ["AGEP", "WKHP", "PWGTP"]

In [None]:
# Cast categorical features to `category`
df[categorical_features] = df[categorical_features].astype("object")

# Cast numerical features to `int`
df[numerical_features] = df[numerical_features].astype("int")

Now you can now separate model features from the model target to explore them separately.

In [None]:
model_target = ">50k"
model_features = categorical_features + numerical_features

print("Model features: ", model_features)
print("Model target: ", model_target)

In [None]:
# Check that the target is not accidentally part of the features
model_target in model_features

This looks good. You made sure that the target is not in the feature list. If the output of the previous cell is `True`, you need to remove the target by calling `model_features.remove(model_target)`.

Next, you will look for missing values.

### Check for missing values
The quickest way to check for missing values is to use `.isna().sum()`. This will provide a count of missing values.

You can also see the count of missing values with `.info()` because the function provides a count of non-null values.

In [None]:
# Show missing values
df.isna().sum()

Implement a threshold-based way to drop columns with too many missing values. You want to drop all columns where more than 20 percent of the data is missing. (Note that this is an example threshold that doesn't apply universally.)

In [None]:
df = df.loc[:, df.isnull().mean() < 0.2]
df.reset_index(drop=True, inplace=True)

### Feature transformation

Now that you dropped the missing values, you want to implement reweighting of the features before they are passed to the model. You will use AIF360 for this and initialize the `Reweighing` class.

In [None]:
# Declare the attribute values of the privileged and unprivileged groups
priv_group = [{"RAC1P": 6}]
unpriv_group = [{"RAC1P": 8}]

rw = Reweighing(unprivileged_groups=unpriv_group, privileged_groups=priv_group)

In [None]:
# Create a dataset construct for AIF360
binaryLabelDataset = BinaryLabelDataset(
    df=df,
    label_names=[model_target],
    protected_attribute_names=["RAC1P"],
    favorable_label=1.0,
    unfavorable_label=0.0,
)

binaryLabelDataset_transform = rw.fit_transform(binaryLabelDataset)
weights = pd.DataFrame(
    {
        "weights": binaryLabelDataset_transform.convert_to_dataframe()[1][
            "instance_weights"
        ]
    }
).round(2)

In [None]:
# Look at the transformed dataset
df_transformed = binaryLabelDataset_transform.convert_to_dataframe()[0]
df_transformed.head()

In [None]:
# Compare to the original dataset
df.head()

Notice that no differences exist between the datasets. You didn't change the data directly but got a list of weights that you can use instead.

Alternatively, if you prefer not to use AIF360, you can code a custom reweigh function. An example is provided in the following cell.

In [None]:
def reweighing(data, label, sensitive_attr, return_list=True):
    "Function that calculates reweighting factors based on given model target and sensitive attribute." ""
    # Initialize dict for the different classes of labels
    label_dict = dict()
    try:
        # Loop through different labels
        for outcome in data[label].unique():
            # Initialize empty dict to store weight values
            weight_map = dict()
            # Loop through different sensitive attributes
            for val in data[sensitive_attr].unique():
                # Count per outcome type, per sensitive attribute class
                nom = (
                    len(data[data[sensitive_attr] == val])
                    / len(data)
                    * len(data[data[label] == outcome])
                    / len(data)
                )
                denom = len(
                    data[(data[sensitive_attr] == val) & (data[label] == outcome)]
                ) / len(data)
                # Calculate fraction to obtain weight
                weight_map[val] = round(nom / denom, 2)
            # Store output in list
            label_dict[outcome] = weight_map
        # Map values back to correct data points
        data["weights"] = list(
            map(lambda x, y: label_dict[y][x], data[sensitive_attr], data[label])
        )
        # Enable to return a list of the weights
        if return_list == True:
            return data["weights"].to_list()
        else:
            return label_dict
    # Catch error
    except Exception as err:
        print(err)
        print("Dataframe might have no entries.")

In [None]:
# Use a custom function to calculate reweighting
weight_manual = pd.DataFrame({"weights": reweighing(df, model_target, "RAC1P", True)})

### Compare outputs
Now, check whether the weights that you calculated manually with the custom function agree with the output from AIF360.

In [None]:
weight_manual.head()

In [None]:
weights.head()

Notice that the outputs are the same between the AIF360 library and the custom function. 

<div style="border: 4px solid coral; text-align: center; margin: auto;"> 
    <h3><i>Try it yourself!</i></h3>
    <p style="text-align:center;margin:auto;"><img src="./images/activity.png" alt="Activity" width="100" /> </p>
    <p style=" text-align: center; margin: auto;">You know that individuals can experience multiple and different sensitive attributes. How would you calculate the weighting factors, assuming that SEX is the sensitive attribute?</p>
    <p style=" text-align: center; margin: auto;">To answer the question, run the following cell.</p>
    <br>
</div>

In [None]:
# Run this cell for a knowledge check question
question_1

<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h3><i>Try it Yourself!</i></h3>
    <br>
    <p style="text-align:center;margin:auto;"><img src="./images/activity.png" alt="Activity" width="100" /> </p>
    <p style=" text-align: center; margin: auto;">In a new cell, run the code from the correct answer to the previous question. Are the weights when using SEX as sensitive attribue the same as the weights when using RAC1P? You can create a new code cell and run the <code>reweighing</code> function for RAC1P and for SEX to check whether the values are the same.</p>
        <p style=" text-align: center; margin: auto;">To answer the question, run the following cell.</p>
    <br>
</div>

In [None]:
# Run this cell for a knowledge check question
question_2

---
## Conclusion

Different combinations of attributes will yield different weights. You can implement reweighing by using existing Python libraries or by using custom functions.

**To finish this lab, continue to notebook 2.**