<center><img src="https://raw.githubusercontent.com/dssg/aequitas/master/docs/_images/aequitas_logo.svg" width="450"></center>

# How to add a new algorithm to Aequitas Flow


**TL;DR**: If you've developed an algorithm you want to benchmark, this notebook will help you on how to include it in Aequitas package. This is also important if you want to contribute to the package.



In [None]:
# Install Aequitas
!pip install "aequitas==1.0.0" &> /dev/null
# This only needs to run once, or after you lose your runtime environment in Colab.

In [None]:
# Cleaning the default logger of Google Colab (logs appear repeated otherwise)
from aequitas.flow.utils.logging import clean_handlers

clean_handlers()

## Method Type Definition

One of the specific cases we want to approach is how to increase the pool of  **algorithms** in the package.

This is especially useful to **benchmark novel methods** of FairML, and compare them to the baselines we have. Additionally, it is a great way to **contribute** to the package.

The first step is to determine what type of FairML algorithm you have. This depends mainly on the **input** and **output** of the method, and follows taxonomies presented in several surveys of Fairness in Machine Learning ([Caton and Haas, 2023](https://dl.acm.org/doi/pdf/10.1145/3616865), [Mehrabi et al., 2022](https://arxiv.org/pdf/1908.09635.pdf), [Pessach and Shmueli, 2022](https://dl.acm.org/doi/pdf/10.1145/3494672?casa_token=WE_KHZCRimkAAAAA:KuJ6HNEHlyNLMYfRA3ZylG7hUVX2PmH9ZiUO1QHhEpPXGinJ-cuuO7KBidKTICsRzIpoq8v-pddYkA)).

We additionally have an option for common ML algorithms, such as Random Forests, Neural Networks, etc., in a separate type of method named `BaseEstimator`.

<img src="https://raw.githubusercontent.com/dssg/aequitas/fairflow-release/src/aequitas_webapp/static/images/fairflow_method_contribution.svg">

---

## Example: Adding a New Method

In this particular example notebook, we integrate an implementation of **Data Repairer** ([Feldman et al. 2015](https://arxiv.org/abs/1412.3756)). This method changes the distribution of the features to match the global distribution, *i.e.*, the value of any percentile for a dataset feature is independent of race in ACS Income.

It is, therefore a pre-processing method, as it transforms the data to a fairer representation, before passing it to a classification model.

We download the implementation from [aequitas repository](https://github.com/dssg/aequitas) and place it in the `examples/data_repair` directory of the environment in Google Colab. It shares the same location in the repository.

In [None]:
from aequitas.flow.utils.colab import get_examples

get_examples("methods/data_repair")

By opening the file, we can observe that there are **three functions** in the file:
- The first method `get_quantiles` **calculates quantiles** for each group and the global quantiles in the dataset.
- The second method `repair_features` **transforms the dataset** so that the distribution of features is independent of the protected attribute.
- The third is a private method that helps with the formatting of pandas dataframes.


---

## Method Interface

Now, let's **check the interface** for pre-processing methods in Aequitas Flow.

In [None]:
from aequitas.flow.methods.preprocessing.preprocessing import PreProcessing
import inspect  # To help get the code for the class

source = inspect.getsource(PreProcessing)
print(source)

From the code above, we can see that a FairML method that inherits from this `Preprocessing` abstract class has to implement **two methods**, one to **fit** to data, and the other to **transform** data. The definition above gives us an idea of what are the inputs and outputs for each method.

---
## Method Implementation

Let's try to integrate this novel method into Aequitas Flow. The following cell will explain how to do so, at the same time as the class is implemented.

In [None]:
import pandas as pd
from typing import Optional

from examples.methods.data_repair import get_quantiles, repair_features
from aequitas.flow.utils import create_logger


class DataRepair(PreProcessing):  # Data repair is a pre-processing method
    def __init__(self, repair_level, definition, columns):
        # In instantiation, we pass all the hyperparameters for the method
        # But first, let's create a logger
        self.logger = create_logger("methods.preprocessing.DataRepair")
        self.logger.info("Creating DataRepair object.")
        self.repair_level = repair_level
        self.columns = columns
        self.definition = definition

    # Now, let's use the methods in the examples to make the DataRepair class
    def fit(self, X: pd.DataFrame, y: pd.Series, s: Optional[pd.Series] = None) -> None:
        self.logger.info("Fitting DataRepair.")
        # We get the quantiles of the training data and save them locally.
        self.group_quantiles, self.global_quantiles = get_quantiles(
            X[self.columns],
            s,
            self.definition,
        )
        self.logger.info("DataRepair fitted.")

    def transform(
        self, X: pd.DataFrame, y: pd.Series, s: Optional[pd.Series] = None
    ) -> tuple[pd.DataFrame, pd.Series, pd.Series]:
        self.logger.info("Transforming with DataRepair.")
        # We use the gathered quantiles and the hyperparams to transform the data.
        transformed_X = repair_features(
            X[self.columns],
            s,
            self.definition,
            self.repair_level,
            self.group_quantiles,
            self.global_quantiles
        )
        self.logger.info("Data transformed with DataRepair.")
        return_X = X.copy()
        for column in transformed_X.columns:
            return_X[column] = transformed_X[column]
        return return_X, y, s

---
## Testing the Method

After the implementation, it is easy to test the usage with **Aequitas Flow's resources**. We will instantiate a dataset, and check if the data is being repaired as intended.

In [None]:
from aequitas.flow.datasets import FolkTables

dataset = FolkTables("ACS_sample")
dataset.load_data()
dataset.create_splits()

We will first check the original distribution of one variable to be transformed: **Age** (AGEP).

In [None]:
import seaborn as sns
for race in dataset.data["RAC1P"].unique():
    sns.kdeplot(dataset.data[dataset.data["RAC1P"]==race], x="AGEP", bw_adjust=1.9)

Now let's create a repaired version of this dataset. To do so, we will **instantiate the class** we created and call the **fit and transformation methods**. We will then visualize the **resulting distributions**.

In [None]:
repairer = DataRepair(repair_level=1.0, definition=101, columns=["AGEP"])

repairer.fit(dataset.data.X, dataset.data.y, dataset.data.s)
repaired_x, y, s = repairer.transform(dataset.data.X, dataset.data.y, dataset.data.s)

# Adding the protected attribute to X to accelerate plots.
repaired_x["RAC1P"] = s
repaired_x = repaired_x.copy()

In [None]:
import seaborn as sns
for race in repaired_x["RAC1P"].unique():
    sns.kdeplot(repaired_x[repaired_x["RAC1P"]==race], x="AGEP", bw_adjust=1.9)

We can see that the **distributions of age are very similar for each group** within the sensitive attribute, which is the expected behavior for the method.


Note that the method was called exactly how Aequitas Flow uses it during the pipeline and is prepared for being included in any experiment. The method requires the definition of an in-processing method or base estimator to score the transformed dataset.

---