# Grid Search with the COMPAS Dataset

This notebook demonstrates the use of the grid search algorithm from `fairlearn` on the [COMPAS dataset from ProPublica](https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv). This dataset comes from the criminal justice system, with the labels (0 or 1) representing the two-year recidivism rate, specifically whether or not a given offender is re-arrested within two years (with a 0 representing no arrest). Models based on this dataset are used in bail decisions.

## Loading and Examining the Data

We start by loading the dataset using the `tempeh` package (there may be some warnings, if you do not have `pytorch`, `keras` or `tensorflow` installed in your environment; these may be ignored). The data are already split into training and test sets:

In [1]:
import pandas as pd
import numpy as np
from tempeh.configurations import datasets

compas_dataset = datasets['compas']()
X_train = pd.DataFrame(compas_dataset.X_train, columns=compas_dataset.features)
y_train = pd.Series(compas_dataset.y_train.reshape(-1).astype(int), name="two_year_recid")
X_test = pd.DataFrame(compas_dataset.X_test, columns=compas_dataset.features)
y_test = pd.Series(compas_dataset.y_test.reshape(-1).astype(int), name="two_year_recid")
sensitive_features_train = pd.Series(compas_dataset.race_train)
sensitive_features_test = pd.Series(compas_dataset.race_test)

No module named 'torch'. If you want to use pytorch with tempeh please install pytorch separately first.
No modules named 'keras' and 'tensorflow'. If you want to use keras and tensorflow with tempeh please install keras and tensorflow separately first.


We can examine the features:

In [2]:
X_train

Unnamed: 0,sex,age,juv_fel_count,juv_misd_count,juv_other_count,priors_count,age_cat_25 - 45,age_cat_Greater than 45,age_cat_Less than 25,c_charge_degree_F,c_charge_degree_M
0,1.0,25.000000,0.0,-2.340451,1.0,-15.010999,1.0,0.0,0.0,0.0,1.0
1,0.0,26.000000,0.0,0.000000,0.0,0.000000,1.0,0.0,0.0,1.0,0.0
2,1.0,21.000000,0.0,0.000000,0.0,0.000000,0.0,0.0,1.0,1.0,0.0
3,1.0,29.129788,0.0,0.000000,0.0,6.000000,1.0,0.0,0.0,0.0,1.0
4,1.0,42.487893,0.0,0.000000,0.0,7.513697,1.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
3531,1.0,33.000000,0.0,0.000000,0.0,3.000000,1.0,0.0,0.0,1.0,0.0
3532,1.0,22.249714,0.0,0.000000,0.0,23.000266,1.0,0.0,0.0,0.0,1.0
3533,1.0,35.000000,0.0,0.000000,0.0,7.000000,1.0,0.0,0.0,1.0,0.0
3534,1.0,53.954471,0.0,0.000000,0.0,0.000000,0.0,1.0,0.0,1.0,0.0


In this example, we treat Race as the sensitive attribute. The dataset has already been reduced to only have two values, "African-American" and "Caucasian", with approximately two thirds of the samples being African-American:

In [6]:
np.unique(sensitive_features_train, return_counts=True)

(array(['African-American', 'Caucasian'], dtype=object),
 array([2147, 1389], dtype=int64))

Note that race does not feature in the feature data itself.

## Training an unmitigated model

Before attempting to mitigate any disparity, we should first train a model without regard to fairness. For simplicity, we will use a logistic regression model, as implemented by `scikit-learn`:

In [4]:
from sklearn.linear_model import LogisticRegression

unconstrained_predictor = LogisticRegression(solver='liblinear')
unconstrained_predictor.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

With the model trained, we can examine it in the Fairness Dashboard. There are a number of sections which we can examine.

First is the Accuracy - the fraction of cases where the model gave the right answer. While the overall accuracy is a little over 66%, this number hides some complexity. While both subgroups had a similar overall accuracy, we can see that African-Americans had a much higher overestimation error (i.e. the model predicts that they will be rearrested when they were not) while Caucasians have a much higher underestimation error (i.e. the model predicts that they will not rearrested, but they were).

If we instead look at the Recall (which measures model's ability to find all of the positive samples), we can see a much lower score for Caucasians than African-Americans. The story for the Specificity (which measures the ability of a model to find all of the negative samples - in this case, thoses where there was no rearrest) is reversed, with Caucasians having a specificity of nearly 80%, but African-Americans only showing a specificity score of about 65%.

In [5]:
from fairlearn.widget import FairlearnDashboard

predicted_ys = [unconstrained_predictor.predict(X_test).tolist()]
sensitive_features_mapped = list(map(lambda x: [x], sensitive_features_test.values))

FairlearnDashboard(sensitive_features=sensitive_features_mapped,
                   true_y=y_test.values,
                   predicted_ys=predicted_ys,
                   class_names=None,
                   feature_names=["Race"],
                   is_classifier=True)

FairlearnWidget(value={'true_y': [1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1…

<fairlearn.widget.fairlearnDashboard.FairlearnDashboard at 0x2e1144040b8>

## Selecting the Disparity Constraint

Before we can try to reduce disparity, we must first ask what the relevant constraint on the disparity should be. There are two options currently in `fairlearn` - Demographic Parity and Equalized Odds. While `fairlearn` produce models which reduce violation of the specified constraint, that does not mean that the models are *fairer* in the broader societal context.

In the following, we use $A$ for the sensitive attribute, $Y$ for the true values and $\hat{Y}$ for the predicted values. Since we have a binary classification problem, $Y , \hat{Y} \in \{ 0, 1 \}$.

Demographic Parity requires that $P( \hat{Y} | A ) = P(\hat{Y})$. That is, each subgroup (African-Americans and Caucasians in this case) should be equally likely to get a positive prediction (which in this example means "rearrested").

Equalized Odds requires that $P( \hat{Y} | A, Y ) = P( \hat{Y} | Y)$, which corresponds to two separate equations for the two possible values of $Y$. For the case $Y=1$, this is equivalent to equalizing the true positive rates (also known as "Recall") across groups. In the $Y=0$ case, this is equivalent to equalizing the false positive rates (also known as "Fall-Out") across groups.

If we are using our model to make bail decisions, we want to minimise the number of offences commited when out on bail. We use the rearrest feature as a proxy for this (note that there are a number of issues with doing so). Demographic Parity does not make much sense in this case - what that will do is equalise the chances of predicting a rearrest. In contrast, Equalized Odds does - we will be aiming to predict a rearrest correctly at equal rates for African-Americans and Caucasians, and also predict rearrests which would not actually have occurred at equal rates.