# Grid Search with the COMPAS Dataset

This notebook demonstrates the use of the grid search algorithm from `fairlearn` on the [COMPAS dataset from ProPublica](https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv). This dataset comes from the criminal justice system, with the labels (0 or 1) representing the two-year recidivism rate, specifically whether or not a given offender is re-arrested within two years (with a 0 representing no arrest). Models based on this dataset are used in bail decisions.

## Loading and Examining the Data

We start by loading the dataset using the `tempeh` package (there may be some warnings, if you do not have `pytorch`, `keras` or `tensorflow` installed in your environment; these may be ignored). The data are already split into training and test sets:

In [None]:
import pandas as pd
import numpy as np
from tempeh.configurations import datasets

compas_dataset = datasets['compas']()
X_train = pd.DataFrame(compas_dataset.X_train, columns=compas_dataset.features)
y_train = pd.Series(compas_dataset.y_train.reshape(-1).astype(int), name="two_year_recid")
X_test = pd.DataFrame(compas_dataset.X_test, columns=compas_dataset.features)
y_test = pd.Series(compas_dataset.y_test.reshape(-1).astype(int), name="two_year_recid")
sensitive_features_train = pd.Series(compas_dataset.race_train)
sensitive_features_test = pd.Series(compas_dataset.race_test)

We can examine the features:

In [None]:
X_train

And we can see the values of the sensitive feature, which is race in this example:

In [None]:
np.unique(sensitive_features_train, return_counts=True)

In this case, race has been reduced to a binary attribute. Also note that race does not feature in the feature data itself.

## Training an unmitigated model

Before attempting to mitigate any disparity, we should first train a model without regard to fairness. For simplicity, we will use a logistic regression model, as implemented by `scikit-learn`:

In [None]:
from sklearn.linear_model import LogisticRegression

unconstrained_predictor = LogisticRegression(solver='liblinear')
unconstrained_predictor.fit(X_train, y_train)
sensitive_features_test

With the model trained, we can examine it in the Fairness Dashboard:

In [None]:
from fairlearn.widget import FairlearnDashboard

predicted_ys = [unconstrained_predictor.predict(X_test).tolist()]
sensitive_features_mapped = list(map(lambda x: [x], sensitive_features_test.values))

FairlearnDashboard(sensitive_features=sensitive_features_mapped,
                   true_y=y_test.values,
                   predicted_ys=predicted_ys,
                   class_names=None,
                   feature_names=X_test.columns.values.tolist(),
                   is_classifier=True)