# Using Fairlearn with Census Data

This notebook shows how to use `fairlearn` and the Fairness dashboard to generate models for the Census dataset. This dataset is a classification problem - given a range of data about 32,000 individuals, predict whether their annual income is above or below fifty thousand dollars per year.

Income data are well known to have biases; in this case, we will attempt to mitigate the bias based on gender.

## Loading the Dataset

For simplicity, we import the dataset from the `shap` package, which contains the data in a cleaned format. We start by importing the various modules we're going to use:

In [None]:
import sys
sys.path.insert(0, "../")

from fairlearn.metrics import DemographicParity
from fairlearn.reductions import GridSearch
from fairlearn.reductions.grid_search.simple_quality_metrics import SimpleClassificationQualityMetric

from sklearn import svm, neighbors, tree
from sklearn.preprocessing import LabelEncoder,StandardScaler
from sklearn.linear_model import LogisticRegression
import pandas as pd
import shap

import numpy as np

print(sys.version)

shap.initjs()

We can now load the dataset itself the `shap` package. The data are in two parts - the large matrix of the data, and a list of the corresponding labels:

In [None]:
X_raw,y = shap.datasets.adult()

With the data loaded, we can inspect the avilable columns:

In [None]:
X_raw

We are going to treat the gender of each individual as a protected attribute, and in this particular case we are going separate it out and drop it from the main data. We use `get_dummies` to convert any categorial columns to indicator variables, and then ensure that the data are scaled to similar magnitudes:

In [None]:
A = X_raw["Sex"]
X = X_raw.drop(labels=['Sex'],axis = 1)
X = pd.get_dummies(X)

sc = StandardScaler()
X_scaled = sc.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

We can also look at the supplied labels:

In [None]:
y

These need to be converted to indicator values as well:

In [None]:
le = LabelEncoder()
y = le.fit_transform(y)
y

Finally, we now perform the normal split of the data into training and test sets:

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test, a_train, a_test = train_test_split(X_scaled, 
                                                    y, 
                                                    A,
                                                    test_size = 0.2,
                                                    random_state=0,
                                                    stratify=y)

# Work around indexing bug
x_train = x_train.reset_index(drop=True)
a_train = a_train.reset_index(drop=True)
x_test = x_test.reset_index(drop=True)
a_test = a_test.reset_index(drop=True)

## Training an unmitigated model

To show the effect of `fairlearn` we will first train a model without it. For speed of demonstration, we use a simple logistic regression learner from `sklearn`:

In [None]:
unmitigated_model = LogisticRegression(solver='liblinear', fit_intercept=True)

unmitigated_model.fit(x_train, y_train)

We can load this model into the Fairness dashboard, and examine how it is unfair (there is a warning about AzureML since we are not yet integrated with that product):

In [None]:
from azureml.contrib.explain.model.visualize import FairnessDashboard

FairnessDashboard(unmitigated_model, x_test, y_test.tolist(), pd.DataFrame(a_test).values.tolist(), True, list(x_test.columns), [0, 1], ["Sex"])

I'm pretty sure that something can be seen, but the widget isn't co-operating for me right now

## Mitigation with GridSearch

The `GridSearch` class in `fairlearn` implements a simplified version of the exponentiated gradient algorithm of (some paper reference). By restricting the problem domain to binary classifiers with binary protected attributes, the exponentiated gradient algorithm can be reduced to trying a sequence of reweightings and relabellings of the training data. This sequence is parameterised by $\lambda$ since in the full algorithm, the sweep comes from a Lagrange multiplier.

To work, `GridSearch` requires an underlying learner (we shall use the same `LogisticRegression` learner as above), a fairness metric (we will specify demographic parity), and also a quality metric. The quality metric is used to pick the 'best' model from the sweep, which can be used to fulfil subsequent calls to the `predict()` method provided. The quality metric we supply here seeks to maximise the sum of the accuracy and parity values.

In this case, we are going to extract the full set of models, and then look at how they behave in accuracy/parity space. We can then choose the model which best meets our requirements for accuracy and parity.

In [None]:
sweep = GridSearch(LogisticRegression(solver='liblinear', fit_intercept=True),
                   fairness_metric=DemographicParity(),
                   quality_metric=SimpleClassificationQualityMetric())

Our algorithms provide `fit()` and `predict()` methods, so they behave in a similar manner to other ML packages in Python. We do however have to specify two extra arguments to `fit()` - the column of protected attribute labels, and also the number of values we wish to use for the Lagrange multiplier. The grid search will call the underlying learner once for each value, making it an ideal point to integrate with AzureML and leverage large scale execution in the cloud.

After `fit()` completes, we extract the full set of models from the `GridSearch` object.

In [None]:
sweep.fit(x_train, y_train,
          protected_attribute=a_train,
          number_of_lagrange_multipliers=71)

models = [ x["model"] for x in sweep.all_models]

We could load these models into the Fairness dashboard now. However, the plot would be somewhat confusing due to their number. In this case, we are going to remove the models which are dominated in the accuracy-parity space by others from the sweep (note that the parity will only be calculated for the protected attribute; other potentially protected attributes will not be mitigated). In general, one might not want to do this, since there may be other considerations beyond the strict maximisation of accuracy and parity (of the given protected attribute).

We start by evaluating the accuracy and parity of each model generated by the sweep on the test dataset extracted above:

We can plot all the models in accuracy-parity space:

We can always devise a new model by interpolating between two others. This means that any model lying closer to the origin than a line connecting any pair of models is said to be 'dominated' by that pair. In practice, this means that we want to extract the models constituting the convex hull of points in this space. These will form an approximation to the so-called "Pareto front" of models which represent the best possible set of trade-offs between accuracy and parity.

We use a library routine to extract the convex hull, and replot to show that we have the correct points:

Finally, we can put the models from the convex hull into the Fairness dashboard. We also add in the original, unmitigated model - it will be the one highlighted when the dashboard first loads.

In [None]:
FairnessDashboard(models, x_test, y_test.tolist(), pd.DataFrame(a_test).values.tolist(), True, list(x_test.columns), [0, 1], ["Sex"])

From this you can.... play around and see things?