# Using Fairlearn with Census Data

This notebook shows how to use `fairlearn` and the Fairness dashboard to generate models for the Census dataset. This dataset is a classification problem - given a range of data about 32,000 individuals, predict whether their annual income is above or below fifty thousand dollars per year.

For simplicity, we import the dataset from the `shap` package, which contains the data in a cleaned format.

In [None]:
import sys
sys.path.insert(0, "../")

from fairlearn.metrics import DemographicParity
from fairlearn.reductions import GridSearch
from fairlearn.reductions.grid_search.simple_quality_metrics import SimpleClassificationQualityMetric

from sklearn import svm, neighbors, tree
from sklearn.preprocessing import LabelEncoder,StandardScaler
from sklearn.linear_model import LogisticRegression
import pandas as pd
import shap

import numpy as np

print(sys.version)

shap.initjs()
X_raw,y = shap.datasets.adult()

We can inspect the raw data, seeing the available columns

In [None]:
X_raw

We are going to treat the gender of each individual as a protected attribute, so we separate it out and drop it from the main data. We use `get_dummies` to convert any categorial columns to indicator variables, and then ensure that the data are scaled to similar magnitudes:

In [None]:
A = X_raw["Sex"]
X = X_raw.drop(labels=['Sex'],axis = 1)
X = pd.get_dummies(X)

sc = StandardScaler()
X_scaled = sc.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

We can also look at the supplied labels:

In [None]:
y

These need to be converted to indicator values as well:

In [None]:
le = LabelEncoder()
y = le.fit_transform(y)
y

We now perform the normal split of the data into training and test sets:

In [None]:
# Split data into train and test
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test, a_train, a_test = train_test_split(X_scaled, 
                                                    y, 
                                                    A,
                                                    test_size = 0.2,
                                                    random_state=0,
                                                    stratify=y)

There is now a slightly subtlety, which is not currently addressed in `fairlearn` itself (yet). Note that `y` is a Python list, but that `X_scaled` and `A` were pandas DataFrames. In the `train_test_split` method above, Python lists are simply split but DataFrames are instead split as a list of row indices (thereby avoiding the need to copy the data). Inside `fairlearn` we combine `a_train` and `y_train` into a new DataFrame with one column for each variable (both will have the same number of elements). When pandas does this, it sees that `a_train` is a DataFrame with indices, and it uses those as the basis for the new column size. The indices correspond to the original, unsplit, data, so this makes that column much longer than the `y_train` column; pandas fills the missing values automatically, but the rest of `fairlearn` subsequently fails due to the combined DataFrame having many more rows than expected.

To avoid this, we take copy `x_train` and `a_train` into new DataFrames, with the indices reset to be sequential (and dropping the old indices entirely):

In [None]:
x_train = x_train.reset_index(drop=True)
a_train = a_train.reset_index(drop=True)

We can now run our grid search. We use a simple logistic regression classifier in the interests of speed. We also specify that our equality goal is demographic parity. The quality metric seeks to maximise the sum of accuracy and parity (as measured by demographic parity). This quality metric then allows the `GridSearch` object to select a model to use for `predict` calls. In this case, though, we just extract the generated models:

In [None]:
sweep = GridSearch(LogisticRegression(solver='liblinear', fit_intercept=True),
                   fairness_metric=DemographicParity(),
                   quality_metric=SimpleClassificationQualityMetric())

sweep.fit(x_train, y_train,
          protected_attribute=a_train,
          number_of_lagrange_multipliers=71)

models = [ x["model"] for x in sweep.all_models]

Now, we can look at the models in the dashboard. First we import the code (since we do not yet have integration with AzureML, there is a warning)

In [None]:
from azureml.contrib.explain.model.visualize import FairnessDashboard


And then we can open the visualisation. We can see a clear Pareto front forming. Individual models can be selected for further exploration on the other tabs.

In [None]:
FairnessDashboard(models, x_test, y_test.tolist(), pd.DataFrame(a_test).values.tolist(), True, list(x_test.columns), [0, 1], ["Sex"])