# Mitigating Unfairness in the Law School Dataset

In this example, we will examine the well known Law School Admissions dataset, provided by [Project SEAPHE](http://www.seaphe.org/databases.php). The motivation was to gain a better understanding of race in law school admissions, and ensuring that students who would ultimately pass the bar exam were treated fairly.

We shall train a model for an admissions scenario. The model's task will be to provide a means of ranking applicants, and offers would be made to the top students from that list.

## Obtaining the Data

We obtain the data from the `tempeh` package. The main feature data (which we will refer to as $X$) has two features - undergraduate GPA and LSAT score. The label (which we call $y$) is 0 or 1 dependent on whether that student passed the bar exam. Finally, we also have the race of the students ('black' or white') as a sensitive attribute, which we will refer to as $A$.

We start by loading the data, which have already been split into "train" and "test" subsets for us.

In [None]:
import numpy as np
import pandas as pd

from tempeh.configurations import datasets
dataset = datasets['lawschool_passbar']()

X_train = pd.DataFrame(dataset.X_train, columns=dataset.features)
X_test = pd.DataFrame(dataset.X_test, columns=dataset.features)

y_train = pd.Series(dataset.y_train.squeeze(), name="Pass Bar", dtype=int)
y_test = pd.Series(dataset.y_test.squeeze(), name="Pass Bar", dtype=int)

A_train = pd.Series(dataset.race_train, name="Race")
A_test = pd.Series(dataset.race_test, name="Race")

Now, let us examine the data. First, we can look at the breakdown of students by race in the dataset. We see that there are far more white students than black, which is already a suggestion of bias in the data:

In [None]:
l, c = np.unique(dataset.race_train, return_counts=True)
for i in range(len(l)):
    print("Number of {0} students is {1}".format(l[i], c[i]))

We can also start using the group metrics from `fairlearn` to examine things such as the final pass rate for the bar exam. Both rates are high, although higher for whites:

In [None]:
from fairlearn.metrics import group_mean_prediction

def group_metric_printer(name, group_metric_result):
    print("{0} overall {1:.3f}".format(name, group_metric_result.overall))
    for k, v in group_metric_result.by_group.items():
        print("{0} for {1:8} {2:.3f}".format(name, k, v))

unused = np.ones(len(dataset.y_train))
group_metric_printer("Pass Rate", group_mean_prediction(unused, y_train, A_train))

Looking at the raw numbers, we see how dominant whites who pass the bar exam are in the dataset:

In [None]:
for r in ['black', 'white']:
    Ys = y_train[A_train==r]
    print(r)
    print(Ys.value_counts())

We can also examine the [ROC-AUC scores](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html) for each feature (that is LSAT score and undergraduate GPA) for the overall dataset and by race. Used in this way, the ROC-AUC score is a measure of how predictive each feature is of the final label. A score of 0.5 would mean that the feature is no better than a coin flip (a coin biased to produce a desired fraction of positives), while a score of 1 means that the feature is perfectly discriminating:

In [None]:
from fairlearn.metrics import group_roc_auc_score

for column_name in X_train:
    column_data = X_train[column_name]
    title = "ROC-AUC {0}".format(column_name)
    group_metric_printer(title, group_roc_auc_score(y_train, column_data, A_train))

We can also examine the CDFs for the LSAT and GPAs for whites and blacks:

In [None]:
import matplotlib.pyplot as plt
from scipy.stats import cumfreq

def plot_separated_cdf(data, A):
    for a in np.unique(A):
        subset = data[A==a]
        
        cdf = cumfreq(subset, numbins=20)
        x = cdf.lowerlimit + np.linspace(0, cdf.binsize*cdf.cumcount.size, cdf.cumcount.size)
        plt.plot(x, cdf.cumcount / len(subset), label=a)
    plt.xlabel(data.name)
    plt.ylabel("Cumulative Frequency")
    plt.legend()
    plt.show()
        

plot_separated_cdf(X_train['lsat'], A_train)
plot_separated_cdf(X_train['ugpa'], A_train)

## An Unmitigated Predictor

As a point of comparison for later, we can train a predictor without regard to fairness.

In [None]:
from sklearn.linear_model import LogisticRegression

unmitigated_predictor = LogisticRegression(solver='liblinear', fit_intercept=True)

unmitigated_predictor.fit(X_train, y_train)

We can examine some statistics for this predictor. Since we have an admissions scenario, with a goal of ranking applicants, it is not useful to look at the binary prediction itself (especially since we already know that most students will pass the bar exam). Instead, we use the `predict_proba` method of the `LogisticRegression` estimator, which provides probabilities for each class. We have a binary label column, so we will focus on the "1" class, and we shall refer to the value as a 'score' (to be used in ranking) rather than a probability.

In [None]:
unmitigated_scores = pd.Series(unmitigated_predictor.predict_proba(X_test)[:,1], name="Unmitigated Score")

We can now look at the mean predicted score:

In [None]:
group_metric_printer("Predicted Score", group_mean_prediction(y_test, unmitigated_scores, A_test))

We can also look at the ROC-AUC scores for the predictions. Due to the huge imbalance in the input population, this is a good metric to use in place of model accuracy.

In [None]:
group_roc_auc_score_unmitigated = group_roc_auc_score(y_test, unmitigated_scores, A_test)
group_metric_printer("Unmitigated ROC-AUC score", group_roc_auc_score_unmitigated)

Finally, we can examine the distribution in predicted score in more detail. Also marked is the maximum distance between the two cumulative frequency curves. This is an alternative measure of disparity in the model.

In [None]:
def compare_cdfs(data, A, num_bins=100):
    cdfs = {}
    assert len(np.unique(A)) == 2
    
    limits = ( min(data), max(data) )
    s = 0.5 * (limits[1] - limits[0]) / (num_bins - 1)
    limits = ( limits[0]-s, limits[1] + s)
    
    for a in np.unique(A):
        subset = data[A==a]
        
        cdfs[a] = cumfreq(subset, numbins=num_bins, defaultreallimits=limits)
        
    lower_limits = [v.lowerlimit for _, v in cdfs.items()]
    bin_sizes = [v.binsize for _,v in cdfs.items()]
    actual_num_bins = [v.cumcount.size for _,v in cdfs.items()]
    
    assert len(np.unique(lower_limits)) == 1
    assert len(np.unique(bin_sizes)) == 1
    assert np.all([num_bins==v.cumcount.size for _,v in cdfs.items()])
    
    xs = lower_limits[0] + np.linspace(0, bin_sizes[0]*num_bins, num_bins)
    
    disparities = np.zeros(num_bins)
    for i in range(num_bins):
        cdf_values = np.clip([v.cumcount[i]/len(data[A==k]) for k,v in cdfs.items()],0,1)
        disparities[i] = max(cdf_values)-min(cdf_values)  
    
    return xs, cdfs, disparities
    
    
def plot_and_compare_cdfs(data, A, num_bins=100):
    xs, cdfs, disparities = compare_cdfs(data, A, num_bins)
    
    for k, v in cdfs.items():
        plt.plot(xs, v.cumcount/len(data[A==k]), label=k)
    
    assert disparities.argmax().size == 1
    d_idx = disparities.argmax()
    
    xs_line = [xs[d_idx],xs[d_idx]]
    counts = [v.cumcount[d_idx]/len(data[A==k]) for k, v in cdfs.items()]
    ys_line = [min(counts), max(counts)]
    
    plt.plot(xs_line, ys_line, 'o--')
    disparity_label = "Max Disparity = {0:.3f} \nat {1:0.3f} ".format(disparities[d_idx], xs[d_idx])
    plt.text(xs[d_idx], 1, disparity_label, ha="right", va="top")
    
    plt.xlabel(data.name)
    plt.ylabel("Cumulative Frequency")
    plt.legend()
    plt.show()

    
plot_and_compare_cdfs(unmitigated_scores, A_test)

## Unfairness Mitigation with Grid Search

In this section, we will attempt to mitigate the unfairness in the incoming data using the `GridSearch` algorithm of `fairlearn`. We shall apply constraints of demographic parity - that is, we will attempt to equalise the positive prediction rates between whites and blacks. This is appropriate for affirmative action scenarios.

We will compute 41 models, on a grid covering the range $[-10, 10]$. The following cell may take a couple of minutes to run:

In [None]:
from fairlearn.reductions import GridSearch, DemographicParity

sweep = GridSearch(LogisticRegression(solver='liblinear', fit_intercept=True),
                   constraints=DemographicParity(),
                   grid_size=41,
                   grid_limit=10)

sweep.fit(X_train, y_train, sensitive_features=A_train)

We can plot the mean opportunity (given by the mean prediction in this case) of these models as a function of the multiplier used. We can also see that the opportunity is equalised for blacks and whites with a multiplier of around six.

In [None]:
def metric_sweep_plot(all_results, metric_func):
    xs = range(len(all_results))
    metrics = [metric_func(y_test, x.predictor.predict_proba(X_test)[:,1], A_test)
               for x in all_results]
    
    for r in ['black', 'white']:
        plt.plot(xs, [x.by_group[r] for x in metrics], label=r)
    plt.plot(xs, [x.overall for x in metrics], label='overall')
    plt.xlabel("Index")
    plt.ylabel(metric_func.__name__[6:])
    plt.legend()
    plt.show()
    
metric_sweep_plot(sweep.all_results, metric_func=group_mean_prediction)

We can examine the ROC-AUC score for this set of models:

In [None]:
metric_sweep_plot(sweep.all_results, metric_func=group_roc_auc_score)

We can also plot the minimum of the ROC-AUC score against the disparity in opportunity for each model in the sweep. This gives us an overview of the tradeoffs available to us:

In [None]:
def roc_auc_disparity_sweep_plot(all_results):
    roc_auc = np.zeros(len(all_results))
    disparity = np.zeros(len(all_results))
    
    for i in range(len(all_results)):
        preds = all_results[i].predictor.predict_proba(X_test)[:,1]
        roc_auc[i] = group_roc_auc_score(y_test, preds, A_test).minimum
        disparity[i] = group_mean_prediction(y_test, preds, A_test).range
        
    plt.scatter(roc_auc, disparity)
    plt.xlabel("Minimum ROC AUC score")
    plt.ylabel("Disparity in Opportunity")
    plt.show()
    print("Index of minimum disparity", disparity.argmin())
    
roc_auc_disparity_sweep_plot(sweep.all_results)

As an alternative to looking at the disparity in opportunity, we can use the maximum distance between the score CDFs (for blacks and whites) as the disparity metric for each model. This gives the following plot:

In [None]:
def roc_auc_cdf_disparity_sweep_plot(all_results):
    roc_auc = np.zeros(len(all_results))
    disparity = np.zeros(len(all_results))
    
    for i in range(len(all_results)):
        preds = all_results[i].predictor.predict_proba(X_test)[:,1]
        roc_auc[i] = group_roc_auc_score(y_test, preds, A_test).minimum
        _, _, dis = compare_cdfs(preds, A_test)
        disparity[i] = dis.max()
        
    plt.scatter(roc_auc, disparity)
    plt.xlabel("Minimum ROC AUC score")
    plt.ylabel("Disparity from CDF")
    plt.show()
    print("Index of minimum Disparity ", disparity.argmin())
    
roc_auc_cdf_disparity_sweep_plot(sweep.all_results)

We can now look at several different models, identified by their index in the results from the grid search. One relevant one is obviously the one with minimum disparity, which occurs at index 33, regardless of the disparity metric chosen. However, the ROC-AUC score is indicating that we're barely better than chosing at random here (slightly better than random for whites, worse than random for blacks - in such a case, one would flip the prediction).

In [None]:
def roc_auc_and_cdf(predictor):
    scores = pd.Series(predictor.predict_proba(X_test)[:,1], name="Scores")
    
    group_metric_printer("Chosen ROC-AUC score", group_roc_auc_score(y_test, scores, A_test))
    print("Disparity in Opportunity {0:.3f}".format(group_mean_prediction(y_test, scores, A_test).range))
    
    plot_and_compare_cdfs(scores, A_test)

roc_auc_and_cdf(sweep.all_results[33].predictor)

We can substantially increase the ROC-AUC score by accepting a slightly higher disparity. The change in disparity in opportunity is minimal, while the change based on the cumulative frequencies is rather larger. However, compared to the unmitigated model above, this is still a substantial improvement.

In [None]:
roc_auc_and_cdf(sweep.all_results[30].predictor)

## Mitigation with Threshold Optimisation

We can also use the post-processing approach from `fairlearn`.

In [None]:
from fairlearn.postprocessing import ThresholdOptimizer

class LogisticRegressionAsRegression:
    def __init__(self, logistic_regression_estimator):
        self.logistic_regression_estimator = logistic_regression_estimator
    
    def fit(self, X, y):
        self.logistic_regression_estimator.fit(X, y)
    
    def predict(self, X):
        # use predict_proba to get real values instead of 0/1, select only prob for 1
        scores = self.logistic_regression_estimator.predict_proba(X)[:,1]
        return scores

est = LogisticRegressionAsRegression(LogisticRegression(solver='liblinear', fit_intercept=True))

postprocess_estimator = ThresholdOptimizer(estimator=est,
                                          constraints="demographic_parity")

postprocess_estimator._plot = True
postprocess_estimator.fit(X_train, y_train, sensitive_features=A_train)

In [None]:
pp_preds = postprocess_estimator.predict(X_test, sensitive_features=A_test)

In [None]:
pp_mean_predictions = group_mean_prediction(y_test, # Actually unused
                                            pp_preds,
                                            A_test)
group_metric_printer("Predicted Pass Rate", pp_mean_predictions)

In [None]:
print(np.unique(pp_preds))

In [None]:
pp_roc_auc_score = group_roc_auc_score(y_test, pp_preds, A_test)

group_metric_printer("PP ROC-AUC", pp_roc_auc_score)