# Binary Classification with the UCI Credit-card Default Dataset
_**Mitigating introduced demographic disparities in the UCI credit-card default dataset**_

## Summary

1. [Introduction](#Introduction)
2. [The UCI Credit-card Default Dataset](#The-UCI-Credit-card-Default-Dataset)
3. [Using a Fairness Unaware Model](#Using-a-Fairness-Unaware-Model)
4. [Mitigating Equalized Odds Difference with Postprocessing](#Mitigating-Equalized-Odds-Difference-with-Postprocessing)
5. [Mitigating Equalized Odds Difference with GridSearch](#Mitigating-Equalized-Odds-Difference-with-GridSearch)

# Introduction

In this example, we emulate the problem of demographic disparities arising in loand decisions. Specifically, we consider scenarios where algorithmic tools are trained on historic data and their predictions about loan applicants are used for making decisions about applicants. See [here](https://www.nytimes.com/2019/11/10/business/Apple-credit-card-investigation.html) for an example involving sex-based discrimination for credit limit decisions.  

For this scenario, we use the [UCI dataset](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients) on credit-card defaults in 2005 in Taiwan. For the sake of this exercise, we modify the original UCI dataset: we introduce a synthetic feature that has a strong predictive power for female clients, but is uninformative for male applicants. We fit a variety of models for predicting the default of a client. We show that a fairness-unaware training algorithm can lead to a predictor that achieves a much better accuracy for women than for men, and that it is insufficient to simply remove the sensitive feature (in this case sex) from training. We then use Fairlearn to mitigate this disparity in accuracy with either `ThresholdOptimizer` or `GridSearch`. 

In [None]:
# General imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Data processing
from sklearn.model_selection import train_test_split

# Models
import lightgbm as lgb
from sklearn.calibration import CalibratedClassifierCV

# Fairlearn algorithms and utils
from fairlearn.postprocessing import ThresholdOptimizer
from fairlearn.reductions import GridSearch, EqualizedOdds
from fairlearn.widget import FairlearnDashboard

# Metrics
from fairlearn.metrics import group_selection_rate, group_zero_one_loss, group_recall_score, \
                              metric_by_group, group_roc_auc_score
from sklearn.metrics import balanced_accuracy_score, accuracy_score, recall_score, roc_auc_score

# The UCI Credit-card Default Dataset

The UCI dataset contains data on 30,000 clients and their credit card transactions at a bank in Taiwan. In addition to static client features, the dataset contains the history of credit card bill payments between April and September 2005, as well as the balance limit of the client's credit card. The target is whether the client will default on a card payment in the following month, October 2005. One can imagine that a model trained on this data can be used in practice to determine whether a client is eligible for another product offering such as an auto loan. 

In [None]:
# Load the data
data_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls"
dataset = pd.read_excel(io=data_url, header=1).drop(columns=['ID']).rename(columns={'PAY_0':'PAY_1'})
dataset.head()

Dataset columns:

* `LIMIT_BAL`: credit card limit, will be replaced by a synthetic feature
* `SEX, EDUCATION, MARRIAGE, AGE`: client demographic features
* `BILL_AMT[1-6]`: amount on bill statement for April-September
* `PAY_AMT[1-6]`: payment amount for April-September
* `default payment next month`: target, whether the customer defaulted the following month

In [None]:
# Extract the sensitive feature
A = dataset["SEX"]
A_str = A.map({ 2:"female", 1:"male"})
# Extract the target
Y = dataset["default payment next month"]
categorical_features = ['EDUCATION', 'MARRIAGE','PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']
for col in categorical_features:
    dataset[col] = dataset[col].astype('category')

## Introduce a Synthetic Feature

We manipulate the balance-limit feature `LIMIT_BAL` to make it highly predictive for women but not for men. For example, we can imagine that a lower credit limit indicates that a female client is less likely to default, but provides no information on a male client's probability of default.

In [None]:
dist_scale = 0.5
np.random.seed(12345)
# Make 'LIMIT_BAL' informative of the target
dataset['LIMIT_BAL'] = Y + np.random.normal(scale=dist_scale, size=dataset.shape[0])
# But then make it uninformative for the male clients
dataset.loc[A==1, 'LIMIT_BAL'] = np.random.normal(scale=dist_scale, size=dataset[A==1].shape[0])

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 4), sharey=True)
# Plot distribution of LIMIT_BAL for men
dataset['LIMIT_BAL'][(A==1) & (Y==0)].plot(kind='kde', label="Payment on Time", ax=ax1, 
                                           title="LIMIT_BAL distribution for men")
dataset['LIMIT_BAL'][(A==1) & (Y==1)].plot(kind='kde', label="Payment Default", ax=ax1)
# Plot distribution of LIMIT_BAL for women
dataset['LIMIT_BAL'][(A==2) & (Y==0)].plot(kind='kde', label="Payment on Time", ax=ax2, 
                                           legend=True, title="LIMIT_BAL distribution for women")
dataset['LIMIT_BAL'][(A==2) & (Y==1)].plot(kind='kde', label="Payment Default", ax=ax2, 
                                           legend=True).legend(bbox_to_anchor=(1.6, 1))
plt.show()

We notice from the above figures that the new `LIMIT_BAL` feature is indeed highly predictive for women, but not for men.

In [None]:
# Train-test split
df_train, df_test, Y_train, Y_test, A_train, A_test, A_str_train, A_str_test = train_test_split(
    dataset.drop(columns=['SEX', 'default payment next month']), 
    Y, 
    A, 
    A_str,
    test_size = 0.3, 
    random_state=12345,
    stratify=Y)

# Using a Fairness Unaware Model

We train an out-of-the-box `lightgbm` model on the modified data and assess several disparity metrics. 

In [None]:
lgb_params = {
    'objective' : 'binary',
    'metric' : 'auc',
    'learning_rate': 0.03,
    'num_leaves' : 10,
    'max_depth' : 3
}

In [None]:
model = lgb.LGBMClassifier(**lgb_params)

In [None]:
model.fit(df_train, Y_train)

In [None]:
# Scores on test set
test_scores = model.predict_proba(df_test)[:, 1]

In [None]:
# Train AUC
roc_auc_score(Y_train, model.predict_proba(df_train)[:, 1])

In [None]:
# Predictions (0 or 1) on test set
test_preds = (test_scores >= np.mean(Y_train)) * 1

In [None]:
# LightGBM feature importance 
lgb.plot_importance(model, height=0.6, title="Features importance (LightGBM)", importance_type="gain", max_num_features=15) 
plt.show()

We notice that the synthetic feature `LIMIT_BAL` appears as the most important feature in this model although it has no predictive power for an entire demographic segment in the data. 

In [None]:
# Helper functions
def group_equalized_odds_diff(y_true, y_pred, group):
    TPR_diff = group_recall_score(y_true, y_pred, group).range
    TNR_diff = group_recall_score(1-y_true, 1-y_pred, group).range
    return max(TPR_diff, TNR_diff)

def get_metrics_df(models_dict, y_true, group):
    metrics_dict = {
        "Demographic parity difference": (lambda x: group_selection_rate(y_true, x, group).range, True),
        "Demographic parity ratio": (lambda x: group_selection_rate(y_true, x, group).range_ratio, True),
        "Error rate difference": (
            lambda x: metric_by_group(balanced_accuracy_score, y_true, x, group).range, True),
        "Equal opportunity difference": (lambda x: group_recall_score(y_true, x, group).range, True),
        "Equalized odds difference": (lambda x: group_equalized_odds_diff(y_true, x, group), True),
        "Group AUC difference": (lambda x: group_roc_auc_score(y_true, x, group).range, False),
        "Overall AUC": (lambda x: roc_auc_score(y_true, x), False)
        }
    df_dict = {}
    for metric_name, (metric_func, use_preds) in metrics_dict.items():
        df_dict[metric_name] = [metric_func(preds) if use_preds else metric_func(scores) 
                                for model_name, (preds, scores) in models_dict.items()]
    return pd.DataFrame.from_dict(df_dict, orient="index", columns=models_dict.keys())

We calculate several disparity metrics below:

In [None]:
# Metrics
models_dict = {"Unmitigated": (test_preds, test_scores)}
get_metrics_df(models_dict, Y_test, A_str_test)

Throughout the rest of the exercise, we focus on mitigating the *equalized odds difference*, which we use to quantify the disparity in accuracy experienced by different demographics. Our goal is to assure that neither of the two groups (men vs women) has substantially larger false-positive rates or false-negative rates than the other group. The equalized odds difference quantifies this. It is equal to the larger of the following two numbers: the difference between false-positive rates of the two groups, and the difference between false-negative rates of the two groups.

As the overall performance metric we use the _area under ROC curve_ (AUC), which is suited to classification problems with a large imbalance of positive and negative examples. For binary classifiers, this is the same as _balanced accuracy_.

# Mitigating Equalized Odds Difference with Postprocessing

We attempt to mitigate the disparities in the `lightgbm` prediction using the Fairlearn postprocessing algorithm `ThresholdOptimizer`. This algorithm finds a suitable threshold for the scores (class probabilities) produced by the `lightgbm` model by optimizing the accuracy rate under the constraint that the equalized odds difference (on training data) is zero. Since our goal is to optimize balanced accuracy, we resample the training data to have the same number of positive and negative examples. (This means that `ThresholdOptimizer` is effectively optimizing balanced accuracy on the original data.)

In [None]:
postprocess_est = ThresholdOptimizer(
    estimator=model,
    constraints="equalized_odds")

In [None]:
# Balanced data set is obtained by sampling the same number of points from the majority class (Y=0)
# as there are points in the minority class (Y=1)
balanced_idx1 = df_train[Y_train==1].index
pp_train_idx = balanced_idx1.union(Y_train[Y_train==0].sample(n=balanced_idx1.size, random_state=1234).index)

In [None]:
df_train_balanced = df_train.loc[pp_train_idx, :]
Y_train_balanced = Y_train.loc[pp_train_idx]
A_train_balanced = A_train.loc[pp_train_idx]

In [None]:
postprocess_est.fit(df_train_balanced, Y_train_balanced, sensitive_features=A_train_balanced)

In [None]:
postprocess_preds = postprocess_est.predict(df_test, sensitive_features=A_test)

In [None]:
models_dict = {"Unmitigated": (test_preds, test_scores),
              "ThresholdOptimizer": (postprocess_preds, postprocess_preds)}
get_metrics_df(models_dict, Y_test, A_str_test)

Note that the `ThresholdOptimizer` method significantly reduces the disparity according to multiple metrics. 

Below, we compare this model with the unmitigated `lightgbm` model using the Fairlearn dashboard. 

**Unmitigated Model vs ThresholdOptimizer: Dashboard Demo**

In [None]:
FairlearnDashboard(sensitive_features=A_str_test, sensitive_feature_names=['Sex'],
                   y_true=Y_test,
                   y_pred={"Unmitigated": test_preds,
                          "ThresholdOptimizer": postprocess_preds})

# Mitigating Equalized Odds Difference with GridSearch

We now attempt to mitigate disparities using the `GridSearch` algorithm from Fairlearn. Unlike `ThresholdOptimizer`, the predictors produced by `GridSearch` do not access the sensitive feature at test time. Also, rather than training a single model, we train multiple models corresponding to different trade-off points between the performance metric (balanced accuracy) and fairness metric (equalized odds difference).

In [None]:
# Train GridSearch
sweep = GridSearch(model,
                   constraints=EqualizedOdds(),
                   grid_size=41,
                   grid_limit=2)

sweep.fit(df_train_balanced, Y_train_balanced, sensitive_features=A_train_balanced)

In [None]:
sweep_preds = [predictor.predict(df_test) for predictor in sweep._predictors] 
sweep_scores = [predictor.predict_proba(df_test)[:, 1] for predictor in sweep._predictors] 

In [None]:
equalized_odds_sweep = [
    group_equalized_odds_diff(Y_test, preds, A_str_test)
    for preds in sweep_preds
]
balanced_accuracy_sweep = [balanced_accuracy_score(Y_test, preds) for preds in sweep_preds]
auc_sweep = [roc_auc_score(Y_test, scores) for scores in sweep_scores]

In [None]:
# Select only non-dominated models (with respect to balanced accuracy and equalized odds difference)
all_results = pd.DataFrame(
    {"predictor": sweep._predictors, "accuracy": balanced_accuracy_sweep, "disparity": equalized_odds_sweep}
) 
non_dominated = [] 
for row in all_results.itertuples(): 
    accuracy_for_lower_or_eq_disparity = all_results["accuracy"][all_results["disparity"] <= row.disparity] 
    if row.accuracy >= accuracy_for_lower_or_eq_disparity.max(): 
        non_dominated.append(True)
    else:
        non_dominated.append(False)

equalized_odds_sweep_non_dominated = np.asarray(equalized_odds_sweep)[non_dominated]
balanced_accuracy_non_dominated = np.asarray(balanced_accuracy_sweep)[non_dominated]
auc_non_dominated = np.asarray(auc_sweep)[non_dominated]

In [None]:
# Plot equalized odds difference vs balanced accuracy
plt.scatter(balanced_accuracy_non_dominated, equalized_odds_sweep_non_dominated, label="GridSearch Models")
plt.scatter(balanced_accuracy_score(Y_test, test_preds), group_equalized_odds_diff(Y_test, test_preds, A_str_test), 
           label="Unmitigated Model")
plt.scatter(balanced_accuracy_score(Y_test, postprocess_preds), 
            group_equalized_odds_diff(Y_test, postprocess_preds, A_str_test) , label="ThresholdOptimizer Model")
plt.xlabel("Balanced Accuracy")
plt.ylabel("Equalized Odds Difference")
plt.legend(bbox_to_anchor=(1.55, 1))
plt.show()

In [None]:
# Plot equalized odds difference vs auc
plt.scatter(auc_non_dominated, equalized_odds_sweep_non_dominated, label="GridSearch Models")
plt.scatter(roc_auc_score(Y_test, test_scores), group_equalized_odds_diff(Y_test, test_preds, A_str_test), 
            label="Unmitigated Model")
plt.scatter(roc_auc_score(Y_test, postprocess_preds), 
            group_equalized_odds_diff(Y_test, postprocess_preds, A_str_test) , label="ThresholdOptimizer Model")
plt.xlabel("AUC")
plt.ylabel("Equalized Odds Difference")
plt.legend(bbox_to_anchor=(1.55, 1))
plt.show()

In [None]:
model_sweep_dict = {"GridSearch_{}".format(i): sweep_preds[i] for i in range(len(sweep_preds)) if non_dominated[i]}
model_sweep_dict.update({"Unmitigated": test_preds, "ThresholdOptimizer": postprocess_preds})

**Grid Search: Dashboard Demo**

We compare the GridSearch candidate models with the unmitigated `lightgbm` and the threshold optimized model using the Fairlearn dashboard. 

In [None]:
FairlearnDashboard(sensitive_features=A_str_test, sensitive_feature_names=['Sex'],
                   y_true=Y_test,
                   y_pred=model_sweep_dict)