# Bias and Explainability with Amazon SageMaker Clarify

## Overview
Biases are imbalances in the training data, or the prediction behavior of the model across different groups. Sometimes these biases can cause harms to demographic subgroups, e.g. based age or income bracket. The field of machine learning provides an opportunity to address biases by detecting them and measuring them in your data and model.

![ Bias and Explainability with SageMaker Clarify ](clarify_explainability_arch.png)

Amazon SageMaker Clarify provides machine learning developers with greater visibility into their training data and models, so they can identify and limit bias and explain predictions.

In this notebook, we are going to go through each stage of the ML lifecycle, and show where you can include Clarify.

## Problem Formation

In this notebook, we are looking to predict the final grade for students in a math class, from the popular [Student Performance dataset](https://archive.ics.uci.edu/ml/datasets/Student+Performance) courtesy of UC Irvine.

For this dataset, final grades range from 0-20, where 15-20 are the most favorable outcomes. This is a multi-class classification problem, where we want to predict which grade a given student will get from 0 to 20. 

The benefit of using ML to predict this, is to be able to provide an accurate grade for the student if they aren't able to attend the final exam, due to circumstances outside their control.

The notebook will take 90 minutes to execute and will cost approximately $2.

## Prerequisites
1. This notebook works in the following environments.
   - Notebook Instances: Jupyter
   - Notebook Instances: JupyterLab
   - Studio
1. Use Python 3 Data Science Kernel on ml.m5.large instance.
1. This is a standalone notebook, and it does not depend on other notebooks.


In [None]:
import sagemaker
import boto3
import pandas as pd

session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = session.default_bucket()
prefix = "sagemaker/student-data-xgb"


---

## Dataset Construction

Data Construction is certainly a stage where bias can be introduced. We need to consider the following:

- Is the training data representative of different groups?
- Are there biases in labels or features?
- Does the data need to be modified to mitigate bias?


### Download the data

In [None]:
# Cleanup files from previous partial runs
!rm -rf student.zip student-merge.R student-por.csv student.txt student-mat.csv

!wget -O student.zip http://archive.ics.uci.edu/ml/machine-learning-databases/00320/student.zip
!unzip student.zip
!rm -rf student.zip student-merge.R student-por.csv student.txt
!ls
print("done!")

### Inspect the Data

In [None]:
local_data_path = "./student-mat.csv"
data = pd.read_csv(local_data_path, sep=";")

pd.set_option("display.max_columns", 500)
pd.set_option("display.max_rows", 500)
data.head()

### Preprocessing

This dataset is anonymized, with some encoding. We now need to encode various categorical variables.
We will not be including First Grade (G1) or Second Grade (G2), as these are strong predictors of the final grade - we want to understand how the other grades contribute.

In [None]:
def _preprocess_data(students):

    students.columns = [
        "School",  # 1. (binary)      student's school
        "Gender",  # 2. (binary)      student's gender
        "Age",  # 3. (numerical)   student's age
        "Address",  # 4. (binary)      student's home address type
        "FamilySize",  # 5. (binary)      family size
        "ParentCohabStatus",  # 6. (binary)      parent's cohabitation status
        "MotherEducation",  # 7. (numeric)     mother's education
        "FatherEducation",  # 8. (numeric)     father's education
        "MotherJob",  # 9. (nominal)     mother's job
        "FatherJob",  # 10.(nominal)     father's job
        "SchoolChoiceReason",  # 11.(nominal)     reason to choose this school
        "Guardian",  # 12.(nominal)     student's guardian
        "TravelTime",  # 13.(numerical)   home to school travel time
        "StudyTime",  # 14.(numerical)   weekly study time
        "Failures",  # 15.(numerical)   number of past class failures
        "SchoolSup",  # 16.(binary)      extra educational support
        "FamilySup",  # 17.(binary)      family educational support
        "ExtraPaidClasses",  # 18.(binary)      extra paid classes within the course subject
        "ExtraActivities",  # 19.(binary)      extra-curricular activities
        "Nursery",  # 20.(binary)      attended nursery school
        "WantsHigherEdu",  # 21.(binary)      wants to take higher education
        "HasInternet",  # 22.(binary)      Internet access at home
        "Romantic",  # 23.(binary)      with a romantic relationship
        "FamilyRelQuality",  # 24.(numerical)   quality of family relationships
        "FreeTime",  # 25.(numerical)   free time after school
        "GoOut",  # 26.(numerical)   going out with friends
        "WorkdayAlcohol",  # 27.(numerical)   workday alcohol consumption
        "WeekendAlcohol",  # 28.(numerical)   workday alcohol consumption
        "HealthStatus",  # 39.(numerical)   current health status
        "Absences",  # 30.(numerical)   number of school absences
        "FirstGrade",  # 31.(numerical)   G1 - first period grade
        "SecondGrade",  # 32.(numerical)   G2 - second period grade
        "FinalGrade",  # 33.(numerical)   G3 - final grade (TARGET)
    ]

    # For xgboost, we need to put target variable in the first column.
    df = pd.DataFrame(students.FinalGrade)

    # Encode the Attributes.
    res = students.School.map({"GP": 1, "MS": 0})
    df = pd.concat([df, res], axis=1, sort=False)

    res = students.Gender.map({"F": 1, "M": 0})
    df = pd.concat([df, res], axis=1, sort=False)

    df = pd.concat([df, students.Age], axis=1, sort=False)

    res = students.Address.map({"U": 1, "R": 0})
    df = pd.concat([df, res], axis=1, sort=False)

    res = students.FamilySize.map({"LE3": 1, "GT3": 0})
    df = pd.concat([df, res], axis=1, sort=False)

    res = students.ParentCohabStatus.map({"T": 1, "A": 0})
    df = pd.concat([df, res], axis=1, sort=False)

    df = pd.concat([df, students.MotherEducation], axis=1, sort=False)

    df = pd.concat([df, students.FatherEducation], axis=1, sort=False)

    res = pd.get_dummies(students.MotherJob, prefix="MotherJob")
    df = pd.concat([df, res], axis=1, sort=False)

    res = pd.get_dummies(students.FatherJob, prefix="FatherJob")
    df = pd.concat([df, res], axis=1, sort=False)

    res = pd.get_dummies(students.SchoolChoiceReason, prefix="SchoolChoiceReason")
    df = pd.concat([df, res], axis=1, sort=False)

    df = pd.concat([df, students.TravelTime], axis=1, sort=False)

    df = pd.concat([df, students.StudyTime], axis=1, sort=False)

    df = pd.concat([df, students.Failures], axis=1, sort=False)

    res = students.SchoolSup.map({"yes": 1, "no": 0})
    df = pd.concat([df, res], axis=1, sort=False)

    res = students.FamilySup.map({"yes": 1, "no": 0})
    df = pd.concat([df, res], axis=1, sort=False)

    res = students.ExtraPaidClasses.map({"yes": 1, "no": 0})
    df = pd.concat([df, res], axis=1, sort=False)

    res = students.Nursery.map({"yes": 1, "no": 0})
    df = pd.concat([df, res], axis=1, sort=False)

    res = students.WantsHigherEdu.map({"yes": 1, "no": 0})
    df = pd.concat([df, res], axis=1, sort=False)

    res = students.HasInternet.map({"yes": 1, "no": 0})
    df = pd.concat([df, res], axis=1, sort=False)

    res = students.Romantic.map({"yes": 1, "no": 0})
    df = pd.concat([df, res], axis=1, sort=False)

    df = pd.concat([df, students.FamilyRelQuality], axis=1, sort=False)
    df = pd.concat([df, students.FreeTime], axis=1, sort=False)
    df = pd.concat([df, students.GoOut], axis=1, sort=False)
    df = pd.concat([df, students.WorkdayAlcohol], axis=1, sort=False)
    df = pd.concat([df, students.WeekendAlcohol], axis=1, sort=False)
    df = pd.concat([df, students.HealthStatus], axis=1, sort=False)
    df = pd.concat([df, students.Absences], axis=1, sort=False)

    # We will not be including G1 or G2, as these are strong predictors
    # of the final grade - we want to understand how the other grades contribute.
    print("DF Shape: {}".format(df.shape))
    print("DF columns: {}".format(df.columns))

    # X will be our dataframe of attributes only, without the target:
    X = df.drop(["FinalGrade"], axis=1)

    # y will be our array of target values, the final grades.
    y = df.FinalGrade

    return X, y, df


X, y, df = _preprocess_data(data)
df.head()

### Split data into training and validation sets

We are going to use the **train_test_split** from sklearn which will randomize the rows and split into two groups for training and validation.

In [None]:
from sklearn.model_selection import train_test_split

train_data, validation_data = train_test_split(df, test_size=0.3, random_state=300)
print(train_data.shape)
print(validation_data.shape)

### Upload training and validation sets to S3

Before we can create a pre-training bias report for Clarify, we need to upload our data to S3.

In [None]:
from sagemaker.s3 import S3Uploader
from sagemaker.inputs import TrainingInput

train_data.to_csv("train_data.csv", index=False)
train_data_s3_path = S3Uploader.upload(
    "train_data.csv", "s3://{}/{}".format(bucket, prefix + "/train")
)
print("Train data uploaded to: " + train_data_s3_path)

validation_data.to_csv("validation_data.csv", index=False)
validation_data_s3_path = S3Uploader.upload(
    "validation_data.csv", "s3://{}/{}".format(bucket, prefix + "/validation")
)
print("Validation data uploaded to: " + validation_data_s3_path)

from IPython.core.display import HTML

url = "https://s3.console.aws.amazon.com/s3/buckets/{}?region={}&prefix={}/&showversions=false".format(
    bucket, session.boto_region_name, prefix
)
HTML('<a target="_blank" href="{}">Click here to view datasets in S3</a>'.format(url))

### SageMaker Clarify - Pre-training Bias Report

This step takes around 12 minutes.

In [None]:
from sagemaker import clarify

pretraining_bias_instance_count = 1
pretraining_bias_instance_type = "ml.c5.xlarge"

clarify_processor = clarify.SageMakerClarifyProcessor(
    role=role,
    instance_count=pretraining_bias_instance_count,
    instance_type=pretraining_bias_instance_type,
    sagemaker_session=session,
)

#### Define Data Configuration

In [None]:
bias_pretrain_report_output_path = "s3://{}/{}/clarify-pretrain-bias".format(bucket, prefix)

bias_data_config = clarify.DataConfig(
    s3_data_input_path=train_data_s3_path,
    s3_output_path=bias_pretrain_report_output_path,
    label="FinalGrade",
    headers=train_data.columns.to_list(),
    dataset_type="text/csv",
)

#### Define Bias Configuration

Now we can run pre-bias training over the favorable values, or the top final grades.

In [None]:
bias_config = clarify.BiasConfig(
    label_values_or_threshold=[11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
    facet_name="Gender",
    group_name="Age",
)

#### Run Pre-training Bias Job

Run the SageMaker Analyzer job - this takes about 5 minutes.

In [None]:
%time

clarify_processor.run_pre_training_bias(
    data_config=bias_data_config, data_bias_config=bias_config, methods="all"
)

Once the report has been generated, we can inspect what our values are looking like.

In [None]:
from IPython.core.display import HTML

url = "https://s3.console.aws.amazon.com/s3/buckets/{}?region={}&prefix={}/clarify-bias/&showversions=false".format(
    bucket, session.boto_region_name, prefix
)
HTML(
    '<a target="_blank" href="{}">Click here to view pre-training bias reports in S3</a>'.format(
        url
    )
)
from sagemaker.s3 import S3Downloader

local_databias_report_path = "./pretraining_bias_reports"

S3Downloader.download(
    local_path=local_databias_report_path,
    s3_uri=bias_pretrain_report_output_path,
    sagemaker_session=session,
)

### Inspect the Pretraining Bias Report

You can view the report [here](./pretraining_bias_reports/report.ipynb).

## Algorithm Selection

During algorithm selection we need to consider the following:

- Do fairness constraints need to be included in the objective function?

For the model, we will use the [XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html). It can handle a variety of data types, relationships, and distributions, and has a number of hyperparameters that you can fine-tune.

In [None]:
from sagemaker.s3 import S3Uploader
from sagemaker.inputs import TrainingInput

train_input = TrainingInput(train_data_s3_path, content_type="csv")
validation_input = TrainingInput(validation_data_s3_path, content_type="csv")

## Training Process

For this demo, we will train 10 models and pick the best one to deploy, based on the lowest **Root Mean Square Error (RMSE).**

In [None]:
from sagemaker.image_uris import retrieve
from sagemaker.estimator import Estimator

container = retrieve("xgboost", session.boto_region_name, version="1.2-1")

training_instance_count = 1
training_instance_type = "ml.m5.xlarge"

xgb = Estimator(
    container,
    role,
    instance_count=training_instance_count,
    instance_type=training_instance_type,
    disable_profiler=True,
    sagemaker_session=session,
)

xgb.set_hyperparameters(
    eval_metric="rmse",
    objective="reg:squarederror",
    num_round=100,
    rate_drop=0.3,
    tweedie_variance_power=1.4,
)

#### Objective Metric

In [None]:
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

hyperparameter_ranges = {
    "eta": ContinuousParameter(0, 1),
    "min_child_weight": ContinuousParameter(1, 10),
    "alpha": ContinuousParameter(0, 2),
    "max_depth": IntegerParameter(1, 10),
    "subsample": ContinuousParameter(0, 1),
}
objective_metric_name = "validation:rmse"
tuner = HyperparameterTuner(
    xgb,
    objective_metric_name,
    hyperparameter_ranges,
    max_jobs=10,
    max_parallel_jobs=5,
    objective_type="Minimize",
)

%time
tuner.fit({"train": train_input, "validation": validation_input}, include_cls_metadata=False)

This tuning job will take around 8 minutes on this dataset.

In [None]:
boto3.client("sagemaker").describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuner.latest_tuning_job.job_name
)["HyperParameterTuningJobStatus"]

Deploy the model, and grab the model name for passing to post-training jobs.

In [None]:
xgb_model_name = "xgb-student-model"
predictor_instance_count = 1
predictor_instance_type = "ml.c5.xlarge"

xgb_predictor = tuner.deploy(
    initial_instance_count=predictor_instance_count,
    instance_type=predictor_instance_type,
    model_name=xgb_model_name,
)

print(f"Model is sucessfully deployed.")
xgb_predictor.endpoint_name

## Testing Process

This is where we can do some post-training testing with Clarify - both for bias and explain-ability.

### Post training bias report

In [None]:
bias_posttrain_report_output_path = "s3://{}/{}/clarify-posttrain-bias".format(bucket, prefix)

bias_posttrain_data_config = clarify.DataConfig(
    s3_data_input_path=train_data_s3_path,
    s3_output_path=bias_posttrain_report_output_path,
    label="FinalGrade",
    headers=train_data.columns.to_list(),
    dataset_type="text/csv",
)

posttraining_bias_instance_count = 1
posttraining_bias_instance_type = "ml.c5.2xlarge"

model_config = clarify.ModelConfig(
    model_name=xgb_model_name,
    instance_type=posttraining_bias_instance_type,
    instance_count=posttraining_bias_instance_count,
    accept_type="text/csv",
    content_type="text/csv",
)

predictions_config = clarify.ModelPredictedLabelConfig()

clarify_processor.run_post_training_bias(
    data_config=bias_posttrain_data_config,
    data_bias_config=bias_config,
    model_config=model_config,
    model_predicted_label_config=predictions_config,
    methods="all",
)

## Download the Post-Training Bias Report

In [None]:
from sagemaker.s3 import S3Downloader

local_posttrain_bias_report_path = "./posttraining_bias_reports"

S3Downloader.download(
    local_path=local_posttrain_bias_report_path,
    s3_uri=bias_posttrain_report_output_path,
    sagemaker_session=session,
)

### Inspect the Post-Training Bias Report

If the report has been successfully downloaded, you can view the report [here](./posttraining_bias_reports/report.ipynb).


### Explainability report

In [None]:
# Need to establish a baseline with our data
test_features = validation_data.drop(["FinalGrade"], axis=1)
explainability_report_output_path = "s3://{}/{}/clarify-explainability".format(bucket, prefix)

explainability_data_config = clarify.DataConfig(
    s3_data_input_path=train_data_s3_path,
    s3_output_path=explainability_report_output_path,
    label="FinalGrade",
    headers=train_data.columns.to_list(),
    dataset_type="text/csv",
)

shap_config = clarify.SHAPConfig(
    baseline=[test_features.iloc[0].values.tolist()], num_samples=50, agg_method="mean_abs"
)

#### Run explain-ability job

In [None]:
clarify_processor.run_explainability(
    data_config=explainability_data_config,
    model_config=model_config,
    explainability_config=shap_config,
)

### Viewing the Explainability Report

We can use the [SageMaker S3 Downloader utility](https://sagemaker.readthedocs.io/en/stable/api/utility/s3.html#sagemaker.s3.S3Downloader) to download files to inspect them locally.

In [None]:
local_explainability_report_path = "./explainability_reports"

S3Downloader.download(
    local_path=local_explainability_report_path,
    s3_uri=explainability_report_output_path,
    sagemaker_session=session,
)

## Deployment

We need to consider:

- Is the model deployed on a population for which it was not trained or evaluated?
- Are there unequal effects across users?

Tuner.deploy will deploy the best model.


If the report has been successfully downloaded, you can view the report [here](./explainability_reports/report.ipynb).

# Monitoring & Feedback

- Does the model encourage feedback loops that can produce increasingly unfair outcomes?

To address this, we can look at including Model Monitor with Clarify Bias and Explainability, but we've run out of time for this demo.

#### Thank you!

## Cleanup
You can keep your endpoint running to continue capturing data. If you do not plan to collect more data or use this endpoint further, you should delete the endpoint to avoid incurring additional charges. Note that deleting your endpoint does not delete the data that was captured during the model invocations. That data persists in Amazon S3 until you delete it yourself.

In [None]:
# Clean up model endpoints
xgb_predictor.delete_model()
xgb_predictor.delete_endpoint(delete_endpoint_config=True)

# clean up local files
!rm -rf student-mat.csv train_data.csv validation_data.csv

## Next steps

In this session we reviewed the machine learning techniques and AWS services you can use to understand and reduce these risks.

AI services and machine learning are helping organizations to build data driven applications that are innovative and can be highly attuned to their customers’ needs, but AI applications require crucial customer data to train machine learning models. Application logic is delegated to these models, which can introduce unfairness and biases into an application.



## References
1. Pauline Kelly - [Building AI applications that avoid bias and maintain privacy and fairness](https://anz-resources.awscloud.com/aws-summit-online-anz-2021-data-scientist/building-ai-applications-that-avoid-bias-and-maintain-privacy-and-fairness-1)