# Explaining Credit Decisions

Given the increasing complexity of machine learning models, the need for model
explainability has been growing lately. Some governments have also introduced
stricter regulations that mandate a *right to explanation* from machine
learning models. In this solution, we take a look at how [Amazon
SageMaker](https://aws.amazon.com/sagemaker/) can be used to explain
individual predictions from machine learning models.

As an example application, we classify credit applications and predict whether
the credit would be payed back or not (often called a *credit default*). Given
a credit application from a bank customer, the aim of the bank is to predict
whether or not the customer will pay back the credit in accordance with their
repayment plan. When a customer can't pay back their credit, often called a
'default', the bank loses money and the customers credit score will be
impacted. On the other hand, denying trustworthy customers credit also has a
set of negative impacts. Using accurate machine learning models to classify
the risk of a credit application can help find a good balance between these
two scenarios, but this provides no comfort to those customers who have been
denied credit. Using explanability methods, it's possible to determine
actionable factors that had a negative impact on the application. Customers
can then take action to increase their chance of obtaining credit in
subsequent applications.

We train a tree-based [LightGBM](https://lightgbm.readthedocs.io/en/latest/)
model using [Amazon SageMaker](https://aws.amazon.com/sagemaker/) and explain
its predictions using a game theoretic approach called
[SHAP](https://github.com/slundberg/shap) (SHapley Additive exPlanations). We
deploy a endpoint that returns the credit default risk score, alongside an
explanation.

## Imports
We start by importing a variety of packages that will be used throughout the
notebook. One of the most important packages is the Amazon SageMaker Python
SDK (i.e. `import sagemaker`). We also import modules from our own custom
package that can be found at `./src/package`.

In [None]:
from bokeh.plotting import output_notebook
import sagemaker
from sagemaker.sklearn import SKLearn
from sagemaker.local import LocalSession
from sagemaker.predictor import json_serializer, json_deserializer, CONTENT_TYPE_JSON
from pathlib import Path
import boto3
import shap
import numpy as np
import os

from src.package import glue
from src.package import config
from src.package import schemas
from src.package import datasets
from src.package import containers
from src.package import utils
from src.package import visuals

## Datasets
When creating the AWS CloudFormation stack, a collection of synthetic datasets
were generated and stored in our solution Amazon S3 bucket with a prefix of
`dataset`. Most of the features contained in these datasets are based on the
[German Credit
Dataset](http://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data))
(UCI Machine Learning Repository), but there are some synthetic data fields
too. All personal information was generated using
[`Faker`](https://faker.readthedocs.io/en/master/). We have 3 datasets in
total: credits, people and contacts.

### Dataset #1: Credits

Our credits dataset contains features directly related to the credit application.

It is a CSV file (i.e. Comma Seperated Value file) that has a header row with feature names. Of particular note is the feature called `default`. It is our target variable that we're trying to predict with our LightGBM model. We show the first two rows of the dataset below:

```
"credit_id","person_id","amount","duration","purpose","installment_rate","guarantor","coapplicant","default"
"51829372","f032303d",1169,6,"electronics",4,0,0,False
```

### Dataset #2: People

Our credits data contains features related to the people making the credit applications (i.e. the applicants).

It's a [JSON Lines](http://jsonlines.org/) file, where each row is a separate JSON object. Of particular note is the feature called `person_id`. You'll notice that this feature was also included in the credits dataset. It is used to connect the credit application with the applicant. We show the first row of the dataset below:  

```
{
    "person_id": "f032303d",
    "finance": {
        "accounts": {
            "checking": {
                "balance": "negative"
            }
        },
        "repayment_history": "very_poor",
        "credits": {
            "this_bank": 2,
            "other_banks": 0,
            "other_stores": 0
        },
        "other_assets": "real_estate"
    },
    "personal": {
        "age": 67,
        "gender": "male",
        "relationship_status": "single",
        "name": "Peter Jones"
    },
    "dependents": [
        {
            "gender": "male",
            "name": "Michael Morales"
        }
    ],
    "employment": {
        "type": "professional",
        "title": "Learning disability nurse",
        "duration": 11,
        "permit": "foreign"
    },
    "residence": {
        "type": "own",
        "duration": 4
    }
}
```

### Dataset #3: Contacts

Our contacts dataset contains contact information for the applicants.

It is a CSV file that has a header row with feature names. Once again we have `person_id`. We show the first two rows of the dataset below:

```
"contact_id","person_id","type","value"
"5996e20a","f032303d","telephone","(716)406-9514x345"
```

## AWS Glue

One of the most time consuming tasks in developing a machine learning workflow
is data preperation. AWS Glue can be used to simplify this process. As a
demonstration of how it can be used to infer data schemas and perform extract,
transform and load (ETL) jobs in Spark, we'll prepare dataset using AWS Glue.
Although our sample datasets are small, there are many real world scenarios
that will benefit from the scalability of AWS Glue.

When creating the AWS CloudFormation stack, a number of AWS Glue resources
were created:

* A
  [Database](https://docs.aws.amazon.com/glue/latest/dg/define-database.html)
  is used to organize solution's tables.
* A [Crawler](https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html) is
  used infer formats and schemas of the datasets above.
* A [Custom
  Classifier](https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html)
  is used to help the classifier infer the schema of the contacts datasets.
    * All fields are of type 'string', so we need to indicate that the first
      row is a header row rather than data.
* A [Job](https://docs.aws.amazon.com/glue/latest/dg/author-job.html) is used
  to join the datasets together, drop certain feature, create other features,
  and split train and test sets.
* A
  [Workflow](https://docs.aws.amazon.com/glue/latest/dg/orchestrate-using-workflows.html)
  (and associated
  [triggers](https://docs.aws.amazon.com/glue/latest/dg/trigger-job.html)) to
  orchestrate the above crawler and job.

You can explore the service console for AWS Glue for more details, but for now
we'll start the workflow. 

In [None]:
glue_run_id = glue.start_workflow(config.GLUE_WORKFLOW)

Our workflow takes around 10 minutes to complete. Most of this time is spend
on resource provisioning, but there is a [preview
feature](https://pages.awscloud.com/glue-reduced-spark-times-preview-2020.html)
for reduced start times. We'll wait until the AWS Glue workflow has completed
before continuing. We need the dataset before training our model in Amazon
SageMaker.

In [None]:
glue.wait_for_workflow_finished(config.GLUE_WORKFLOW, glue_run_id)

With our AWS Glue workflow complete, we should now have 4 additional datasets
in our solution's Amazon S3 bucket: `data_train`, `label_train`, `data_test`
and `label_test`. We show the first row of `data_train` below (although it may
wrap onto two lines):

```
false,338,0,6,0,4,electronics,8,foreign,professional,negative,high,0,0,2,car_or_other,very_poor,52,male,1,single,4,own
```

We now have 23 features that describe a credit application and its applicant.
We no longer have a header row of feature names, but fortunately all of this
schema information is stored in our AWS Glue catalog. Since we're interested
in explaining the model predictions, and our explanations attribute features,
it's useful if our feature names are understandable.

**Advanced**: We can also organize features in a hierarchy (using a seperator
in the feature names), which enables summarization of the explanations. As an
example, `employment__type` and `employment__duration` are both `employment`
related features. We use two consecutive underscores (`__`) as our level
separator.

## Schema
Schemas can be used to keep track of feature names, descriptions and types.
Our solution uses
[`jsonschema`](https://python-jsonschema.readthedocs.io/en/stable/) as the
primary schema format. We have the added bonus of being able to use schemas to
validate input to the trained model and deployed endpoints.

We already have most of this schema information in our AWS Glue catalog, so
let's start by retrieving the table schema for `data_train`.

In [None]:
data_schema = glue.get_table_schema(
    database_name=config.GLUE_DATABASE, table_name="data_train"
)

We can now add additional information such as feature descriptions, that will
be shown inside the tooltip on the visuals later on.

In [None]:
# flake8: noqa: E501
data_schema.title = "Credit Application"
data_schema.description = "An array of items used to describe a credit application."
item_descriptions_dict = {
    "contact__has_telephone": "Customer has a registered telephone number.",
    "credit__amount": "Amount of money requested as part of credit application (in EUR).",
    "credit__coapplicant": "Co-applicant on credit application.",
    "credit__duration": "Amount of time the credit is requested for (in months).",
    "credit__guarantor": "Guarantor on credit application.",
    "credit__installment_rate": "Credit installment rate (as a percentage of the customer's disposable income).",
    "credit__purpose": "Customer's reason for requiring credit.",
    "employment__duration": "Amount of time the customer has been employed at their current employer (in years).",
    "employment__permit": "Customer's current work permit type.",
    "employment__type": "Customer's current job classification.",
    "finance__accounts__checking__balance": "Customer's checking account balance.",
    "finance__accounts__savings__balance": "Customer's savings account balance.",
    "finance__credits__other_banks": "Count of credits the customer has at other banks.",
    "finance__credits__other_stores": "Count of credits the customer has at other stores.",
    "finance__credits__this_bank": "Count of credits the customer has at this bank.",
    "finance__other_assets": "Customer's most significant asset.",
    "finance__repayment_history": "Quality of the customer's repayment history.",
    "personal__age": "Customer's age in years.",
    "personal__gender": "Customer's gender.",
    "personal__num_dependents": "Count of the customer's dependents.",
    "personal__relationship_status": "Customer's relationship status.",
    "residence__duration": "Amount of time the customer has been at their current residence (in years).",
    "residence__type": "Class of the customer's residence."
}
data_schema.item_descriptions_dict = item_descriptions_dict

We do the same for `label_train` too.

In [None]:
label_schema = glue.get_table_schema(
    database_name=config.GLUE_DATABASE, table_name="label_train"
)
label_schema.title = "Credit Application Outcome"
item_descriptions_dict = {
    "credit__default": (
        "0 if the customer successfully made credit payments, "
        "1 if the customer defaulted on credit payments.")
}
label_schema.item_descriptions_dict = item_descriptions_dict

Since the schemas for train and test datasets are the same, we can skip
`data_test` and `label_test`.

We can save our updated schemas to disk, in preperation for uploading to
Amazon S3.

In [None]:
current_folder = utils.get_current_folder(globals())
schema_folder = Path(current_folder, "schemas")
data_schema_filepath = Path(schema_folder, "data.schema.json")
data_schema.save(data_schema_filepath)
label_schema_filepath = Path(schema_folder, "label.schema.json")
label_schema.save(label_schema_filepath)

Up next, we create a SageMaker Session. A SageMaker Session can be used to
conveniently perform certain AWS actions, such as uploading and downloading
files from Amazon S3. We use the SageMaker Session to upload our schemas to
Amazon S3.

In [None]:
boto_session = boto3.session.Session(region_name=config.AWS_REGION)
sagemaker_session = sagemaker.Session(boto_session)

sagemaker_session.upload_data(
    path=str(schema_folder),
    bucket=config.S3_BUCKET,
    key_prefix=config.SCHEMAS_S3_PREFIX
)

## Container
We now build our custom Docker image that will be used for model training and
deployment. It extends the official Amazon SageMaker framework image for
Scikit-learn, by adding additional packages such as
[LightGBM](https://lightgbm.readthedocs.io/en/latest/) and
[SHAP](https://github.com/slundberg/shap). After building the image, we upload
it to our solution's Amazon ECR repository.

In [None]:
scikit_learn_image = containers.scikit_learn_image()
custom_image = containers.custom_image()

dockerfile = Path(current_folder, 'containers/Dockerfile')
custom_image.build(
    dockerfile=dockerfile,
    buildargs={'SCIKIT_LEARN_IMAGE': str(scikit_learn_image)}
)
custom_image.push()

## Model Training
Amazon SageMaker provides two methods to training and deploying models. You
can start by quickly testing and debuging models on the Amazon SageMaker
Notebook instance using Local Mode (set `local = True`). After this, you can
scale up training with SageMaker Mode on dedicated instances and deploy the
model on dedicated instance too (set `local = False`). Since this is a
pre-developed solution we'll start with SageMaker Mode.

In [None]:
local = False
if local:
    train_instance_type = 'local'
    deploy_instance_type = 'local'
    session = LocalSession(boto_session)
else:
    train_instance_type = 'ml.c5.xlarge'
    deploy_instance_type = 'ml.c5.xlarge'
    session = sagemaker_session

Up next, we configure our SKLearn estimator. We will use it to coordinate
model training and deployment. We reference our custom container (see
`image_name`) and our custom code (see `entry_point` and `source_dir`). At
this stage, we also reference the instance type (and instance count) that will
be used during training, and the hyperparmeters we wish to use. And lastly we
set the `output_path` for trained model artifacts and `code_location` for a
snapshot of the training script that was used.

In [None]:
hyperparameters = {
    "tree-n-estimators": 42,
    "tree-max-depth": 2,
    "tree-min-child-samples": 1,
    "tree-boosting-type": "dart"
}

estimator = sagemaker.sklearn.SKLearn(
    image_name=str(custom_image),
    source_dir=str(Path(current_folder, 'src').resolve()),
    entry_point='entry_point_explanations.py',
    hyperparameters=hyperparameters,
    role=config.SAGEMAKER_IAM_ROLE,
    train_instance_count=1,
    train_instance_type=train_instance_type,
    sagemaker_session=session,
    output_path='s3://' + str(Path(config.S3_BUCKET, config.OUTPUTS_S3_PREFIX)),
    code_location='s3://' + str(Path(config.S3_BUCKET, config.OUTPUTS_S3_PREFIX))
)

With our estimator now initialized, we can start the Amazon SageMaker training
job. Since our entry point script expects a number of data channels to be
defined, we can provide them when calling `fit`. When referencing `s3://`
folders, the contents of these folders will be automatically downloaded from
Amazon S3 before the entry point script is run. When using local mode, it's
possible to avoid this data transfer and reference local folder using the
`file://` prefix instead: e.g. `{'schemas': 'file://' + str(schema_folder)}`

You can expect this step to take approximately 5 minutes.

In [None]:
estimator.fit({
    'schemas': 's3://' + str(Path(config.S3_BUCKET, config.SCHEMAS_S3_PREFIX)),
    'data_train': 's3://' + str(Path(config.S3_BUCKET, config.DATASETS_S3_PREFIX, 'data_train')),
    'label_train': 's3://' + str(Path(config.S3_BUCKET, config.DATASETS_S3_PREFIX, 'label_train')),
    'data_test': 's3://' + str(Path(config.S3_BUCKET, config.DATASETS_S3_PREFIX, 'data_test')),
    'label_test': 's3://' + str(Path(config.S3_BUCKET, config.DATASETS_S3_PREFIX, 'label_test'))
})

## Model Deployment
Our Amazon SageMaker training job has now completed, and we should have a
number of trained model artifacts that can be deployed. Calling `deploy` will
start a container to host the model (and an instance to run the container if
you're not running in local mode). Using `estimator.deploy` means that we'll
use the same entry point script as used for training, but the model deployment
functions (i.e. `model_fn`, `input_fn`, `predict_fn`, etc) will be used
instead of the model training function (i.e. `train_fn`).

**Caution**: When using local mode, you may see `docker-compose` errors if
trying to deploy the estimator multiple times. You need to manually stop the
original hosting container before deploying a second time. Uncomment and
execute the following command to stop the original hosting container.

In [None]:
# !docker container stop $(docker ps -a -q --filter ancestor={config.ECR_REPOSITORY})

You can expect this step to take approximately 5 minutes.
Note: AWS CloudFormation will delete this endpoint (and endpoint
configuration) during stack deletion if the `endpoint_name` is kept as is. You
will need to manually delete the endpoint (and endpoint configuration) after
stack deletion if you change this.

In [None]:
explainer = estimator.deploy(
    endpoint_name="{}-endpoint".format(config.STACK_NAME),
    instance_type=deploy_instance_type,
    initial_instance_count=1
)

When calling the `explainer` endpoint from the notebook, we first need to
convert the features stored as a Python list into a JSON string. With the
Amazon SageMaker Python SDK, we can simply set the `serializer` to the
in-built `json_serializer`. Additionally, we notify to the endpoint that the
contents being sent is in-fact JSON by setting the `content_type`. Similarly,
we can request a response from the endpoint in JSON format by setting
`accept`. And lastly, we can convert the JSON responce back into Python
objects by setting `deserializer` to `json_deserializer`. 

You should be aware that these changes only effect endpoints calls from this
notebook.

In [None]:
explainer.serializer = json_serializer
explainer.content_type = CONTENT_TYPE_JSON  # 'application/json'
explainer.accept = CONTENT_TYPE_JSON        # 'application/json'
explainer.deserializer = json_deserializer

## Model Explanations

In [None]:
# data_train = datasets.read_csv_dataset(Path(datasets_folder, 'data_train'), data_schema)
# sample = data_train[0, :].tolist()
sample = {
    'contact__has_telephone': False,
    'credit__amount': 433,
    'credit__coapplicant': 1,
    'credit__duration': 18,
    'credit__guarantor': 0,
    'credit__installment_rate': 3,
    'credit__purpose': 'electronics',
    'employment__duration': 0,
    'employment__permit': 'foreign',
    'employment__type': 'professional',
    'finance__accounts__checking__balance': 'no_account',
    'finance__accounts__savings__balance': 'low',
    'finance__credits__other_banks': 0,
    'finance__credits__other_stores': 0,
    'finance__credits__this_bank': 1,
    'finance__other_assets': 'real_estate',
    'finance__repayment_history': 'good',
    'personal__age': 22,
    'personal__gender': 'male',
    'personal__num_dependents': 1,
    'personal__relationship_status': 'married',
    'residence__duration': 4,
    'residence__type': 'rent'
}

When calling `predict`, the `sample` will be serialized and sent to the
`explainer` endpoint. Our endpoint, after running the model deployment
functions in the entry point, will return a predicted probability of credit
default and an explanation for this prediction.

**Caution**: the probability returned by this model has not been calibrated.
When the model gives a probability of credit default of 20%, for example, this
does not necessarily mean that 20% of applications with a probability of 20%
resulted in credit default. Calibration is a useful property in certain
circumstances, but is not required in cases where discrimination between cases
of default and non-defult is sufficient.
[CalibratedClassifierCV](https://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html)
from [Scikit-learn](https://scikit-learn.org/stable/modules/calibration.html)
can be used to calibrate a model. Calibration also has an impact on the
explanations. Since the calibration process is typically non-linear, it breaks
the additive property of Shapley Values.
[`KernelExplainer`](https://shap.readthedocs.io/en/latest/) can handle this
case, but is typically much slower to compute the explanations.

In [None]:
output = explainer.predict(sample)
prediction = output['prediction']
print("prediction: {:.2%}".format(prediction))
explanation = output['explanation']

## Visualizing Explanations

In [None]:
output_notebook()

### Summary Explanation

In [None]:
explanation_summary = visuals.summarize_explanation(explanation)
summary_waterfall = visuals.WaterfallChart(
    baseline=explanation_summary['expected_value'],
    shap_values=explanation_summary['shap_values'],
    names=explanation_summary['feature_names'],
    descriptions=explanation_summary['feature_descriptions'],
    max_features=10,
    x_axis_label='Credit Default Risk Score (%)',
)
summary_waterfall.show()

### Detailed Explanation

In [None]:
detailed_waterfall = visuals.WaterfallChart(
    baseline=explanation['expected_value'],
    shap_values=explanation['shap_values'],
    names=explanation['feature_names'],
    feature_values=explanation['feature_values'],
    descriptions=explanation['feature_descriptions'],
    max_features=10,
    x_axis_label='Credit Default Risk Score (%)'
)
detailed_waterfall.show()

### Counterfactual Example

In [None]:
sample['finance__accounts__checking__balance'] = 'negative'  # from 'no_account'
explanation = explainer.predict(sample)['explanation']
detailed_waterfall = visuals.WaterfallChart(
    baseline=explanation['expected_value'],
    shap_values=explanation['shap_values'],
    names=explanation['feature_names'],
    feature_values=explanation['feature_values'],
    descriptions=explanation['feature_descriptions'],
    max_features=10,
    x_axis_label='Credit Default Risk Score (%)',
)
detailed_waterfall.show()

## Clean Up
When you've finished with this solution, make sure that you delete all
unwanted AWS resources. AWS CloudFormation can be used to automatically delete
all standard resources that have been created by the solution and notebook.

**Caution**: You need to manually delete any extra resources that you may have
created in this notebook. Some examples include, extra Amazon S3 buckets (to
the solution's default bucket), extra Amazon SageMaker endpoints (using a
custom name), and extra Amazon ECR repositories.

You can explicitly delete the Amazon SageMaker endpoint (and endpoint configuration)
using the Amazon SageMaker Python SDK, but this is also deleted in the AWS
CloudFormation stack if you forget.

In [None]:
# explainer.delete_endpoint()

You can now return to AWS CloudFormation and delete the stack.