# Debugging SageMaker XGBoost Training Jobs with Tornasole

This notebook uses the MNIST dataset to demonstrate a classification task using Tornasole with XGBoost.
For a regression problem, see [xgboost_regression.ipynb](xgboost_regression.ipynb).

## Overview

Tornasole is a new capability of Amazon SageMaker that allows debugging machine learning training. 
Tornasole helps you to monitor your training in near real time using rules and would provide you
alerts, once it has detected inconsistency in training. 

Using Tornasole is a two step process: Saving tensors and Analysis.
Let's look at each one of them closely.

### Saving tensors (and scalars)

In deep learning algorithms, tensors define the state of the training job
at any particular instant in its lifecycle.
Tornasole exposes a library which allows you to capture these tensors and
save them for analysis.
Although XGBoost is not a deep learning algorithm, Tornasole is highly customizable
and can help provide interpretability by saving insightful metrics, such as
performance metrics or feature importances, at different frequencies.
Refer to [DeveloperGuide_XGBoost](../DeveloperGuide_XG.md) for details on how to
save the metrics you want.

### Analysis

Analysis of the tensors emitted is captured by the Tornasole concept called ***Rules***.
On a very broad level, a rule is a python code used to detect certain conditions during training.
Some of the conditions that a data scientist training an algorithm may care about are
monitoring for gradients getting too large or too small, detecting overfitting, and so on.
Tornasole will come pre-packaged with certain rules.
Users can write their own rules using the Tornasole APIs.
You can also analyze raw tensor data outside of the Rules construct in say, a Sagemaker notebook,
using Tornasole's full set of APIs. 
Please refer to [DeveloperGuide_Rules](../../../rules/DeveloperGuide_Rules.md) for more details about analysis.

This example guides you through installation of the required components for emitting tensors in a 
SageMaker training job and applying a rule over the tensors to monitor the live status of the job. 

## Setup

We will also install the required tools which will allow emission of tensors (saving tensors) and application of rules to analyze them. This is only for the purposes of this private beta. Once we do this, we will be ready to use Tornasole.

You'll probably have to restart this notebook after running the following code cell.

In [None]:
! aws s3 sync s3://tornasole-external-preview-use1/sdk/ ~/SageMaker/tornasole-preview-sdk/
! pip3 -q install ~/SageMaker/tornasole-preview-sdk/ts-binaries/tornasole_xgboost/py3/latest/tornasole-* --user
! chmod +x ~/SageMaker/tornasole-preview-sdk/installer.sh && ~/SageMaker/tornasole-preview-sdk/installer.sh

### If you running this notebook for the first time, please wait for the above setup to complete and restart the notebook by selecting *Kernel -> Restart Kernel* before proceeding.

We have built SageMaker XGBoost containers with Tornasole. You can use them from ECR from SageMaker. Here are the links to the images. Please use the image from the appropriate region in which you want your jobs to run.

In [None]:
import os
import boto3
from sagemaker import get_execution_role

# Below changes the region to be one where this notebook is running
REGION = boto3.Session().region_name
ROLE = get_execution_role()
os.environ["AWS_REGION"] = REGION

TAG = "latest"
docker_image_name = "072677473360.dkr.ecr.{}.amazonaws.com/tornasole-preprod-xgboost-0.90-cpu:{}".format(REGION, TAG)

## Training XGBoost models in SageMaker with Tornasole

### SageMaker XGBoost as a framwork

We'll train a few XGBoost models in this notebook with Tornasole enabled and monitor the training jobs with Tornasole Rules. This will be done using SageMaker XGBoost 0.90 Container as a framework. The [XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) can be used as a built-in algorithm or as a framework such TensorFlow. Using XGBoost as a framework provides more flexibility than using it as a built-in algorithm as it enables more advanced scenarios that allow pre-processing and post-processing scripts to be incorporated into your training script.

Let us first train a simple example training script [xgboost_mnist_basic_hook_demo.py](../scripts/xgboost_abalone_basic_hook_demo.py) with XGBoost enabled in SageMaker using the SageMaker Estimator API, along with a LossNotDecreasing Rule to monitor the training job in realtime. A Tornasole Rule is essentially python code which analyzes tensors saved by tornasole and validates some condition. LossNotDecreasing rule is a first party (1P) rule provided by Tornasole. For other 1P rules that can be used in XGBoost, refer to [FirstPartyRules.md](../../../rules/FirstPartyRules.md)

During training, Tornasole will capture tensors as specified in its configuration and LossNotDecreasing Rule job will monitor whether you are running into a situation where loss is not going down. The rule will emit a cloudwatch event if it finds that the performance metrics are not decreasing during training.

### Enabling Tornasole in the script

You can see in the script that we have made a couple of simple changes to enable Tornasole. We created a TornasoleHook which we pass as a callback function when creating a Booster. We passed a SaveConfig object telling the hook to save the evaluation metrics, feature importances, and SHAP values at regular intervals. Note that Tornasole is highly configurable, you can choose exactly what to save. The changes are described in a bit more detail below after we train this example as well as in even more detail in our [Developer Guide for XGBoost](../DeveloperGuide_XG.md). 

```python
from tornasole.xgboost import TornasoleHook, SaveConfig

save_config = SaveConfig(save_interval=frequency)
hook = TornasoleHook(save_config=save_config)

bst = xgboost.train(
    ...
    callbacks=[hook]
)
```

### XGBoost for Classification

We use the [MNIST data](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html) stored in [LIBSVM](https://www.csie.ntu.edu.tw/~cjlin/libsvm/) format.

Refer to [XGBoost for Classification](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/introduction_to_amazon_algorithms/xgboost_mnist)
for an example of using classification from Amazon SageMaker's implementation of
[XGBoost](https://github.com/dmlc/xgboost).

In [None]:
entry_point_script = "../scripts/xgboost_mnist_basic_hook_demo.py"

hyperparameters={
    "max_depth": "5",
    "eta": "0.5",
    "gamma": "4",
    "min_child_weight": "6",
    "silent": "0",
    "objective": "multi:softmax",
    "num_class": "10",  # num_class is required for 'multi:*' objectives
    "num_round": "10",
    "tornasole_frequency": "1"
}

In [None]:
from sagemaker.xgboost import XGBoost

estimator = XGBoost(
    image_name=docker_image_name,
    base_job_name="demo-tornasole-xgboost-classification",
    entry_point=entry_point_script,
    hyperparameters=hyperparameters,
    train_instance_type="ml.m4.4xlarge",
    train_instance_count=1,
    framework_version="0.90-1",
    py_version="py3",
    role=ROLE,
    
    # These are Tornasole specific parameters, 
    # debug=True means rule specified in rules_specification 
    # will run as rule job. 
    # Below, we specify to run the first party rule LossNotDecreasing
    # on a ml.c5.4xlarge instance
    debug=True,
    rules_specification=[
        {
        "RuleName": "LossNotDecreasing",
        "InstanceType": "ml.c5.4xlarge",
        "RuntimeConfigurations": {
            "use_losses_collection": "False",
            "tensor_regex": "train-merror,validation-merror",
            "num_steps" : "10"
            }
        }
    ]
)


*Note that Tornasole is only supported for `py_version='py3'` currently.*

In [None]:
# This is a fire and forget event.
# By setting wait=False, we just submit the job to run in the background.
# In the background SageMaker will spin off 1 training job and 1 rule job for you.
# Please follow this notebook to see status of the training job and the rule job.
estimator.fit(wait=False)

### Result
As a result of the above command, SageMaker will spin off 1 training job and 1 rule job for you - the first one being the job which produces the tensors to be analyzed and the second one, which analyzes the tensors to check if `train-merror` and `validation-merror` are not decreasing at any point during training.

### Describing the training job
We can check the status of the training job by running the following command:

In [None]:
# Below command will give the status of training job
# Note: In the output of below command you will see DebugConfig parameter 
job_name = estimator.latest_training_job.name
client = estimator.sagemaker_session.sagemaker_client
description = client.describe_training_job(TrainingJobName=job_name)

In [None]:
# The status of the training job can be seen below
description["TrainingJobStatus"]

Once your training job is started SageMaker will spin up a rule execution job to run the LossNotDecreasing rule.

### Tornasole specific parameters in the description
**DebugConfig** parameter has details about Tornasole related configuration. The key parameters to look for below are

*S3OutputPath* : This is the path where output tensors from tornasole is getting saved.  
*RuleConfig*' : This parameter tells about the rule config parameter that was passed when creating the trainning job. In this you should be able to see details of the rule that ran for training. 

In [None]:
description["DebugConfig"]

### Check the status of the Rule Execution Job
To get the rule execution job that SageMaker started for you, run the command below and it shows you the `RuleName`, `RuleStatus`, `FailureReason` if any, and `RuleExecutionJobArn`. If the tensors meets a rule evaluation condition, the rule execution job throws a client error with `FailureReason: RuleEvaluationConditionMet`. These details are also available as part of the response `description` above under: `description['RuleMonitoringStatuses']`


The logs of the training job are available in the Cloudwatch Logstream `/aws/sagemaker/TrainingJobs` with `RuleExecutionJobArn`. 

You will see that once the rule execution job starts, that it identifies the loss not decreasing situation in the training job, raises the `RuleEvaluationConditionMet` exception and ends the job. 

**Note that the next cell blocks until the rule execution job ends. You can stop it at any point to proceed to the rest of the notebook. Once it says RuleStatus is Started, and shows the `RuleExecutionJobArn`, you can look at the status of the rule being monitored. At that point, we can also look at the logs as shown in the next cell**

In [None]:
estimator.describe_rule_execution_jobs()

### Check logs of the rule execution jobs

If you want to access the logs of a particular rule job name, you can do the following. First, you need to get the rule job name (`RuleExecutionJobArn` field from the training job description). Note that this is only available after the rule job reaches Started stage. Hence the next cell waits till the job name is available.

In [None]:
import time

rule_descr = client.describe_training_job(TrainingJobName=job_name)["RuleMonitoringStatuses"]
print("Waiting for rule execution job to start")
while "RuleExecutionJobArn" not in rule_descr[0]:
    time.sleep(5)
    rule_descr = client.describe_training_job(TrainingJobName=job_name)["RuleMonitoringStatuses"]

rule_job_arn = rule_descr[0]["RuleExecutionJobArn"]
print("Rule execution job has started. The job ARN is {}".format(rule_job_arn))
rule_job_name = rule_job_arn.split('/')[1]

Now we can attach to this job to see its logs

In [None]:
from sagemaker.estimator import Estimator
loss_not_decreasing = Estimator.attach(rule_job_name)

In the above example, the `LossNotDecreasing` rule was completed without producing an alert because both `train-merror` and `validation-merror` decreased steadily throught the training run. To see an example of the rule when performance metrics stop decreasing during training, see [xgboost_regression.ipynb](xgboost_regression.ipynb).

## Data Analysis - Manual

Now that we have trained the system we can analyze the data. Here we focus on after-the-fact analysis.

We import a basic analysis library, which defines a concept of `Trial` that represents a single training run.

In [None]:
import os
from urllib.parse import urlparse
from tornasole.trials import create_trial

s3_output_path = description["DebugConfig"]["DebugHookConfig"]["S3OutputPath"]
trial = create_trial(s3_output_path)

We can list all the tensors we know something about. Each one of these names is the name of a tensor - the name is a combination of the feature name (which, in these cases, is auto-assigned by XGBoost) and whether it's an evaluation metric, feature importance, or SHAP value. We also have `y/validation` for true labels from the validation set and `y_hat/validation` for predicted labels on the same validation set.

In [None]:
trial.tensors()[:10]

For each tensor we can ask for which steps we have data - in this case, every 2 steps

In [None]:
print(list(trial.tensor("validation-merror").steps()))

We can obtain each tensor at each step as a `numpy` array

In [None]:
type(trial.tensor("train-merror").step(5).value)

### Performance metrics

We can also create a simple function that visualizes the training and validation errors
as the training progresses.
We expect each training errors to get smaller over time, as the system converges to a good solution.
Now, remember that this is an interactive analysis - we are showing these tensors to give an idea of the data. 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Define a function that, for the given tensor name, walks through all 
# the iterations for which we have data and fetches the value.
# Returns the set of steps and the values
def get_data(trial, tname):
    tensor = trial.tensor(tname)
    steps = tensor.steps()
    vals = [tensor.value(s) for s in steps]
    return steps, vals

In [None]:
metrics_to_plot = ["train-merror", "validation-merror"]
for metric in metrics_to_plot:
    steps, data = get_data(trial, metric)
    plt.plot(steps, data, label=metric)
plt.xlabel('Iteration')
plt.ylabel('Classification error')
plt.legend()
plt.show()

### Feature importances

We can also visualize the feature importances as determined by
[xgboost.get_fscore()](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.Booster.get_fscore).
Note that feature importances with zero values are not included here
(which means that those features were not used in any split condisitons).

In [None]:
def plot_collections(trial, collection_name, ylabel=''):
    
    plt.figure(
        num=1, figsize=(8, 8), dpi=80,
        facecolor='w', edgecolor='k')

    features = trial.collection(collection_name).get_tensor_names()

    # to avoid cluttering, we will plot only one out of 20 features
    for feature in list(features)[::20]:
        steps, data = get_data(trial, feature)
        label = feature.replace('/' + collection_name, '')
        plt.plot(steps, data, label=label)

    plt.legend(bbox_to_anchor=(1.04,1), loc='upper left')
    plt.xlabel('Iteration')
    plt.ylabel(ylabel)
    plt.show()

In [None]:
plot_collections(trial, "feature_importance", "Feature importance")

### SHAP

[SHAP](https://github.com/slundberg/shap) (SHapley Additive exPlanations) is
another approach to explain the output of machine learning models.
SHAP values represent a feature's contribution to a change in the model output.

In [None]:
plot_collections(trial, "average_shap", "SHAP values")

### Confusion matrix

In [None]:
import numpy as np
from sklearn.metrics import confusion_matrix
from IPython.display import display, clear_output

fig, ax = plt.subplots()

for step in range(0, 9):
    cm = confusion_matrix(
        trial.tensor('labels').step(step).value,
        trial.tensor('predictions').step(step).value
    )
    normalized_cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    sns.heatmap(normalized_cm, cmap="bone", ax=ax, cbar=False, annot=cm, fmt='')
    print(f"iteartion: {step}")
    display(fig)
    plt.pause(1)
    ax.clear()
    clear_output(wait=True)

## 1P rule: Confusion matrix

As another example of using a first party (1P) rule provided by Tornasole, let us again train the example training script [xgboost_mnist_basic_hook_demo.py](../scripts/xgboost_abalone_basic_hook_demo.py) and use a 1P rule `Confusion` to monitor the training job in realtime.

During training, `Confusion` Rule job will monitor whether you are running into a situation where the ratio of on-diagonal and off-diagonal values in the confusion matrix is not within a specified range. In other words, this rule evaluates the goodness of a confusion matrix for a classification problem. It creates a matrix of size `category_no` $\times$ `category_no` and populates it with data coming from (`y`, `y_hat`) pairs. For each (`y`, `y_hat`) pairs the count in `confusion[y][y_hat]` is  incremented by 1. Once the matrix is fully populated, the ratio of data on- and off-diagonal will be evaluated according to:

- For elements on the diagonal:

$$ \frac{ \text{confusion}_{ii} }{ \sum_j \text{confusion}_{jj} } \geq \text{min_diag} $$

- For elements off the diagonal:

$$ \frac{ \text{confusion}_{ji} }{ \sum_j \text{confusion}_{ji} } \leq \text{max_off_diag} $$

If the condition is met, the rule will emit a cloudwatch event.

Note that this rule will infer the default parameters if configurations are not specified, so you can simply use

```python
rules_specification = [
    {
        "RuleName": "Confusion",
        "InstanceType": "ml.c5.4xlarge"
    }
]
```
If you want to specify the optional parameters, you can do so by using `RuntimeConfigurations`:

```python
rules_specification = [
    {
        "RuleName": "Confusion",
        "InstanceType": "ml.c5.4xlarge",
        "RuntimeConfigurations": {
            "category_no": "10",
            "min_diag": "0.8",
            "max_diag": "0.2"
        }
    }
]
```

For `Confusion` Rule API and other 1P rules that can be used in XGBoost, refer to [FirstPartyRules.md](../../../rules/FirstPartyRules.md).

In [None]:
estimator = XGBoost(
    image_name=docker_image_name,
    base_job_name="demo-tornasole-xgboost-confusion",
    entry_point=entry_point_script,
    hyperparameters=hyperparameters,
    train_instance_type="ml.m4.4xlarge",
    train_instance_count=1,
    framework_version="0.90-1",
    py_version="py3",
    role=ROLE,

    debug=True,
    rules_specification=[
        {
        "RuleName": "Confusion",
        "InstanceType": "ml.c5.4xlarge"
        }
    ]
)


In [None]:
estimator.fit(wait=False)

job_name = estimator.latest_training_job.name
client = estimator.sagemaker_session.sagemaker_client
description = client.describe_training_job(TrainingJobName=job_name)

description["TrainingJobStatus"]

In [None]:
description["DebugConfig"]

In [None]:
estimator.describe_rule_execution_jobs()

This notebook showed two examples of using 1P rules provided Tornasole, but you can also write your own rules looking at these 1P rules for inspiration. Refer to [DeveloperGuide_Rules.md](../../../rules/DeveloperGuide_Rules.md) for more on the APIs you can use to write your own rules as well as descriptions for the 1P rules that we provide. [xgboost_regression.ipynb](xgboost_regression.ipynb) also demonstrates how to use a custom rule that monitors the ratio of feature importance values.