# Debugging SageMaker Training Jobs with Tornasole - Custom Rules

## Overview

Tornasole is new capability in Amazon SageMaker designed to be a debugger for machine learning models. It lets you go beyond just looking at scalars like losses and accuracies during training and gives you full visibility into all tensors 'flowing through the graph' during training or inference.

Using Tornasole is a two step process:

### Saving tensors

This library, intended to be used with Tornasole, helps you save tensors from a running TensorFlow job. It lets you collect the tensors you want at the frequency that you want, and save them for analysis.

### Analysis

The analysis of tensors saved requires the package tornasole_rules. Please refer to the documentation for the above package for more details about how to install and analyze. That said, we do provide a few example analysis commands below so as to provide an end to end flow. These require the tornasole_rules package to be installed.

This example walks through an example which guides through installation of the required components for emitting tensors in a SageMaker training job and applying a custom rule over the tensors to monitor the live status of the job.

## Setup

As a first step, we'll do the installation of required tools which will allow emission of tensors (saving tensors) and application of rules to analyze them

In [1]:
! aws s3 cp --recursive s3://tornasole-external-preview-use1/ .
! chmod +x sdk/installer.sh && ./sdk/installer.sh

download: s3://tornasole-external-preview-use1/frameworks/mxnet/api.md to frameworks/mxnet/api.md
download: s3://tornasole-external-preview-use1/frameworks/mxnet/DeveloperGuide_MXNet.md to frameworks/mxnet/DeveloperGuide_MXNet.md
download: s3://tornasole-external-preview-use1/frameworks/mxnet/examples/mnist_mxnet.py to frameworks/mxnet/examples/mnist_mxnet.py
download: s3://tornasole-external-preview-use1/frameworks/mxnet/examples/mxnet.ipynb to frameworks/mxnet/examples/mxnet.ipynb
download: s3://tornasole-external-preview-use1/frameworks/pytorch/api.md to frameworks/pytorch/api.md
download: s3://tornasole-external-preview-use1/frameworks/tensorflow/examples/simple/simple.py to frameworks/tensorflow/examples/simple/simple.py
download: s3://tornasole-external-preview-use1/frameworks/tensorflow/api.md to frameworks/tensorflow/api.md
download: s3://tornasole-external-preview-use1/frameworks/pytorch/DeveloperGuide_PyTorch.md to frameworks/pytorch/DeveloperGuide_PyTorch.md
download: s3://t

Now that we've completed the setup, we're ready to spin off a SageMaker training job with debugging enabled.

## Training with Script Mode

We'll be training a MXNet gluon model for FashonMNIST dataset. This will be done using SageMaker MXNet Container with Script Mode.

In [2]:
import boto3
import sagemaker
from sagemaker.mxnet import MXNet

### Inputs

In this section, we'll look how to configure the inputs for the training job. For script mode, you'll need to upload your MXNet training script in the notebook's directory. This training script will contain the code for your Tornasole Hook initialization.

After you've done that, you can proceed below.


In [3]:
# Set the input values required by the estimator
entry_point_script = '../frameworks/mxnet/scripts/mnist_mxnet.py'
docker_image_name= '072677473360.dkr.ecr.us-west-2.amazonaws.com/tornasole-preprod-mxnet-1.4.1-cpu:latest'

#### Parameters

Now we'll call the MXNet Estimator to kick off a training job. As we mentioned earlier, we want to enable debugging for this job through Tornasole. Since you've already initialized the hook in the training script, you've enabled tensors to be stored when the script is run. The configuration of these tensors, for instance, the save interval, the regexes etc. are controlled by a json configuration called tornasole-hook-config-json which you can write in the script.

For analysis, i.e. Rule Evaluation, we've introduced some new parameters in the estimator which allow you to specify what kind of rule you want to get evaluated and how. The new parameters in the Estimator to look out for are

##### `debug` (bool)
This indicates that debugging should be enabled for the training job. Setting this as `True` would make Tornasole available for use with the SageMaker training job

##### `tornasole_hook_config_json` (str)
This is a stringified form of the json configuration you need to instantiate the Tornasole hook in your training script.

##### `rules_specification` (list[*dict*])
This is a list of python dictionaries, where each `dict` is of the following form:
```
{
    "RuleName": <str> # The name of the class implementing the Tornasole Rule interface. (required)
    "SourceS3Uri": <str> # S3 URI of the rule script containing the class in 'RuleName'. If left empty, it would look for the class in one of the First Party rules already provided to you by Amazon. If not, SageMaker will try to look for the rule class in the script
    "InstanceType": <str> # The ml instance type in which the rule evaluation should run
    "VolumeSizeInGB": <int> # The volume size to store the runtime artifacts from the rule evaluation
    "RuntimeConfigurations": {
        # Map defining the parameters required to instantiate the Rule class and invoke the rule
        <str>: <str>
    }
}
```
#### Storage
The tensors, by default, will be saved to the S3 output location path of the training job, under the folder **`/tensors-<job name>`**. This is done to preserve separate paths for tensors from different training job so that the Rules can be evaluated correctly. Re-using the same path for different training jobs will result in rule to be evaluated incorrectly.

If you don't provide an output path, SageMaker will create one for you as
**`s3://sagemaker-<region>-<account_id>/`**

### Your Rule

To use rule evaluation with SageMaker training jobs, you can either opt in to bring your own rule python script or leverage one of the rules provided out-of-the-box to you by Tornasole. This notebook will focus on using your own custom rule python script and having that evaluated against the output of your SageMaker training job. Read more about how you can write your rule script in the documentation in this notebook's directory path.

This notebook uses a custom rule script which attempts to identify steps with bad ratios while training. We've uploaded the file to **'s3://tornasole-test-artifacts/rules/weight_update_ratio.py'**. The script contains a class implementing the `Rule` interface called *`WeightUpdateRatio`* This class implements all the methods from the Rule interface in a fashion that will allow it to catch steps with bad weight update ratio.

The constructor of this class is of the following signature:
```
__init__(self, base_trial, large_threshold=10, small_threshold=0.00000001, epsilon=0.000000001)
```

Keep in mind that for SageMaker to be able to evaluate your rule, the rule class will need to have a signature conforming to the spec defined by Tornasole.

In order to initialize your class, you'll need to pass down values for everything except `self` and `base_trial`. This is done through putting the parameters and their values as a string-to-string map in `RuntimeConfigurations` in the `rules_specification` as alluded to earlier.

In [4]:
rules_specification=[
    {
        "RuleName": "WeightUpdateRatio", # Defines the class within the rule script you want to evaluate
        "SourceS3Uri": "s3://ibhatt-tornasole-test-artifacts/rules/weight_update_ratio.py",
        "InstanceType": "ml.c5.4xlarge", # Instance type you want to have the rule evaluation done on
        "VolumeSizeInGB": 10,            # Volume of the disk attached to the instance on which evaluation is done
        "RuntimeConfigurations": {
            # For class constructor initialization
            "large_threshold": "100",
            "small_threshold": "0.00000002",
            
            # For rule invocation
            "start-step": "1",
            "end-step": "20",
        }
    }
]

### Estimator

Now that we have all the things in place, we are ready to initialize our estimator and kick off a sagemaker training job. This job will emit tensors based on the configuration you defined in the training script and also do a rule evaluation on the side, checking for weight update ratios (Again, based on the rule script you provided.)

In [5]:
estimator = MXNet(role=sagemaker.get_execution_role(),
                  base_job_name='mxnet-trsl-ibhatt-test-nb',
                  train_instance_count=1,
                  train_instance_type='ml.m4.xlarge',
                  image_name=docker_image_name,
                  entry_point=entry_point_script,
                  framework_version='1.4.1',
                  tornasole_hook_config_json='{"config_name": "ibhatt-mxnet-config"}',
                  py_version='py3',
                  debug=True,
                  rules_specification=rules_specification
                 )

To kick off the job, we call the `fit()` method on the MXNet estimator

In [6]:
estimator.fit(wait=False)

## Result

As a result of the above command, SageMaker will spin off a training job for you which produces the tensors to be analyzed and, a rule evaluation job, which evaluates or analyzes the rule you asked it to in `rules_specification`

### Training Job
Doing a describe on the Training Job reveals the state of the job as a whole

In [7]:
job_details = estimator.sagemaker_session.sagemaker_client.describe_training_job(
    TrainingJobName=estimator._current_job_name
)
print(job_details)

{'TrainingJobName': 'mxnet-trsl-ibhatt-test-nb-2019-08-20-22-16-58-274', 'TrainingJobArn': 'arn:aws:sagemaker:us-west-2:072677473360:training-job/mxnet-trsl-ibhatt-test-nb-2019-08-20-22-16-58-274', 'TrainingJobStatus': 'InProgress', 'SecondaryStatus': 'Starting', 'HyperParameters': {'sagemaker_container_log_level': '20', 'sagemaker_enable_cloudwatch_metrics': 'false', 'sagemaker_job_name': '"mxnet-trsl-ibhatt-test-nb-2019-08-20-22-16-58-274"', 'sagemaker_program': '"mnist_mxnet.py"', 'sagemaker_region': '"us-west-2"', 'sagemaker_submit_directory': '"s3://sagemaker-us-west-2-072677473360/mxnet-trsl-ibhatt-test-nb-2019-08-20-22-16-58-274/source/sourcedir.tar.gz"'}, 'AlgorithmSpecification': {'TrainingImage': '072677473360.dkr.ecr.us-west-2.amazonaws.com/tornasole-preprod-mxnet-1.4.1-cpu:latest', 'TrainingInputMode': 'File'}, 'RoleArn': 'arn:aws:iam::072677473360:role/service-role/AmazonSageMaker-ExecutionRole-20190614T145575', 'InputDataConfig': [{'ChannelName': 'tornasole-config', 'Data

#### Events
Whenever there is a change of status of your training job, SageMaker will record a CloudWatch event which you can react to. See https://docs.aws.amazon.com/sagemaker/latest/dg/cloudwatch-events.html. 

#### Logs
The training job logs for the training job should be visible under the `/aws/sagemaker/TrainingJobs/<your job name>` in CloudWatch console.

### Rule execution result
To get the result of rule execution that SageMaker started for you, you can do a Describe API call on the parent training job and observe the `RuleMonitoringStatus` blob

In [8]:
job_details['RuleMonitoringStatuses']

[{'RuleName': 'WeightUpdateRatio',
  'RuleStatus': 'NotStarted',
  'LastModifiedTime': datetime.datetime(2019, 8, 20, 22, 16, 59, 571000, tzinfo=tzlocal())}]

As you can see the Rule execution was a success. 

#### Event

Just like a training job, when the status of the rule execution job changes, SageMaker will emit CloudWatch event for each rule you ran: https://docs.aws.amazon.com/sagemaker/latest/dg/cloudwatch-events.html. 

You can configure a CW event rule to receive and process these events by setting up a target (Lambda function, SNS).

For our training job, log into the cloudwatch console and see if an event was recorded for your rule execution job.

#### Logs
To get access to the logs for your rule evaluation, go to the CloudWatch console and see the logs under `/aws/sagemaker/TrainingJobs/<job name from RuleMonitoringStatuses above in camelCase>`