# Debugging SageMaker Training Jobs with Tornasole

## Overview

Tornasole is an upcoming AWS service designed to be a debugger for machine learning models. It lets you go beyond just looking at scalars like losses and accuracies during training and gives you full visibility into all tensors 'flowing through the graph' during training or inference.

Using Tornasole is a two step process:

### Saving tensors

This library, intended to be used with Tornasole, helps you save tensors from a running TensorFlow job. It lets you collect the tensors you want at the frequency that you want, and save them for analysis.

### Analysis

The analysis of tensors saved requires the package tornasole_rules. Please refer to the documentation for the above package for more details about how to install and analyze. That said, we do provide a few example analysis commands below so as to provide an end to end flow. These require the tornasole_rules package to be installed.

This example walks through an example which guides through installation of the required components for emitting tensors in a SageMaker training job and applying a rule over the tensors to monitor the live status of the job.

## Setup

As a first step, we'll do the installation of required tools which will allow emission of tensors (saving tensors) and application of rules to analyze them

In [None]:
!aws s3 cp s3://tornasole-external-preview-use1/sdk/sagemaker-1.35.2.dev0.tar.gz .
!pip install sagemaker-1.35.2.dev0.tar.gz

In [None]:
!aws s3 cp s3://tornasole-external-preview-use1/sdk/sagemaker-tornasole.json .
!aws configure add-model --service-model sagemaker-tornasole.json --service-name sagemaker

Now that we've completed the setup, we're ready to spin off a training job with debugging enabled

## Training with Script Mode

We'll be training a mxnet gluon model for FashonMNIST dataset. This will be done using SageMaker MXNet Container with Script Mode.

In [None]:
import boto3
import sagemaker
from sagemaker.mxnet import MXNet

### Inputs

Configuring the inputs for the training job

In [None]:
entry_point_script = '../scripts/mnist_gluon_basic_hook_demo.py'
docker_image_name= '072677473360.dkr.ecr.us-west-2.amazonaws.com/tornasole-preprod-mxnet-1.4.1-cpu:latest'

#### Parameters

Now we'll call the MXNet Estimator to kick off a training job. The new parameters in the Estimator to look out for are

##### `debug` (bool)
This indicates that debugging should be enabled for the training job. Setting this as `True` would make Tornasole available for use with the job

##### `tornasole_hook_config_json` (str)
This is a stringified form of the json configuration you need to instantiate the Tornasole hook in your training script.

##### `rules_specification` (list[*dict*])
This is a list of python dictionaries, where each `dict` is of the following form:
```
{
    "RuleName": <str> # The name of the class implementing the Tornasole Rule interface. (required)
    "SourceS3Uri": <str> # S3 URI of the rule script containing the class in 'RuleName'. If left empty, it would look for the class in one of the First Party rules already provided to you by Amazon. If not, SageMaker will try to look for the rule class in the script
    "InstanceType": <str> # The ml instance type in which the rule evaluation should run
    "VolumeSizeInGB": <int> # The volume size to store the runtime artifacts from the rule evaluation
    "RuntimeConfigurations": {
        # Map defining the parameters required to instantiate the Rule class and invoke the rule
        <str>: <str>
    }
}
```
#### Storage
The tensors, by default, will be saved to the S3 output location path of the training job, under the folder **`/tensors-<job name>`**. This is done to preserve separate paths for tensors from different training job so that the Rules can be evaluated correctly. Re-using the same path for different training jobs will result in rule to be evaluated incorrectly.

If you don't provide an output path, SageMaker will create one for you as
**`s3://sagemaker-<region>-<account_id>/`**

#### Estimator
See the way we instantiate the estimator below

In [None]:
hyperparameters = {'tornasole_path' : '/opt/ml/output/tensors', 'random_seed' : True,  'num_steps': 6}

In [None]:
estimator = MXNet(role=sagemaker.get_execution_role(),
                  base_job_name='mxnet-trsl-test-nb',
                  train_instance_count=1,
                  train_instance_type='ml.m4.xlarge',
                  image_name=docker_image_name,
                  entry_point=entry_point_script,
                  hyperparameters=hyperparameters,
                  framework_version='1.4.1',
                  debug=True,
                  py_version='py3',
                  rules_specification=[
                      {
                          "RuleName": "VanishingGradient",
                          "InstanceType": "ml.c5.4xlarge",
                          "VolumeSizeInGB": 10,
                          "RuntimeConfigurations": {
                              "end-step": "5"
                          }
                      }
                  ])

To kick off the job, we call the `fit()` method on the MXNet estimator

In [None]:
estimator.fit()

## Result

As a result of the above command, SageMaker will spin off 2 training jobs for you - the first one being the job which produces the tensors to be analyzed and the second one, which evaluates or analyzes the rule you asked it to in `rules_specification`

You'll notice that while the Training Job completes, the weight update ratio blows of step 233 onwards. Thus, the rule execution job which was started as a result of this training job, fails.

### Training Job
You can go to the console to get the training job starting with **mxnet-trsl-ibhatt-test-nb** or optionally, do a list call and get the job arn from there. 

### Accessing the Rule Execution Job
To get the rule execution job that SageMaker started for you, go to the SageMaker console and under Training Jobs find the job name starting with 'WeightUpdateRatio'. Optionally, you can do a Describe API call on the parent training job and get the job name from `RuleMonitoringStatus` blob
```
Failure reason
ClientError: RuleEvaluationConditionMet: Rule evaluation resulted in the condition being met Traceback (most recent call last): File "train.py", line 214, in execute exec(_SYMBOLIC_INVOKE_RULE.format(self.start_step, self.end_step), globals(), exec_local) File "<string>", line 2, in <module> File "/usr/local/lib/python3.7/site-packages/tornasole/rules/rule_invoker.py", line 82, in invoke_rule raise e File "/usr/local/lib/python3.7/site-packages/tornasole/rules/rule_invoker.py", line 77, in invoke_rule rule_obj.invoke(step) File "/usr/local/lib/python3.7/site-packages/tornasole/rules/rule.py", line 103, in invoke raise RuleEvaluationConditionMet tornasole.exceptions.RuleEvaluationConditionMet: Rule evaluation resulted in the condition being met 
```

In [None]:
estimator.describe_rule_execution_jobs()

In [None]:
entry_point_script = '../scripts/mnist_gluon_vg_demo.py'
bad_hyperparameters = {'tornasole_path' : '/opt/ml/output/tensors', 'random_seed' : True,  'num_steps': 33, 'tornasole_frequency' : 30}

In [None]:
estimator = MXNet(role=sagemaker.get_execution_role(),
                  base_job_name='mxnet-trsl-test-nb',
                  train_instance_count=1,
                  train_instance_type='ml.m4.xlarge',
                  image_name=docker_image_name,
                  entry_point=entry_point_script,
                  hyperparameters=bad_hyperparameters,
                  framework_version='1.4.1',
                  debug=True,
                  py_version='py3',
                  rules_specification=[
                      {
                          "RuleName": "VanishingGradient",
                          "InstanceType": "ml.c5.4xlarge",
                          "VolumeSizeInGB": 10,
                          "RuntimeConfigurations": {
                              "start-step" : "1",
                              "end-step": "33"
                          }
                      }
                  ])

In [None]:
estimator.fit()

In [None]:
estimator.describe_rule_execution_jobs()