# Debugging SageMaker Training Jobs with Tornasole

## Overview

Tornasole is a new capability of Amazon SageMaker that allows debugging machine learning training. 
It lets you go beyond just looking at scalars like losses and accuracies during training and gives 
you full visibility into all tensors 'flowing through the graph' during training.

Using Tornasole is a two step process: Saving tensors and Analysis. Let's look at each one of them closely.

### Saving tensors

Tensors define the state of the training job at any particular instant in its lifecycle. Tornasole exposes a library which allows you to capture these tensors and save them for analysis

### Analysis

Analyses of the tensors emitted is captured by the Tornasole concept called ***Rules***. On a very broad level, 
A Rule is a python code used to detect certain conditions during training. Some of the conditions that a data scientist training a deep learning model may care about are monitoring for gradients getting too large or too small, detecting overfitting, and so on.
Tornasole will come pre-packaged with certain rules. Users can write their own rules using the Tornasole APIs.
You can also analyze raw tensor data outside of the Rules construct in say, a Sagemaker notebook, using Tornasole's full set of APIs. 
Please refer [DeveloperGuide_Rules.md](../../../rules/DeveloperGuide_Rules.md) for more details about analysis.

This example guides you through installation of the required components for emitting tensors in a 
SageMaker training job and applying a rule over the tensors to monitor the live status of the job.

## Setup

As a first step, we'll do the installation of required tools which will allow emission of tensors (saving tensors) and application of rules to analyze them

In [None]:
!aws s3 cp s3://tornasole-external-preview-use1/sdk/sagemaker-1.35.2.dev0.tar.gz .
!pip install sagemaker-1.35.2.dev0.tar.gz
!aws s3 cp s3://tornasole-external-preview-use1/sdk/sagemaker-tornasole.json .
!aws configure add-model --service-model sagemaker-tornasole.json --service-name sagemaker

Now that we've completed the setup, we're ready to spin off a training job with debugging enabled

## Enable Tornasole in the training script

Integrating Tornasole into the training job can be accomplished by following steps below.

### Import the tornasole_hook package
Import the TornasoleHook class along with other helper classes in your training script as shown below

```
from tornasole.mxnet.hook import TornasoleHook
from tornasole.mxnet import SaveConfig, Collection
```

### Instantiate and initialize tornasole hook

**NOTE: In order to enable Tornasole functionality while running the script in SageMaker, the hook must be initialized with 'out_dir = /opt/ml/output/tensors'.**

```
    # Create SaveConfig that instructs engine to log graph tensors every 10 steps.
    save_config = SaveConfig(save_interval=10)
    # Create a hook that logs tensors of weights, biases and gradients while training the model.
    tornasole_path = '/opt/ml/output/tensors'
    hook = TornasoleHook(out_dir=output_s3_uri, save_config=save_config)
```

### Register Tornasole hook to the model before starting of the training.

### NOTE: The tornasole hook can only be registered to Gluon Non-hybrid models.

After creating or loading the desired model, users can register the hook with the model as shown below.

```
net = create_gluon_model()
 # Apply hook to the model (e.g. instruct engine to recognize hook configuration
 # and enable mode in which engine will log graph tensors
hook.register_hook(net)
```

#### Set the mode
Set the mode you are running the job in. This helps you group steps by mode, 
for easier analysis. 
If you do not specify this, it saves steps under a `default` mode.

```
hook.set_mode(ts.modes.TRAIN)
```

## Start Sagemaker training with Tornasole enabled

We'll be training a mxnet gluon model for FashonMNIST dataset and collect the tensors. This will be done using SageMaker MXNet Container with Script Mode. In the first example, we will schedule the SageMaker job for training the model and at the same time enable the 'VanishingGradient' rule. In the first example, the training proceeds and it  does not show the Vanishing Gradient issue.

In [None]:
import boto3
import sagemaker
from sagemaker.mxnet import MXNet

### Inputs

Configuring the inputs for the training job. The 'docker_image_name' points to the docker image that contains pre-installed Tornasole binaries. 
The 'entry_point_script' points to the MXNet training script that has the TornasoleHook integrated.
The 'hyperparameters' are the parameters that will be passed to the training script. Please note that the **tornasole_path** parameter is set to be **/opt/ml/output/tensors**. This is **mandatory** when running the training script with SageMaker and Tornasole. 

In [None]:
docker_image_name= '072677473360.dkr.ecr.us-west-2.amazonaws.com/tornasole-preprod-mxnet-1.4.1-cpu:latest'

In [None]:
entry_point_script = '../scripts/mnist_gluon_basic_hook_demo.py'
hyperparameters = {'tornasole_path' : '/opt/ml/output/tensors', 'random_seed' : True,  'num_steps': 6}

#### Storage
The tensors saved by Tornasole are, by default, stored in the S3 output path of the training job, 
under the folder **`/tensors-<job name>`**. This is done to ensure that we don't end up accidentally 
overwriting the tensors from a training job with the others. Rules evaluation require separation of 
the tensors paths to be evaluated correctly.

If you don't provide an S3 output path to the estimator, SageMaker creates one for you as:
**`s3://sagemaker-<region>-<account_id>/`**


This path is used to create a Tornasole Trial taken by Rules (see below). 

#### New Parameters
The new parameters in Sagemaker Estimator to look out for are

##### `debug` (bool)
This indicates that debugging should be enabled for the training job. 
Setting this as `True` would make Tornasole available for use with the job

##### `rules_specification` (list[*dict*])
This is a list of python dictionaries, where each `dict` is of the following form:
```
{
    "RuleName": <str> # The name of the class implementing the Tornasole Rule interface. (required)
    "SourceS3Uri": <str> # S3 URI of the rule script containing the class in 'RuleName'. 
    If left empty, it would look for the class in one of the First Party rules already provided to you by Amazon. 
    If not, SageMaker will try to look for the rule class in the script
    "InstanceType": <str> # The ml instance type in which the rule evaluation should run
    "VolumeSizeInGB": <int> # The volume size to store the runtime artifacts from the rule evaluation
    "RuntimeConfigurations": {
        # Map defining the parameters required to instantiate the Rule class and
        # parameters regarding invokation of the rule (start-step and end-step)
        # This can be any parameter taken by the rule
        <str>: <str>
    }
}
```


### Rules
Rules are the medium by which Tornasole executes a certain piece of code regularly on different steps of the job.
They can be used to assert certain conditions during training, and raise Cloudwatch Events based on them that you can
use to process in any way you like. 

A Trial in Tornasole's context
refers to a training job. It is identified by the path where the saved tensors for the job are stored. 
A rule takes a `base_trial` which refers to the job whose run invokes the rule execution.
A rule can optionally look at other jobs as well, passed using the ar `other_trials`. 

Tornasole comes with a set of first party rules (1P rules).
You can also write your own rules looking at these 1P rules for inspiration. 
Refer [DeveloperGuide_Rules.md](../../../rules/DeveloperGuide_Rules.md) for more.
 
Here we will talk about how to use Sagemaker to evalute these rules on the training jobs.
##### 1P Rule 
If you want to use a 1P rule. Specify the RuleName field with the 1P RuleName, 
and the rule will be automatically applied. You can pass any parameters accepted by the 
rule as part of the RuntimeConfigurations dictionary. The arguments `base_trial` (and `other_trials` if 
taken by the rule) can be passed as the S3 path where the tensors for 
the trial are stored in the RuntimeConfigurations dictionary above.

Here's a example of a complex configuration for the SimilarAcrossRuns (which accepts another trial and a regex pattern) 
where we ask for the rule to be invoked for the steps between 10 and 100.

``` 
rules_specification = [
    {
      "RuleName": "SimilarAcrossRuns",
      "InstanceType": "ml.c5.4xlarge",
      "VolumeSizeInGB": 10,
      "RuntimeConfigurations": {
         "other_trials": "s3://sagemaker-<region>-<account_id>/past-job",
         "include_regex": ".*",
         "start-step": "10",
         "end-step": "100"
       }
    }
]
```

##### Custom rule
In this case you need to define a custom rule class which inherits from `tornasole.rules.Rule` class.
You need to provide Sagemaker the S3 location of the file which defines your custom rule classes as the value for the field `SourceS3Uri`.
Again, you can pass any arguments taken by this rule through the RuntimeConfigurations dictionary. 
Note that the custom rules can only have arguments which expect a string as the value except the two arguments 
specifying trials to the Rule. Refer [DeveloperGuide_Rules.md](../../../rules/DeveloperGuide_Rules.md) for more.

Here's an example:
```
rules_specification = [
    {
      "RuleName": "CustomRule",
      "SourceS3Uri": "s3://weiyou-tornasole-test/rule-script/custom_rule.py",
      "InstanceType": "ml.c5.4xlarge",
      "VolumeSizeInGB": 10,
      "RuntimeConfigurations": {
         "threshold" : "0.5"
       }
    }
]
```

### Estimator
Now we'll call the Sagemaker MXNet Estimator to kick off a training job along with the VanishingGradient rule to monitor the job.

For the purposes of this demonstration let us ensure that the script produces nans during training and monitor
the job with the rule ExplodingTensor. Let us create a bad hyperparameters dictionary which copies the 
standard hyperparameters and sets bad learning rate and scale paramters taken by the script.


In [None]:
estimator = MXNet(role=sagemaker.get_execution_role(),
                  base_job_name='mxnet-trsl-test-nb',
                  train_instance_count=1,
                  train_instance_type='ml.m4.xlarge',
                  image_name=docker_image_name,
                  entry_point=entry_point_script,
                  hyperparameters=hyperparameters,
                  framework_version='1.4.1',
                  debug=True,
                  py_version='py3',
                  rules_specification=[
                      {
                          "RuleName": "VanishingGradient",
                          "InstanceType": "ml.c5.4xlarge",
                          "VolumeSizeInGB": 10,
                          "RuntimeConfigurations": {
                              "end-step": "5"
                          }
                      }
                  ])

To kick off the job, we call the `fit()` method on the MXNet estimator

In [None]:
estimator.fit()

## Result

As a result of the above command, SageMaker will spin off 2 training jobs for you - the first one being the job which produces the tensors to be analyzed and the second one, which evaluates or analyzes the rule you asked it to in `rules_specification`

You'll notice that while the Training Job completes, the weight update ratio blows of step 233 onwards. Thus, the rule execution job which was started as a result of this training job, fails.

### Training Job
You can go to the console to get the training job starting with **mxnet-trsl-ibhatt-test-nb** or optionally, do a list call and get the job arn from there. 

### Accessing the Rule Execution Job
To get the rule execution job that SageMaker started for you, go to the SageMaker console and under Training Jobs find the job name starting with 'WeightUpdateRatio'. Optionally, you can do a Describe API call on the parent training job and get the job name from `RuleMonitoringStatus` blob
```
Failure reason
ClientError: RuleEvaluationConditionMet: Rule evaluation resulted in the condition being met Traceback (most recent call last): File "train.py", line 214, in execute exec(_SYMBOLIC_INVOKE_RULE.format(self.start_step, self.end_step), globals(), exec_local) File "<string>", line 2, in <module> File "/usr/local/lib/python3.7/site-packages/tornasole/rules/rule_invoker.py", line 82, in invoke_rule raise e File "/usr/local/lib/python3.7/site-packages/tornasole/rules/rule_invoker.py", line 77, in invoke_rule rule_obj.invoke(step) File "/usr/local/lib/python3.7/site-packages/tornasole/rules/rule.py", line 103, in invoke raise RuleEvaluationConditionMet tornasole.exceptions.RuleEvaluationConditionMet: Rule evaluation resulted in the condition being met 
```

In [None]:
estimator.describe_rule_execution_jobs()

In [None]:
entry_point_script = '../scripts/mnist_gluon_vg_demo.py'
bad_hyperparameters = {'tornasole_path' : '/opt/ml/output/tensors', 'random_seed' : True,  'num_steps': 33, 'tornasole_frequency' : 30}

In [None]:
estimator = MXNet(role=sagemaker.get_execution_role(),
                  base_job_name='mxnet-trsl-test-nb',
                  train_instance_count=1,
                  train_instance_type='ml.m4.xlarge',
                  image_name=docker_image_name,
                  entry_point=entry_point_script,
                  hyperparameters=bad_hyperparameters,
                  framework_version='1.4.1',
                  debug=True,
                  py_version='py3',
                  rules_specification=[
                      {
                          "RuleName": "VanishingGradient",
                          "InstanceType": "ml.c5.4xlarge",
                          "VolumeSizeInGB": 10,
                          "RuntimeConfigurations": {
                              "start-step" : "1",
                              "end-step": "33"
                          }
                      }
                  ])

In [None]:
estimator.fit()

In [None]:
estimator.describe_rule_execution_jobs()