# Debugging SageMaker Training Jobs with Tornasole

## Overview

Tornasole is a new capability of Amazon SageMaker that allows debugging machine learning training. 
It lets you go beyond just looking at scalars like losses and accuracies during training and gives 
you full visibility into all tensors 'flowing through the graph' during training. Tornasole helps you to monitor your training in near real time using rules and would provide you alerts, once it has detected inconsistency in training flow.

Using Tornasole is a two step process: Saving tensors and Analysis. Let's look at each one of them closely.

### Saving tensors

Tensors define the state of the training job at any particular instant in its lifecycle. Tornasole exposes a library which allows you to capture these tensors and save them for analysis

### Analysis

Analysis of the tensors emitted is captured by the Tornasole concept called ***Rules***. On a very broad level, 
a rule is a python code used to detect certain conditions during training. Some of the conditions that a data scientist training a deep learning model may care about are monitoring for gradients getting too large or too small, detecting overfitting, and so on.
Tornasole will come pre-packaged with certain rules. Users can write their own rules using the Tornasole APIs.
You can also analyze raw tensor data outside of the Rules construct in say, a Sagemaker notebook, using Tornasole's full set of APIs. 
Please refer [DeveloperGuide_Rules.md](../../../../rules/DeveloperGuide_Rules.md) for more details about analysis.

This example guides you through installation of the required components for emitting tensors in a 
SageMaker training job and applying a rule over the tensors to monitor the live status of the job.

## Setup

As a first step, we'll do the installation of required tools which will allow emission of tensors (saving tensors) and application of rules to analyze them

In [1]:
!aws s3 cp s3://tornasole-external-preview-use1/sdk/sagemaker-1.35.2.dev0.tar.gz .
!pip install sagemaker-1.35.2.dev0.tar.gz
!aws s3 cp s3://tornasole-external-preview-use1/sdk/sagemaker-tornasole.json .
!aws configure add-model --service-model sagemaker-tornasole.json --service-name sagemaker

Traceback (most recent call last):
  File "C:\Python27\Scripts\aws.cmd", line 50, in <module>
    import awscli.clidriver
  File "C:\Users\212757215\AppData\Roaming\Python\Python37\site-packages\awscli\clidriver.py", line 17, in <module>
    import botocore.session
ModuleNotFoundError: No module named 'botocore'


Processing d:\learning_repo\dinesh\python\aws_preview\sagemaker-1.35.2.dev0.tar.gz


ERROR: Could not install packages due to an EnvironmentError: [Errno 2] No such file or directory: 'D:\\Learning_Repo\\dinesh\\Python\\AWS_Preview\\sagemaker-1.35.2.dev0.tar.gz'

You should consider upgrading via the 'python -m pip install --upgrade pip' command.
Traceback (most recent call last):
  File "C:\Python27\Scripts\aws.cmd", line 50, in <module>
    import awscli.clidriver
  File "C:\Users\212757215\AppData\Roaming\Python\Python37\site-packages\awscli\clidriver.py", line 17, in <module>
    import botocore.session
ModuleNotFoundError: No module named 'botocore'
Traceback (most recent call last):
  File "C:\Python27\Scripts\aws.cmd", line 50, in <module>
    import awscli.clidriver
  File "C:\Users\212757215\AppData\Roaming\Python\Python37\site-packages\awscli\clidriver.py", line 17, in <module>
    import botocore.session
ModuleNotFoundError: No module named 'botocore'


Now that we've completed the setup, we're ready to spin off a training job with debugging enabled

## Enable Tornasole in the training script

Integrating Tornasole into the training job can be accomplished by following steps below.

### Import the tornasole_hook package
Import the TornasoleHook class along with other helper classes in your training script as shown below

```
from tornasole.mxnet.hook import TornasoleHook
from tornasole.mxnet import SaveConfig, Collection
```

### Instantiate and initialize tornasole hook

```
    # Create SaveConfig that instructs engine to log graph tensors every 10 steps.
    save_config = SaveConfig(save_interval=10)
    # Create a hook that logs tensors of weights, biases and gradients while training the model.
    hook = TornasoleHook(save_config=save_config)
```

### Register Tornasole hook to the model before starting of the training.

### NOTE: The tornasole hook can only be registered to Gluon Non-hybrid models.

After creating or loading the desired model, users can register the hook with the model as shown below.

```
net = create_gluon_model()
 # Apply hook to the model (e.g. instruct engine to recognize hook configuration
 # and enable mode in which engine will log graph tensors
hook.register_hook(net)
```

#### Set the mode
Tornasole has the concept of modes (TRAIN, EVAL, PREDICT) to separate out different modes of the jobs.
Set the mode you are running in your job. Every time the mode changes in your job, please set the current mode. This helps you group steps by mode, for easier analysis. Setting the mode is optional but recommended. If you do not specify this, Tornasole saves all steps under a `GLOBAL` mode. 
```
hook.set_mode(ts.modes.TRAIN)
```

Refer [DeveloperGuide_MXNet.md](../../DeveloperGuide_MXNet.md) for more details on the APIs Tornasole provides to help you save tensors.


## SageMaker with Tornasole

We'll be training a mxnet gluon model for FashonMNIST dataset in this notebook with Tornasole enabled and monitor the training jobs with Tornasole's Rules. This will be done using SageMaker MXNet 1.4.1 Container with Script Mode. ote that Tornasole currently only works with python3, so be sure to set `py_version='py3'` when creating SageMaker Estimator below.

#### Storage
The tensors saved by Tornasole are, by default, stored in the S3 output path of the training job, under the folder **`/tensors-<job name>`**. This is done to ensure that we don't end up accidentally overwriting the tensors from a training job with the others. Rules evaluation require separation of the tensors paths to be evaluated correctly.

If you don't provide an S3 output path to the estimator, SageMaker creates one for you as: **`s3://sagemaker-<region>-<account_id>/`**

This path is used to create a Tornasole Trial taken by Rules (see below).

#### New Parameters 
The new parameters in Sagemaker Estimator to look out for are

- `debug` :(bool)
This indicates that debugging should be enabled for the training job. 
Setting this as `True` would make Tornasole available for use with the job

- `rules_specification`: (list[*dict*])
You can specify any number of rules to monitor your SageMaker training job. This parameter takes a list of python dictionaries, one for each rule you want to enable. Each `dict` is of the following form:
```
{
    "RuleName": <str>       
        # The name of the class implementing the Tornasole Rule interface. (required)

    "SourceS3Uri": <str>    
        # S3 URI of the rule script containing the class in 'RuleName'. 
        # This is not required if you want to use one of the First Party rules provided to you by Amazon. 
        # In such a case you can leave it empty or not pass it. If you want to run a custom rule 
        # defined by you, you will need to define the custom rule class in a python 
        # file and provide it to SageMaker as a S3 URI. 
        # SageMaker will fetch this file and try to look for the rule class 
        # identified by RuleName in this file.
    
    "InstanceType": <str>   
        # The ML instance type which should be used to run the rule evaluation job
        
    "VolumeSizeInGB": <int> 
        # The volume size to store the runtime artifacts from the rule evaluation 
        
    "RuntimeConfigurations": {
        # Map defining the parameters required to instantiate the Rule class and
        # parameters regarding invokation of the rule (start-step and end-step)
        # This can be any parameter taken by the rule. Every value here needs to be a string. 
        # So when you write custom rules, ensure that you can parse each argument from a string.
        # PARAMS CAN BE
        # STANDARD PARAMS FOR RULE EXECUTION
        # "start-step": <str>
        # "end-step": <str>
        # "other-trials-paths": <str> (';' separated list of s3 paths as a string)
        # ANY OTHER PARAMETER TAKEN BY THE RULE
        # "parameter" : <str>
        <str>: <str>
    }
}
```

### Inputs
Just a quick reminder if you are not familiar with script mode in SageMaker. You can pass command line arguments taken by your training script with a hyperparameter dictionary which gets passed to the SageMaker Estimator class. You can see this in the examples below.

### Rules
Rules are the medium by which Tornasole executes a certain piece of code regularly on different steps of the job.
They can be used to assert certain conditions during training, and raise Cloudwatch Events based on them that you can
use to process in any way you like. 

A Trial in Tornasole's context refers to a training job. It is identified by the path where the saved tensors for the job are stored. A rule takes a `base_trial` which refers to the job whose run invokes the rule execution. A rule can optionally look at other jobs as well, passed using the argument `other_trials`. 

Tornasole comes with a set of **First Party rules** (1P rules).
You can also write your own rules looking at these 1P rules for inspiration. 
Refer [DeveloperGuide_Rules.md](../../../../rules/DeveloperGuide_Rules.md) for more on the APIs you can use to write your own rules as well as descriptions for the 1P rules that we provide. 
 
Here we will talk about how to use Sagemaker to evalute these rules on the training jobs.


##### 1P Rule 
If you want to use a 1P rule. Specify the RuleName field with the 1P RuleName, and the rule will be automatically applied. You can pass any parameters accepted by the rule as part of the RuntimeConfigurations dictionary. The argument `base_trial` will automatically be set by SageMaker when executing the rule. The parameter `other_trials` (if taken by the rule) can be passed by passing `other-trials-paths` in the RuntimeConfigurations dictionary. The value for this argument should be `;` separated list of S3 output paths where the tensors for those trials are stored.

Here's a example of a complex configuration for the SimilarAcrossRuns (which accepts one other trial and a regex pattern) where we ask for the rule to be invoked for the steps between 10 and 100.

``` 
rules_specification = [ 
    {
      "RuleName": "SimilarAcrossRuns",
      "InstanceType": "ml.c5.4xlarge",
      "VolumeSizeInGB": 10,
      "RuntimeConfigurations": {
         "other_trials": "s3://sagemaker-<region>-<account_id>/past-job",
         "include_regex": ".*",
         "start-step": "10",
         "end-step": "100"
       }
    }
]
```

##### Custom rule
In this case you need to define a custom rule class which inherits from `tornasole.rules.Rule` class.
You need to provide Sagemaker the S3 location of the file which defines your custom rule classes as the value for the field `SourceS3Uri`. Again, you can pass any arguments taken by this rule through the RuntimeConfigurations dictionary. Note that the custom rules can only have arguments which expect a string as the value except the two arguments specifying trials to the Rule. Refer [DeveloperGuide_Rules.md](../../../../rules/DeveloperGuide_Rules.md) for more.

Here's an example:
```
rules_specification = [
    {
      "RuleName": "CustomRule",
      "SourceS3Uri": "s3://weiyou-tornasole-test/rule-script/custom_rule.py",
      "InstanceType": "ml.c5.4xlarge",
      "VolumeSizeInGB": 10,
      "RuntimeConfigurations": {
         "threshold" : "0.5"
       }
    }
]
```

## Training MXNet models in SageMaker with Tornasole
Now let us see how to train a model in SageMaker using the SageMaker Estimator with Tornasole enabled, along with a rule to monitor the job. First, let us import the required libraries and set the links to docker images that we will use.

### Docker Images with Tornasole
We have built SageMaker MXNet containers with Tornasole. You can use them from ECR from SageMaker. Here are the links to the images. Please change the region below to the region you want your jobs to run.

In [None]:
import sagemaker
from sagemaker.mxnet import MXNet

REGION='us-west-2'
TAG='latest'

docker_image_name= '072677473360.dkr.ecr.{}.amazonaws.com/tornasole-preprod-mxnet-1.4.1-cpu:{}'.format(REGION, TAG)

### Configuring the inputs for the training job

Now we'll call the Sagemaker MXNet Estimator to kick off a training job along with the VanishingGradient rule to monitor the job.

The 'entry_point_script' points to the MXNet training script that has the TornasoleHook integrated.

The 'hyperparameters' are the parameters that will be passed to the training script.



In [None]:
entry_point_script = '../scripts/mnist_gluon_basic_hook_demo.py'
hyperparameters = {'random_seed' : True,  'num_steps': 6}

In [None]:
estimator = MXNet(role=sagemaker.get_execution_role(),
                  base_job_name='mxnet-trsl-test-nb',
                  train_instance_count=1,
                  train_instance_type='ml.m4.xlarge',
                  image_name=docker_image_name,
                  entry_point=entry_point_script,
                  hyperparameters=hyperparameters,
                  framework_version='1.4.1',
                  debug=True,
                  py_version='py3',
                  rules_specification=[
                      {
                          "RuleName": "VanishingGradient",
                          "InstanceType": "ml.c5.4xlarge",
                          "VolumeSizeInGB": 10,
                          "RuntimeConfigurations": {
                              "end-step": "5"
                          }
                      }
                  ])

By setting wait=False while invoking `fit()` method, we just submit the job to run in the background
**NOTE: This is fire and forget event, in background sageMaker will spin off 1 training job and 1 rule job
for you.**
Please follow this notebook to see status of training job and rule job

In [None]:
estimator.fit(wait=False)


### Check the status of the Rule Execution Job

To get the rule execution job that SageMaker started for you, run the command below and it polls the rule job and shows you the `RuleName`, `RuleStatus`, `FailureReason` if any, and `RuleExecutionJobArn`. If the tensors meets a rule evaluation condition, the rule execution job throws a client error with `FailureReason: RuleEvaluationConditionMet`. 

When the status of training job or rule execution job change (i.e. starting, failed), TrainingJobStatus [CloudWatch events](https://docs.aws.amazon.com/sagemaker/latest/dg/cloudwatch-events.html) are emitted. Sagemaker also creates a CloudWatch event rule that monitors the status change of the rule execution job. You can add targets (Lambda function, SNS) for the CloudWatch Event rule to process the events.


The next cell will wait and continuously report the status of the Rule execution job. You can stop this cell and proceed to the next cell if you want to look at the logs of the rule execution job.

In [None]:
estimator.describe_rule_execution_jobs()

### Check the logs of the Rule Execution Job

When the rule jobs are completed, you can run the following to show the logs of a particular rule job, using the RuleExecutionJobName from the output of the previous cell.


In [None]:
from sagemaker.estimator import Estimator
rule_execution_job = Estimator.attach(estimator.latest_training_job.name)

## Example demonstrating the Vanishing Gradient issue

You can create the estimator with following *entry_point_script* and *bad_hyperparameters*. start a new training job. You will see that the VanishingGradient rule is triggered.

In [None]:
entry_point_script = '../scripts/mnist_gluon_vg_demo.py'
bad_hyperparameters = {'random_seed' : True,  'num_steps': 33, 'tornasole_frequency' : 30}

In [None]:
vg_estimator = MXNet(role=sagemaker.get_execution_role(),
                  base_job_name='mxnet-trsl-test-nb',
                  train_instance_count=1,
                  train_instance_type='ml.m4.xlarge',
                  image_name=docker_image_name,
                  entry_point=entry_point_script,
                  hyperparameters=bad_hyperparameters,
                  framework_version='1.4.1',
                  debug=True,
                  py_version='py3',
                  rules_specification=[
                      {
                          "RuleName": "VanishingGradient",
                          "InstanceType": "ml.c5.4xlarge",
                          "VolumeSizeInGB": 10,
                          "RuntimeConfigurations": {
                              "start-step" : "1",
                              "end-step": "33"
                          }
                      }
                  ])

In [None]:
vg_estimator.fit(wait=False)

In [None]:
vg_estimator.describe_rule_execution_jobs()