# Debugging SageMaker Training Jobs with Tornasole

## Overview

Tornasole is a new capability of Amazon SageMaker that allows debugging machine learning training. 
It lets you go beyond just looking at scalars like losses and accuracies during training and gives 
you full visibility into all tensors 'flowing through the graph' during training.

Using Tornasole is a two step process: Saving tensors and Analysis. Let's look at each one of them closely.

### Saving tensors

Tensors define the state of the training job at any particular instant in its lifecycle. Tornasole exposes a library which allows you to capture these tensors and save them for analysis. Tornasole is highly customizable to save the tesnsors you want at different frequencies. Refer [DeveloperGuide_PyTorch](../../DeveloperGuide_PyTorch.md) for details on how to save the tensors you want to save.

### Analysis

Analyses of the tensors emitted is captured by the Tornasole concept called ***Rules***. On a very broad level, 
A Rule is a python code used to detect certain conditions during training. Some of the conditions that a data scientist training a deep learning model may care about are monitoring for gradients getting too large or too small, detecting overfitting, and so on.
Tornasole will come pre-packaged with certain rules. Users can write their own rules using the Tornasole APIs.
You can also analyze raw tensor data outside of the Rules construct in say, a Sagemaker notebook, using Tornasole's full set of APIs. 
Please refer [DeveloperGuide_Rules](../../../rules/DeveloperGuide_Rules.md) for more details about analysis.

This example guides you through installation of the required components for emitting tensors in a 
SageMaker training job and applying a rule over the tensors to monitor the live status of the job. 


## Setup

As a first step, we'll do the installation of required tools which will allow emission of tensors (saving tensors) and application of rules to analyze them

In [None]:
!aws s3 sync s3://tornasole-external-preview-use1/ ~/tornasole-preview
!pip -q install ~/tornasole-preview/sdk/sagemaker-tornasole-latest.tar.gz
!aws configure add-model --service-model file://`echo ~/tornasole-preview/sdk/sagemaker-tornasole.json` --service-name sagemaker

Now that we've completed the setup, we're ready to spin off a training job with debugging enabled. 

## Enable Tornasole in the training script

### Import the tornasole_hook package
Import the TornasoleHook class along with other helper classes in your training script as shown below

```
from tornasole.pytorch import SaveConfig, TornasoleHook
```

### Instantiate and initialize tornasole hook

```
    # Create SaveConfig that instructs engine to log graph tensors every 10 steps.
    save_config = SaveConfig(save_interval=10)
    
    # Create a hook that logs tensors of weights, biases and gradients while training the model.
    
    hook = TornasoleHook(save_config=save_config)
```

For additional details on TornasoleHook, SaveConfig and Collection please refer to the [API documentation](api.md)

### Register Tornasole hook to the model before starting of the training.


After creating or loading your desired model, you can register the hook with the model as shown below.

```
net = create_model()
# Apply hook to the model
# and enable mode in which engine will log graph tensors
hook.register_hook(net)
```

#### Set the mode
Tornasole has the concept of modes (TRAIN, EVAL, PREDICT) to separate out different modes of the jobs. Set the mode you are running in your job. Every time the mode changes in your job, please set the current mode. This helps you group steps by mode, for easier analysis. Setting the mode is optional but recommended. If you do not specify this, Tornasole saves all steps under a GLOBAL mode.


```
hook.set_mode(ts.modes.TRAIN)
```

Refer [DeveloperGuide_PyTorch.md](../../DeveloperGuide_TF.md) for more details on the APIs Tornasole provides to help you save the tensors in different forms at the frequency you want.

#### Note
Tornasole currently only works for single process training. We will support distributed training very soon. 

## Start Sagemaker training with Tornasole enabled

We'll be training a simple Pytorch model using the script [simple.py](../scripts/simple.py).
This will be done using SageMaker Pytorch 1.13.1 Container in Script Mode.


In [None]:
import sagemaker
from sagemaker.pytorch import PyTorch


### Inputs
Configuring the inputs for the training job. The command line arguments taken by the script
can be passed using the hyperparameters dictionary below.


In [None]:
entry_point_script = '../scripts/simple.py'
docker_image_name= '072677473360.dkr.ecr.us-west-2.amazonaws.com/tornasole-preprod-pytorch-1.1.0-cpu:latest'
hyperparameters = {'epochs': 2, 'lr' : 0.01, 'momentum' : 0.9, 'tornasole-frequency' : 3, 'steps' : 10, 'hook-type' : 'saveall', 'random-seed' : True }


#### Storage
The tensors saved by Tornasole are, by default, stored in the S3 output path of the training job, 
under the folder **`/tensors-<job name>`**. This is done to ensure that we don't end up accidentally 
overwriting the tensors from a training job with the others. Rules evaluation require separation of 
the tensors paths to be evaluated correctly.

If you don't provide an S3 output path to the estimator, SageMaker creates one for you as:
**`s3://sagemaker-<region>-<account_id>/`**


This path is used to create a Tornasole Trial taken by Rules (see below). 

#### New Parameters
The new parameters in Sagemaker Estimator to look out for are

##### `debug` (bool)
This indicates that debugging should be enabled for the training job. 
Setting this as `True` would make Tornasole available for use with the job

##### `rules_specification` (list[*dict*])
This is a list of python dictionaries, where each `dict` is of the following form:
```
{
    "RuleName": <str> # The name of the class implementing the Tornasole Rule interface. (required)
    "SourceS3Uri": <str> # S3 URI of the rule script containing the class in 'RuleName'. 
    If left empty, it would look for the class in one of the First Party rules already provided to you by Amazon. 
    If not, SageMaker will try to look for the rule class in the script
    "InstanceType": <str> # The ml instance type in which the rule evaluation should run
    "VolumeSizeInGB": <int> # The volume size to store the runtime artifacts from the rule evaluation
    "RuntimeConfigurations": {
        # Map defining the parameters required to instantiate the Rule class and
        # parameters regarding invokation of the rule (start-step and end-step)
        # This can be any parameter taken by the rule
        <str>: <str>
    }
}
```

### Rules
Rules are the medium by which Tornasole executes a certain piece of code regularly on different steps of the job.
They can be used to assert certain conditions during training, and raise Cloudwatch Events based on them that you can
use to process in any way you like. 

A Trial in Tornasole's context
refers to a training job. It is identified by the path where the saved tensors for the job are stored. 
A rule takes a `base_trial` which refers to the job whose run invokes the rule execution.
A rule can optionally look at other jobs as well, passed using the ar `other_trials`. 

Tornasole comes with a set of first party rules (1P rules).
You can also write your own rules looking at these 1P rules for inspiration. 
Refer [DeveloperGuide_Rules.md](../../../rules/DeveloperGuide_Rules.md) for more.
 
Here we will talk about how to use Sagemaker to evalute these rules on the training jobs.
##### 1P Rule 
If you want to use a 1P rule. Specify the RuleName field with the 1P RuleName, 
and the rule will be automatically applied. You can pass any parameters accepted by the 
rule as part of the RuntimeConfigurations dictionary. The arguments `base_trial` (and `other_trials` if 
taken by the rule) can be passed as the S3 path where the tensors for 
the trial are stored in the RuntimeConfigurations dictionary above.

Here's a example of a complex configuration for the SimilarAcrossRuns (which accepts another trial and a regex pattern) 
where we ask for the rule to be invoked for the steps between 10 and 100.

``` 
rules_specification = [
    {
      "RuleName": "SimilarAcrossRuns",
      "InstanceType": "ml.c5.4xlarge",
      "VolumeSizeInGB": 10,
      "RuntimeConfigurations": {
         "other_trials": "s3://sagemaker-<region>-<account_id>/past-job",
         "include_regex": ".*",
         "start-step": "10",
         "end-step": "100"
       }
    }
]
```

##### Custom rule
In this case you need to define a custom rule class which inherits from `tornasole.rules.Rule` class.
You need to provide Sagemaker the S3 location of the file which defines your custom rule classes as the value for the field `SourceS3Uri`.
Again, you can pass any arguments taken by this rule through the RuntimeConfigurations dictionary. 
Note that the custom rules can only have arguments which expect a string as the value except the two arguments 
specifying trials to the Rule. Refer [DeveloperGuide_Rules.md](../../../rules/DeveloperGuide_Rules.md) for more.

Here's an example:
```
rules_specification = [
    {
      "RuleName": "CustomRule",
      "SourceS3Uri": "s3://weiyou-tornasole-test/rule-script/custom_rule.py",
      "InstanceType": "ml.c5.4xlarge",
      "VolumeSizeInGB": 10,
      "RuntimeConfigurations": {
         "threshold" : "0.5"
       }
    }
]
```

### Estimator
Now we'll call the Sagemaker Pytorch Estimator to kick off a training job along with a rule to monitor the job.

For the purposes of this demonstration let us use the simple.py script with the above hyperparameters dictionary.
These good hyperparameters do not produce vanishing gradients, so you will see that the rule doesn't get fired.


### Training Example Without Vanishing Gradients 

In [1]:
sagemaker_execution_role = sagemaker.get_execution_role()
#sagemaker_execution_role = 'AmazonSageMaker-ExecutionRole-20190614T145575'
estimator = PyTorch(role=sagemaker_execution_role,
                  base_job_name='pytorch-good-example',
                  train_instance_count=1,
                  train_instance_type='ml.m4.xlarge',
                  image_name=docker_image_name,
                  entry_point=entry_point_script,
                  framework_version='1.1.0',
                  hyperparameters=hyperparameters,
                  py_version='py3',
                  debug=True,
                  rules_specification=[
                      {
                          "RuleName": "VanishingGradient",
                          "InstanceType": "ml.c5.4xlarge",
                          "VolumeSizeInGB": 10,
                          "RuntimeConfigurations": {
                              "start-step": "1",
                              "end-step": "50"
                          }
                      }
                  ])

NameError: name 'sagemaker' is not defined

In [None]:
estimator.fit()

## Result

As a result of the above command, SageMaker will spin off 2 training jobs for you - the first one being the job which produces the tensors to be analyzed and the second one, which evaluates or analyzes the rule you asked it to in `rules_specification`

### Check the status of the Rule Execution Job
To get the rule execution job that SageMaker started for you, run the command below and it shows you the `RuleName`, `RuleStatus`, `FailureReason` if any, and `RuleExecutionJobArn`. If the tensors meets a rule evaluation condition, the rule execution job throws a client error with `FailureReason: RuleEvaluationConditionMet`. You can check the Cloudwatch Logstream `/aws/sagemaker/TrainingJobs` with `RuleExecutionJobArn`

In [None]:
estimator.describe_rule_execution_jobs()

### Receive CloudWatch Event For your Jobs
When the status of training job or rule execution job change (i.e. starting, failed), TrainingJobStatus CloudWatch events are emitted : https://docs.aws.amazon.com/sagemaker/latest/dg/cloudwatch-events.html. You can configure a CW event rule to receive and process these events by setting up a target (Lambda function, SNS). 


### Training Example With Vanishing Gradients 

Now let us change the hyperparameters dictionary to the below bad set of hyperparameters, which produce vanishing gradients 


In [None]:
entry_point_script = '../scripts/simple.py'
docker_image_name= '072677473360.dkr.ecr.us-west-2.amazonaws.com/tornasole-preprod-pytorch-1.1.0-cpu:latest'
bad_hyperparameters = {'epochs': 2, 'lr' : 1.0, 'momentum' : 0.9, 'tornasole-frequency' : 3, 'steps' : 10, 'hook-type' : 'saveall', 'random-seed' : True }


In [None]:
sagemaker_execution_role = sagemaker.get_execution_role()
#sagemaker_execution_role = 'AmazonSageMaker-ExecutionRole-20190614T145575'
estimator = PyTorch(role=sagemaker_execution_role,
                  base_job_name='pytorch-bad-example',
                  train_instance_count=1,
                  train_instance_type='ml.m4.xlarge',
                  image_name=docker_image_name,
                  entry_point=entry_point_script,
                  framework_version='1.1.0',
                  hyperparameters=bad_hyperparameters,
                  py_version='py3',
                  debug=True,
                  rules_specification=[
                      {
                          "RuleName": "VanishingGradient",
                          "InstanceType": "ml.c5.4xlarge",
                          "VolumeSizeInGB": 10,
                          "RuntimeConfigurations": {
                              "start-step": "1",
                              "end-step": "10"
                          }
                      }
                  ])

In [None]:
estimator.fit()

In [None]:
estimator.describe_rule_execution_jobs()