# Debugging SageMaker Training Jobs with Tornasole 
## Writing Custom Rules

## Overview
Tornasole is a new capability of Amazon SageMaker that allows debugging machine learning training. 
It lets you go beyond just looking at scalars like losses and accuracies during training and gives 
you full visibility into all tensors 'flowing through the graph' during training. Tornasole helps you to monitor your training in near real time using rules and would provide you alerts, once it has detected inconsistency in training flow. 

Using Tornasole is a two step process: Saving tensors and Analysis. Let's look at each one of them closely.

### Saving tensors
Tensors define the state of the training job at any particular instant in its lifecycle. Tornasole exposes a library which allows you to capture these tensors and save them for analysis. Tornasole is highly customizable to save the tensors you want at different frequencies. Refer [DeveloperGuide_TensorFlow](../../DeveloperGuide_TF.md) for details on how to save the tensors you want to save.

### Analysis

Analysis of the tensors emitted is captured by the Tornasole concept called ***Rules***. On a very broad level, 
a rule is a python code used to detect certain conditions during training. Some of the conditions that a data scientist training a deep learning model may care about are monitoring for gradients getting too large or too small, detecting overfitting, and so on.
Tornasole will come pre-packaged with certain rules. Users can write their own rules using the Tornasole APIs.
You can also analyze raw tensor data outside of the Rules construct in say, a Sagemaker notebook, using Tornasole's full set of APIs. 
Please refer [Developer Guide for Rules](../../../../rules/DeveloperGuide_Rules.md) for more details about analysis.

This example guides you through installation of the required components for emitting tensors in a 
SageMaker training job and applying a rule over the tensors to monitor the live status of the job. 


## Setup

As a first step, we'll do the installation of required tools which will allow emission of tensors (saving tensors) and application of rules to analyze them. This is only for the purposes of this private beta. Once we do this, we will be ready to use Tornasole.

In [1]:
! aws s3 sync s3://tornasole-external-preview-use1/sdk/ ~/SageMaker/tornasole-preview-sdk/
! chmod +x ~/SageMaker/tornasole-preview-sdk/installer.sh && ~/SageMaker/tornasole-preview-sdk/installer.sh

Installing requirements...
[33mYou are using pip version 10.0.1, however version 19.2.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Installation completed!



## Using custom Tornasole rules with SageMaker 

This notebook assumes that you have gone through at least one notebook demonstrating training models
in SageMaker with Tornasole with your framework of choice. That notebook would demonstrate the 
changes you need to make in your training script to enable Tornasole, starting a training job 
along with a rule execution job, and looking at the status of these jobs.

In this notebook we will focus on how to write a custom Tornasole rule, and how to 
execute this custom rule in SageMaker. To make this notebook runnable, we are picking a TensorFlow script as the training job.
Whatever framework or script you use, rule behavior would be similar. 

### Start training with a custom rule

#### Configuring the inputs for the training job
Set the docker image to the SageMaker TensorFlow container that we have built with Tornasole pre-installed, for the region you are in. 

In [3]:
import sagemaker
import boto3
from sagemaker.tensorflow import TensorFlow

REGION = boto3.Session().region_name
TAG='latest'

docker_image_name= '072677473360.dkr.ecr.{}.amazonaws.com/tornasole-preprod-tf-1.13.1-cpu:{}'.format(REGION, TAG)


Let us now set `entry_point_script` to the simple TensorFlow training script that has TornasoleHook integrated.
The 'hyperparameters' below are the parameters that will be passed to the training script as command line arguments in SageMaker's script mode.

In [5]:
entry_point_script = '../../frameworks/tensorflow/examples/scripts/simple.py'
hyperparameters = { 'steps': 1000000, 'tornasole_frequency': 50 }

#### Configuring custom rule
We have written an example custom rule `CustomGradientRule`, available [here](../scripts/my_custom_rule.py). We need to upload this to a bucket in the same region where we want to run the job. We have chosen a default bucket below. Please change it to the bucket you want. We will now create this bucket if it does not exist, and upload this file. 
We will then specify this path when starting the job as `SourceS3Uri`.

In [17]:
ACCOUNT_ID = boto3.client('sts').get_caller_identity().get('Account')
BUCKET = f'tornasole-resources-{ACCOUNT_ID}-{REGION}'

CUSTOM_RULE_PATH = '../scripts/my_custom_rule.py'

PREFIX = os.path.join('rules', os.path.basename(CUSTOM_RULE_PATH))

import os
s3 = boto3.resource('s3')
bucket = s3.Bucket(BUCKET)
if not bucket.creation_date:
    s3.create_bucket(Bucket=BUCKET, CreateBucketConfiguration={'LocationConstraint': REGION})
s3.Object(BUCKET, PREFIX).put(Body=open(CUSTOM_RULE_PATH, 'rb'))
SOURCE_S3_URI = f's3://{BUCKET}/{PREFIX}'

Keep in mind that for SageMaker to be able to evaluate your rule, the rule class **will need** to have a signature conforming to the spec defined by Tornasole. 

This custom rule that we have written takes the arguments `self`, `base_trial` and `threshold`. 
In order to initialize a custom rule class, you'll need to pass down values for everything except `self` and `base_trial`. 
This is done through putting the parameters and their values as a string-to-string map in `RuntimeConfigurations` in the `rules_specification` parameter to the SageMaker Estimator.

After we run this example, in this notebook we will look at these concepts in more detail.

In [18]:
estimator = TensorFlow(role=sagemaker.get_execution_role(),
                  base_job_name='tensorflow-custom-rule-tornasole',
                  train_instance_count=1,
                  train_instance_type='ml.m4.xlarge',
                  image_name=docker_image_name,
                  entry_point=entry_point_script,
                  hyperparameters=hyperparameters,
                  framework_version='1.13.1',
                  debug=True,
                  py_version='py3',
                  rules_specification=[
                      {
                          "RuleName": "CustomGradientRule",
                          "SourceS3Uri": SOURCE_S3_URI,
                          "InstanceType": "ml.c5.4xlarge",
                          "RuntimeConfigurations": {
                              "threshold" : "0.5"
                          }
                      }
                  ])

To kick off the job, we call the `fit()` method on the SageMaker TensorFlow estimator

In [19]:
# setting wait as True will cause the logs to be streamed in the notebook directly,
# in order to proceed to further cells you'll need to stop cell execution. So, 
# we set wait to False for demonstration purposes.
estimator.fit(wait=False)

### Result
As a result of the above command, SageMaker will spin off two jobs for you - the first one being the training job which produces the tensors to be analyzed and the second one, which evaluates or analyzes the rule you asked it to in `rules_specification`.
#### Check the status of the Rule Execution Job
To get the rule execution job that SageMaker started for you, run the command below and it shows you the `RuleName`, `RuleStatus`, `FailureReason` if rule job started, the `RuleJobName` and `RuleExecutionJobArn`. 
If the tensors meets a rule evaluation condition, the rule execution job throws a client error with `FailureReason: RuleEvaluationConditionMet`. 
You can check the Cloudwatch Logstream `/aws/sagemaker/TrainingJobs` with `RuleExecutionJobArn`.

Depending on how your tensors are emitted and how your custom rule reacts to the script, your rule evaluation job will either fail or succeed. 
You can get the rule evaluation statuses of the jobs through the following mechanism. This function will continue to poll till the rule execution jobs end. To proceed with the notebook, please stop the cell after RuleStatus changes to InProgress. At this point, you should see RuleExecutionJobName. This will be needed to execute the next cell of code where we attach to the rule execution job to see its logs.

In [23]:
rule_description = estimator.describe_rule_execution_jobs()

Wait to get status for Rule Execution Jobs...
RuleName: CustomGradientRule
RuleStatus: RuleExecutionError
FailureReason: ClientError: RuleEvaluationConditionMet: Evaluation of the rule CustomGradientRule at step 0 resulted in the condition being met
Traceback (most recent call last):
  File "train.py", line 214, in execute
    exec(_SYMBOLIC_INVOKE_RULE.format(self.start_step, self.end_step), globals(), exec_local)
  File "<string>", line 2, in <module>
  File "/usr/local/lib/python3.7/site-packages/tornasole/rules/rule_invoker.py", line 84, in invoke_rule
    raise e
  File "/usr/local/lib/python3.7/site-packages/tornasole/rules/rule_invoker.py", line 79, in invoke_rule
    rule_obj.invoke(step)
  File "/usr/local/lib/python3.7/site-packages/tornasole/rules/rule.py", line 56, in invoke
    raise RuleEvaluationConditionMet(self.rule_name, step)
tornasole.exceptions.RuleEvaluationConditionMet: Evaluation of the rule CustomGradientRule at step 0 resulted in the condition being met


Rule

#### Check the logs of the Rule Execution Job
If you want to access the logs of a particular rule job name, you can do the following. First, you need to get the rule job name (`RuleExecutionJobArn` field from the training job description). Note that this is only available after the rule job reaches Started stage. Hence the next cell waits till the job name is available

Now we can attach to this job to see its logs

In [24]:
from sagemaker.estimator import Estimator
rule_job_name = rule_description[0]['RuleExecutionJobName']
exploding_tensor = Estimator.attach(rule_job_name)

2019-08-29 22:40:05 Starting - Preparing the instances for training
2019-08-29 22:40:05 Downloading - Downloading input data
2019-08-29 22:40:05 Training - Training image download completed. Training in progress.
2019-08-29 22:40:05 Uploading - Uploading generated training model
2019-08-29 22:40:05 Failed - Training job failed[31m[2019-08-29 22:39:24.434 ip-10-0-174-72.us-west-2.compute.internal:1 INFO s3_trial.py:27] Loading trial base-trial at path s3://sagemaker-us-west-2-072677473360/tensors-tensorflow-custom-rule-tornasole-2019-08-29-22-32-10-697[0m
[31m[2019-08-29 22:39:56.413 ip-10-0-174-72.us-west-2.compute.internal:1 INFO rule_invoker.py:76] Started execution of rule CustomGradientRule at step 0[0m
[31mException during rule execution: Customer Error: RuleEvaluationConditionMet: Evaluation of the rule CustomGradientRule at step 0 resulted in the condition being met[0m
[31mTraceback (most recent call last):
  File "train.py", line 214, in execute
    exec(_SYMBOLIC_INVOKE

UnexpectedStatusException: Error for Training job CustomGradientRule-75136a67d770aaea32adbb833fb2ee4f: Failed. Reason: ClientError: RuleEvaluationConditionMet: Evaluation of the rule CustomGradientRule at step 0 resulted in the condition being met
Traceback (most recent call last):
  File "train.py", line 214, in execute
    exec(_SYMBOLIC_INVOKE_RULE.format(self.start_step, self.end_step), globals(), exec_local)
  File "<string>", line 2, in <module>
  File "/usr/local/lib/python3.7/site-packages/tornasole/rules/rule_invoker.py", line 84, in invoke_rule
    raise e
  File "/usr/local/lib/python3.7/site-packages/tornasole/rules/rule_invoker.py", line 79, in invoke_rule
    rule_obj.invoke(step)
  File "/usr/local/lib/python3.7/site-packages/tornasole/rules/rule.py", line 56, in invoke
    raise RuleEvaluationConditionMet(self.rule_name, step)
tornasole.exceptions.RuleEvaluationConditionMet: Evaluation of the rule CustomGradientRule at step 0 resulted in the condition being met



## Tornasole Rules Explained in depth
Let us now walk through some of Tornasole's concepts which will be helpful to understand how rules are executed in SageMaker
and how custom rules work. 

### Trial
A Trial in Tornasole's context refers to a training job. 
It is identified by the path where the saved tensors for the job are stored. 

### Rules
Rules are the medium by which Tornasole executes a certain piece of code regularly on different steps of the job.
They can be used to assert certain conditions during training, and raise Cloudwatch Events based on them that you can
use to process in any way you like. 

These are defined by the class `tornasole.rules.Rule`. A rule takes a `base_trial` which refers to the job whose run invokes the rule execution. 
A rule can optionally look at other jobs as well, passed using the argument `other_trials`.

Tornasole comes with a set of **First Party rules** (1P rules).
You can also write your own rules looking at these 1P rules for inspiration.
Refer [Developer Guide for Rules.md](../../DeveloperGuide_Rules.md) for more on the 
APIs you can use to write your own rules as well as descriptions for the 1P rules that we provide.

### Storage
The tensors saved by Tornasole are, by default, stored in the S3 output path of the training job, under the folder **`/tensors-<job name>`**. 
This is done to ensure that we don't end up accidentally overwriting the tensors from a training job with the others. 
Rules evaluation require separation of the tensors paths to be evaluated correctly.
If you don't provide an S3 output path to the estimator, SageMaker creates one for you as: **`s3://sagemaker-<region>-<account_id>/`**

### Using Tornasole Rules in SageMaker 
Here we will talk about how to use SageMaker to evaluate these rules on the training jobs. 
The new parameters in Sagemaker Estimator to look out for are

- `debug` :(bool)
This indicates that debugging should be enabled for the training job. 
Setting this as `True` would make Tornasole available for use with the job

- `rules_specification`: (list[*dict*])
You can specify any number of rules to monitor your SageMaker training job. 
This parameter takes a list of python dictionaries, one for each rule you want to enable. 
Each `dict` is of the following form:
```
{
    "RuleName": <str>       
        # The name of the class implementing the Tornasole Rule interface. (required)

    "SourceS3Uri": <str>    
        # S3 URI of the rule script containing the class in 'RuleName'. 
        # This is not required if you want to use one of the
        # First Party rules provided to you by Amazon. 
        # In such a case you can leave it empty or not pass it. If you want to run a custom rule 
        # defined by you, you will need to define the custom rule class in a python 
        # file and provide it to SageMaker as a S3 URI. 
        # SageMaker will fetch this file and try to look for the rule class 
        # identified by RuleName in this file.
    
    "InstanceType": <str>   
        # The ML instance type which should be used to run the rule evaluation job
        
    "VolumeSizeInGB": <int> 
        # The volume size to store the runtime artifacts from the rule evaluation 
        
    "RuntimeConfigurations": {
        # Map defining the parameters required to instantiate the Rule class and
        # parameters regarding invokation of the rule (start-step and end-step)
        # This can be any parameter taken by the rule. 
        # Every value here needs to be a string. 
        # So when you write custom rules, ensure that you can parse each argument from a string.
        #
        # PARAMS CAN BE
        #
        # STANDARD PARAMS FOR RULE EXECUTION
        # "start-step": <str>
        # "end-step": <str>
        # "other-trials-paths": <str> (';' separated list of s3 paths as a string)
        # "logging-level": <str> (can be one of "CRITICAL", "FATAL", "ERROR", 
        #                         "WARNING", "WARN", "DEBUG", "NOTSET")
        #
        # ANY PARAMETER TAKEN BY THE RULE other than `base_trial` and `other_trials` 
        # "parameter" : "value"
        # <str>: <str>
    }
}
```


### CloudWatch Event Integration for Rules
When the status of training job or rule execution job change (i.e. starting, failed), TrainingJobStatus [CloudWatch events](https://docs.aws.amazon.com/sagemaker/latest/dg/cloudwatch-events.html) are emitted.  

You can configure a CloudWatch event rule to receive and process these events by setting up a target (Lambda function, SNS) as follows:

- Configure the [SageMaker TrainingJobStatus CW event](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html#sagemaker_event_types) to include rule job statuses associated with the training job
- Configure the CW event to be emitted when a RuleStatus changes
- Create a CloudWatch event rule that monitors the Training Job customer started
- Set a Target (Lambda funtion, SQS) for the CloudWatch event rule that processes the event, and triggers an alarm for the customer based on the RuleStatus. 

Refer [this page](https://docs.aws.amazon.com/sagemaker/latest/dg/cloudwatch-events.html) for more details. 

### Writing a custom rule

Implementing a custom rule involves implementing the Rule interface that Tornasole provides.
Let us go through the exercise of writing a rule which checks whether gradients are very high.

#### Constructor
Creating a rule involves first inheriting from the base Rule class Tornasole provides: `tornasole.rules.Rule`

Every rule is required to take the argument `base_trial` which represents the Trial object for the job whose execution 
invokes this rule. In addition to this you might want to pass `other_trials` which represents
list of Trial objects for other jobs if you want your custom rule to look at other jobs for some comparision. 
For this rule here we do not need to look at any other trials, so we set `other_trials` to None.

```python
from tornasole.rules import Rule

class CustomGradientRule(Rule):
    def __init__(self, base_trial, threshold=10.0):
        super().__init__(base_trial, other_trials=None)
        self.threshold = float(threshold)
```

Please note that apart from `base_trial` and `other_trials` (if required), we require all 
arguments of the rule constructor to take a string as value. You can parse them to the type
that you want from the string. This means if you want to pass
a list of strings, you might want to pass them as a comma separated string. This restriction is
being enforced so as to let you create and invoke rules from json using Sagemaker's APIs.   

#### Function to invoke at a given step
When a rule is executed, it is invoked at each step. We need to now define what to do when the rule is invoked at a given step, `step`.
In this function you can implement the core logic of what you want to do with your selection of tensors. If your custom rule 
has access to other trials, you can access tensors from other trials as well.

This function should return a boolean value `True` or `False`. When `True` is returned,
SageMaker will raise the exception `RuleEvaluationConditionMet`. This will also create a CloudWatch Event which can be used to configure your chosen action. 

The invoke function for `CustomGradientRule` to check whether tensors have large gradients can look like below:
```python
    def invoke_at_step(self, step):
        for tensor in self.base_trial.tensors_in_collection('gradients'):
            abs_mean = tensor.reduction_value(step, 'mean', abs=True)
            if abs_mean > self.threshold:
                return True
        return False
```
Here, we can access the names of tensors in `gradients` collection by using the method `tensors_in_collection`. 
You can see the full API that Trial provides to get tensors in our [Developer Guide For Rules](../../DeveloperGuide_Rules.md).

#### Optional: RequiredTensors
RequiredTensors is an optional construct that allows Tornasole to bulk-fetch all tensors that you need to 
execute the rule. This helps the rule invocation be more performant so it does not fetch tensor values from S3 one by one. 

##### RequiredTensors API 
This is a class whose object is provided as a member of the rule class, so you can access it as `self.req_tensors`. 
Its full API is described in our [Developer Guide For Rules](../../DeveloperGuide_Rules.md). 
In short, it has the following methods:
```python
# Add name of required tensor for a particular trial at given steps 
self.req_tensors.add(name=tname, steps=[step_num], trial=None, should_match_regex=False)

# If required tensors were added inside `set_required_tensors`, during rule invocation it is
# automatically used to fetch all tensors at once by calling `req_tensors.fetch()`
# If required tensors were added elsewhere, or later, you can call the `req_tensors.fetch()` method 
# yourself to fetch all tensors at once.
self.req_tensors.fetch()

# This method returns the names of the required tensors for a given trial
self.req_tensors.get_names(trial=None)

# This method returns the steps for which the tensor is required to execute the rule at this step.
self.req_tensors.get_tensor_steps(trial=None)

# This method returns the list of required tensors for a given trial as `Tensor` objects
self.req_tensors.get(trial=None)
``` 

##### Declare required tensors
To use this construct, you need to implement a method which lets Tornasole know what tensors you are interested in for invocation at a given step. 
This is the `set_required_tensors` method.

```python
def set_required_tensors(self, step):
    for tname in self.base_trial.tensors_in_collection('gradients'):
        self.req_tensors.add(tname, steps=[step])
```
##### Accessing required tensors
Since we defined required tensors in the `set_required_tensors` method, these will have been
pre-fetched when invoking the rule at a given step. You can continue to access the tensors as before.

If you do not want to determine which tensors you want to process again, you can also just call
self.req_tensors.get() to get them. In that case, the function would look as below:  

```python
def invoke_at_step(self, step):
    for tensor in self.req_tensors.get():
        abs_mean = tensor.reduction_value(step, 'mean', abs=True)
        if abs_mean > self.threshold:
            return True
    return False
```

### Executing the custom rule

You need to now provide Sagemaker the S3 location of the file which defines your custom rule classes as the value for the field `SourceS3Uri`. 

From above, our rule constructor takes the arguments `base_trial` and `threshold`. The `base_trial` argument will automatically be passed by SageMaker Rule Executor. The other arguments need to be passed through the RuntimeConfigurations dictionary as a mapping from string to string. 

If the custom rule took `other_trials`, which represents list of Trial objects for other jobs that the rule is interested in, that can be passed by passing the argument `other-trials-paths` which needs to be in the form of `s3_path_other_trial_1;s3_path_other_trial_2`.

Note that the custom rules can only have arguments which expect a string as the value except the two arguments specifying trials to the Rule (`base_trial` and `other_trials`). 

Here's an example:
```
rules_specification = [
    {
        "RuleName": "CustomGradientRule",
        "SourceS3Uri": "s3://tornasole-external-preview-use1/rules/scripts/my_custom_rule.py",
        "InstanceType": "ml.c5.4xlarge",
        "VolumeSizeInGB": 10,
        "RuntimeConfigurations": {
            "threshold" : "20.0"
        }
    }
]
```