# Debugging SageMaker Training Jobs with Tornasole

## Overview

Tornasole is a new capability of Amazon SageMaker that allows debugging machine learning training. It lets you go beyond just looking at scalars like losses and accuracies during training and gives you full visibility into all tensors 'flowing through the graph' during training or inference.

Using Tornasole is a two step process:

### Saving tensors

Tensors define the state of the training job at any particular instant in its lifecycle. Tornasole exposes a library which allows you to capture these tensors and save them for analysis

### Analysis

Analyses of the tensors emitted is captured by the Tornasole concept called ***Rules***. On a very broad level, Rules are a piece of analysis code that one writes to compares tensors across steps of a training job and analyze them in each step of the training job.
You can also analyze raw tensor data outside of the Rules construct using our analysis APIs. Please refer [DeveloperGuide_Rules.md](../../../rules/DeveloperGuide_Rules.md)

The analysis of tensors saved requires the package `tornasole.rules`.

This example guides you through installation of the required components for emitting tensors in a SageMaker training job and applying a rule over the tensors to monitor the live status of the job. 

## Setup

As a first step, we'll do the installation of required tools which will allow emission of tensors (saving tensors) and application of rules to analyze them

In [1]:
!aws s3 cp s3://tornasole-external-preview-use1/sdk/sagemaker-1.35.2.dev0.tar.gz .
!pip -q install sagemaker-1.35.2.dev0.tar.gz
!aws s3 cp s3://tornasole-external-preview-use1/sdk/sagemaker-tornasole.json .
!aws configure add-model --service-model sagemaker-tornasole.json --service-name sagemaker

download: s3://tornasole-external-preview-use1/sdk/sagemaker-1.35.2.dev0.tar.gz to ./sagemaker-1.35.2.dev0.tar.gz
[33mYou are using pip version 10.0.1, however version 19.2.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
download: s3://tornasole-external-preview-use1/sdk/sagemaker-tornasole.json to ./sagemaker-tornasole.json

Expecting value: line 1 column 1 (char 0)


Now that we've completed the setup, we're ready to spin off a training job with debugging enabled

## Training with Script Mode

We'll be training a TensorFlow model for Sentiment Analysis. This will be done using SageMaker TensorFlow 1.14 Container with Script Mode.

In [2]:
import boto3
import sagemaker
from sagemaker.tensorflow import TensorFlow

### Inputs

Configuring the inputs for the training job

In [3]:
entry_point_script = 'scripts/simple.py'
docker_image_name= '072677473360.dkr.ecr.us-west-2.amazonaws.com/tornasole-preprod-tf-1.13.1-cpu:latest'
hyperparameters = {'epochs': 1, 'batch_size': 128}

#### Parameters
Now we'll call the TensorFlow Estimator to kick off a training job. The new parameters in the Estimator to look out for are

##### `debug` (bool)
This indicates that debugging should be enabled for the training job. Setting this as `True` would make Tornasole available for use with the job

##### `rules_specification` (list[*dict*])
This is a list of python dictionaries, where each `dict` is of the following form:
```
{
    "RuleName": <str> # The name of the class implementing the Tornasole Rule interface. (required)
    "RuleEvaluatorImage": <str> # The ECR location of rule evaluator image. Not required if first-party rule is used.
    "SourceS3Uri": <str> # S3 URI of the rule script containing the class in 'RuleName'. If left empty, it would look for the class in one of the First Party rules already provided to you by Amazon. If not, SageMaker will try to look for the rule class in the script
    "InstanceType": <str> # The ml instance type in which the rule evaluation should run
    "VolumeSizeInGB": <int> # The volume size to store the runtime artifacts from the rule evaluation
    "RuntimeConfigurations": {
        # Map defining the parameters required to instantiate the Rule class and invoke the rule
        <str>: <str>
    }
}
```
#### Storage
The tensors are, by default, stored in the S3 output path of the training job, under the folder **`/tensors-<job name>`**. This is done to ensure that we don't end up accidentally overwriting the tensors from a training job with the others. Rules evaluation require separation of the tensors paths to be evaluated correctly.

If you don't provide an S3 output path to the estimator, SageMaker creates one for you as:
**`s3://sagemaker-<region>-<account_id>/`**

See the way we instantiate the estimator below

### Rule

There are two ways to apply rules.
1. Use a 1P rule. Specify the RuleName with the 1P RuleName, and the rule will be automatically applied. Here we are uing **`VanishingGradient`**. Leave `SourceS3Uri` empty if a 1P rule is needed.
2. Use a custom rule script and specify the S3 location of the script in `SourceS3Uri`

### Estimator

In [4]:
estimator = TensorFlow(role=sagemaker.get_execution_role(),
                  base_job_name='tensorflow-simple-3',
                  train_instance_count=1,
                  train_instance_type='ml.m4.xlarge',
                  image_name=docker_image_name,
                  entry_point=entry_point_script,
                  framework_version='1.4.1',
                  py_version='py3',
                  script_mode=True,
                  model_dir='/opt/ml/model',
                  #hyperparameters=hyperparameters,
                  debug=True,
                  train_max_run=1800,
                  rules_specification=[
                      {
                          "RuleName": "WeightUpdateRatio",
                          # "SourceS3Uri": "s3://weiyou-tornasole-test/rule-script/check_grads.py",
                          "InstanceType": "ml.c5.4xlarge",
                          "VolumeSizeInGB": 10,
                          #"RuntimeConfigurations": {
                          #    "start-step": "1",
                          #    "end-step": "50"
                          #}
                      }
                  ])

In [5]:
estimator.fit()

2019-08-20 21:34:30 Starting - Starting the training job...
2019-08-20 21:34:32 Starting - Launching requested ML instances......
2019-08-20 21:35:35 Starting - Preparing the instances for training...
2019-08-20 21:36:19 Downloading - Downloading input data
2019-08-20 21:36:19 Training - Downloading the training image......
2019-08-20 21:37:29 Uploading - Uploading generated training model
2019-08-20 21:37:29 Completed - Training job completed

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])[0m
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])[0m
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])[0m
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])[0m
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])[0m
  np_resource = np.dtype([("resource", np.ubyte, 1)])[0m
[31m2019-08-20 21:37:20,619 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training[0m
[31m2019-08-20 21:37:20,625 sagemaker-containers INFO     No GPUs detected (normal if no gpus i

## Result

As a result of the above command, SageMaker will spin off 2 training jobs for you - the first one being the job which produces the tensors to be analyzed and the second one, which evaluates or analyzes the rule you asked it to in `rules_specification`

### Check the status of the Rule Execution Job
To get the rule execution job that SageMaker started for you, run the command below and it shows you the `RuleName`, `RuleStatus`, `FailureReason` if any, and `RuleExecutionJobArn`. If the tensors meets a rule evaluation condition, the rule execution job throws a client error with `FailureReason: RuleEvaluationConditionMet`. You can check the Cloudwatch Logstream `/aws/sagemaker/TrainingJobs` with `RuleExecutionJobArn`

In [6]:
estimator.describe_rule_execution_jobs()

RuleName: WeightUpdateRatio
RuleStatus: NotStarted


### Receive CloudWatch Event For your Jobs
When the status of training job or rule execution job change (i.e. starting, failed), TrainingJobStatus CloudWatch events are emitted : https://docs.aws.amazon.com/sagemaker/latest/dg/cloudwatch-events.html. You can configure a CW event rule to receive and process these events by setting up a target (Lambda function, SNS). 
