# Amazon SageMaker - Debugging With Custom Rules
[Amazon SageMaker](https://aws.amazon.com/sagemaker/) is a fully managed platform to build, train, and deploy machine learning models quickly. Amazon SageMaker Debugger offers the capability to debug machine learning models during training by identifying and detecting problems with the models in near real time. 

## How does Amazon SageMaker Debugger work?

Amazon SageMaker Debugger lets you go beyond just looking at scalars like losses and accuracies during training. It gives you full visibility into all tensors flowing through the graph during training. Furthermore, it helps you monitor your training in near real time using rules. It also provides alerts once it has detected an inconsistency in training flow.

### Concepts
* **Tensors**: These are the artifacts that define the state of the training job at any particular instant in its lifecycle.
* **Debug Hook**: Hook is the construct with which Amazon SageMaker Debugger looks into the training process and captures the tensors requested at the desired step intervals
* **Debugging Rule**: A logical construct, implemented as python code, which helps analyze the tensors captured by the hook and report anamolies, if at all

With these concepts in mind, let's understand the overall flow of things which the Debugger uses to orchestrate debugging


### Saving tensors during training

The tensors captured by the debug hook are stored in an S3 location specified by you. There are two ways you can configure the Debugger to save tensors:

#### With no changes to your training script
If you use one of the SageMaker provided [Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html) for 1.15, then you don't need to make any changes to your training script for the tensors to be stored. SageMaker Debugger will use the configuration you provide through the SageMaker SDK's Tensorflow `Estimator` when creating your job to save the tensors in the fashion you specify. You can review the script we are going to use at [src/mnist_zerocodechange.py](src/mnist_zerocodechange.py). You will note that this is an untouched TensorFlow script which uses the Estimator interface. Please note that SageMaker Debugger only supports TF.Keras, Estimator and MonitoredSession interfaces. Full description of support is available at [SMDebug with TensorFlow ](https://github.com/awslabs/sagemaker-debugger/tree/master/docs/tensorflow.md)

#### Orchestrating your script to store tensors
 For other containers, you need to make couple of lines of changes to your training script. The Debugger exposes a library `smdebug` which allows you to capture these tensors and save them for analysis. It's highly customizable and allows to save the specific tensors you want at different frequencies and possibly with other configurations. Refer [DeveloperGuide](https://github.com/awslabs/sagemaker-debugger/tree/master/docs) for details on how to use the Debugger library with your choice of framework in your training script. Here we have an example script orchestrated at [src/mnist_byoc](src/mnist_byoc.py). You also need to ensure that your container has the `smdebug` library installed.

### Analysis of tensors

Once the tensors are saved, the Debugger can be configured to run debugging ***Rules*** on them. On a very broad level, a rule is a python code used to detect certain conditions during training. Some of the conditions that a data scientist training an algorithm may care about are monitoring for gradients getting too large or too small, detecting overfitting, and so on. Sagemaker-Debugger will come pre-packaged with certain first-party (1P) rules. Users can write their own rules using the Sagemaker-Debugger APIs. You can also analyze raw tensor data outside of the Rules construct in say, a Sagemaker notebook, using Sagemaker-Debugger's full set of APIs.


## Training TensorFlow models with Amazon SageMaker Debugger

### Amazon SageMaker TensorFlow as a framework

Train a TensorFlow model in this notebook with Amazon Sagemaker Debugger enabled and monitor the training jobs with rules. This is done using Amazon SageMaker [TensorFlow 1.15.0](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html) Container as a framework

In [1]:
import boto3
import os
import sagemaker
from sagemaker.tensorflow import TensorFlow

Import the libraries needed for the demo of Amazon SageMaker Debugger.

In [7]:
from sagemaker.debugger import Rule, DebuggerHookConfig, TensorBoardOutputConfig, CollectionConfig, rule_configs

Now define the entry point for the training script

In [10]:
# define the entrypoint script
entrypoint_script='src/mnist_zerocodechange.py'

### Setting up the Estimator

Now it's time to setup our TensorFlow estimator. There are new parameters with the estimator to enable your training job for debugging through Amazon SageMaker Debugger. These new parameters are explained below

* **debugger_hook_config**: This new parameter accepts a local path where you wish your tensors to be written to and also accepts the S3 Uri where you wish your tensors to be uploaded to. It also accepts CollectionConfigurations which specify which tensors will be saved from the training job.
* **rules**: This new parameter will accept a list of rules you wish to evaluate against the tensors output by this training job. For rules, 

SageMaker Debugger supports two types of rules
* **Amazon SageMaker Rules**: These are rules curated by the Amazon SageMaker team and you can choose to evaluate them against your training job.
* **Custom Rules**: You can optionally choose to write your own rule as a Python source file and have it evaluated against your training job. To provide SageMaker Debugger to evaluate this rule, you would have to provide the S3 location of the rule source and the evaluator image.
 
#### Using your own custom rule
 
In this example you see how to use your own custom rule logic to be evaluated against your training.

##### **Summary of what the custom rule evaluates**
For demonstration purposes, we provide here a rule that tries to track whether gradients are too large. The custom rule looks at the tensors in the collection "gradients" saved during training and attempt to get the absolute value of the gradients in each step of the training. If the mean of the absolute values of gradients in any step is greater than a specified threshold, mark the rule as 'triggering'. Let us look at how to structure the rule source.

Any custom rule logic you want to be evaluated should extend the `Rule` interface provided by Amazon SageMaker Debugger

```python
from smdebug.rules.rule import Rule

class CustomGradientRule(Rule):
```

Now implement the class methods for the rule. Doing this allows Amazon SageMaker to understand the intent of the rule and evaluate it against your training tensors.

##### **Rule class constructor**

In order for Amazon SageMaker to instantiate your rule, your rule class constructor must conform to the following signature.
```python
    def __init__(self, base_trial, other_trials, <other parameters>)
```
`base_trial (Trial)`: This defines the primary [Trial](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/analysis.md#trial) that your rule is anchored to. This is an object of class type `Trial`.

`other_trials (list[Trial])`: *(Optional)* This defines a list of 'other' trials you want your rule to look at. This is useful in the scenarios when you're comparing tensors from the base_trial to tensors from some other trials. 

`<other parameters>`: This is similar to `**kwargs` where you can pass in however many string parameters in your constructor signature. Note that SageMaker would only be able to support supplying string types for these values at runtime (see how, later).

##### `invoke_at_step()`:

This defines the logic to invoked for each step. Essentially, this is where you decide whether the rule should trigger or not. In this case, you're concerned about the gradients getting too large. So, get the [tensor reduction]() "mean" for each step and see if it's value is larger than a threshold.

```python
    def invoke_at_step(self, step):
        for tname in self.base_trial.tensor_names(collection="gradients"):
            t = self.base_trial.tensor(tname)
            abs_mean = t.reduction_value(step, "mean", abs=True)
            if abs_mean > self.threshold:
                return True
        return False
```

See the [documentation](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/analysis.md) to learn more about structuring your rules and other related concepts.

In [20]:
custom_rule = Rule.custom(name='MyCustomRule', # used to identify the rule
                          image_uri='490809245908.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rule-evaluator:latest', # rule evaluator container image
                          instance_type='ml.t3.medium', # instance type to run the rule evaluation on
                          source='rules/my_custom_rule.py', # path to the rule source file
                          rule_to_invoke='CustomGradientRule', # name of the class to invoke in the rule source file
                          volume_size_in_gb=30, # EBS volume size required to be attached to the rule evaluation instance
                          collections_to_save=[CollectionConfig("gradients")], 
                          # collections to be analyzed by the rule. since this is a first party collection we fetch it as above
                          rule_parameters={
                              "threshold": "20.0" # this will be used to intialize 'threshold' param in your constructor
                          })


estimator = TensorFlow(
    role=sagemaker.get_execution_role(),
    base_job_name='smdebugger-custom-rule-demo-mnist-tensorflow',
    train_instance_count=1,
    train_instance_type='ml.m4.xlarge',
    entry_point=entrypoint_script,
    framework_version='1.15',
    py_version='py3',
    train_max_run=3600,
    script_mode=True,
    ## New parameter
    rules = [custom_rule]
)

Before you proceed and create our training job, explore the new parameters in the TensorFlow estimator above

* `name`: This is used to identify this particular rule among the suite of rules you specified to be evaluated.
* `image_uri`: This is the image of the container that has the logic of understanding your custom rule sources and evaluating them against the collections you save in the training job. You can get the list of open sourced SageMaker rule evaluator images [here]()
* `instance_type`: The type of the instance you want to run the rule evaluation on
* `source`: This is the local path or the S3 Uri of your rule source file.
* `rule_to_invoke`: This specifies the particular Rule class implementation in your source file which you want to be evaluated. SageMaker supports only 1 rule to be evaluated at a time in a rule job. Your source file can have multiple Rule class implementations, though.
* `collections_to_save`: This specifies which collections are necessary for this rule.
* `rule_parameters`: This provides the runtime values of the parameter in your constructor. You can still choose to pass in other values which may be necessary for your rule to be evaluated. Any value in this map is available as an environment variable and can be accessed by your rule script using `$<rule_parameter_key>`

You can read more about custom rule evaluation in Amazon SageMaker in this [documentation](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/analysis.md)


Let's call `fit()` on our estimator to start the training job and the parallel custom rule evaluation

In [21]:
# After calling fit, Amazon SageMaker spins off one training job and one rule job for you.
# The rule evaluation status is visible in the training logs
# at regular intervals

estimator.fit()

{'AlgorithmSpecification': {'TrainingInputMode': 'File', 'TrainingImage': '072677473360.dkr.ecr.us-west-2.amazonaws.com/beta-tensorflow-training:1.15.0-py3-cpu-with-horovod-build', 'EnableSageMakerMetricsTimeSeries': True}, 'OutputDataConfig': {'S3OutputPath': 's3://sagemaker-us-west-2-072677473360/'}, 'TrainingJobName': 'smdebugger-demo-mnist-tensorflow-2019-12-01-22-26-35-210', 'StoppingCondition': {'MaxRuntimeInSeconds': 3600}, 'ResourceConfig': {'InstanceCount': 1, 'InstanceType': 'ml.m4.xlarge', 'VolumeSizeInGB': 30}, 'RoleArn': 'arn:aws:iam::072677473360:role/service-role/AmazonSageMaker-ExecutionRole-20190917T111877', 'HyperParameters': {'sagemaker_submit_directory': '"s3://sagemaker-us-west-2-072677473360/smdebugger-demo-mnist-tensorflow-2019-12-01-22-26-35-210/source/sourcedir.tar.gz"', 'sagemaker_program': '"mnist_zerocodechange.py"', 'sagemaker_enable_cloudwatch_metrics': 'false', 'sagemaker_container_log_level': '20', 'sagemaker_job_name': '"smdebugger-demo-mnist-tensorflow

## Result 

As a result of calling the `fit()` Amazon SageMaker debugger kicked off a rule evaluation job for our custom gradient logic in-parallel with the training job that was monitoring the tensors output by the training job. As you can see, in the summary, there was no step in the training that reported vanishing gradients in the tensors.

In [22]:
job_name = estimator.latest_training_job.name
client = estimator.sagemaker_session.sagemaker_client
description = client.describe_training_job(TrainingJobName=job_name)

In [23]:
description['DebugRuleEvaluationStatuses']

[{'RuleConfigurationName': 'MyCustomRule',
  'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-west-2:072677473360:processing-job/smdebugger-demo-mnist-tens-mycustomrule-3ddb0ffc',
  'RuleEvaluationStatus': 'NoIssuesFound',
  'LastModifiedTime': datetime.datetime(2019, 12, 1, 22, 35, 26, 471000, tzinfo=tzlocal())}]

Having these kind of analyses run through the Amazon SageMaker Debugger in-parallel with the training job is beneficial. You can react to the status transitions of the rule evaluations by configuring Amazon CloudWatch events.