# Debugging SageMaker Training Jobs with Tornasole

## Overview

Tornasole is a new capability of Amazon SageMaker that allows debugging machine learning training. 
It lets you go beyond just looking at scalars like losses and accuracies during training and gives 
you full visibility into all tensors 'flowing through the graph' during training. Tornasole helps you to monitor your training in near real time using rules and would provide you alerts, once it has detected inconsistency in training flow. 

Using Tornasole is a two step process: Saving tensors and Analysis. Let's look at each one of them closely.

### Saving tensors
Tensors define the state of the training job at any particular instant in its lifecycle. Tornasole exposes a library which allows you to capture these tensors and save them for analysis. Tornasole is highly customizable to save the tesnsors you want at different frequencies. Refer [DeveloperGuide_TensorFlow](../../DeveloperGuide_TF.md) for details on how to save the tensors you want to save.

### Analysis

Analysis of the tensors emitted is captured by the Tornasole concept called ***Rules***. On a very broad level, 
a rule is a python code used to detect certain conditions during training. Some of the conditions that a data scientist training a deep learning model may care about are monitoring for gradients getting too large or too small, detecting overfitting, and so on.
Tornasole will come pre-packaged with certain rules. Users can write their own rules using the Tornasole APIs.
You can also analyze raw tensor data outside of the Rules construct in say, a Sagemaker notebook, using Tornasole's full set of APIs. 
Please refer [DeveloperGuide_Rules](../../../../rules/DeveloperGuide_Rules.md) for more details about analysis.

This example guides you through installation of the required components for emitting tensors in a 
SageMaker training job and applying a rule over the tensors to monitor the live status of the job. 


## Setup

As a first step, we'll do the installation of required tools which will allow emission of tensors (saving tensors) and application of rules to analyze them. This is only for the purposes of this private beta. Once we do this, we will be ready to use Tornasole.

In [None]:
!aws s3 sync s3://tornasole-external-preview-use1/ ~/tornasole-preview
!pip -q install ~/tornasole-preview/sdk/sagemaker-tornasole-latest.tar.gz
!aws configure add-model --service-model file://`echo ~/tornasole-preview/sdk/sagemaker-tornasole.json` --service-name sagemaker

## Run a tensorflow training job with tornasole in sagemaker

We will use a simple training script, [simple.py](../scripts/simple.py). This script is designed to produce a exploding gradient problem. Tornasole is hooked in the training script. We created a hook and pass it in monitored session as below. See [simple.py](../scripts/simple.py) for detailed script.

```python
 #wrap the optimizer with Tornasole optimizer  
optimizer = ts.TornasoleOptimizer(optimizer)  
 
 #TORNASOLE will save all the tensors. Note: TornasoleHook is highly configurable. We would talk about other options in [DeveloperGuide_TensorFlow.md](../../DeveloperGuide_TF.md)  
hook = ts.TornasoleHook(out_dir=args.tornasole_path,
                        save_all=True)  
 #pass the hook to hooks parameter of monitored session  
sess = tf.train.MonitoredSession(hooks=[hook])```

We will train simple.py in sagemaker and attach a ExplodingTensor rule to training job. Rules are essentially python code and does analysis on tensors saved by tornasole. ExplodingTensor rule is one of 1P rule provided by sagemaker.
When training is going on tornasole will capture tensors as specified in configuration and ExplodingTensor Rule job will be looking for a exploding tensor. The rule will emit a cloudwatch event if it finds a exploding tensor problem while the training job is running.

## Training tensorflow models in SageMaker with Tornasole
Now let us see how to run simple.py training script in SageMaker using the SageMaker Estimator api with Tornasole enabled, along with a ExplodientTensorRule to monitor the training job in realtime. First, let us import the required libraries and set the links to docker images that we will use.

### Docker Images with Tornasole
We have built SageMaker TensorFlow containers with Tornasole. You can use them from ECR from SageMaker. Here are the links to the images. 
**Please change the region below to the region you want your jobs to run.**

In [31]:
import sagemaker
from sagemaker.tensorflow import TensorFlow

######## NOTE :::: Change the region to be one where this notebook is running ########
REGION='us-east-2'


TAG='latest'

gpu_docker_image_name = '072677473360.dkr.ecr.{}.amazonaws.com/tornasole-preprod-tf-1.13.1-gpu:{}'.format(REGION, TAG)
cpu_docker_image_name = '072677473360.dkr.ecr.{}.amazonaws.com/tornasole-preprod-tf-1.13.1-cpu:{}'.format(REGION, TAG)

### Start training a simple Session based example
For the purposes of this demonstration let us use a simple script that produces nans during training and monitor the job with the rule ExplodingTensor which checks for this condition. Let us create a bad hyperparameters dictionary which sets bad learning rate and scale parameters taken by the script.

In [32]:
simple_entry_point_script = '../scripts/simple.py'
simple_hyperparameters = { 'steps': 10000, 'tornasole_frequency': 50 }

# copy dict
bad_simple_hyperparameters = dict(simple_hyperparameters)
## These parameters are consumed by simple.py to produce a exploding tensor problem
bad_simple_hyperparameters.update({ 'lr': 100, 'scale': 100000000000})

In [33]:
sagemaker_simple_estimator = TensorFlow(role=sagemaker.get_execution_role(),
                       base_job_name='tornasole-simple-demo',
                       train_instance_count=1,
                       train_instance_type='ml.m4.xlarge',
                       image_name=cpu_docker_image_name,
                       entry_point=simple_entry_point_script,
                       framework_version='1.13.1',
                       py_version='py3',
                       script_mode=True,
                       hyperparameters=bad_simple_hyperparameters,
                       train_max_run=1800,
                       
                    ## THIS is tornasole specific parameter, debug= True,means rule specified in rules_specification wil
                    ## run as rule job. We specify to run ExplodingTensor 1P rule on ml.c5.4xlarge instance
                       debug=True,
                       rules_specification=[
                           {
                              "RuleName": "ExplodingTensor",
                              "InstanceType": "ml.c5.4xlarge",
                              
                           }
                      ])

In [34]:
# By setting wait=False, we just submit the job to run in the background

sagemaker_simple_estimator.fit(wait=False)
# NOTE: Above is fire and forget event, in background sageMaker will spin off 1 training job and 1 rule job
# for youp. Please follow this notebook to see status of training job and rule job

### Result
As a result of the above command, SageMaker will spin off 1 training job and 1 rule job for you - the first one being the job which produces the tensors to be analyzed and the second one, which analyzes the tensors to check if there are any exploding tensor during training.

### Check status of Training Job using describe_training_job api

In [36]:
# Below command will give the status of training job
# Note: In the output of below command you will see DebugConfig parameter 

sagemaker_simple_estimator.sagemaker_session.sagemaker_client.describe_training_job(
                TrainingJobName=sagemaker_simple_estimator.latest_training_job.name
            )

{'TrainingJobName': 'tornasole-simple-demo-2019-08-24-04-07-54-159',
 'TrainingJobArn': 'arn:aws:sagemaker:us-east-2:072677473360:training-job/tornasole-simple-demo-2019-08-24-04-07-54-159',
 'TrainingJobStatus': 'InProgress',
 'SecondaryStatus': 'Starting',
 'HyperParameters': {'lr': '100',
  'model_dir': '"s3://sagemaker-us-east-2-072677473360/tornasole-simple-demo-2019-08-24-04-07-54-159/model"',
  'sagemaker_container_log_level': '20',
  'sagemaker_enable_cloudwatch_metrics': 'false',
  'sagemaker_job_name': '"tornasole-simple-demo-2019-08-24-04-07-54-159"',
  'sagemaker_program': '"simple.py"',
  'sagemaker_region': '"us-east-2"',
  'sagemaker_submit_directory': '"s3://sagemaker-us-east-2-072677473360/tornasole-simple-demo-2019-08-24-04-07-54-159/source/sourcedir.tar.gz"',
  'scale': '100000000000',
  'steps': '10000',
  'tornasole_frequency': '50'},
 'AlgorithmSpecification': {'TrainingImage': '072677473360.dkr.ecr.us-east-2.amazonaws.com/tornasole-preprod-tf-1.13.1-cpu:latest',


**Once your training job is started(See TrainingJobStatus in above response), sagemaker will spin up rule job to run exploding tensor rule.**

### Check the status of the Rule Execution Job
To get the rule execution job that SageMaker started for you, run the command below and it shows you the `RuleName`, `RuleStatus`, `FailureReason` if any, and `RuleExecutionJobArn`. If the tensors meets a rule evaluation condition, the rule execution job throws a client error with `FailureReason: RuleEvaluationConditionMet`. You can check the Cloudwatch Logstream `/aws/sagemaker/TrainingJobs` with `RuleExecutionJobArn`.

You will see that once the rule execution job starts, that it identifies the exploding tensor situation in the training job, raises the `RuleEvaluationConditionMet` exception and ends the job.

You can go back and change the hyperparameters passed to the estimator to `simple_hyperparameters` and start a new training job. You will see that the ExplodingTensor rule is not fired in that case as no tensors go to `nan` with the default good hyperparameters.

In [37]:
sagemaker_simple_estimator.describe_rule_execution_jobs()

Wait to get status for Rule Execution Jobs...
RuleName: ExplodingTensor
RuleStatus: NotStarted
Wait to get status for Rule Execution Jobs...
RuleName: ExplodingTensor
RuleStatus: NotStarted
Wait to get status for Rule Execution Jobs...
RuleName: ExplodingTensor
RuleStatus: NotStarted
Wait to get status for Rule Execution Jobs...
RuleName: ExplodingTensor
RuleStatus: NotStarted
Wait to get status for Rule Execution Jobs...
RuleName: ExplodingTensor
RuleStatus: NotStarted
Wait to get status for Rule Execution Jobs...
RuleName: ExplodingTensor
RuleStatus: NotStarted
Wait to get status for Rule Execution Jobs...
RuleName: ExplodingTensor
RuleStatus: NotStarted
Wait to get status for Rule Execution Jobs...
RuleName: ExplodingTensor
RuleStatus: NotStarted
Wait to get status for Rule Execution Jobs...
RuleName: ExplodingTensor
RuleStatus: NotStarted
Wait to get status for Rule Execution Jobs...
RuleName: ExplodingTensor
RuleStatus: NotStarted
Wait to get status for Rule Execution Jobs...
Rule

### Receive CloudWatch Event For your Jobs
When the status of training job or rule execution job change (i.e. starting, failed), TrainingJobStatus [CloudWatch events](https://docs.aws.amazon.com/sagemaker/latest/dg/cloudwatch-events.html) are emitted. You can configure a CloudWatch event rule to receive and process these events by setting up a target (Lambda function, SNS). 

## Describing training job to get tornasole specific parameters
#### Tornasole specific parameters in response
**DebugConfig** parameter has details about tornasole related configuration. Key params to look for -  
*S3OutputPath* : This is the path where output tensors from tornasole is getting saved.  
*RuleConfig* : This parameter tells about the rule config parameter that was passed when creating the trainning job. In this you should be able to see details of the rule that ran for training.


In [None]:
sagemaker_simple_estimator.sagemaker_session.sagemaker_client.describe_training_job(
                TrainingJobName=sagemaker_simple_estimator.latest_training_job.name
            )

## Describing Rule job to get rule execution details and results
##### key parameters
**RuleName**: Rule which ran for the job  
**RuleStatus**: The status of the rule job  
**RuleExecutionJobArn**: arn of rule job  

**NOTE: You can see in response, the details about the cloudwatch event that was emitted due to rule success/failure.**  
> Created CW event Rule: arn:aws:events:us-east-2:072677473360:rule/RuleEvaluationConditionMetRule-tornasole-simple-demo-VanishingGr  
Please monitor the rule job statuses by going to CloudWatch->Events->Rule->Monitoring

In above example, we saw how a ExplodientTensorRule was run which analyzed the tensors when training was running and produced an alert in form of cloudwatch event.  

We have 2 more real life examples at the end section of this notebook. Before moving further, let's take some detailed look into tornasole.

## Enabling Tornasole in the training script

The first step to using Tornasole is to save tensors from the training job. The containers we provide in SageMaker come with Tornasole library installed, which needs to be used to enable Tornasole in your training script. We currently support two interfaces for training in TensorFlow: `tf.Session` and `tf.Estimator`. 

Please note: **Keras** support is Work in Progress. Please stay tuned! We will also support **Eager** mode in the future. Tornasole also currently only works for single process training. We will support distributed training very soon. 

### TF Session based training
When training using this interface you need to create a [MonitoredSession](https://www.tensorflow.org/api_docs/python/tf/train/MonitoredSession) to use for the job which is configured with TornasoleHook, a construct Tornasole exposes to save tensors from the job. Here's how you will need to modify your training script.

First, you need to import `tornasole.tensorflow`. 
```
import tornasole.tensorflow as ts 
```
Then create the TornasoleHook by specifying what you want to save and when you want to save them.
```
hook = ts.TornasoleHook(include_collections=['weights','gradients'],
                        save_config=ts.SaveConfig(save_interval=50))
```
Tornasole has the concept of modes (TRAIN, EVAL, PREDICT) to separate out different modes of the jobs.
Set the mode you are running in your job. Every time the mode changes in your job, please set the current mode. This helps you group steps by mode, for easier analysis. Setting the mode is optional but recommended. If you do not specify this, Tornasole saves all steps under a `GLOBAL` mode. 
```
hook.set_mode(ts.modes.TRAIN)
```
Wrap your optimizer with TornasoleOptimizer so that Tornasole can identify your gradients and automatically provide these tensors as part of the `gradients` collection. Use this new optimizer to minimize your loss during training.
```
optimizer = ts.TornasoleOptimizer(optimizer)
```
Create a monitored session with the above hook, and use this for executing your TensorFlow job.
```
sess = tf.train.MonitoredSession(hooks=[hook])
```

We have an example script which shows the above [scripts/simple.py](../scripts/simple.py). You will be running this script below.

Refer [DeveloperGuide_TensorFlow.md](../../DeveloperGuide_TF.md) for more details on the APIs Tornasole provides to help you save tensors.

### TF Estimator based training
When training using this interface you need to pass TornasoleHook, a construct Tornasole exposes to save tensors, to the train, predict or evaluate functions of your [TensorFlow Estimator](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/estimator/Estimator?hl=en). Here's how you will need to modify your training script.

First, you need to import `tornasole.tensorflow`. 
```
import tornasole.tensorflow as ts 
```
Then create the TornasoleHook by specifying what you want to save and when you want to save them.
```
hook = ts.TornasoleHook(include_collections=['weights','gradients'],
                        save_config=ts.SaveConfig(save_interval=50))
```
Tornasole has the concept of modes (TRAIN, EVAL, PREDICT) to separate out different modes of the jobs.
Set the mode you are running in your job. Every time the mode changes in your job, please set the current mode. This helps you group steps by mode, for easier analysis. Setting the mode is optional but recommended. If you do not specify this, Tornasole saves all steps under a `GLOBAL` mode. 
```
hook.set_mode(ts.modes.TRAIN)
```
Wrap your optimizer with TornasoleOptimizer so that Tornasole can identify your gradients and automatically provide these tensors as part of the `gradients` collection. Use this new optimizer to minimize your loss during training.
```
opt = ts.TornasoleOptimizer(opt)
```
Now pass this hook to the estimator object's train, predict or evaluate methods, whichever ones you want to monitor.
```
classifier = tf.estimator.Estimator(...)

classifier.train(input_fn, hooks=[hook])
classifier.predict(input_fn, hooks=[hook])
classifier.evaluate(input_fn, hooks=[hook])
```

Refer our example script for [MNIST](../scripts/mnist.py) or [ResNet50 for ImageNet](../scripts/train_imagenet_resnet_hvd.py) for examples of using Tornasole with the Estimator interface. We will show you to how to run these examples in SageMaker below.

Refer [DeveloperGuide_TensorFlow.md](../../DeveloperGuide_TF.md) for more details on the APIs Tornasole provides to help you save tensors.

## SageMaker with Tornasole

We'll train a few TensorFlow models in this notebook with Tornasole enabled and monitor the training jobs with Tornasole's Rules. This will be done using SageMaker TensorFlow 1.13.1 Container in Script Mode. Note that Tornasole currently only works with python3, so be sure to set `py_version='py3'` when creating SageMaker Estimator below.

#### Storage
The tensors saved by Tornasole are, by default, stored in the S3 output path of the training job, under the folder **`/tensors-<job name>`**. This is done to ensure that we don't end up accidentally overwriting the tensors from a training job with the others. Rules evaluation require separation of the tensors paths to be evaluated correctly.

If you don't provide an S3 output path to the estimator, SageMaker creates one for you as: **`s3://sagemaker-<region>-<account_id>/`**

This path is used to create a Tornasole Trial taken by Rules (see below).

#### New Parameters 
The new parameters in Sagemaker Estimator to look out for are

- `debug` :(bool)
This indicates that debugging should be enabled for the training job. 
Setting this as `True` would make Tornasole available for use with the job

- `rules_specification`: (list[*dict*])
You can specify any number of rules to monitor your SageMaker training job. This parameter takes a list of python dictionaries, one for each rule you want to enable. Each `dict` is of the following form:
```
{
    "RuleName": <str>       
        # The name of the class implementing the Tornasole Rule interface. (required)

    "SourceS3Uri": <str>    
        # S3 URI of the rule script containing the class in 'RuleName'. 
        # This is not required if you want to use one of the First Party rules provided to you by Amazon. 
        # In such a case you can leave it empty or not pass it. If you want to run a custom rule 
        # defined by you, you will need to define the custom rule class in a python 
        # file and provide it to SageMaker as a S3 URI. 
        # SageMaker will fetch this file and try to look for the rule class 
        # identified by RuleName in this file.
    
    "InstanceType": <str>   
        # The ML instance type which should be used to run the rule evaluation job
        
    "VolumeSizeInGB": <int> 
        # The volume size to store the runtime artifacts from the rule evaluation 
        
    "RuntimeConfigurations": {
        # Map defining the parameters required to instantiate the Rule class and
        # parameters regarding invokation of the rule (start-step and end-step)
        # This can be any parameter taken by the rule. Every value here needs to be a string. 
        # So when you write custom rules, ensure that you can parse each argument from a string.
        # PARAMS CAN BE
        # STANDARD PARAMS FOR RULE EXECUTION
        # "start-step": <str>
        # "end-step": <str>
        # "other-trials-paths": <str> (';' separated list of s3 paths as a string)
        # ANY OTHER PARAMETER TAKEN BY THE RULE
        # "parameter" : <str>
        <str>: <str>
    }
}
```

### Inputs
Just a quick reminder if you are not familiar with script mode in SageMaker. You can pass command line arguments taken by your training script with a hyperparameter dictionary which gets passed to the SageMaker Estimator class. You can see this in the examples below.

### Rules
Rules are the medium by which Tornasole executes a certain piece of code regularly on different steps of the job.
They can be used to assert certain conditions during training, and raise Cloudwatch Events based on them that you can
use to process in any way you like. 

Tornasole comes with a set of **First Party rules** (1P rules).
You can also write your own rules looking at these 1P rules for inspiration. 
Refer [DeveloperGuide_Rules.md](../../../../rules/DeveloperGuide_Rules.md) for more on the APIs you can use to write your own rules as well as descriptions for the 1P rules that we provide. 
 
Here we will talk about how to use Sagemaker to evalute these rules on the training jobs.


##### 1P Rule 
If you want to use a 1P rule. Specify the RuleName field with the 1P RuleName, and the rule will be automatically applied. You can pass any parameters accepted by the rule as part of the RuntimeConfigurations dictionary. Rules constructor take trial as parameter.  
A Trial in Tornasole's context refers to a training job. It is identified by the path where the saved tensors for the job are stored.  
A rule takes a `base_trial` which refers to the job whose run invokes the rule execution. 

**Note:** A rule can be written to compare & analyze tensors across training jobs. A rule which needs to compare tensors across trials can be run by passing the argument `other_trials`. The argument `base_trial` will automatically be set by SageMaker when executing the rule. The parameter `other_trials` (if taken by the rule) can be passed by passing `other-trials-paths` in the RuntimeConfigurations dictionary. The value for this argument should be `;` separated list of S3 output paths where the tensors for those trials are stored.

Here's a example of a complex configuration for the SimilarAcrossRuns (which accepts one other trial and a regex pattern) where we ask for the rule to be invoked for the steps between 10 and 100.

``` 
rules_specification = [ 
    {
      "RuleName": "SimilarAcrossRuns",
      "InstanceType": "ml.c5.4xlarge",
      "VolumeSizeInGB": 10,
      "RuntimeConfigurations": {
         "other_trials": "s3://sagemaker-<region>-<account_id>/past-job",
         "include_regex": ".*",
         "start-step": "10",
         "end-step": "100"
       }
    }
]
```
List of 1P rules and details about the rules can be found in *First party rules* section in [DeveloperGuide_Rules.md](../../../../rules/DeveloperGuide_Rules.md)  


##### Custom rule
In this case you need to define a custom rule class which inherits from `tornasole.rules.Rule` class.
You need to provide Sagemaker the S3 location of the file which defines your custom rule classes as the value for the field `SourceS3Uri`. Again, you can pass any arguments taken by this rule through the RuntimeConfigurations dictionary. Note that the custom rules can only have arguments which expect a string as the value except the two arguments specifying trials to the Rule. Refer section *Writing a rule* in [DeveloperGuide_Rules.md](../../../../rules/DeveloperGuide_Rules.md) for more details.

Here's an example:
```
rules_specification = [
    {
      "RuleName": "CustomRule",
      "SourceS3Uri": "s3://weiyou-tornasole-test/rule-script/custom_rule.py",
      "InstanceType": "ml.c5.4xlarge",
      "VolumeSizeInGB": 10,
      "RuntimeConfigurations": {
         "threshold" : "0.5"
       }
    }
]
```



## Train Resnet50 on Imagenet with Tornasole
Now let us run a more complicated example, let us train ResNet50 on a GPU instance. The script which uses the TensorFlow Estimator interface is available [here](../scripts/train_imagenet_resnet_hvd.py). It supports various modes of using Tornasole. Please refer to [this document](../../sm_resnet50.md) which summarizes the changes made to this script to save weights, gradients, activations of certain layers etc. You can also save large layers as reductions instead of saving the full tensor. Full details of Tornasole's APIs to save tensors are available in this document [DeveloperGuide_TensorFlow](../../DeveloperGuide_TF.md).

The below hyperparameters initialize the weights of the model badly (to a small constant). This results in training proceeding badly with many gradients vanishing. We can monitor the situation using the VanishingGradient rule.

In [15]:
resnet_script = '../scripts/train_imagenet_resnet_hvd.py'
bad_resnet_hyperparameters = {
    'enable_tornasole': True,
    'tornasole_save_gradients': True,
    'tornasole_save_gradients': True,
    'tornasole_step_interval' : 100,
    'num_epochs': 1,
    'constant_initializer': 0.01
}

In [16]:
sagemaker_resnet_estimator = TensorFlow(role=sagemaker.get_execution_role(),
                  base_job_name='tornasole-demo-resnet',
                  train_instance_count=1,
                  train_instance_type='ml.p3.2xlarge',
                  image_name=gpu_docker_image_name,
                  entry_point=resnet_script,
                  framework_version='1.13.1',
                  py_version='py3',
                  script_mode=True,
                  hyperparameters=bad_resnet_hyperparameters,
                  debug=True,
                  train_max_run=1800,
                  rules_specification=[
                      {
                          "RuleName": "VanishingGradient",
                          "InstanceType": "ml.c5.4xlarge",
                      }
                  ])

In [18]:
sagemaker_resnet_estimator.fit(wait=False)


In [None]:
## Note:wait=False above, made fit call fire and forget call. Sagemaker will run the training job and rule job in the
## backgroud. 
## To see status of training job:
sagemaker_simple_estimator.sagemaker_session.sagemaker_client.describe_training_job(
                TrainingJobName=sagemaker_simple_estimator.latest_training_job.name
            )


In [None]:
## To check status of rule execution job
sagemaker_resnet_estimator.describe_rule_execution_jobs()

## Train MNIST with Estimator interface
If you do not want to use GPUs at this point, but want to run a slightly more complicated script than the simple example you saw above, you can train a model on CPU on the MNIST dataset as below. Let us monitor for VanishingGradient in this job. **We do not expect this rule to be fired for the below hyperparameters.**

In [2]:
mnist_script = '../scripts/mnist.py'
mnist_hyperparameters = {'num_epochs': 5}

In [3]:
sagemaker_mnist_estimator = TensorFlow(role=sagemaker.get_execution_role(),
                  base_job_name='tornasole-demo-mnist',
                  train_instance_count=1,
                  train_instance_type='ml.m4.xlarge',
                  image_name=cpu_docker_image_name,
                  entry_point=mnist_script,
                  framework_version='1.13.1',
                  py_version='py3',
                  script_mode=True,
                  hyperparameters=mnist_hyperparameters,
                  debug=True,
                  train_max_run=1800,
                  rules_specification=[
                      {
                          "RuleName": "VanishingGradient",
                          "InstanceType": "ml.c5.4xlarge",
                      }
                  ])

In [4]:
sagemaker_mnist_estimator.fit(wait=False)

sagemaker_mnist_estimator.describe_training_job(sagemaker_mnist_estimator.latest_training_job.name)

In [None]:
sagemaker_mnist_estimator.describe_rule_execution_jobs()

In [None]:
sagemaker_mnist_estimator.attach(sagemaker_mnist_estimator.latest_training_job.name)