# Amazon SageMaker - Tensorflow 2.x
[Amazon SageMaker](https://aws.amazon.com/sagemaker/) is a machine learning platform to build, train, and host state-of-the-art ML and AI models. [Amazon SageMaker Debugger](https://github.com/awslabs/sagemaker-debugger) offers the capability to debug machine learning models during training and identifies problems with the models in real-time.

Experimental support for TF 2.x was initially introduced in v0.7.1 of the Debugger's smdebug library. A full description of support is available at [Amazon SageMaker Debugger with TensorFlow](https://github.com/awslabs/sagemaker-debugger/tree/master/docs/tensorflow.md)

With the smdebug v0.9.0 release, its support has been extended to cover TF 2.x [model_to_estimator](https://www.tensorflow.org/api_docs/python/tf/keras/estimator/model_to_estimator) and Estimator APIs

In this notebook, you will learn how to run your training job with the TF 2.x [model_to_estimator](https://www.tensorflow.org/api_docs/python/tf/keras/estimator/model_to_estimator) API and the Debugger built-in rules to watch training anomalies.

## Training TensorFlow Keras models with Amazon SageMaker Debugger

### Amazon SageMaker TensorFlow as a framework

We will train a TensorFlow Keras model in this notebook with Amazon Sagemaker Debugger, and monitor the training jobs with the Debugger built-in rules. The training job will be run on a pre-built [AWS Deep Learning Container](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html) with Tensorflow 2.1.0 and smdebug 0.9.0 installed.


## Setup

Execute the following code cell for a one-time smdebug setup to get your notebook kernel ready for a full experience of using the Debugger features. This smdebug library provides tools to perform interactive analysis throughout the notebook.


In [None]:
! pip install smdebug

Import the AWS boto3 Python SDK, the SageMaker Python SDK, the SageMaker TensorFlow class, and other python utility libraries.

In [2]:
import boto3
import os
import sagemaker
from sagemaker.tensorflow import TensorFlow

Import the SageMaker Debugger classes for configuring hooks and rules.

In [3]:
from sagemaker.debugger import Rule, DebuggerHookConfig, TensorBoardOutputConfig, CollectionConfig, rule_configs

Now define the entry point for the training script.

Since our demo training script `tf_keras_to_estimator.py` uses the `tensorflow-datasets` package that is not available in the pre-built container, we need to install it as a requirement for the training job. To add the installation step, we simply add the package in a `requirements.txt` file. You can add a list of required libraries to run your own training script in the same way.

The `launcher.sh` script is provided along with this notebook, which will install the `tensorflow-datasets` package and initiate the training script. For more information about installing third-party libraries, see [Use third-party libraries with SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#use-third-party-libraries).

In [4]:
# define the entrypoint script
entrypoint_script='launcher.sh'

### Setting up the SageMaker TensorFlow Estimator

Now it's time to setup the SageMaker TensorFlow estimator. We will add the following Debugger-specific parameters to the estimator to enable for debugging the demo training script.

**debugger_hook_config**: This new parameter accepts a local path where you wish your tensors to be written to and also accepts the S3 bucket to store tensors. Debugger will take care of uploading these tensors transparently during execution.
**rules**: This rules parameter will accept a list of rules you want to evaluate the tensors output and training behaviors while the training job is running. There are two types of Debugger rules:
**Built-in Rules**: These are rules specially curated by the Amazon SageMaker Debugger which you can opt to evaluate your training job.
**Custom Rules**: You can write your own rules as a Python source file and use them to evaluate your training job. To provide Amazon SageMaker Debugger to evaluate this rule, you have to provide the S3 location of the rule source and the evaluator image. For more information about how to create and use Debugger custom rules, see [Create Debugger Custom Rules for Training Job Analysis](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-custom-rules.html).
 
### Using Amazon SageMaker Rules
 
In this notebook, we will demonstrate how to use SageMaker built-in rules to evaluate the training script. You can find the list of SageMaker rules and the configurations best suited for using them [here](https://github.com/awslabs/sagemaker-debugger-rulesconfig).

The rules we are using in this notebook are **VanishingGradient** and **LossNotDecreasing**. As the names suggest, the rules will attempt to evaluate if there are vanishing gradients in the tensors captured by the debugging hook during training and also if the loss is not decreasing.

In [5]:
rules = [
    Rule.sagemaker(rule_configs.vanishing_gradient()), 
    Rule.sagemaker(rule_configs.loss_not_decreasing())
]

The training script expects an argument "out_dir" which specifies where SageMaker Debugger should save the tensors. This will be passed to the script throught the Estimator API's hyperparameters argument.

In [6]:
hyperparameters = {"out_dir": "/opt/ml/output/tensors"}

Let us now create the estimator and call `fit()` on our estimator to start the training job and rule evaluation job in parallel.

In [7]:
estimator = TensorFlow(
    role=sagemaker.get_execution_role(),
    base_job_name='smdebug-tf2-model-to-estimator',
    train_instance_count=1,
    train_instance_type='ml.p2.xlarge',
    entry_point=entrypoint_script,
    source_dir="src",
    framework_version='2.1.0',
    train_max_run=3600,
    script_mode=True,
    py_version='py3',
    hyperparameters=hyperparameters,
    ## New parameter
    rules = rules
)

# After calling fit, Amazon SageMaker starts one training job and one rule job for you.
# The rule evaluation status is visible in the training logs
# at regular intervals

estimator.fit(wait=False)

'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


## Result 

As a result of calling the `fit(wait=False)`, two jobs were kicked off in the background. Amazon SageMaker Debugger kicked off a rule evaluation job for our custom gradient logic in parallel with the training job. You can review the status of the above rule job as follows.

In [8]:
import time
status = estimator.latest_training_job.rule_job_summary()
while status[0]['RuleEvaluationStatus'] == 'InProgress':
    status = estimator.latest_training_job.rule_job_summary()
    print(status)
    time.sleep(10)
    

Once the rule job starts, this will return RuleEvaluationJabArn values. We can see the logs for the rule job in CloudWatch. To do that, we will use the following utlity functions to get the CloudWatch link to the rule job logs.

In [9]:
def _get_rule_job_name(training_job_name, rule_configuration_name, rule_job_arn):
        """Helper function to get the rule job name with correct casing"""
        return "{}-{}-{}".format(
            training_job_name[:26], rule_configuration_name[:26], rule_job_arn[-8:]
        )
    
def _get_cw_url_for_rule_job(rule_job_name, region):
    return "https://{}.console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix".format(region, region, rule_job_name)


def get_rule_jobs_cw_urls(estimator):
    training_job = estimator.latest_training_job
    training_job_name = training_job.describe()["TrainingJobName"]
    rule_eval_statuses = training_job.describe()["DebugRuleEvaluationStatuses"]
    
    result={}
    for status in rule_eval_statuses:
        if status.get("RuleEvaluationJobArn", None) is not None:
            rule_job_name = _get_rule_job_name(training_job_name, status["RuleConfigurationName"], status["RuleEvaluationJobArn"])
            result[status["RuleConfigurationName"]] = _get_cw_url_for_rule_job(rule_job_name, boto3.Session().region_name)
    return result

get_rule_jobs_cw_urls(estimator)

{'VanishingGradient': 'https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logStream:group=/aws/sagemaker/ProcessingJobs;prefix=smdebug-tf2-model-to-estim-VanishingGradient-a81f7777;streamFilter=typeLogStreamPrefix',
 'LossNotDecreasing': 'https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logStream:group=/aws/sagemaker/ProcessingJobs;prefix=smdebug-tf2-model-to-estim-LossNotDecreasing-85e1058d;streamFilter=typeLogStreamPrefix'}

## Data Analysis - Interactive Exploration

Now that we have trained a job and looked at automated Debugger analysis through its rules. Now let's have a look at another aspect of Amazon SageMaker Debugger. It allows us to perform interactive exploration of the tensors saved in real time or after the job has finished. Here we focus on after-the-fact analysis of the above job. We import the `smdebug` library to use its tools to create trials that picks up the most recent training job. The following code cells show you how to fetch the path to the Debugger artifacts of the latest training job and access the saved tensors by collection names.

In [10]:
from smdebug.trials import create_trial
trial = create_trial(estimator.latest_job_debugger_artifacts_path())

[2020-06-25 20:19:48.381 ip-172-16-189-249:1818 INFO s3_trial.py:42] Loading trial debug-output at path s3://sagemaker-us-east-1-920076894685/smdebug-tf2-model-to-estimator-2020-06-25-20-02-23-993/debug-output


The `smdebug` Trial class provides tools to parse the saved tensor based on names which are auto-assigned by TensorFlow. In other frameworks, the tensor names may vary and we have to use appropriate tensor names or regex based on the names of tensors such as weight, bias, gradient, input or output.

For simple examples of fetching the tensors in the following code cells, we print the total number of tensors saved for losses, weights, and gradients.

In [11]:
len(trial.tensor_names())

[2020-06-25 20:19:50.662 ip-172-16-189-249:1818 INFO trial.py:198] Training has ended, will refresh one final time in 1 sec.
[2020-06-25 20:19:51.685 ip-172-16-189-249:1818 INFO trial.py:210] Loaded all steps


480

In [15]:
trial.tensor_names(collection="losses")

50

In [16]:
len(trial.tensor_names(collection="weights"))

160

In [17]:
len(trial.tensor_names(collection="gradients"))

214