# Debugging SageMaker Training Jobs In Real Time with Tornasole

## Overview

Tornasole is a new capability of Amazon SageMaker that allows debugging machine learning training. 
It lets you go beyond just looking at scalars like losses and accuracies during training and gives 
you full visibility into all tensors 'flowing through the graph' during training. Tornasole helps you to monitor your training in near real time using rules and would provide you alerts, once it has detected inconsistency in training flow.

Using Tornasole is a two step process: Saving tensors and Analysis. Let's look at each one of them closely.

### Saving tensors

Tensors define the state of the training job at any particular instant in its lifecycle. Tornasole exposes a library which allows you to capture these tensors and save them for analysis.

### Analysis

There are two ways to get to tensors and run analysis on them. One way is to use concept called ***Rules***. Please refer to [DeveloperGuide_Rules.md](../../../../rules/DeveloperGuide_Rules.md) for more details about rules based approach to analysis. Focus of this notebook is on another way of analysis: **Manual**.

Manual analysis is what you use when there are no rules available to detect type of an issue you are running into and you need to get to raw tensors in order to understand what data is travelling through your model duing training and, hopefully, root cause a problem or two with your training job.

Manual analysis is powered by Tornasole API - a framework that allows to retrieve tensors and scalas (e.g. debugging data) saved during training job via few lines of code. One of the most powerful features provided by it is real time access to data - you can get tensors and scalars ***while your training job is running***.

This example guides you through installation of the required components for emitting tensors in a 
SageMaker training job and using Tornasole API to access those tensors while training is running. We will use small gluon CNN model and train it on FashionMNIST dataset. While job is running we will retrieve activations of first convolutional layer from each 100 batches and visualize them. Also we will visualize weights of that level after the job is done.

## Setup

As a first step, we'll do the installation of required tools which will allow emission of tensors (saving tensors) and provide access to Tornasole API to retrieve them.

In [None]:
!aws s3 sync s3://tornasole-external-preview-use1/ ./tornasole
!pip install ./smdebug/sdk/ts-binaries/tornasole_mxnet/py3/latest/tornasole-0.3.4-py2.py3-none-any.whl --user
!pip -q install ./smdebug/sdk/sagemaker-tornasole-latest.tar.gz
!aws configure add-model --service-model file:///home/ec2-user/SageMaker/smdebug/sdk/sagemaker-smdebug.json --service-name sagemaker

## Training MXNet models in SageMaker with Tornasole

We'll be training a small mxnet CNN model with FashonMNIST dataset in this notebook with Tornasole enabled. This will be done using SageMaker MXNet 1.4.1 Container with Script Mode. Note that Tornasole currently only works with python3, so be sure to set `py_version='py3'` when creating SageMaker Estimator.

Let us first train with a simple training script mnist_gluon_realtime_visualize_demo.py with Tornasole enabled in SageMaker using the SageMaker Estimator API. In this example, for simplicity sake, Tornasole will capture all tensors as specified in its configuration every 100 steps (1 step is 1 batch). While training job is running we will use Tornasole API to access saved tensors in real time and visualize them. We will rely on Tornasole to take care of downloading fresh set of tensors every time we query for them.

## Enable Tornasole in the training script

Integrating Tornasole into the training job can be accomplished by following steps below.

### Import the hook package
Import the SessionHook class along with other helper classes in your training script as shown below

```
from smdebug.mxnet import SessionHook
from smdebug import SaveConfig, modes
```

### Instantiate and initialize hook

```
    # Create SaveConfig object that instructs engine to log graph tensors every 100 steps (1 step == 1 batch).
    save_config = SaveConfig(save_interval=100)
    # Create a hook that logs ***all*** tensors while training the model.
    hook = SessionHook(save_config=save_config, save_all=True)
```

### Register Tornasole hook to the model before starting of the training.

<span style='color:red'>*NOTE: The hook can only be registered to Gluon Non-hybrid models.
*</span>

After creating or loading the desired model, you can register Tornasole hook with the model as shown below.

```
# Create a Gluon Model.
net = create_gluon_model()

# Create a hook for logging all tensors.
hook = create_hook()

# Apply hook to the model (e.g. instruct engine to recognize hook configuration
# and enable mode in which engine will log graph tensors
hook.register_hook(net)
```

#### Set the mode
Tornasole has the concept of modes (TRAIN, EVAL, PREDICT) to separate out different modes of the jobs.
Set the mode you are running in your job. Every time the mode changes in your job, please set the current mode. This helps you group steps by mode, for easier analysis. Setting the mode is optional but recommended. If you do not specify this, Tornasole saves all steps under a `GLOBAL` mode. 
```
hook.set_mode(smd.modes.TRAIN)
```

Refer [DeveloperGuide_MXNet.md](../../DeveloperGuide_MXNet.md) for more details on the APIs Tornasole provides to help you save tensors.

### Docker Images with Tornasole

We have built SageMaker MXNet containers with smdebug. You can use them from ECR from SageMaker. Here are the links to the images. Please use the image from the appropriate region in which you want your jobs to run.

In [None]:
%load_ext autoreload
%autoreload 2
import sagemaker
import boto3
import os
from sagemaker.mxnet import MXNet
from smdebug.mxnet import modes

# Below changes the region to be one where this notebook is running
TAG='latest'
REGION = boto3.Session().region_name
os.environ['AWS_REGION'] = REGION

cpu_docker_image_name= '072677473360.dkr.ecr.{}.amazonaws.com/tornasole-preprod-mxnet-1.4.1-cpu:{}'.format(REGION, TAG)
#gpu_docker_image_name= '072677473360.dkr.ecr.{}.amazonaws.com/tornasole-preprod-mxnet-1.4.1-gpu:{}'.format(REGION, TAG)

### Configuring the inputs for the training job

Now we'll call the Sagemaker MXNet Estimator to kick off a training job along with enabling Tornasole functionality.

The *entry_point_script* points to the MXNet training script that has the SessionHook integrated.

The *hyperparameters* are the parameters that will be passed to the training script.

In [None]:
entry_point_script = '../scripts/mnist_gluon_realtime_visualize_demo.py'
hyperparameters = {'batch-size': 256, 'learning_rate': 0.1, 'epochs': 10}
base_job_name = 'mxnet-TS-realtime-analysis'

In [None]:
sagemaker_simple_estimator = MXNet(role=sagemaker.get_execution_role(),
                                base_job_name=base_job_name,
                                train_instance_count=1,
                                train_instance_type='ml.m4.xlarge',
                                image_name=cpu_docker_image_name,
                                entry_point=entry_point_script,
                                hyperparameters=hyperparameters,
                                framework_version='1.4.1',
                                py_version='py3',
                                # following parameter is necesary to instruct SageMaker 
                                # that debugging data generated by Tornasole needs to be 
                                # uploaded to S3 bucket in your account. This way we can 
                                # access it while training job is running.
                                debug=True)

In [None]:
# This is a fire and forget event. By setting wait=False, we just submit the job to run in the background.
# SageMaker will spin off one training job and release control to next cells in the notebook.
# Please follow this notebook to see status of the training job.
sagemaker_simple_estimator.fit(wait=False)


### Result

As a result of the above command, SageMaker will spin off 1 training job for you and it will produce the tensors to be analyzed. This job will run in a background without you having to wait for it to complete in order to continue with the rest of the notebook. Because of this async nature of training job we will need to monitor its status so that we don't start to request debugging tensors too early. Tensors are only produced during training phase of SageMaker training job hence let's wait until that begins.

### Checking on the training job status

We can check the status of the training job by running the following code. It will check on a status of SageMaker training job every five seconds. Once job has started its traning cycle control is released to next cells in the notebook.

In [None]:
# some helper method first, to render status status updates
import time
import sys
from time import gmtime, strftime

def print_same_line(s):
    sys.stdout.write('\r{}: {}'.format(strftime('%X', gmtime()), s))
    sys.stdout.flush()

In [None]:
# Below command will give the status of training job
# Note: In the output of below command you will see DebugConfig parameter 
import time

job_name = sagemaker_simple_estimator.latest_training_job.name
print('Training job name: ' + job_name)

client = sagemaker_simple_estimator.sagemaker_session.sagemaker_client

description = client.describe_training_job(TrainingJobName=job_name)

if description['TrainingJobStatus'] != 'Completed':
    while description['SecondaryStatus'] not in {'Training', 'Completed'}:
        description = client.describe_training_job(TrainingJobName=job_name)
        primary_status = description['TrainingJobStatus']
        secondary_status = description['SecondaryStatus']
        print_same_line('Current job status: [PrimaryStatus: {}, SecondaryStatus: {}]'.format(primary_status, secondary_status))
        time.sleep(5)

# uncomment next line to see full details of training job 
# client.describe_training_job(TrainingJobName=job_name)

### Retrieving and Analyzing tensors

Before getting to analysis, here are some notes on concepts being used in Tornasole that help with analysis.
- ***Trial*** - object that is a center piece of Tornasole API when it comes to getting access to tensors. It is a top level abstract that represents a single run of a training job. All tensors emitted by training job are associated with its Trial.
- ***Step*** - object that represents next level of abstraction. In Tornasole step is a representation of a single batch of a training job. Each trial has multiple steps. Each tensor is associated with multiple steps - having a particular value at each of the steps.
- ***Tensor*** - object that represent actual tensor saved during training job. *Note* - it could be a scalar as well.
- ***Mode*** - each DL engine does forward and backward passes during training job. During each of those passes tensors generated by the model are saved. However, in addition to training itself DL engine also uses forward passes for validation phase, and tensors generated during such forward passes are also saved. Tornasole introduces concept of Training Mode in order to allow to differentiate between tensors of each of the phases.

For more details on aforementioned concepts as well as on Tornasole API in general (including examples) please refer to [Rules API](../../docs/rules/readme.md)

Below, you can find several methods to help with retrieving and plotting tensors. In *get_data* we use concepts described above to retrieve data. We expect to get steps_range that will have 1 or more steps (batches) for which we want to get tensors for. Please note that we are going to retrieve only tensors saved by main training loop and will excluse tensors saved during validation (`mode=modes.TRAIN` will help with that). Two other methods are helpers to plot tensors.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

def get_data(trial, tname, batch_index, steps_range):
    tensor = trial.tensor(tname)
    vals = []
    for s in steps_range:
        val = tensor.value(step_num=s, mode=modes.TRAIN)[batch_index][0]
        vals.append(val)
    return vals

def create_plots(steps_range):
    fig, axs = plt.subplots(nrows=1, ncols=len(steps_range), constrained_layout=True, figsize=(2*len(steps_range), 2),
                            subplot_kw={'xticks': [], 'yticks': []})
    return fig, axs

def plot_tensors(trial, layer, batch_index, steps_range):
    if len(steps_range) > 0:    
        fig, axs = create_plots(steps_range)
        vals = get_data(trial, layer, batch_index, steps_range)

        for ax, image, step in zip(axs.flat if isinstance(axs, np.ndarray) else np.array([axs]), vals, steps_range):
            ax.imshow(image, cmap='gray')
            ax.set_title(str(step))
        plt.show()

Now that we are prepared with methods to get data and plot it, let's get to it. The goal of the next block is to instantiate a ***Trial***, a central access point for all Tornasole API calls to get tensors. We will do that by inspecting currently running training job and extract necessary params from its debug config to instruct Tornasole where the data we are looking for is located. Couple notes here:
- Tensors are being stored in your own S3 bucket to which you can navigate and manually inspect its content if desired.
- You might notice a slight delay before trial object is created (last line in the cell). It is normal as Tornasole will monitor corresponding bucket with tensors and wait until tensors appear in it. The delay is introduced by less then instantenous upload of tensors from training container to your S3 bucket. 

In [None]:
import os
from urllib.parse import urlparse
import smdebug.trials
from smdebug.trials import S3Trial
import logging

description = client.describe_training_job(TrainingJobName=job_name)
s3_output_path = description["DebugConfig"]["DebugHookConfig"]["S3OutputPath"]
parse_result = urlparse(s3_output_path)
bucket_name = parse_result.netloc
prefix_name = parse_result.path.strip('/')

logging.getLogger("tornasole").setLevel(logging.INFO)

# this is where we create a Trial object that allows access to saved tensors
trial = S3Trial(base_job_name, bucket_name, prefix_name)

In [None]:
# feel free to inspect all tensors logged by uncommenting below line
# trial.tensor_names()

### Visualize tensors of a running training job
Now to the final part of our example. Below we will wait until Tornasole has downloaded initial chunk of tensors for us to look at. Once that first chunk is ready - we will keep getting new chunks every 5 seconds and plot their tensors correspondingly one under another.

In [None]:
# Below we select the very first tensor from every batch.
# Feel free to modify this and select another tensor from the batch.
batch_index = 0

# This is a name of a tensor to retrieve data of.
# Variable is called `layer` as this tensor happens to be output of first convolutional layer.
layer = 'conv0_output0'

steps = 0
while steps == 0:
    # trial.steps return all steps that have been downloaded by Tornasole to date.
    # It doesn't represent all steps that are to be available once training job is complete -
    # it is a snapshot of a current state of the system. If you call it after training job is done
    # you will get all tensors available.
    steps = trial.steps(mode=modes.TRAIN)
    print_same_line('Waiting for tensors to become available...')
    time.sleep(3)
print('\nDone')

print('Getting tensors and plotting...')
rendered_steps = []
# trial.training_ended is a way to keep monitoring for a state of a training job as seen by smdebug.
# When SageMaker completes training job, trial becomes aware of it.
while not trial.training_ended():
    steps = trial.steps(mode=modes.TRAIN)
    # quick way to get diff between two lists
    steps_to_render = list(set(steps).symmetric_difference(set(rendered_steps)))
    # plot only tensors from newer chunk
    plot_tensors(trial, layer, batch_index, steps_to_render)
    rendered_steps.extend(steps_to_render)
    time.sleep(5)
print('\nDone')

### Additional visualizations

Now that we completed plotting of tensors during training job run, let's plot some more tensors. This time we will get all of them at once as training job has finished and Tornasole is aware of all tensors emitted by it. Let's visualize tensors representing weights of first convolutional layer (e.g. its kernels). By inspecting each row of plotted tensors from left to right you can notice progression in how each kernel was "learning" its values.

In [None]:
# Let's visualize weights of the first convolutional layer.
layer = 'conv0_weight'

for i in range(0, trial.tensor(layer).value(step_num=trial.tensor(layer).steps(mode=modes.TRAIN)[0], mode=modes.TRAIN).shape[0]):
    plot_tensors(trial, layer, i, trial.tensor(layer).steps(mode=modes.TRAIN))