# Visualizing Tensors 

### Overview

SageMaker Debugger is a new capability of Amazon SageMaker that allows debugging machine learning models. 
It lets you go beyond just looking at scalars like losses and accuracies during training and gives 
you full visibility into all the tensors 'flowing through the graph' during training. SageMaker Debugger helps you to monitor your training in near real time using rules and would provide you alerts, once it has detected an inconsistency in the training flow.

Using SageMaker Debugger is a two step process: Saving tensors and Analysis. In this notebook we will run an MXNet training job and configure SageMaker Debugger to store all tensors from this job. Afterwards we will visualize those tensors in our notebook.


### Dependencies
Before we begin, let us install the library plotly if it is not already present in the environment.
If the below cell installs the library for the first time, you'll have to restart the kernel and come back to the notebook.

In [None]:
! pip install plotly

### Configure and run the training job

Now we'll call the Sagemaker MXNet Estimator to kick off a training job along with the VanishingGradient rule to monitor the job.

The 'entry_point_script' points to the MXNet training script that has the SageMaker DebuggerHook integrated.


In [None]:
entry_point_script = '../scripts/mnist_gluon_save_all_demo.py'

In [None]:
import boto3
import os
import sagemaker
from sagemaker.mxnet import MXNet


REGION='us-west-2'
TAG='latest'

estimator = MXNet(role=sagemaker.get_execution_role(),
                  base_job_name='mxnet-trsl-test-nb',
                  train_instance_count=1,
                  train_instance_type='ml.m4.xlarge',
                  entry_point=entry_point_script,
                  framework_version='1.4.1',
                  debug=True,
                  py_version='py3')

Start the training job:

In [None]:
estimator.fit()

### Get S3 location of tensors

We can check the status of the training job by running the following command:

In [None]:
job_name = estimator.latest_training_job.name

client = estimator.sagemaker_session.sagemaker_client

description = client.describe_training_job(TrainingJobName=job_name)

print('downloading tensors from training job: ', job_name)

We can retrieve the S3 location of the tensors by accessing the dictionary `description`:

In [None]:
path = description['DebugConfig']['DebugHookConfig']['S3OutputPath']

print('Tensors are stored in: ', path)

### Download tensors from S3

Now we will download the tensors from S3, so that we can visualize them in our notebook.

In [None]:
folder_name = path.split("/")[-1]
os.system("aws s3 cp --recursive " + path + " " + folder_name)
print('Downloading tensors into folder: ', folder_name)

### Visualize
The main purpose of this class (TensorPlot) is to visualise the tensors in your network. This could be to determine dead or saturated activations, or the features maps the network.

To use this class (TensorPlot), you will need to supply the argument regex with the tensors you are interested in. e.g., if you are interested in activation outputs, then you need to supply the following regex .*relu|.*tanh|.*sigmoid.

Another important argument is the `sample_batch_id`, which allows you to specify the index of the batch size to display. For example, given an input tensor of size (batch_size, channel, width, height), `sample_batch_id = n` will display (n, channel, width, height). If you set sample_batch_id = -1 then the tensors will be summed over the batch dimension (i.e., `np.sum(tensor, axis=0)`). If batch_sample_id is None then each sample will be plotted as separate layer in the figure.

Here are some interesting use cases:

1) If you want to determine dead or saturated activations for instance ReLus that are always outputting zero, then you would want to sum the batch dimension (sample_batch_id=-1). The sum gives an indication which parts of the network are inactive across a batch.

2) If you are interested in the feature maps for the first image in the batch, then you should provide batch_sample_id=0. This can be helpful if your model is not performing well for certain set of samples and you want to understand which activations are leading to misprediction.

An example visualization of layer outputs:
![](tensorplot.gif)


`TensorPlot` normalizes tensor values to the range 0 to 1 which means colorscales are the same across layers. Blue indicates value close to 0 and yellow indicates values close to 1. This class has been designed to plot convolutional networks that take 2D images as input and predict classes or produce output images. You can use this  for other types of networks like RNNs, but you may have to adjust the class as it is currently neglecting tensors that have more than 4 dimensions.

Let's plot Relu output activations for the given MNIST training example.

In [1]:
import tensor_plot 

visualization = tensor_plot.TensorPlot(
    regex=".*relu_output", 
    path=folder_name,
    steps=10,  
    batch_sample_id=0,
    color_channel = 1,
    title="Relu outputs",
    label=".*sequential0_input_0",
    prediction=".*sequential0_output_0"
)

[2019-11-13 10:52:31.483 186590da3c0d.ant.amazon.com:3874 INFO local_trial.py:35] Loading trial  at path /tmp/mxnet3/
[2019-11-13 10:52:31.516 186590da3c0d.ant.amazon.com:3874 INFO trial.py:191] Training has ended, will refresh one final time in 1 sec.
[2019-11-13 10:52:32.519 186590da3c0d.ant.amazon.com:3874 INFO trial.py:203] Loaded all steps


If we plot too many layers, it can crash the notebook. If you encounter performance or out of memroy issues, then either try to reduce the layers to plot by changing the `regex` or run this Notebook in JupyterLab instead of Jupyter. 

In [2]:
visualization.fig.show(renderer="iframe")