#### Specify the training job name and the region name
The training job name and the region name were generated at **Step 3** of the `aws_sagemaker_profiler_example_*.ipynb` notebooks that you initiated the training job. 

In [1]:
training_job_name = 'profiler-gpu-2020-06-23-22-29-05-483'
region = 'us-east-2'

# Amazon SageMaker Profiler Dashboard

This notebook will create the dashoboard for provided sagemaker training job

# 1. Install Dependencies


###  `smdebug`

The first thing you will need to do is install the private beta versions of the SageMaker Python SDK and the `smdebug` library. This will enable you to call a private version of the API that allows you to create SageMaker training jobs with the profiler enabled.

In [None]:
! which pip
! pip --version
! pip uninstall smdebug --yes
! pip install ../sdk/smdebug-0.9.3b20201015-py2.py3-none-any.whl


Install latest version of Bokeh:

In [None]:
! pip uninstall -y bokeh
! pip install bokeh==2.1.1

### Read profiler data: system metrics and framework metrics

Once the training job is running SageMaker will collect system and framework metrics. The following code cell is waiting for the system metrics & framework metrics to become available in S3. Once they are available you will be able to query and plot those metrics.

In [None]:
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob
tj = TrainingJob(training_job_name, region)
tj.wait_for_sys_profiling_data_to_be_available()

## Summary view - GPU and CPU usage 

This notebook provides dashboards to aggregate and visualize the profiler data in real-time. 

MetricHistogram computes a histogram on GPU and CPU utilization values. Bins are between 0 and 100. Good system utilization means that the center of the distribtuon should be between 80 to 90. As example:
<table><tr>
<td> <img src="images/histogram1.png" alt="Drawing" style="width: 250px;"/> </td>
<td> <img src="images/histogram2.png" alt="Drawing" style="width: 250px;"/> </td>
</tr></table>

First image shows good utilization pattern. Second one indicates high fluctuations because distribution has a spike at 0 and 100. In case of multi-GPU training: if distributions of GPU utilization values are not similar it indicates an issue with workload distribution.

The following cell will plot the histograms per metric. In order to only plot specific metrics define the list  `select_dimensions` and `select_events`. A dimension can be CPUUtilization, GPUUtilization, GPUMemoryUtilization IOPS. If no event is specified then for CPU uiltization histogram for each single core and total cpu usage will be plotted. In case of GPU, it will visualize utilization and memory for each GPU. In case of IOPS it will plot io-wait time per cpu. If `select_events` is specified then only metrics that match the name in `select_metrics` will be shown. If neither `select_dimensions` nor `select_events` all available metrics will be visualized. One can also specify a start and endtime.

In [None]:
from smdebug.profiler.analysis.notebook_utils.metrics_histogram import MetricsHistogram

system_metrics_reader = tj.get_systems_metrics_reader()
system_metrics_reader.refresh_event_file_list()

metrics_histogram = MetricsHistogram(system_metrics_reader)
metrics_histogram.plot(starttime=0, 
                       endtime=system_metrics_reader.get_timestamp_of_latest_available_file(), 
                       select_dimensions=["CPU", "GPU"],
                       select_events=["total"])

## Step durations over time

SageMaker Debugger records the durations of each step, which is the time spent in one forward and backward pass. The following code cell plots step durations (y-axis) over training job duration (x-axis). Typically we would expect the step duration to be very similar across the training run. Signficant outliers are an indication of a bottleneck. `StepTimelineChart` helps to identify if such outliers happen in regular intervals. Following image shows an example, where the step duration mostly lasts about 200 to 250ms but every 10th step a spike occurs where step duration is significantly higher (600-800ms).  ![](images/step_duration.png)

In [None]:
tj.wait_for_framework_profiling_data_to_be_available()

In [21]:
from smdebug.profiler.analysis.notebook_utils.step_timeline_chart import StepTimelineChart

framework_metrics_reader = tj.get_framework_metrics_reader()
framework_metrics_reader.refresh_event_file_list()

view_step_timeline_chart = StepTimelineChart(framework_metrics_reader)

## Outliers in step duration

StepHistogram creates a histogram of step duration values. Signficant outliers are an indication of a bottleneck. In contrast to `SetpTimelineChart` it allows to more easily identify clusters of step duration values. As a simple example: time spent during training phase (forward+backward pass) will likely be different to time spent during validation phase (forward pass), so we would expect at least two clusters.

In [4]:
from smdebug.profiler.analysis.notebook_utils.step_histogram import StepHistogram
tj.wait_for_framework_profiling_data_to_be_available()
framework_metrics_reader = tj.get_framework_metrics_reader()

framework_metrics_reader.refresh_event_file_list()

step_histogram = StepHistogram(framework_metrics_reader)
step_histogram.plot(starttime=step_histogram.last_timestamp - 5 * 1000 * 1000, endtime=step_histogram.last_timestamp, show_workers=True)
# select metrics can be given as regex, for example ["Forward-node", "Step", "Backward\(post-forward\)-node"]




 Profiler data from framework is available
Found recorded framework annotations. Latest available timestamp microsseconds_since_epoch is:1592956321784417 , human_readable_timestamp in utc: 2020-06-23T16:52:01:784417
StepHistogram created, last_timestamp found:1592956321784417
stephistogram getting events from 1592956316784417 to 1592956321784417
Total events fetched:711
Select metrics:['Step:ModeKeys', 'Forward-node', 'Backward\\(post-forward\\)-node']
Available_metrics: ['Step:ModeKeys.TRAIN-nodeid:21-algo-1', 'Step:ModeKeys.EVAL-nodeid:21-algo-1', 'DataIterator-nodeid:21-algo-1']
Filtered metrics:['Step:ModeKeys.TRAIN-nodeid:21-algo-1', 'Step:ModeKeys.EVAL-nodeid:21-algo-1']


## Timeline charts 

The following class create  timeline charts for utilization per core and GPU. It will show the last 1000 datapoints and charts will get updated by the last code cell in the end of this notebook. Once updated you can inspect previous datapoints by zooming out of the chart.

In [23]:
from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts

framework_metrics_reader.refresh_event_file_list()
system_metrics_reader.refresh_event_file_list()

view_timeline_charts  = TimelineCharts(system_metrics_reader, 
                                       framework_metrics_reader,
                                       select_dimensions=["CPU", "GPU"],
                                       select_events=["total"])

You can use the BoxSelectTool to make a selection in the timeline chart.

<img src='images/boxselect.png' width="840" height="180" border="10" />


The following code cell identifies which time annotations have been recorded in the training job for the selected timerange:

In [24]:
# Note change index range below with selected index range from above cell
view_timeline_charts.find_time_annotations([700,710]) 

Selected timerange: 1592955857.345275 to 1592955859.3462992
Spent 0.26735199999999987 ms (cumulative time) in dataset::GetNext
Spent 0.221752 ms (cumulative time) in Step:ModeKeys.TRAIN_17687
Spent 0.218265 ms (cumulative time) in Step:ModeKeys.TRAIN_17688
Spent 0.217526 ms (cumulative time) in Step:ModeKeys.TRAIN_17689
Spent 0.215613 ms (cumulative time) in Step:ModeKeys.TRAIN_17690
Spent 0.217347 ms (cumulative time) in Step:ModeKeys.TRAIN_17691
Spent 0.219217 ms (cumulative time) in Step:ModeKeys.TRAIN_17692
Spent 0.217934 ms (cumulative time) in Step:ModeKeys.TRAIN_17693
Spent 0.217318 ms (cumulative time) in Step:ModeKeys.TRAIN_17694
Spent 0.218663 ms (cumulative time) in Step:ModeKeys.TRAIN_17695
Spent 0.219072 ms (cumulative time) in Step:ModeKeys.TRAIN_17696


Following cell creates a detailed view of framework metrics for the selected timerange. To avoid issues with out of memory, it will only plot the first 1000 datapoints.

In [None]:
# Note change index range below with selected index range from above cell
view_timeline_charts.plot_detailed_profiler_data([700,710]) 

## Heatmap 

The following code cell creates a heatmap where each row corresponds to one metric (CPU core and GPU utilizations) and x-axis is the duration of the training job. It allows to more easily spot CPU bottlenecks e.g. if utilization on GPU is low but a utilization of one or more cores is high. 

For instance the following example shows the heatmap of a training job that has been using 4 GPUs and 8 CPU cores. The first 4 rows show the GPUs utilization, the remaining rows the utilization on CPU cores. Yellow indicates maximum utilization, purple means that utilization was 0. GPUs have frequent stalled cycles where utilization is dropping to 0 while at the same time utilization on CPU cores is at a maximum. This is a clear indication of a CPU bottleneck where GPUs are waiting for the data to arrive. Such a bottleneck can be caused for instance by a too compute-heavy pre-processing.

![](images/heatmap.png)

In [25]:
from smdebug.profiler.analysis.notebook_utils.heatmap import Heatmap

system_metrics_reader.refresh_event_file_list()
view_heatmap = Heatmap(system_metrics_reader, plot_height=450)

## Run loop to fetch latest profiler data and update charts

In [26]:
from bokeh.io import push_notebook
from smdebug.profiler.utils import us_since_epoch_to_human_readable_time
import time 

last_timestamp = system_metrics_reader.get_timestamp_of_latest_available_file() 
description = tj.describe_training_job()

while description['TrainingJobStatus'] == "InProgress":
    system_metrics_reader.refresh_event_file_list()
    framework_metrics_reader.refresh_event_file_list()
    current_timestamp = system_metrics_reader.get_timestamp_of_latest_available_file()  
    description = tj.describe_training_job()
    
    if current_timestamp > last_timestamp:  
        
        print("New data available, updating dashboards. Current timestamp is", 
              us_since_epoch_to_human_readable_time(current_timestamp))
        
        view_timeline_charts.update_data(current_timestamp)
        push_notebook(handle=view_timeline_charts.target)

        view_step_timeline_chart.update_data(current_timestamp)
        push_notebook(handle=view_step_timeline_chart.target)

        view_heatmap.update_data(current_timestamp)
        push_notebook(handle=view_heatmap.target) 

        metrics_histogram.update_data(current_timestamp)
        push_notebook(handle=metrics_histogram.target)

        step_histogram.update_data(current_timestamp)
        push_notebook(handle=step_histogram.target)

        last_timestamp = current_timestamp
    time.sleep(10)