# AWS SageMaker Profiler Example

This notebook will walk you through creating a training job with the profiler feature enabled.

# 1. Install Dependencies


### SageMaker Python SDK and `smdebug`

The first thing you will need to do is install the private beta versions of the SageMaker Python SDK and the `smdebug` library. This will enable you to call a private version of the API that allows you to create SageMaker training jobs with the profiler enabled.

In [None]:
! pip install ../sdk/sagemaker-1.60.3.dev0.tar.gz -q
! pip install ../sdk/smdebug-0.8.0b20200622-py3-none-any.whl

# The following command will enable the SDK to use new profiler configs in the API
! aws configure add-model --service-model file://../sdk/sagemaker-2017-07-24.normal.json --service-name sagemaker

If you run this notebook in Jupyterlab and not Jupyter, you need to install the jupyterlab extensions 
`@jupyter-widgets/jupyterlab-manager` and `@bokeh/jupyter_bokeh`. We provide a [SageMaker Lifecycle configuration](lifecycle_config/on_start.sh) that automatically installs these extensions when your notebook instance is started. Check out [this blog](https://aws.amazon.com/blogs/machine-learning/customize-your-amazon-sagemaker-notebook-instances-with-lifecycle-configurations-and-the-option-to-disable-internet-access/) for how to create and attach a Lifecycle configuration to your notebook instance.

# 2. Create A Training Job With Profiling Enabled

You will use the standard [SageMaker Estimator API for Tensorflow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator) to create training jobs. To create a training job with the profiler enabled, all you need to do is create a `ProfilerConfig` object and pass it into the `profiler_config` parameter of an `Estimator`.

### Define an Estimator

We define some hyperparameters such as number of epochs, batch size and enable data augmentation. You can increase batch size which leads to higher system utilization but may introduce a CPU bottlneck since data preprocessing and augmentation is very compute heavy. Larger batch size means more data has to be preprocessed more quickly. Alternatively, you can disable data_augmentation to see the impact on the system utilization. 

For demonstration purposes we choose a set of data augmentation techniques that will heavily increase CPU usage leading to GPU starvation.

In [None]:
hyperparameters = {'epoch': 10, 
                   'batch_size': 64,
                   'data_augmentation': True}

In [None]:
import boto3
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.profiler import ProfilerConfig 

estimator = TensorFlow(
    role=sagemaker.get_execution_role(),
    image_name='385479125792.dkr.ecr.us-east-2.amazonaws.com/profiler-gpu:latest',
    train_instance_count=1,
    train_instance_type='ml.p3.8xlarge',
    entry_point='train.py',
    source_dir='demo',
    framework_version='2.2.0',
    py_version='py37',
    profiler_config=ProfilerConfig(profiling_interval_millis=500),
    script_mode=True,
    hyperparameters=hyperparameters
)

### Start training job

In [None]:
estimator.fit(wait=False)

# 3. Read Profiler Data

### Get the S3 path where profiler data is stored

In [None]:
path = estimator.latest_job_profiler_artifacts_path()
path

### Read profiler data: system metrics

Once the training job is running SageMaker will collect system and framework metrics. The following code cell is waiting for the system metrics to become available in S3. Once they are available you will be able to query and plot those metrics.

In [None]:
from smdebug.profiler.system_metrics_reader import S3SystemMetricsReader
import time

system_metrics_reader = S3SystemMetricsReader(path)

sagemaker_client = boto3.client('sagemaker')
training_job_name = estimator.latest_training_job.name
print(f"Training job name: {training_job_name}")

training_job_status = ''
training_job_secondary_status = ''
while system_metrics_reader.get_timestamp_of_latest_available_file() == 0:
    system_metrics_reader.refresh_event_file_list()
    client = sagemaker_client.describe_training_job(
        TrainingJobName=training_job_name
    )
    if 'TrainingJobStatus' in client:
        training_job_status = f"TrainingJobStatus: {client['TrainingJobStatus']}"
    if 'SecondaryStatus' in client:
        training_job_secondary_status = f"TrainingJobSecondaryStatus: {client['SecondaryStatus']}"
        
    print(f"Profiler data from system not available yet. {training_job_status}. {training_job_secondary_status}.")
    time.sleep(20)

print("\n\nProfiler data from system is available")


Helper function to convert timestamps into UTC:

In [None]:
from datetime import datetime

def timestamp_to_utc(timestamp):
    utc_dt = datetime.utcfromtimestamp(timestamp)
    return utc_dt.strftime('%Y-%m-%d %H:%M:%S')

Now that the data is available we can query and inspect it. We get the latest available timestamp and query all the events within the given timerange:

In [None]:
system_metrics_reader.refresh_event_file_list()
last_timestamp = system_metrics_reader.get_timestamp_of_latest_available_file()
events = system_metrics_reader.get_events(0, last_timestamp) 

print("Found", len(events), "recorded system metric events. Latest recorded event:",  
      timestamp_to_utc(last_timestamp/1000000))


We can iterate over the list of recorded events. Let's have a look on the first event.

In [None]:
print("Event name:", events[0].name, 
      "\nTimestamp:", timestamp_to_utc(events[0].timestamp), 
      "\nValue:", events[0].value)

## Summary view - GPU and CPU usage 

This notebook provides dashboards to aggregate and visualize the profiler data in real-time. 

MetricHistogram computes a histogram on GPU and CPU utilization values. Bins are between 0 and 100. Good system utilization means that the center of the distribtuon should be between 80 to 90. As example:
<table><tr>
<td> <img src="images/histogram1.png" alt="Drawing" style="width: 250px;"/> </td>
<td> <img src="images/histogram2.png" alt="Drawing" style="width: 250px;"/> </td>
</tr></table>

First image shows good utilization pattern. Second one indicates high fluctuations because distribution has a spike at 0 and 100. In case of multi-GPU training: if distributions of GPU utilization values are not similar it indicates an issue with workload distribution.

In [None]:
from utils import MetricsHistogram  

system_metrics_reader.refresh_event_file_list()
metrics_histogram = MetricsHistogram(system_metrics_reader)

### Read profiler data: framework annotations

In [None]:
from smdebug.profiler.algorithm_metrics_reader import S3AlgorithmMetricsReader

framework_metrics_reader = S3AlgorithmMetricsReader(path)

events = []
while framework_metrics_reader.get_timestamp_of_latest_available_file() == 0 or len(events) == 0:
    framework_metrics_reader.refresh_event_file_list()
    last_timestamp = framework_metrics_reader.get_timestamp_of_latest_available_file()
    events = framework_metrics_reader.get_events(0, last_timestamp)
    
    print("Profiler data from framework not available yet")
    time.sleep(20)
        
print("\n\n Profiler data from framework is available")

The following code cell retrieves all recorded events from Amazon S3.

In [None]:
framework_metrics_reader.refresh_event_file_list()
last_timestamp = framework_metrics_reader.get_timestamp_of_latest_available_file()
events = framework_metrics_reader.get_events(0, last_timestamp) 

print("Found", len(events), "recorded framework annotations. Latest event recorded ",  
      timestamp_to_utc(last_timestamp/1000000))


Like before we can inspect the recorded events. Since we are reading framework metrics there is now a start and end time for each event.

In [None]:
print("Event name:", events[0].event_name, 
      "\nStart time:", timestamp_to_utc(events[0].start_time/1000000000), 
      "\nEnd time:", timestamp_to_utc(events[0].end_time/1000000000), 
      "\nDuration:", events[0].duration, "nanosecond")


## Step durations over time

SageMaker Debugger records the durations of each step, which is the time spent in one forward and backward pass. The following code cell plots step durations (y-axis) over training job duration (x-axis). Typically we would expect the step duration to be very similar across the training run. Signficant outliers are an indication of a bottleneck. `StepTimelineChart` helps to identify if such outliers happen in regular intervals. Following image shows an example, where the step duration mostly lasts about 200 to 250ms but every 10th step a spike occurs where step duration is significantly higher (600-800ms).  ![](images/step_duration.png)

In [None]:
from utils import StepTimelineChart

framework_metrics_reader.refresh_event_file_list()
view_step_timeline_chart = StepTimelineChart(framework_metrics_reader)

## Outliers in step duration

StepHistogram creates a histogram of step duration values. Signficant outliers are an indication of a bottleneck. In contrast to `SetpTimelineChart` it allows to more easily identify clusters of step duration values. As a simple example: time spent during training phase (forward+backward pass) will likely be different to time spent during validation phase (forward pass), so we would expect at least two clusters.

In [None]:
from utils import StepHistogram

framework_metrics_reader.refresh_event_file_list()
step_histogram = StepHistogram(framework_metrics_reader)

## Timeline charts 

The following class create  timeline charts for utilization per core and GPU. It will show the last 1000 datapoints and charts will get updated by the last code cell in the end of this notebook. Once updated you can inspect previous datapoints by zooming out of the chart.

In [None]:
from utils import TimelineCharts

view_timeline_charts  = TimelineCharts(system_metrics_reader, framework_metrics_reader)

You can use the BoxSelectTool to make a selection in the timeline chart.

<img src='images/boxselect.png' width="840" height="180" border="10" />


The following code cell identifies which time annotations have been recorded in the training job for the selected timerange:

In [None]:
view_timeline_charts.find_time_annotations([]) 

## Heatmap 

The following code cell creates a heatmap where each row corresponds to one metric (CPU core and GPU utilizations) and x-axis is the duration of the training job. It allows to more easily spot CPU bottlenecks e.g. if utilization on GPU is low but a utilization of one or more cores is high. 

For instance the following example shows the heatmap of a training job that has been using 4 GPUs and 8 CPU cores. The first 4 rows show the GPUs utilization, the remaining rows the utilization on CPU cores. Yellow indicates maximum utilization, purple means that utilization was 0. GPUs have frequent stalled cycles where utilization is dropping to 0 while at the same time utilization on CPU cores is at a maximum. This is a clear indication of a CPU bottleneck where GPUs are waiting for the data to arrive. Such a bottleneck can be caused for instance by a too compute-heavy pre-processing.

![](images/heatmap.png)

In [None]:
from utils import Heatmap

view_heatmap = Heatmap(system_metrics_reader)

## Run loop to fetch latest profiler data and update charts

In [None]:
job_name = estimator.latest_training_job.name
print('Training job name: {}'.format(job_name))

client = estimator.sagemaker_session.sagemaker_client

description = client.describe_training_job(TrainingJobName=job_name)


In [None]:
from bokeh.io import push_notebook
import time 

last_timestamp = system_metrics_reader.get_timestamp_of_latest_available_file() 


while description['TrainingJobStatus'] == "InProgress":
    system_metrics_reader.refresh_event_file_list()
    framework_metrics_reader.refresh_event_file_list()
    current_timestamp = system_metrics_reader.get_timestamp_of_latest_available_file()  
    description = client.describe_training_job(TrainingJobName=job_name)
    
    if current_timestamp > last_timestamp:  
        
        print("New data available, updating dashboards. Current timestamp is", 
              timestamp_to_utc(current_timestamp/1000000))
        
        view_timeline_charts.update_data(current_timestamp)
        push_notebook(handle=view_timeline_charts.target)

        view_step_timeline_chart.update_data(current_timestamp)
        push_notebook(handle=view_step_timeline_chart.target)

        view_heatmap.update_data(current_timestamp)
        push_notebook(handle=view_heatmap.target) 

        metrics_histogram.update_data(current_timestamp)
        push_notebook(handle=metrics_histogram.target)

        step_histogram.update_data(current_timestamp)
        push_notebook(handle=step_histogram.target)

        last_timestamp = current_timestamp
    time.sleep(10)
