# AWS SageMaker Profiler Example

This notebook will walk you through creating a training job with the profiler feature enabled.

# 1. Install Dependencies


### SageMaker Python SDK and `smdebug`

The first thing you will need to do is install the private beta versions of the SageMaker Python SDK and the `smdebug` library. This will enable you to call a private version of the API that allows you to create SageMaker training jobs with the profiler enabled.

In [None]:
! pip install ../sdk/sagemaker-1.60.3.dev0.tar.gz -q
! pip install ../sdk/smdebug-0.8.0b20200622-py3-none-any.whl

# The following command will enable the SDK to use new profiler configs in the API
! aws configure add-model --service-model file://../sdk/sagemaker-2017-07-24.normal.json --service-name sagemaker

If you run this notebook in Jupyterlab and not Jupyter, you need to install the jupyterlab extensions 
`@jupyter-widgets/jupyterlab-manager` and `@bokeh/jupyter_bokeh`. We provide a [SageMaker Lifecycle configuration](lifecycle_config/on_start.sh) that automatically installs these extensions when your notebook instance is started. Check out [this blog](https://aws.amazon.com/blogs/machine-learning/customize-your-amazon-sagemaker-notebook-instances-with-lifecycle-configurations-and-the-option-to-disable-internet-access/) for how to create and attach a Lifecycle configuration to your notebook instance.

# 2. Create A Training Job With Profiling Enabled

You will use the standard [SageMaker Estimator API for Tensorflow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator) to create training jobs. To create a training job with the profiler enabled, all you need to do is create a `ProfilerConfig` object and pass it into the `profiler_config` parameter of an `Estimator`.

### Define an Estimator

We define some hyperparameters such as number of epochs, batch size and enable data augmentation. You can increase batch size which leads to higher system utilization but may introduce a CPU bottlneck since data preprocessing and augmentation is very compute heavy. Larger batch size means more data has to be preprocessed more quickly. Alternatively, you can disable data_augmentation to see the impact on the system utilization. 

For demonstration purposes we choose a set of data augmentation techniques that will heavily increase CPU usage leading to GPU starvation.

In [None]:
hyperparameters = {'epoch': 10, 
                   'batch_size': 64,
                   'data_augmentation': True}

### Set region where this notebook is running

In [None]:
import os
region = os.environ['AWS_REGION'] # Set it to the region like us-east-1, us-east-2 if AWS_REGION is not set
image_name = f'385479125792.dkr.ecr.{region}.amazonaws.com/profiler-gpu:latest'
print(f"image being used is {image_name}")


In [None]:
import boto3
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.profiler import ProfilerConfig 

estimator = TensorFlow(
    role=sagemaker.get_execution_role(),
    image_name=image_name,
    train_instance_count=1,
    train_instance_type='ml.p3.8xlarge',
    entry_point='train.py',
    source_dir='demo',
    framework_version='2.2.0',
    py_version='py37',
    profiler_config=ProfilerConfig(profiling_interval_millis=500),
    script_mode=True,
    hyperparameters=hyperparameters
)

### Start training job

In [None]:
estimator.fit(wait=False)

# 3.  Analyse profiling data

Run notebook ../profiler/profiler_generic_dashboard.ipynb with training_job_name and region as printed below


In [None]:
training_job_name = estimator.latest_training_job.name
print(f"Training jobname:{training_job_name} and region:{region}")
print("Run notebook ../profiler/profiler_generic_dashboard.ipynb with parameters training_job_name:{training_job_name} and region:{region} ")