# Amazon SageMaker EagleEye TensorFlow Training Job Example

This notebook will walk you through creating a TensorFlow training job with the SageMaker EagleEye feature enabled.

# 1. Install Dependencies


### SageMaker Python SDK

Install the private beta version of the SageMaker Python SDK library. This enables you to call a private version of the SageMaker EagleEye API that allows you to create SageMaker training jobs with the profiler enabled.

In [None]:
! pip install ../sdk/sagemaker-1.60.3.dev0.tar.gz -q

# The following command will enable the SDK to use new profiler configs in the API
! aws configure add-model --service-model file://../sdk/sagemaker-2017-07-24.normal.json --service-name sagemaker

If you run this notebook in Jupyterlab and not Jupyter, you need to install the jupyterlab extensions 
`@jupyter-widgets/jupyterlab-manager` and `@bokeh/jupyter_bokeh`. We provide a [SageMaker Lifecycle configuration](lifecycle_config/on_start.sh) that automatically installs these extensions when your notebook instance is started. Check out [this blog](https://aws.amazon.com/blogs/machine-learning/customize-your-amazon-sagemaker-notebook-instances-with-lifecycle-configurations-and-the-option-to-disable-internet-access/) for how to create and attach a Lifecycle configuration to your notebook instance.

# 2. Create a Training Job with Profiling Enabled<a class="anchor" id="option-1"></a>

You will use the standard [SageMaker Estimator API for Tensorflow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator) to create training jobs. To enable profiling, create a `ProfilerConfig` object and pass it to the `profiler_config` parameter of the `TensorFlow` estimator.

### Define hyperparameters

Define hyperparameters such as number of epochs, batch size, and data augmentation. You can increase batch size to increases system utilization, but it may result in CPU bottlneck problems. Data preprocessing of a large batch size with augmentation requires a heavy computation. You can disable data_augmentation to see the impact on the system utilization. 

For demonstration purpose, the following hyperparameters are prepared to increase CPU usage, leading to GPU starvation.

In [None]:
hyperparameters = {'epoch': 5, 
                   'batch_size': 64,
                   'data_augmentation': True}

### Set `region` and the `image_name` for your training job with the TensorFlow framework

In [None]:
import boto3

session = boto3.session.Session()
region = session.region_name

# EagleEye beta version base image for TensorFlow
image_name = f'385479125792.dkr.ecr.{region}.amazonaws.com/profiler-gpu:latest'
print(f"image being used is {image_name}")


### Set a profiler configuration

In [None]:
from sagemaker.profiler import ProfilerConfig 

profiling_parameters = {
    "ProfilerEnabled": str(True),
    "GeneralMetricsConfig": "{\"StartStep\": \"2\", \"NumSteps\": \"2\"}"
}
profiler_config = ProfilerConfig(
    profiling_interval_millis=500,
    profiling_parameters=profiling_parameters
)

In [None]:
import boto3
import sagemaker
from sagemaker.tensorflow import TensorFlow

# This parameter tells sagemaker how to configure and run horovod. 
# If you use a bigger instance with more than 4 GPUs per node, change the process_per_host paramter accordingly.
distributions = {
                    "mpi": {
                        "enabled": True,
                        "processes_per_host": 4,
                        "custom_mpi_options": "-verbose -x HOROVOD_TIMELINE=./hvd_timeline.json -x NCCL_DEBUG=INFO -x OMPI_MCA_btl_vader_single_copy_mechanism=none",
                    }
                }
job_name="multi-node-multi-gpu-tf-hvd"
instance_count=2
entry_script='tf-hvd-train.py'

"""
# Uncomment this block if you want to run single node multi gpu horovod training job.

## This parameter tells sagemaker how to configure and run horovod. If you want to use more than 4 GPUs per node then change the process_per_host paramter accordingly.
distributions = {
                    "mpi": {
                        "enabled": True,
                        "processes_per_host": 4,
                        "custom_mpi_options": "-verbose -x HOROVOD_TIMELINE=./hvd_timeline.json -x NCCL_DEBUG=INFO -x OMPI_MCA_btl_vader_single_copy_mechanism=none",
                    }
                }
job_name="single-node-multi-gpu-tf-hvd"
instance_count=1
entry_script='tf-hvd-train.py'
"""

estimator = TensorFlow(
    role=sagemaker.get_execution_role(),
    base_job_name=job_name,
    image_name=image_name,
    train_instance_count=instance_count,
    train_instance_type='ml.p3.8xlarge',
    entry_point=entry_script,
    source_dir='demo',
    framework_version='2.2.0',
    py_version='py37',
    profiler_config=profiler_config,
    script_mode=True,
    hyperparameters=hyperparameters,
    distributions=distributions,
)

### Start training job

The following `estimator.fit()` with `wait=False` argument initiates the training job in the background. You can proceed to run the dashboard or analysis notebooks.

In [None]:
estimator.fit(wait=False)

# 3. Retrieve the Training Job Name to Analyze Profiling Data

Copy outputs of the following cell (`training_job_name` and `region`) to run the analysis notebooks `eagleeye_generic_dashboard.ipynb`, `analyze_performance_bottlenecks.ipynb`, and `eagleeye_interactive_analysis.ipynb`.

In [None]:
import boto3

session = boto3.session.Session()
region = session.region_name

training_job_name = estimator.latest_training_job.name
print(f"Training jobname: {training_job_name}")
print(f"Region: {region}")

# 4. Run SageMaker EagleEye rules

The following cell runs a profiler rule processing container on a separte instance in parallel. EagleEye will fetch the system and framework metrics and analyze the data for potential performance issues. 

You can run the rule container at any time while the training job is in progress or after the job has finished.

In [None]:
from sagemaker.processing import Processor, ProcessingInput, ProcessingOutput

profiler_rule_image=f'385479125792.dkr.ecr.{region}.amazonaws.com/sagemaker-profiler-rules-container:latest'

processor = Processor(
            role=sagemaker.get_execution_role(),
            image_uri=profiler_rule_image,
            instance_count=1,
            instance_type='ml.r5.4xlarge',
            env={'S3_PATH': estimator.latest_job_profiler_artifacts_path()}
        )
processor.run([], 
              [ProcessingOutput(output_name='profiler-analysis', 
                                source='/opt/ml/processing/outputs', 
                                destination=estimator.latest_job_profiler_artifacts_path())],
              wait=False, logs=False
             ) 
              

Once the processing job finished you will find an html report `plot-viz-rule.html` and a notebook `profiler-report.ipynb` in your S3 bucket. Each rule also creates a json-file that is used to generate the final report and you can find those files in your S3 bucket under `profiler-reports`.  

In [None]:
print(f"You will find the profiler report in {estimator.latest_job_profiler_artifacts_path()}/plot_viz_rule.html after the training has finished")