# Amazon SageMaker EagleEye PyTorch Training Job Example

This notebook will walk you through creating a PyTorch training job with the SageMaker EagleEye feature enabled.

# 1. Install Dependencies


### SageMaker Python SDK

Install the private beta version of the SageMaker Python SDK library. This enables you to call a private version of the SageMaker EagleEye API that allows you to create SageMaker training jobs with the profiler enabled.

In [None]:
! pip install ../sdk/sagemaker-1.60.3.dev0.tar.gz -q

# The following command will enable the SDK to use new profiler configs in the API
! aws configure add-model --service-model file://../sdk/sagemaker-2017-07-24.normal.json --service-name sagemaker

If you run this notebook in Jupyterlab and not Jupyter, you need to install the jupyterlab extensions 
`@jupyter-widgets/jupyterlab-manager` and `@bokeh/jupyter_bokeh`. We provide a [SageMaker Lifecycle configuration](lifecycle_config/on_start.sh) that automatically installs these extensions when your notebook instance is started. Check out [this blog](https://aws.amazon.com/blogs/machine-learning/customize-your-amazon-sagemaker-notebook-instances-with-lifecycle-configurations-and-the-option-to-disable-internet-access/) for how to create and attach a Lifecycle configuration to your notebook instance.

# 2. Create a Training Job with Profiling Enabled

You will use the standard [SageMaker Estimator API for PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html) to create training jobs. To enable profiling, create a `ProfilerConfig` object and pass it to the `profiler_config` parameter of the `PyTorch` estimator.

### Set a profiler configuration

In [None]:
from sagemaker.profiler import ProfilerConfig 

profiler_config = ProfilerConfig(
    profiling_interval_millis=500,
    profiling_parameters={
        "ProfilerEnabled": str(True),
        "GeneralMetricsConfig": "{\"StartStep\": \"2\", \"NumSteps\": \"2\"}"
   }
)

### Set region where this notebook is running

In [None]:
import boto3

session = boto3.session.Session()
region = session.region_name

### Set parameters for the PyTorch estimator

In [None]:
# EagleEye beta version base image for PyTorch
image_name = f'385479125792.dkr.ecr.{region}.amazonaws.com/profiler-gpu:pt_1.5.1_dataloader'

# Default training job
hyperparameters = {"batch_size":512, "epochs":10}
entry_point_script = 'train_pt.py'
base_job_name="pt-train-pt"
instance_count = 1


# Trainingjob with configurable dataloader parameters, such as num_workers and pin memory
"""
hyperparameters = {"batch_size":2048, "gpu":True, "pin_memory":True, "workers":4, "epoch":5}
hyperparameters = {"batch_size":2048, "gpu":True, "pin_memory":True, "workers":0, "epoch":5, "model":"resnext101_32x8d"}
hyperparameters = {"batch_size":2048, "gpu":True, "pin_memory":False, "workers":0, "epoch":5, "model":"resnext101_32x8d"}
entry_point_script = 'pytorch_res50_cifar10_dataloader.py'
base_job_name="pt-dataloader-singlegpu"
"""


# Uncomment for horovod based single node multigpu training
"""
hyperparameters = {"script":"pt_res50_cifar10_horovod_dataloader.py", "model":"resnext101_32x8d", "batch_size":2048, "epoch":5}
entry_point_script = 'horovod_test_launcher.py'
base_job_name = "pt-horvod-multigpu-resnet101"
"""

# Uncomment for distributed API based training single node multi gpu
"""
hyperparameters = {"training_script":"pt_res50_cifar10_distributed.py", "nproc_per_node":4, "nnodes":1}
entry_point_script = 'distributed_launch.py'
instance_count = 1
base_job_name = "pt-distributed-singlenode"
"""


# Uncomment for distributed API based training multi node multi gpu
"""
hyperparameters = {"training_script":"pt_res50_cifar10_distributed.py", "nproc_per_node":4, "nnodes":2}
entry_point_script = 'distributed_launch.py'
instance_count = 2
base_job_name = "pt-distributed-multinode"
"""

print(f"image being used is {image_name}")

### Define PyTorch estimator

In [None]:
import sagemaker
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    role=sagemaker.get_execution_role(),
    image_name=image_name,
    train_instance_count=instance_count,
    train_instance_type='ml.p3.8xlarge',
    source_dir='demo',
    entry_point=entry_point_script,
    framework_version='1.5.0',
    hyperparameters=hyperparameters,
    base_job_name=base_job_name,    
    profiler_config=profiler_config)

### Start training job

The following `estimator.fit()` with `wait=False` argument initiates the training job in the background. You can proceed to run the dashboard or analysis notebooks.

In [None]:
estimator.fit(wait=False)

# 3. Retrieve the Training Job Name to Analyze Profiling Data

Copy outputs of the following cell (`training_job_name` and `region`) to run the analysis notebooks `eagleeye_generic_dashboard.ipynb`, `analyze_performance_bottlenecks.ipynb`, and `eagleeye_interactive_analysis.ipynb`.

In [None]:
import boto3

session = boto3.session.Session()
region = session.region_name

training_job_name = estimator.latest_training_job.name
print(f"Training jobname: {training_job_name}")
print(f"Region: {region}")

# 4. Run SageMaker EagleEye Rules

The following cell runs a profiler rule processing container on a separte instance in parallel. EagleEye will fetch the system and framework metrics and analyze the data for potential performance issues. 

You can run the rule container at any time while the training job is in progress or after the job has finished.

In [None]:
from sagemaker.processing import Processor, ProcessingInput, ProcessingOutput

profiler_rule_image=f'385479125792.dkr.ecr.{region}.amazonaws.com/sagemaker-profiler-rules-container:latest'

processor = Processor(
            role=sagemaker.get_execution_role(),
            image_uri=profiler_rule_image,
            instance_count=1,
            instance_type='ml.r5.4xlarge',
            env={'S3_PATH': estimator.latest_job_profiler_artifacts_path()}
        )
processor.run([], 
              [ProcessingOutput(output_name='profiler-analysis', 
                                source='/opt/ml/processing/outputs', 
                                destination=estimator.latest_job_profiler_artifacts_path())],
              wait=False, logs=False
             ) 
              

Once the processing job finished you will find an html report `plot-viz-rule.html` and a notebook `profiler-report.ipynb` in your S3 bucket. Each rule also creates a json-file that is used to generate the final report and you can find those files in your S3 bucket under `profiler-reports`.  

In [None]:
print(f"You will find the profiler report in {estimator.latest_job_profiler_artifacts_path()}/plot_viz_rule.html after the training has finished")