![](display_images/remars_logo.png)

# Train a 3D object detector

Welcome to the training notebook! In the previous notebook you learned about the [A2D2 Dataset](https://www.a2d2.audi/a2d2/en.html), how to visualize it, and how to launch an Amazon SageMaker Ground Truth Labeling Job. 

In this notebook we will walk through how to train a 3D object detection model using Amazon SageMaker. We will:
- Build a custom container
- Setup FSx for Lustre as a data source
- Setup SageMaker Experiments
- Launch a distributed training job on Amazon SageMaker
- Review training job profiling metrics 

Training a 3D object detection model requires a specialized toolset. Point cloud data cannot simply use the same kinds of operations 2D vision models use out of the box. Point cloud data tends to be rather sparse and spread out. The typical way 3D object detection models handle point cloud data is by either using specialized [sparse convolutions](https://arxiv.org/pdf/1711.10275.pdf) or by [voxelizing (generating uniform 3D boxes)](https://arxiv.org/pdf/1711.06396.pdf) the input. [MMDetection3D](https://github.com/open-mmlab/mmdetection3d) has implementations of a variety of different 3D object detection and segmentation models, making model training much easier! We are going to use a model called [3DSSD](https://arxiv.org/pdf/2002.10187.pdf). Some of you who are familiar with the 2D version of SSD will not notice a lot of similarities. It is indeed a single shot detector, but the feature generation it uses is much different, read the paper to learn more about it!

We will start by installing [SageMaker Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html) and cloning MM3D into our local filesystem.

Note: **Please use conda_pytorch_p38 as the kernel for this notebook**

In [None]:
!pip install sagemaker-experiments
!cd container_training && git clone https://github.com/open-mmlab/mmdetection3d.git
!pip install botocore==1.24.42
import IPython

IPython.Application.instance().kernel.do_shutdown(True)

To install MM3D in your local environment, run the following commands. We aren't going to install MM3D in our kernel because the process takes ~20 minutes and can be complicated by CUDA dependencies. Instead we will build MM3D in our docker image where we can explicitly control the dependencies.

In [None]:
## If installing MM3D in your local environment, make sure the CUDA version you use matches the CUDA version PyTorch and MMCV are built with.

# %%time
# !pip install -U torch==1.8.1 torchvision==0.9.1
# !export MKL_SERVICE_FORCE_INTEL=1 && pip install mmcv-full==1.3.13 -f https://download.openmmlab.com/mmcv/dist/cu110/torch1.8.1/index.html
# !pip install mmdet==2.17.0
# !pip install mmsegmentation==0.18.0
# !cd container_training && git clone https://github.com/iprivit/mmdetection3d.git
# !export MKL_SERVICE_FORCE_INTEL=1 && cd container_training/mmdetection3d && pip install -v -e .

# import IPython
# IPython.Application.instance().kernel.do_shutdown(True) #automatically restarts kernel

### Initialize clients and import libraries

Let's import a few libraries and initialize some [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) clients

In [2]:
%pylab inline
import os
import sys
import json
import yaml
import tarfile
import boto3
import sagemaker
import multiprocessing
import numpy as np
import pandas as pd
from datetime import datetime
from glob import glob
from tqdm import tqdm
from PIL import Image
from matplotlib import patches
from sagemaker.pytorch.estimator import PyTorch
from tqdm.contrib.concurrent import process_map
import torch
import torchvision
from sagemaker.debugger import ProfilerConfig, FrameworkProfile, DetailedProfilingConfig, DataloaderProfilingConfig, PythonProfilingConfig, Rule, ProfilerRule, rule_configs

def timestamp_to_utc(timestamp):
    utc_dt = datetime.utcfromtimestamp(timestamp)
    return utc_dt.strftime('%Y-%m-%d %H:%M:%S')

# set device for PyTorch to use, if on a GPU instance use cuda
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# initialize clients to make boto3 calls 
role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.Session()
b3sess = boto3.Session()
fsx_client = boto3.client('fsx')
cfn_client = boto3.client('cloudformation')
sm = b3sess.client('sagemaker')
region = b3sess.region_name

# set the S3 bucket you'll use
bucket = sagemaker_session.default_bucket() 
prefix_output = 'training_res'

# Use cloudformation to describe the stack we've built to grab resource names 
stack_res = cfn_client.describe_stack_resources(
    StackName='threedee',
)
resource_dict = {}
for stack in stack_res['StackResources']:
    resource_dict[stack['ResourceType']] = stack['PhysicalResourceId']
    
# grab subnets and security groups so we can run our training in our VPC
subnets = [resource_dict['AWS::EC2::Subnet']]
security_group_ids = [resource_dict['AWS::EC2::SecurityGroup']]

Populating the interactive namespace from numpy and matplotlib


### View Dockerfile

Since MMDetection3D has rather complex dependencies, the easiest way to install it in our environment is by using [Docker](https://www.docker.com/resources/what-container). Docker allows us to create self contained images with all of the dependencies necessary to run MMDetection3D. Let's take a look at the Dockerfile we are going to use to build our image.

In [3]:
# view our dockerfile
!pygmentize -l docker docker/Dockerfile

[37m# # Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved.[39;49;00m

[34mFROM[39;49;00m [33m763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.8.1-gpu-py36-cu111-ubuntu18.04[39;49;00m

[34mLABEL[39;49;00m [31mauthors[39;49;00m=[33m"privisaa@amazon.com"[39;49;00m

[34mENV[39;49;00m [31mTORCH_CUDA_ARCH_LIST[39;49;00m=[33m"5.2 6.0 6.1 7.0 7.5 8.0+PTX"[39;49;00m
[34mENV[39;49;00m [31mTORCH_NVCC_FLAGS[39;49;00m=[33m"-Xfatbin -compress-all"[39;49;00m
[34mENV[39;49;00m [31mCMAKE_PREFIX_PATH[39;49;00m=[33m"[39;49;00m[34m$([39;49;00mdirname [34m$([39;49;00mwhich conda[34m)[39;49;00m[34m)[39;49;00m[33m/../[39;49;00m[33m"[39;49;00m
[34mENV[39;49;00m [31mFORCE_CUDA[39;49;00m=[33m"1"[39;49;00m
[37m# ENV CUDA_HOME="/usr/local/cuda/"[39;49;00m

[37m# RUN apt-key del 7fa2af80[39;49;00m

[37m# RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub[39;49;

## Build Our Docker Container

Now that we have taken a look at our Dockerfile, let's build our container and push it to [Amazon Elastic Container Registry (ECR)](https://aws.amazon.com/ecr/). We will build our container, create a repository for it in ECR, and push our image to ECR with one simple command. This will allow us to later use that container when we run our training job.

In [4]:
IMAGE_NAME = 'mmdet3d-sagemaker-pt181' 
account = boto3.client('sts').get_caller_identity()['Account']

# # if in MLR401 your container will already be built for you, you can ue the following commands to pull them down!

! docker pull public.ecr.aws/k2j9l5n0/mmdet3d
! docker tag public.ecr.aws/k2j9l5n0/mmdet3d {account}.dkr.ecr.us-east-1.amazonaws.com/{IMAGE_NAME}
! aws ecr get-login --no-include-email | bash
! aws ecr create-repository --region {region} --repository-name {IMAGE_NAME}
! docker push {account}.dkr.ecr.us-east-1.amazonaws.com/{IMAGE_NAME}

# # if running on your own uncomment out the below lines:
# !aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin 763104351884.dkr.ecr.{region}.amazonaws.com
# !bash ./build_and_push.sh {region} {IMAGE_NAME} latest docker 

Let's unpack our ground truth labels, DO NOT SKIP THIS STEP! Otherwise your model will not have any labels to train with.

In [5]:
# unpack the ground truth labels

# !aws s3 cp s3://aws-tc-largeobjects/DEV-AWS-MO-Nvidia/a2d2_gt_database.tar.gz ../fsx/a2d2/a2d2_gt_database.tar.gz
!tar -xzf /home/ec2-user/SageMaker/fsx/a2d2/a2d2_gt_database.tar.gz -C ../fsx/a2d2/camera_lidar_semantic_bboxes/
!cp a2d2/a2d2*.pkl ../fsx/a2d2/camera_lidar_semantic_bboxes/

### Pre-process point clouds

One pre-processing step we need to take care of is converting our LiDAR point clouds into bin files our [DataLoader](https://pytorch.org/docs/stable/data.html) is expecting. A2D2 stores point cloud data in [NPZ](https://numpy.org/doc/stable/reference/generated/numpy.savez.html) files, which are compressed Numpy files. We will use the convert_lidar function we define below and parallelize it's execution by using the [process_map](https://tqdm.github.io/docs/contrib.concurrent/) utility. 

In [None]:
def convert_lidar(lidar_path, debug = False):

    """
    azimuth     -
    row         - 2d coordinate of LiDAR point in image space, y (1208)
    col         - 2d coordinate of LiDAR point in image space, x (1920)
    lidar_id    - lidar sensor id (5)
    depth       -
    reflectance - reflectance measurement
    points      - 3D point measurement
    timestamp   -
    distance    -
    """

    lidar = np.load(lidar_path)
    xyz   = lidar['points'     ].astype(dtype = np.float32)
    i     = lidar['reflectance'].astype(dtype = np.float32).reshape(-1, 1)
    xyzi  = np.concatenate((xyz, i), axis = 1) # [x y z] + [i]

    if  debug:

        pprint(xyz)
        pprint(i)
        pprint(xyzi)

        pprint(np.asarray(np.unique(lidar['lidar_id'], return_counts = True), dtype = int).T)
    
    path  = lidar_path.replace('npz', 'bin')
    xyzi.ravel().tofile(path) # flatten
#     print(path)

roots = glob('../fsx/a2d2/camera_lidar_semantic_bboxes/2*')
for root in tqdm(roots):
    paths = glob(f'{root}/lidar/cam_front_center/*npz')
    process_map(convert_lidar, paths, max_workers = multiprocessing.cpu_count())


## Create metric definitions

Since we aren't training on the same instance our notebook is hosted on we need a way to capture our performance metrics. SageMaker allows users to collect metrics from the output logs of their training jobs. In our case we are going to capture the 4 loss outputs from our Faster RCNN model as well as the total loss, the learning rate, and the number of training iterations. The following definition specifies the name of the metric collected and the appropriate regex used to collect the metric. 

In [7]:
# define metrics

metric_definitions=[{
        "Name": "loss",
        "Regex": ".*loss:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "corner_loss",
        "Regex": ".*corner_loss:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "vote_loss",
        "Regex": ".*vote_loss:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "dir_class_loss",
        "Regex": ".*dir_class_loss:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "lr",  
        "Regex": ".*lr:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "grad_norm",  
        "Regex": ".*grad_norm:\s([0-9\\.]+)\s*"
    }
]

## SageMaker Experiments 

Now that we have specified our training metrics above, we are going to need a way to organize and compare our training runs. [Amazon SageMaker experiments](https://aws.amazon.com/blogs/aws/amazon-sagemaker-experiments-organize-track-and-compare-your-machine-learning-trainings/) lets you organize, track, compare and evaluate machine learning experiments and model versions. We can add experiments tracking to our training jobs using a couple simple hooks. There is a small amount of setup required before we can hook it into our estimators. We first are going to create our experiment, and within our experiment create a trial for our new training job.

In [8]:
# create d2 experiment

from sagemaker.analytics import ExperimentAnalytics
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker
from smexperiments.search_expression import Filter, Operator, SearchExpression

mm3d_experiment = Experiment.create(
    experiment_name=f"mm3d-a2d2-demo-{int(time.time())}", 
    description="MMDetection3D training on the A2D2 dataset", 
    sagemaker_boto_client=sm)
print(mm3d_experiment,'\n')

Experiment(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7f1f3eafdb50>,experiment_name='mm3d-a2d2-demo-1637711766',description='MMDetection3D training on the A2D2 dataset',tags=None,experiment_arn='arn:aws:sagemaker:us-east-1:427894311213:experiment/mm3d-a2d2-demo-1637711766',response_metadata={'RequestId': '6b6af75d-acae-4833-8f18-862a4a30d7fe', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '6b6af75d-acae-4833-8f18-862a4a30d7fe', 'content-type': 'application/x-amz-json-1.1', 'content-length': '97', 'date': 'Tue, 23 Nov 2021 23:56:05 GMT'}, 'RetryAttempts': 0}) 



### Setup FSx for Lustre

We created a [FSx for Lustre filesystem](https://aws.amazon.com/fsx/lustre/) when we launched our initial [Cloudformation](https://aws.amazon.com/cloudformation/) stack. FSx for Lustre is a high performance file system that provides fast, scalable storage. It's ideal for tasks that require high data throughput like distributed training. 

In order to mount it to our SageMaker Training instance, we will need to create a FileSystemInput object. This will tell SageMaker how and where to mount the file system.

In [5]:
# Configure FSx Input for your SageMaker Training job
from sagemaker.inputs import FileSystemInput
username = 'AWS'

file_system_id= resource_dict['AWS::FSx::FileSystem']  # FSx file system ID with your training dataset. Example: 'fs-0bYYYYYY'
file_system_directory_path= f"/{fsx_client.describe_file_systems()['FileSystems'][0]['LustreConfiguration']['MountName']}/a2d2"  # NOTE: '/fsx/' will be the root mount path. Example: '/fsx/mask_rcnn/PyTorch'
file_system_access_mode='ro'
file_system_type='FSxLustre'
train_fs = FileSystemInput(file_system_id=file_system_id,
                                    file_system_type=file_system_type,
                                    directory_path=file_system_directory_path,
                                    file_system_access_mode=file_system_access_mode)
# if using the above FSx Input then use the following data channel config
# data_channels = {'train': train_fs}
# if using local mode gpu training use the following data channel instead:
data_channels =  {'train': 'file://../fsx/a2d2'} 

# SageMaker Training

![](display_images/sagemaker_how_it_works.png)

[Image source](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html)

## Define the Estimator

In SageMaker training jobs are created by initializing an estimator class where we define our training container, our entrypoint, our hyperparameters and instance types in addition to a few other variables and then launching our training job on the instance or instances we specify.

We first define a set of hyperparameters that we will pass to our estimator. When we launch our training job, these hyperparameters in addition to any source directory we define, will be packaged up and uploaded to our training instance running our Docker image. 

We then define our profiler configuration. SageMaker Debugger allows data scientists the ability to debug, monitor, and profile training jobs in real time! In this notebook we will focus specifically on the profiling. SageMaker Debugger's profiling feature allows us to collect both system and framework level information about our training job. This gives us information ranging from CPU/GPU utilization to detailed descriptions of the most time consuming operations in our training job! When we setup our profiling configuration, we tell our estimator how often to record both system and framework level information on our training job.

For our specific training job, MM3D has a wide variety of model architectures with pretrained weights that we can use as a starting point. In our hyperparameters we can define the configuration file that tells the MM3D framework what model architecture we want to use. In this case we are using a 3DSSD model, you can experiment with different models, but make sure to look at how they ingest data first.  

Our training script is setup for distributed training so let's launch our job on one of AWS's multi-gpu instances!

In [None]:
config = '3dssd/3dssd_4x4_a2d2-3d-car.py'
launcher = 'none' # if using distributed training set to pytorch, otherwise if using single GPU, set to none

with open(f"container_training/mmdetection3d/configs/{config.split('/')[0]}/metafile.yml", 'r') as f:
    cfg_meta = yaml.safe_load(f)
    
model_path = cfg_meta['Models'][0]['Weights']
!wget {model_path} -O container_training/{model_path.split('/')[-1]}

In [None]:
# run training job

# create experiment trial
trial_name = f"mm3d-demo-training-job-{int(time.time())}"
mm3d_trial = Trial.create(
    trial_name=trial_name, 
    experiment_name=mm3d_experiment.experiment_name,
    sagemaker_boto_client=sm,
)
print(mm3d_trial)

account = boto3.client('sts').get_caller_identity()['Account']
image_uri = f'{account}.dkr.ecr.us-east-1.amazonaws.com/{IMAGE_NAME}'

# pick our instance type
instance_type = 'local_gpu' # set to use local mode, but if running in your own account, try running one of the below larger instances:
if instance_type in ['ml.p3.8xlarge', 'ml.p3.16xlarge', 'ml.p3dn.24xlarge', 'ml.g4dn.12xlarge']:
    distributed = 1
else:
    distributed = 0

# set our hyperparameters
hyperparameters = {
    'config': f'/mmdetection3d/configs/{config}', 
    "work-dir":'/opt/ml/model/',
    'launcher':launcher,
    'load-path':f"/opt/ml/code/{model_path.split('/')[-1]}",
    "distributed":distributed,
    "epochs":1,
    "batch-size":6,
    "instance-count":1
}

# setup our SageMaker Debugger Profiler configuration to monitor our resource utilization
profiler_config = ProfilerConfig(
    system_monitor_interval_millis=1000,
)

# setup our estimator
estimator = PyTorch(
                      role=role,
                      instance_count=1,
                      instance_type= instance_type,
                      entry_point='train.py',
                      source_dir='container_training',
                      image_uri=image_uri,
                      volume_size=225,
                      output_path=f"s3://{bucket}/{prefix_output}",
                      base_job_name=f"{config.split('/')[0]}-{launcher}-{instance_type.replace('.','-')}", 
                      profiler_config=profiler_config,
                      enable_cloudwatch_metrics=True,
                      hyperparameters=hyperparameters,
                      metric_definitions=metric_definitions,
                      subnets=subnets,
                      security_group_ids=security_group_ids,
#                       distribution={  # if running distributed training, uncomment this argument
#                         "mpi":{"enabled":True}
#                       },
                   )

Now that we defined our estimator, we can launch our training job. We supply a few arguments, including 
- `inputs` this argument informs SageMaker where to find our training data.
- `experiment_config` this argument is where we specify an experiment configuration for SageMaker Experiments.
- `wait` this argument defines whether we want to hold the attention of the notebook cell. In this case we are setting it to `True` so that we can view the log output of our training job in our notebook.

If you receive this error: `CapacityError: Unable to provision requested ML compute capacity. Please retry using a different ML instance type.`, this is because we are running our training inside of a specific [availability zone or AZ](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-availability-zones) in our [virtual private cloud or VPC](https://aws.amazon.com/vpc/). Capacity is constantly being replenished, so wait a minute or two and retry launching your training job.

In [None]:
estimator.fit(inputs=data_channels, 
              wait=True,
              experiment_config={
            "ExperimentName": mm3d_experiment.experiment_name,
            "TrialName": mm3d_trial.trial_name,
            "TrialComponentDisplayName": f"Training-{instance_type.replace('.','-')}"},
                  )
training_job_name = estimator.latest_training_job.name
print('Training job name:', training_job_name)

To run training locally on your instance inside of the docker image we pulled from ECR, run the output of the following print commands in a terminal, the gif below will demonstrate how to do this.

In [None]:
# to run the same training job locally on your SageMaker instance, run the following commands in a terminal:

print(f'docker run -it --gpus all -v /home/ec2-user/SageMaker/fsx/a2d2:/opt/ml/input/data/train {account}.dkr.ecr.us-east-1.amazonaws.com/mmdet3d-sagemaker-pt181 bash')
print(f'cd /opt/ml/code && python train.py --config /mmdetection3d/configs/3dssd/3dssd_4x4_a2d2-3d-car.py --batch-size 8 --epochs 1') # # takes about 45 minutes
print('cp /opt/ml/code/work_dirs/3dssd_4x4_a2d2-3d-car/latest.pth /opt/ml/input/data/train/model.pth') 
print('exit')
print('cd /home/ec2-user/SageMaker/end-2-end-3d-ml && tar -cvf model.tar.gz ../fsx/a2d2/model.pth') # model will be deposited in end-2-end-3d-ml folder

![](display_images/local_train.gif)

All of the information that Debugger gathers is stored in s3. The below call via the AWS CLI will check if profiler information has been saved to our training job's folder. 

In [None]:
! aws s3 ls s3://{bucket}/{prefix_output}/{training_job_name}/profiler-output/

### Download our model object

We will use this later when we deploy our model as an endpoint.

In [16]:
!aws s3 cp s3://{bucket}/{prefix_output}/{training_job_name}/output/model.tar.gz .

download: s3://sagemaker-us-east-1-427894311213/training_res/3dssd-pytorch-ml-g4dn-12xlarge-2021-11-23-23-56-46-899/output/model.tar.gz to ./model.tar.gz



<span style="color:red;font-size:22.0pt">  IF YOU RAN THE TRAINING JOB USING LOCAL MODE THE FOLLOWING BLOCKS WILL NOT WORK.</span>

### Find system metrics

Once our outputs have been processed, we can import our system and framework data into our notebook and visualize them! The following block will check for the availability of our profiling data.

In [17]:
from smdebug.profiler.system_metrics_reader import S3SystemMetricsReader

sagemaker_client = boto3.client('sagemaker')
output_path = f's3://{bucket}/{prefix_output}/{training_job_name}/profiler-output'
print(f'output path: {output_path}')
print(f'Training job name: {training_job_name}')

system_metrics_reader = S3SystemMetricsReader(output_path)

training_job_status = ''
training_job_secondary_status = ''
while system_metrics_reader.get_timestamp_of_latest_available_file() == 0:
    system_metrics_reader.refresh_event_file_list()
    client = sagemaker_client.describe_training_job(
        TrainingJobName=training_job_name
    )
    if 'TrainingJobStatus' in client:
        training_job_status = f"TrainingJobStatus: {client['TrainingJobStatus']}"
    if 'SecondaryStatus' in client:
        training_job_secondary_status = f"TrainingJobSecondaryStatus: {client['SecondaryStatus']}"
        
    print(f"Profiler data from system not available yet. {training_job_status}. {training_job_secondary_status}.")
    time.sleep(20)

print('\n\nProfiler data from system is available')

[2021-11-24 00:18:36.422 ip-172-16-55-87.ec2.internal:22501 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
output path: s3://sagemaker-us-east-1-427894311213/training_res/3dssd-pytorch-ml-g4dn-12xlarge-2021-11-23-23-56-46-899/profiler-output
Training job name: 3dssd-pytorch-ml-g4dn-12xlarge-2021-11-23-23-56-46-899


Profiler data from system is available


## Visualize Data in Notebook

Now that we have verified our profiler data is available, let's plot some system metrics in our notebook. One easy thing to check for is if you are fully utilizing your GPU memory. If it seems low, we might be able to increase our batch size! 

In [24]:
# create system plots

from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob
from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts
tj = TrainingJob(training_job_name, region)
tj.wait_for_sys_profiling_data_to_be_available()

system_metrics_reader = tj.get_systems_metrics_reader()
system_metrics_reader.refresh_event_file_list()

view_timeline_charts  = TimelineCharts(system_metrics_reader, 
                                       framework_metrics_reader=None,
                                       select_dimensions=["CPU", "GPU"], 
                                       select_events=["total"] # if you want to look specifically at gpu0 and gpu1, replace total with a list
                                      )

ProfilerConfig:{'S3OutputPath': 's3://sagemaker-us-east-1-427894311213/training_res', 'ProfilingIntervalInMilliseconds': 1000}
s3 path:s3://sagemaker-us-east-1-427894311213/training_res/3dssd-pytorch-ml-g4dn-12xlarge-2021-11-23-23-56-46-899/profiler-output


Profiler data from system is available
select events:['total']
select dimensions:['CPU', 'GPU']
filtered_events:{'total'}
filtered_dimensions:{'GPUMemoryUtilization-nodeid:algo-1', 'GPUUtilization-nodeid:algo-1', 'CPUUtilization-nodeid:algo-1'}


### Create a heatmap

This heatmap shows similar system utilization metrics but different

In [25]:
from smdebug.profiler.analysis.notebook_utils.heatmap import Heatmap

view_heatmap = Heatmap(
    system_metrics_reader,
#     framework_metrics_reader, # can add back in
    select_dimensions=["CPU", "GPU", "I/O"], # optional
    select_events=["total"],                 # optional
    plot_height=450
)

select events:['total']
select dimensions:['CPU', 'GPU', 'I/O']
filtered_events:{'total'}
filtered_dimensions:{'I/OWaitPercentage', 'GPUUtilization', 'GPUMemoryUtilization', 'CPUUtilization'}


## Access system level metrics

We can look at our system level metrics in depth by using our system metric reader and pulling the data.

In [26]:
# get system metrics

system_metrics_reader.refresh_event_file_list()
last_timestamp = system_metrics_reader.get_timestamp_of_latest_available_file()
events = system_metrics_reader.get_events(0, last_timestamp) 

print("Found", len(events), "recorded system metric events. Latest recorded event:",  
      timestamp_to_utc(last_timestamp/1000000))

Found 119061 recorded system metric events. Latest recorded event: 2021-11-24 00:17:00


### Create system level metric dataframe

We can create a dataframe with all of our system metrics for further exploration.

In [27]:
dimensions = []
names = []
node_ids = []
timestamps = []
types = []
values = []
for event in events:
    dimensions.append(event.dimension)
    names.append(event.name)
    node_ids.append(event.node_id)
    timestamps.append(event.timestamp)
    types.append(event.type)
    values.append(event.value)
    
system_df = pd.DataFrame.from_dict({
    "dimension":dimensions,
    "name":names,
    "node_id":node_ids,
    "timestamp":timestamps,
    "type":types,
    "value":values
})

system_df.head()

Unnamed: 0,dimension,name,node_id,timestamp,type,value
0,GPUMemoryUtilization,gpu2,algo-1,1637712000.0,gpu,0.0
1,GPUUtilization,gpu0,algo-1,1637712000.0,gpu,0.0
2,GPUMemoryUtilization,gpu0,algo-1,1637712000.0,gpu,0.0
3,GPUUtilization,gpu1,algo-1,1637712000.0,gpu,0.0
4,GPUMemoryUtilization,gpu1,algo-1,1637712000.0,gpu,0.0


In [28]:
system_df.groupby(by='dimension').sum()

Unnamed: 0_level_0,timestamp,value
dimension,Unnamed: 1_level_1,Unnamed: 2_level_1
,6965191000000.0,485122500000000.0
Algorithm,3478501000000.0,2308.69
CPUUtilization,83562640000000.0,523685.5
GPUMemoryUtilization,6970104000000.0,189135.0
GPUUtilization,6970104000000.0,234829.0
I/OWaitPercentage,83562640000000.0,34198.1
Platform,3478501000000.0,11161420000.0


## View rule output

Profiler also generates a html and iPython notebook that goes over the different rules your job triggered. These rules give you hints about where your job could improve.

In [None]:
rule_output_path = f"s3://{bucket}/{prefix_output}/{training_job_name}/rule-output"
print(f"You will find the profiler report in {rule_output_path}")

!aws s3 cp --recursive {rule_output_path} .

## Get experiment results

Once our training job is complete, we want to evaluate the results of our training run. An easy way to do so is by using SageMaker Experiments. In addition to the UI in SageMaker Studio, SageMaker Experiments allows users to import their results as a dataframe so they can easily evaluate their training runs. The code below will associate our trial with our experiment we created above.

In [34]:
from datetime import timezone
from smexperiments.search_expression import Filter, Operator, SearchExpression

# get the trial components derived from the training jobs

creation_time = estimator.latest_training_job.describe()['CreationTime'] #most_recently_created_tuning_job["CreationTime"]
creation_time = creation_time.astimezone(timezone.utc)
creation_time = creation_time.strftime("%Y-%m-%dT%H:%M:%SZ")

created_after_filter = Filter(
    name="CreationTime",
    operator=Operator.GREATER_THAN_OR_EQUAL,
    value=str(creation_time),
)

# the training job names contain the tuning job name (and the training job name is in the source arn)
source_arn_filter = Filter(
    name="TrialComponentName", operator=Operator.CONTAINS, value=training_job_name
)
source_type_filter = Filter(
    name="Source.SourceType", operator=Operator.EQUALS, value="SageMakerTrainingJob"
)

search_expression = SearchExpression(
    filters=[created_after_filter, source_arn_filter, source_type_filter]
)

# search iterates over every page of results by default
trial_component_search_results = list(
    TrialComponent.search(search_expression=search_expression, sagemaker_boto_client=sm)
)
print(f"Found {len(trial_component_search_results)} trial components.")
trial_component_search_results

# associate the trial components with the trial
for tc in trial_component_search_results:
    print(f"Associating trial component {tc.trial_component_name} with trial {mm3d_trial.trial_name}.")
    mm3d_trial.add_trial_component(tc.trial_component_name)
    # sleep to avoid throttling
    time.sleep(0.5)

Found 1 trial components.
Associating trial component 3dssd-pytorch-ml-g4dn-12xlarge-2021-11-23-23-56-46-899-aws-training-job with trial mm3d-demo-training-job-1637711804.


### View experiments DataFrame

Once we have associated our experiment trials, we can import them as a DataFrame. We can incorporate search expressions to narrow down training runs with specific attributes and sort our trials by specified metrics. The experiments will track all of your set hyperparameters, making it easier to evaluate the effects of changing them. 

In [35]:
trial_component_analytics = ExperimentAnalytics(
    sagemaker_session=sagemaker_session,  
    experiment_name=mm3d_experiment.experiment_name,
#     search_expression=search_expression,
    sort_by="metrics.test:accuracy.max",
    sort_order="Descending",
#     metric_names=['test:accuracy'],
)

trial_df = trial_component_analytics.dataframe()
trial_df

Unnamed: 0,TrialComponentName,DisplayName,SourceArn,SageMaker.ImageUri,SageMaker.InstanceCount,SageMaker.InstanceType,SageMaker.VolumeSizeInGB,batch-size,config,distributed,...,lr - Last,lr - Count,train - MediaType,train - Value,SageMaker.DebugHookOutput - MediaType,SageMaker.DebugHookOutput - Value,SageMaker.ModelArtifact - MediaType,SageMaker.ModelArtifact - Value,Trials,Experiments
0,3dssd-pytorch-ml-g4dn-12xlarge-2021-11-23-23-5...,Training-ml-g4dn-12xlarge,arn:aws:sagemaker:us-east-1:427894311213:train...,427894311213.dkr.ecr.us-east-1.amazonaws.com/m...,1.0,ml.g4dn.12xlarge,300.0,6.0,"""/mmdetection3d/configs/3dssd/3dssd_4x4_a2d2-3...",1.0,...,2.0,30,,/qqq4tbmv/a2d2,,s3://sagemaker-us-east-1-427894311213/training...,,s3://sagemaker-us-east-1-427894311213/training...,[mm3d-demo-training-job-1637711804],[mm3d-a2d2-demo-1637711766]


## Conclusion

In this notebook we setup SageMaker Training with the MMDetection3D repository and FSx for Lustre as a data source. We trained our model on a multi-GPU instance and downloaded our model object. In the next notebook we will deploy the model we trained as an asynchronous SageMaker endpoint!