# Distributed data parallel MaskRCNN training with PyTorch and SMDataParallel

SMDataParallel is a new capability in Amazon SageMaker to train deep learning models faster and cheaper. SMDataParallel is a distributed data parallel training framework for PyTorch, TensorFlow, and MXNet.

This notebook example shows how to use SMDataParallel with PyTorch(version 1.8.0) on [Amazon SageMaker](https://aws.amazon.com/sagemaker/) to train a MaskRCNN model on [COCO 2017 dataset](https://cocodataset.org/#home) using [Amazon FSx for Lustre file-system](https://aws.amazon.com/fsx/lustre/) as data source.

The outline of steps is as follows:

1. Stage COCO 2017 dataset in [Amazon S3](https://aws.amazon.com/s3/)
2. Create Amazon FSx Lustre file system and import data into the file system from S3
3. Build Docker training image and push it to [Amazon ECR](https://aws.amazon.com/ecr/)
4. Configure data input channels for SageMaker
5. Configure hyper-parameters
6. Define training metrics
7. Define training job, set distribution strategy to SMDataParallel and start training

**NOTE:**  With large training dataset, we recommend using [Amazon FSx](https://aws.amazon.com/fsx/) as the input file system for the SageMaker training job. FSx file input to SageMaker significantly cuts down training start up time on SageMaker because it avoids downloading the training data each time you start the training job (as done with S3 input for SageMaker training job) and provides good data read throughput.


**NOTE:** This example requires SageMaker Python SDK v2.X.

## Amazon SageMaker Initialization

Initialize the notebook instance. Get the aws region, sagemaker execution role.

The IAM role arn is used to give training and hosting access to your data. See the [Amazon SageMaker Roles](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the sagemaker.get_execution_role() with the appropriate full IAM role arn string(s). As described above, since we will be using FSx, please make sure to attach `FSx Access` permission to this IAM role.

In [None]:
%%time
import sys

! {sys.executable} -m pip install --upgrade sagemaker
import sagemaker
from sagemaker import get_execution_role
from sagemaker.estimator import Estimator
import boto3

sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
fsx_client = boto3.client("fsx")

role = (
    get_execution_role()
)  # provide a pre-existing role ARN as an alternative to creating a new role
print(f"SageMaker Execution Role:{role}")

client = boto3.client("sts")
account = client.get_caller_identity()["Account"]
print(f"AWS account:{account}")

session = boto3.session.Session()
region = session.region_name
print(f"AWS region:{region}")

subnets = [
    "<SUBNET_ID>"
]  # this will be used for FSx and will be where the training job runs. Example: subnet-0f9XXXX
security_group_ids = ["<SECURITY_GROUP_ID>"]  # Example: sg-03ZZZZZZ

## Stage COCO

The following bash script will grab data from COCO, decompressing the data, and then send it to s3. This notebook uses FSx as a file source so you are going to put the data in a folder that you will setup to sync directly to FSx. This script will take ~20 minutes.

In [None]:
%%time
!bash ./upload_coco2017_to_s3.sh {bucket} fsx_sync/train-coco/coco

## Prepare SageMaker Training Images

1. SageMaker by default uses the latest [Amazon Deep Learning Container Images (DLC)](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) PyTorch training image. In this step, we use it as a base image and install additional dependencies required for training MaskRCNN model.
2. In the Github repository https://github.com/HerringForks/DeepLearningExamples.git we have made PyTorch-SMDataParallel MaskRCNN training script available for your use. We will be installing the same on the training image.

### Build and Push Docker Image to ECR

Run the below command build the docker image and push it to ECR.

In [None]:
image = "<ADD NAME OF REPO>"  # Example: mask-rcnn-smdataparallel-sagemaker
tag = "<ADD TAG FOR IMAGE>"  # Example: pt1.8

In [None]:
!pygmentize ./Dockerfile

In [None]:
!pygmentize ./build_and_push.sh

In [None]:
%%time
! chmod +x build_and_push.sh; bash build_and_push.sh {region} {image} {tag}

## Preparing FSx Input for SageMaker

1. Download and prepare your training dataset on S3.
2. Follow the steps listed here to create a FSx linked with your S3 bucket with training data - https://docs.aws.amazon.com/fsx/latest/LustreGuide/create-fs-linked-data-repo.html. Make sure to add an endpoint to your VPC allowing S3 access.
3. Follow the steps listed here to configure your SageMaker training job to use FSx https://aws.amazon.com/blogs/machine-learning/speed-up-training-on-amazon-sagemaker-using-amazon-efs-or-amazon-fsx-for-lustre-file-systems/

### Important Caveats

1. You need use the same `subnet` and `vpc` and `security group` used with FSx when launching the SageMaker notebook instance. The same configurations will be used by your SageMaker training job.
2. Make sure you set appropriate inbound/output rules in the `security group`. Specifically, opening up these ports is necessary for SageMaker to access the FSx file system in the training job. https://docs.aws.amazon.com/fsx/latest/LustreGuide/limit-access-security-groups.html
3. Make sure `SageMaker IAM Role` used to launch this SageMaker training job has access to `AmazonFSx`.

You also can automatically create a FSx file system with the following command:

In [None]:
# use boto3 to create FSx

fsx_response = fsx_client.create_file_system(
    FileSystemType="LUSTRE",
    StorageCapacity=1200,
    StorageType="SSD",
    SubnetIds=subnets,
    SecurityGroupIds=security_group_ids,
    Tags=[
        {"Key": "Name", "Value": "COCO-storage"},
    ],
    LustreConfiguration={
        "WeeklyMaintenanceStartTime": "7:03:00",
        "ImportPath": f"s3://{bucket}/fsx_sync/",  # where FSx will import data from in s3, can do entire bucket or a specific folder
        "ImportedFileChunkSize": 1024,
        "DeploymentType": "PERSISTENT_1",  # |'SCRATCH_1' |'SCRATCH_2' # PERSISTENT means the storage in FSx will be persistent, SCRATCH indicates the storage is temporary
        "AutoImportPolicy": "NEW",  # 'NONE'| |'NEW_CHANGED' # this policy is how often data will be imported to FSx from S3
        "PerUnitStorageThroughput": 200,  # this is specific to PERSISTENT storage, not required for temporary
    },
)

fsx_response

## Download model

If we are running our model in a VPC and restrict outside internet access, we'll need to download our base model ahead of time. 

In [None]:
!wget https://dl.fbaipublicfiles.com/detectron/ImageNetPretrained/MSRA/R-50.pkl
!aws s3 cp R-50.pkl s3://{bucket}/pretrained_weights/R-50.pkl

## Setup training metrics

To get more information on your training job, you define some algorithm metrics. SageMaker will scrape the logs from the training job and render them in the training job console (and store them in SageMaker Experiments). The metrics defined are pretty standard, you just need to define the regex to find them, feel free to define your own!

In [None]:
metric_definitions = [
    {"Name": "loss", "Regex": ".*loss:\s([0-9\\.]+)\s*"},
    {"Name": "loss_classifier", "Regex": ".*loss_cls:\s([0-9\\.]+)\s*"},
    {"Name": "loss_box_reg", "Regex": ".*loss_box_reg:\s([0-9\\.]+)\s*"},
    {"Name": "loss_mask", "Regex": ".*loss_mask:\s([0-9\\.]+)\s*"},
    {"Name": "loss_objectness", "Regex": ".*loss_objectness:\s([0-9\\.]+)\s*"},
    {"Name": "loss_rpn_box_reg", "Regex": ".*loss_rpn_box_reg:\s([0-9\\.]+)\s*"},
    {"Name": "overall_training_speed", "Regex": ".*Overall training speed:\s([0-9\\.]+)\s*"},
    {"Name": "lr", "Regex": ".*lr:\s([0-9\\.]+)\s*"},
    {"Name": "iter", "Regex": ".*iter:\s([0-9\\.]+)\s*"},
    {"Name": "avg iter/s", "Regex": ".*avg iter/s:\s([0-9\\.]+)\s*"},
]

## SageMaker PyTorch Estimator function options

In the following code block, you can update the estimator function to use a different instance type, instance count, and distribution strategy. You're also passing in the training script you reviewed in the previous cell.

**Instance types**

SMDataParallel supports model training on SageMaker with the following instance types only:
1. ml.p3.16xlarge
1. ml.p3dn.24xlarge [Recommended]
1. ml.p4d.24xlarge [Recommended]

**Instance count**

To get the best performance and the most out of SMDataParallel, you should use at least 2 instances, but you can also use 1 for testing this example.

**Distribution strategy**

Note that to use DDP mode, you update the `distribution` strategy, and set it to use `smdistributed dataparallel`. 

In [None]:
import os
import time
from sagemaker.pytorch import PyTorch

In [None]:
instance_type = "ml.p3dn.24xlarge"  # Other supported instance type: ml.p3.16xlarge, ml.p4d.24xlarge
instance_count = 2  # You can use 2, 4, 8 etc.
docker_image = f"{account}.dkr.ecr.{region}.amazonaws.com/{image}:{tag}"  # YOUR_ECR_IMAGE_BUILT_WITH_ABOVE_DOCKER_FILE
username = "AWS"
job_name = f"pytorch-smdataparallel-mrcnn-fsx-{int(time.time())}"  # This job name is used as prefix to the sagemaker training job. Makes it easy for your look for your training job in SageMaker Training job console.
file_system_id = fsx_response["FileSystem"][
    "FileSystemId"
]  # FSx file system ID with your training dataset. Example: 'fs-0bYYYYYY'
config_file = "e2e_mask_rcnn_R_50_FPN_1x_16GPU_4bs.yaml"

In [None]:
hyperparameters = {
    "config-file": config_file,
    "skip-test": "",
    "seed": 987,
    "dtype": "float16",
    "spot_ckpt": f"s3://{bucket}/pretrained_weights/R-50.pkl",  # this is where our script will look for existing weights to initialize the model backbone from.
}

In [None]:
estimator = PyTorch(
    entry_point="train_pytorch_smdataparallel_maskrcnn.py",
    role=role,
    image_uri=docker_image,
    source_dir=".",
    instance_count=instance_count,
    instance_type=instance_type,
    framework_version="1.8.0",
    py_version="py36",
    sagemaker_session=sagemaker_session,
    metric_definitions=metric_definitions,
    hyperparameters=hyperparameters,
    subnets=subnets,
    security_group_ids=security_group_ids,
    debugger_hook_config=False,
    # Training using SMDataParallel Distributed Training Framework
    distribution={"smdistributed": {"dataparallel": {"enabled": True}}},
)

In [None]:
# Configure FSx Input for your SageMaker Training job

from sagemaker.inputs import FileSystemInput

file_system_directory_path = "YOUR_MOUNT_PATH_FOR_TRAINING_DATA"  # NOTE: '/fsx/' will be the root mount path. Example: '/fsx/mask_rcnn/PyTorch'
file_system_access_mode = "ro"
file_system_type = "FSxLustre"
train_fs = FileSystemInput(
    file_system_id=file_system_id,
    file_system_type=file_system_type,
    directory_path=file_system_directory_path,
    file_system_access_mode=file_system_access_mode,
)
data_channels = {"train": train_fs}

In [None]:
# Submit SageMaker training job
estimator.fit(inputs=data_channels, job_name=job_name)

## Cleanup

In [None]:
# delete FSx
fsx_client.delete_file_system(FileSystemId=fsx_response["FileSystem"]["FileSystemId"])