# Use Amazon Sagemaker Distributed Model Parallel to Launch a Mask-RCNN Training Job with Model Parallelization

Sagemaker's [distributed model parallel library](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html) is a model parallelism library for training large deep learning models that were previously difficult to train due to GPU memory limitations. This library automatically and efficiently splits a model across multiple GPUs and instances and coordinates model training, allowing you to increase prediction accuracy by creating larger models with more parameters.

Use this notebook to configure SageMaker's model parallel library train a model using PyTorch (version 1.6.0) and the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html#train-a-model-with-the-sagemaker-python-sdk).

In this notebook, you will use a Mask-RCNN example training script with the SageMaker model parallel library.
The example script is based on [Nvidia ML-perf Examples](https://mlcommons.org/en/news/mlperf-training-v07/) and requires you to download the [COCO datasets](https://cocodataset.org/#download) and upload them to Amazon Simple Storage Service (Amazon S3) as explained in the instructions below. This is a large dataset, and so depending on your connection speed, this process can take hours to complete. 

This notebook takes the Mask-RCNN code from the [repo](https://github.com/karakusc/training_results_v0.7) on the branch `smp-mrcnn`, which is based on the ml-perf implementation. The following are important files that are included with this notebook:

* `utils/launch_sm.py`: This is an entrypoint script that is passed to the Pytorch estimator in the notebook instructions. This contains most of the hyperparameter settings. Many of the hyperparameters defined in this file can be configured and passed directly from this notebook.

* `utils/train_scripts/maskrcnn/maskrcnn_benchmark`: This folder contains the model definition for the Mask-RCNN model and various utilities.

* `utils/train_scripts/maskrcnn/tools`: This folder contains the python scripts to start training. We use the `train_mlperf.py` in this demo, which will be launched by the `utils/launch_sm.py`.

* `utils/train_scripts/maskrcnn/configs`: This folder contains various configuration for mask-rcnn. We are using the `e2e_mask_rcnn_R_50_FPN_1x.yaml` in this demo, which is based on FPN and resnet-50 backbone.

### Additional Resources
If you are a new user of Amazon SageMaker, you may find the following helpful to learn more about SageMaker's model parallel library and using SageMaker with Pytorch. 

* To learn more about the SageMaker model parallelism library, see [Model Parallel Distributed Training with SageMaker Distributed](http://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html).

* To learn about the model parallelism library's API, see [Distributed model parallel API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel.html#).

* To learn more about using the SageMaker Python SDK with Pytorch, see [Using PyTorch with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html).

* To learn more about launching a training job in Amazon SageMaker with your own training image, see [Use Your Own Training Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html).


### Prerequisites 
If you create the notebook instance with `stack-sm.sh`, you can proceed to the next section.

If you are doing everything manually, you must satisfy the following requirements:

1. You must create an S3 bucket or a FSx/EFS file system to store the input data to be used for training. This data source must must be located in the same AWS Region you use to launch your training job. This is the AWS Region you use to run this notebook.

2. If you do not use this notebook to download the dataset and upload it to your S3 bucket, you must have the training dataset that you downloaded from the [COCO dataset](https://cocodataset.org/#download) stored in the S3 bucket described in 1. You can use the `utils/download_data.sh` script to download the data. 

**Important**: You should only run the `utils/download_data.sh` script in this notebook instance if you have enough storge. The COCO dataset is about 30 GB. Be default, notebook instances have 5 GB of storage. You can check the **Volume Size** of your notebook instance by selecting the instance name on the **Notebook instances** section of the Amazon SageMaker console. To learn how to increase your notebook storage, see [Update a Notebook Instance](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-update.html).

If you download the dataset locally, you need to upload it to an S3 bucket or EFS/FSx file systems.
   - For S3: 
        - To learn how to use the console to upload a dataset to S3, see [How do I upload files and folders to an S3 bucket?](https://docs.aws.amazon.com/AmazonS3/latest/user-guide/upload-objects.html)
        - To learn how to use the AWS CLI to upload a dataset to S3, see [Using high-level (s3) commands with the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-services-s3-commands.html). For example, after configuring the AWS CLI, the following command can be used to upload all files to S3 from a local working directory, `./` to the bucket, `bucket-name`: `$ aws s3 cp s3://bucket-name ./`
   - For FSx, check [Getting Started with Amazon FSx for Lustre](https://docs.aws.amazon.com/fsx/latest/LustreGuide/getting-started.html).
   - For EFS, check [Getting Started with Amazon Elastic File System](https://docs.aws.amazon.com/efs/latest/ug/getting-started.html).


## Amazon SageMaker Initialization

Upgrade Sagemaker SDK to the latest version.

***Important: Restart the kernel after this step.***

In [None]:
import sagemaker
original_version = sagemaker.__version__
%pip install --upgrade sagemaker

Initialize the notebook instance. Get the AWS Region and the SageMaker execution role Amazon Resource Name (ARN).

In [None]:
%%time
import sagemaker
from sagemaker import get_execution_role
from sagemaker.estimator import Estimator
from sagemaker.pytorch import PyTorch
import boto3

role = get_execution_role() # provide a pre-existing role ARN as an alternative to creating a new role
print(f'SageMaker Execution Role:{role}')

client = boto3.client('sts')
account = client.get_caller_identity()['Account']
print(f'AWS account:{account}')

session = boto3.session.Session()
region = session.region_name
print(f'AWS region:{region}')
sagemaker_session = sagemaker.session.Session(boto_session=session)

## Build Your Own Container

The Mask-RCNN implementation uses custom ops to improve performance.  Since building the custom ops and installing other dependencies are time-consuming, we separate the this from copying training script into container. In the following cells you will first build a base image which is slow but only runs once. Then you will build the real training image which copies the training script into container. The docker file  `Dockerfile.mrcnn_sm_base` will build the custom ops and also install other dependencies and `Dockerfile` will copy the training script.

### Build the base image

In [None]:
# Choose your Pytorch version
# Check if the your version is supported by DLC from: https://github.com/aws/deep-learning-containers/blob/master/available_images.md
from account_mapping import DLC_account_mapping
pt_version = "1.6.0"
dlc_account = DLC_account_mapping[region]
if 'cn' in region:
    base_image = f"{dlc_account}.dkr.ecr.{region}.amazonaws.com.cn/pytorch-training:{pt_version}-gpu-py36-cu110-ubuntu18.04"
else:
    base_image = f"{dlc_account}.dkr.ecr.{region}.amazonaws.com/pytorch-training:{pt_version}-gpu-py36-cu110-ubuntu18.04"
print(f"The base DLC image is {base_image}")

`build_base_image.sh` will build the base container for training. This processes builds the custom op and installs other dependencies and will take about 30 minutes. **This docker build only needs to be run once.**

In [None]:
!cat utils/build_base_image.sh

In [None]:
!cat utils/Dockerfile.mrcnn_sm_base

In [None]:
!cd utils; chmod +x build_base_image.sh; ./build_base_image.sh {region} {dlc_account} {base_image}

In [None]:
mrcnn_base_image = "mrcnn_base_image"

### Build the real training image

The Mask-RCNN code location is `train_scripts/maskrcnn/`. The following cells will build the training image using these files and upload the image to [Amazon Elastic Container Registry (ECR)](https://aws.amazon.com/ecr/). 

**Anytime you make code changes, you must re-run the following code blocks to build the container and upload the updated container to ECR.** This build process is fast because it only copies the training script into container.

Specify your ECR image repo and tag

In [None]:
image = "pt-mrcnn-model-parallel" # use your image repo
tag = "latest" # use your image tag

Build and push the training image

In [None]:
!cat utils/build_and_push.sh

In [None]:
%%time
! cd utils; chmod +x build_and_push.sh; ./build_and_push.sh {region} {image} {tag} {mrcnn_base_image}

Update the training image

In [None]:
if 'cn' in region:
    training_image = f"{account}.dkr.ecr.{region}.amazonaws.com.cn/{image}:{tag}"
else:
    training_image = f"{account}.dkr.ecr.{region}.amazonaws.com/{image}:{tag}"
print(f'Training image: {training_image}')

## Prepare data for training data


### Stage COCO 2017 dataset in Amazon S3

We use [COCO 2017 dataset](http://cocodataset.org/#home) for training. Use the following code cells to download the COCO 2017 training and validation dataset to this notebook instance, extract the files from the dataset archives, and upload the extracted files to your Amazon [S3 bucket](https://docs.aws.amazon.com/en_pv/AmazonS3/latest/gsg/CreatingABucket.html) with the prefix ```mask-rcnn/sagemaker/input/train```. The ```prepare-s3-bucket.sh``` script executes this step.

If you've already uploaded the dataset to an S3 bucket, use this section to define the S3 URI where the data is store. This URI will be used to define the data channel used for training. 

Specify the in `s3_bucket` where you want to upload the data, or where you have already uploaded the dataset in the following code cell.

In [None]:
s3_bucket = # your-s3-bucket-name
s3_output_location = f's3://{s3_bucket}/output'

If you already have the COCO 2017 data ready in the `s3_bucket`, you can skip the next two steps and run the code block below to define `s3train`.

In [None]:
!cat ./utils/prepare-s3-bucket.sh

In [None]:
%%time
!cd utils; chmod +x prepare-s3-bucket.sh; ./prepare-s3-bucket.sh {s3_bucket}

Specify the location of training data using `s3train` in the following block. If you used this notebook to upload the data to S3, you do not need to modify the following cell. If you did not use this notebook to upload the data to S3, modify `s3train` to point to the location of the training data in S3.

In [None]:
prefix = "mask-rcnn/sagemaker" # prefix in your S3 bucket
s3train = f's3://{s3_bucket}/{prefix}/input/train' # Specify your S3 path if you already have the data
print(f'your training data should be stored in: {s3train}')

### Define the data channel
In this step, you define an Amazon SageMaker training data channel. The training data channel identifies where your training data is located. Here we provided 3 methods you can use for data input: [AWS S3](https://aws.amazon.com/s3/), [AWS FSx](https://aws.amazon.com/fsx/) and [AWS EFS](https://aws.amazon.com/efs/). 
- S3: When S3 is used as the data input channel, Sagemaker will download all the data in the S3 bucket into the training instance during the `Downloading` phase. After the downloading the data will be stored locally on the instance. The downloading might be time consuming but it only needs to be done once.
- FSx/EFS: When FSx/EFS is used as the data input channel, Sagemaker doesn't download any data. Instead, SageMaker mounts the FSx/EFS volume on the training instance, so the data will be remote, but it will behave like a local volume.

When you use FSx/EFS for development you save data-downloading time, however S3 has better performance and is recommended for production.

In [None]:
# Pick your mode from "S3", "FSx", "EFS"
mode = 

#### Use S3 as data input

In [None]:
if mode == "S3":
    train = sagemaker.session.TrainingInput(s3train, distribution='FullyReplicated', 
                                            s3_data_type='S3Prefix')

    data_channels = {'train': train}

    # For S3 this is not required
    security_group_ids = None
    subnets = None

#### Using AWS FSx/EFS as data input

When using the FSx/EFS input, Sagemaker will mount the file system into the training instance, so it behaves like a local volume. However, because data is not local, using FSx/EFS may impact performance.

For FSx/EFS, you need to provide the subnet and `security_group_id` of the FSx file system. You can find the subnet and `security_group_id` from the output of `stack-sm.sh` if you used it to create this notebook instance. Otherwise, you need to guarantee that your VPC follows this [guadiance](https://docs.aws.amazon.com/sagemaker/latest/dg/train-vpc.html#train-vpc-nat) so that Sagemaker can access it.

Prepare the data into the FSx/EFS system. If you have the FSx/EFS ready, you can skip this step. 

**Important: if you want prepare the EFS file system here, you need to create notebook instance using `stack-sm.sh`**

In [None]:
%%time
if mode == "FSx":
    #usage: ./stack-fsx.sh <aws-region> <s3-import-path> <fsx-capacity> <subnet-id> <security-group-id>
    !./stack-fsx.sh {region} {s3train} 3600 'subnet-xxxxxxxxxxxx' 'sg-xxxxxxxxxxxx'
elif mode == "EFS":
    # This requires that the EFS volume is already mounted in the notebook instance
    !./prepare-efs.sh {s3_bucket}

Specify FSx/EFS as the input channel.

In [None]:
if mode == "FSx" or mode == "EFS":
    from sagemaker.inputs import FileSystemInput

    # Spercify your file system id
    file_system_id = 

    # Specify your mounting path, must start with Mount name if using FSx
    directory_path = '/fsx/mask-rcnn/sagemaker/input/train' if mode == "FSx" else "/mask-rcnn/sagemaker/input/train"

    file_system_type = 'FSxLustre' if mode == "FSx" else "EFS"

    training = FileSystemInput(
        file_system_id=file_system_id,
        file_system_type=file_system_type,
        directory_path=directory_path,
        file_system_access_mode='ro',
    )
    data_channels = {'train': training}

    # Specify the security_group_id and subnets of your FSx/EFS system
    # This is required as the input argument for the estimator when using FSx/EFS
    # You can find them in the output of ./stack-sm.sh script you used to create this notebook instance.  
    # Specify only one subnet
    security_group_ids = []# ['sg-xxxxxxxx'] 
    subnets =  []# [ 'subnet-xxxxxxx', 'subnet-xxxxxxx', 'subnet-xxxxxxx']

## Define SageMaker Training Job

Next, you will use SageMaker Estimator API to define a SageMaker Training Job. You will use a [`PyTorchEstimator`](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html) to define the number and type of EC2 instances Amazon SageMaker uses for training, as well as the size of the volume attached to those instances. To learn more about using the SageMaker distributed model parallel library with the SageMaker Python SDK, see the library's documentation on [launching a Training Job with the SageMaker Python SDK
](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-use-api.html#model-parallel-sm-sdk).

### Set your parameters dictionary for SMP and set custom MPI options

With the parameters dictionary you can configure: the number of microbatches, number of partitions, whether to use data parallelism with DDP, the pipelining strategy, the placement strategy and other Mask-RCNN specific hyperparameters. 

If you set `ddp` to `True` in the following code block, you must ensure that the total number of GPUs available (GPUs per instance multiplied by number of instances) is divisible by `partitions`. You can set `instance_type`, which determines the number of GPUs per instance, and the number of instances (`instance_count`) in the next code block. The result of the division is inferred to be the number of model replicas to be used for Horovod (data parallelism degree). 

In [None]:
mpi_options = "-verbose --mca btl_vader_single_copy_mechanism none --mca orte_base_help_aggregate 0 -x SMP_D2D_GPU_BUFFER_SIZE_BYTES=1073741824"
smp_parameters = {
    "partitions": 2,
    "microbatches": 2,
    "memory_weight": 1.0,
    "ddp": True
}

metric_definitions = [{"Name": "base_metric", "Regex": "<><><><><><>"}]

hyperparameters = {
    'BASE_LR' : 0.008,
    'MAX_ITER' : 45000,
    'WARMUP_FACTOR' : 0.0001,
    'WARMUP_ITERS' : 100,
    'TRAIN_IMS_PER_BATCH' : 32,
    'TEST_IMS_PER_BATCH' : 32,
    'WEIGHT_DECAY': 5e-4,
    'OPTIMIZER' : 'NovoGrad',
    'LR_SCHEDULE' : 'COSINE',
    'BETA1' : 0.9,
    'BETA2' : 0.4
}

### Instantiate Pytorch Estimator with SageMaker's Data Parallel Library Enabled

The following code block defines a PyTorch estimator using the Amazon SageMaker Python SDK, with SageMaker's data parallel enabeled. As needed, update the following parameters and refer to the *Update the Type and Number of EC2 Instances Used* above for more information.
* `base_job_name` 
* `instance_type` 
* `instance_count`
* `volume_size`
* `processes_per_host`

#### Update the Type and Number of EC2 Instances Used

The instance type and number of instances you specify in `instance_type` and `instance_count` respectively will determine the number of GPUs Amazon SageMaker uses during training. Explicitly, `instance_type` will determine the number of GPUs on a single instance and that number will be multiplied by `instance_count`. 

You must specify values for `instance_type` and `instance_count` so that the total number of GPUs available for training is equal to `partitions` in `config` of `smp.init` in your training script. 


Additionally, in `mpi_options`,  you must specify the number of processes MPI should launch on each host using `process_per_host.` In SageMaker a host is a single [Amazon EC2 ml instance](http://dev-dsk-chopt-2a-eb34d156.us-west-2.amazon.com/sagemaker/mainline/latest/dg/model-parallel-use-api.html). The SageMaker Python SDK maintains a one-to-one mapping between processes and GPUs across model and data parallelism. This means that SageMaker schedules each process on a single, separate GPU and no GPU contains more than one process. To learn more, see documentation on using SageMaker distribute model parallel with [PyTorch 1.7.1, 1.6.0](http://dev-dsk-chopt-2a-eb34d156.us-west-2.amazon.com/sagemaker/mainline/latest/dg/model-parallel-customize-training-script-pt.html#model-parallel-customize-training-script-pt-16).

**important**: `process_per_host` must be less than the number of GPUs per instance and typically will be equal to the number of GPUs per instance. For example, if you use one instance with 4-way model parallelism and 2-way data parallelism, then processes_per_host should be 2 x 4 = 8. Therefore, you must choose an instance that has at least 8 GPUs, such as an ml.p3.16xlarge.

See [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/) for SageMaker supported instances and cost information. To look up GPUs for each instance types, see [Amazon EC2 Instance Types](https://aws.amazon.com/ec2/instance-types/). Use the section **Accelerated Computing** to see general purpose GPU instances. Note that an ml.p3.2xlarge has the same number of GPUs as an p3.2xlarge.

#### Update your Volume Size

The volume size you specify in `volume_size` must be larger than your input data size (about 30 GB).


In [None]:
pytorch_estimator = PyTorch("launch_sm.py",
                            source_dir="utils",
                            role=role,
                            instance_type="ml.p3.16xlarge",
                            volume_size=200,
                            instance_count=1,
                            sagemaker_session=sagemaker_session,
                            image_uri=training_image,
                            distribution={
                                "smdistributed": {
                                    "modelparallel": {
                                        "enabled": True,
                                        "parameters": smp_parameters
                                    }
                                },
                                "mpi": {
                                    "enabled": True,
                                    "processes_per_host": 8,
                                    "custom_mpi_options": mpi_options,
                                }
                            },
                            output_path=s3_output_location,
                            hyperparameters=hyperparameters,
                            metric_definitions=metric_definitions,
                            security_group_ids = security_group_ids,
                            subnets = subnets,
                            base_job_name="mask-rcnn-model-parallel-demo")

Finally, you will use the estimator to launch the SageMaker training job.


In [None]:
pytorch_estimator.fit(data_channels)

## Monitor Your Training Job
You can monitor the status of the training job using the Amazon SageMaker console and you can access the training logs from [Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html). You can use CloudWatch to track SageMaker GPU and memory utilization during training and inference. To view the metrics and logs that SageMaker writes to CloudWatch, see **Processing Job, Training Job, Batch Transform Job, and Endpoint Instance Metrics** in [Monitor Amazon SageMaker with Amazon CloudWatch](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html).

If you are a new user of CloudWatch, see [Getting Started with Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/GettingStarted.html). 

For additional information on monitoring and analyzing Amazon SageMaker training jobs, see [Monitor and Analyze Training Jobs Using Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/training-metrics.html).

## Deploying Models with Amazon SageMaker
After you build and train your models, you can deploy them to get predictions in one of two ways:

- To set up a persistent endpoint to get predictions from your models, use SageMaker hosting services. For an overview on deploying a single model or multiple models with SageMaker hosting services, see [Deploy a Model on SageMaker Hosting Services](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-deployment.html#how-it-works-hosting).
- To get predictions for an entire dataset, use SageMaker batch transform. For an overview on deploying a model with SageMaker batch transform, see [Get Inferences for an Entire Dataset with Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-batch.html).

To learn more about deploying models for inference using SageMaker, see [Deploy Models for Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html). 