# Distributed Data Parallel RoBERTa Training with PyTorch and SageMaker distributed

[Amazon SageMaker's distributed library](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html) can be used to train deep learning models faster and cheaper. The [data parallel](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel.html) feature in this library (`smdistributed.dataparallel`) is a distributed data parallel training framework for PyTorch, TensorFlow, and MXNet.

This notebook demonstrates how to use `smdistributed.dataparallel` with PyTorch(version 1.9.0) on [Amazon SageMaker](https://aws.amazon.com/sagemaker/) to train a [faiseq RoBERTa model](https://github.com/HerringForks/fairseq/tree/use_herring) on [the WikiText-103 dataset](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/) using [Amazon FSx for Lustre file-system](https://aws.amazon.com/fsx/lustre/) as data source.

The outline of steps is as follows:

1. Stage the dataset in [Amazon S3](https://aws.amazon.com/s3/)
2. Create Amazon FSx Lustre file-system and import data into the file-system from S3
3. Build Docker training image and push it to [Amazon ECR](https://aws.amazon.com/ecr/)
4. Configure data input channels for SageMaker
5. Configure hyper-prarameters
6. Define training metrics
7. Define training job, set distribution strategy to SMDataParallel and start training

**NOTE:** With large training dataset, we recommend using [Amazon FSx](https://aws.amazon.com/fsx/) as the input file system for the SageMaker training job. FSx file input to SageMaker significantly cuts down training start up time on SageMaker because it avoids downloading the training data each time you start the training job (as done with S3 input for SageMaker training job) and provides good data read throughput.


**NOTE:** This example requires SageMaker Python SDK v2.X.

## Amazon SageMaker Initialization

Initialize the notebook instance. Get the AWS Region and a SageMaker execution role.

### SageMaker role

The following code cell defines `role` which is the IAM role ARN used to create and run SageMaker training and hosting jobs. This is the same IAM role used to create this SageMaker Notebook instance. 

`role` must have permission to create a SageMaker training job and host a model. For granular policies you can use to grant these permissions, see [Amazon SageMaker Roles](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html). If you do not require fine-tuned permissions for this demo, you can use the IAM managed policy AmazonSageMakerFullAccess to complete this demo. 

As described above, since we will be using FSx, please make sure to attach `FSx Access` permission to this IAM role.

In [None]:
%%time
! python3 -m pip install --upgrade sagemaker
import sagemaker
from sagemaker import get_execution_role
from sagemaker.estimator import Estimator
import boto3

sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

role = (
    get_execution_role()
)  # provide a pre-existing role ARN as an alternative to creating a new role
role_name = role.split(["/"][-1])
print('------------------------------------------------')
print(f"SageMaker Execution Role:{role}")
print(f"The name of the Execution role: {role_name[-1]}")

client = boto3.client("sts")
account = client.get_caller_identity()["Account"]
print(f"AWS account:{account}")

session = boto3.session.Session()
region = session.region_name
print(f"AWS region:{region}")

To verify that the role above has required permissions:

1. Go to the IAM console: https://console.aws.amazon.com/iam/home.
2. Select **Roles**.
3. Enter the role name in the search box to search for that role. 
4. Select the role.
5. Use the **Permissions** tab to verify this role has required permissions attached.

## Prepare SageMaker Training Images

1. SageMaker by default uses the latest [Amazon Deep Learning Container Images (DLC)](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) PyTorch training image. In this step, we use the DLC PyTorch 1.9 image as the base image and install additional dependencies required for training the RoBERTa model.
2. We have forked the Facebook [Fairseq repository](https://github.com/pytorch/fairseq) to [HerringForks/fairseq](https://github.com/HerringForks/fairseq/tree/use_herring) so that we can make custom adaptations to the training script to make it work with `smdistributed.dataparallel`. Please refer to [this commit](https://github.com/HerringForks/fairseq/commit/173a6bf51cb251a787bac1b7620f753405071ed4) to see the actual code changes. We will use this fork to build the training docker image.

In [None]:
# Name the training image
image = "smddp-reborta"
tag = "pt1.9"

In [None]:
# Show the Docker file
!pygmentize ./Dockerfile

In [None]:
# This is the script to build the training image and push it to ECR
!pygmentize ./build_and_push.sh

In [None]:
# Login for DLC ECR account
!aws ecr get-login-password --region {region} | docker login \
  --username AWS --password-stdin 763104351884.dkr.ecr.{region}.amazonaws.com

In [None]:
# Build and push the training image
! chmod +x build_and_push.sh; bash build_and_push.sh {region} {image} {tag}

## Preparing FSx Input for SageMaker

1. Download and prepare your training dataset on S3. One example dataset is [WikiText-103](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/). Instructions on preparing the WikiText-103 dataset can be found at [here](https://github.com/HerringForks/fairseq/blob/master/examples/roberta/README.pretraining.md)
2. Follow the [steps](https://docs.aws.amazon.com/fsx/latest/LustreGuide/create-fs-linked-data-repo.html) to create a FSx linked with your S3 bucket with training data. Make sure to add an endpoint to your VPC allowing S3 access.
3. Follow the [steps](https://aws.amazon.com/blogs/machine-learning/speed-up-training-on-amazon-sagemaker-using-amazon-efs-or-amazon-fsx-for-lustre-file-systems/) to configure your SageMaker training job to use FSx.

### Important Caveats

1. You need to use the same `subnet` and `vpc` and `security group` used with FSx when launching the SageMaker notebook instance. The same configurations will be used by your SageMaker training job.
2. Make sure you set [appropriate inbound/output rules](https://docs.aws.amazon.com/fsx/latest/LustreGuide/limit-access-security-groups.html) in the `security group`. Specifically, opening up these ports is necessary for SageMaker to access the FSx file system in the training job. 
3. Make sure `SageMaker IAM Role` used to launch this SageMaker training job has access to `AmazonFSx`.

## Preparing Training Script
To start training, SageMaker requires a single Python script as the entry point in each process/GPU. This script can either be the training script itself, or it can call other executables in the training docker container. In this example, we will have a entry point script that demonstrates the second case. Please remember to fill in the dataset directory in the entry point script. This is the directory to the dataset in your FSx file server.

In [None]:
# Show the entry point script
!pygmentize ./entry_point.py

## SageMaker PyTorch Estimator function options

In the following code block, you can update the estimator function to use a different instance type, instance count, and distribution strategy. You will also need to pass in your entry point script from above.

**Instance types**

SMDataParallel supports model training on SageMaker with the following instance types only. For best performance, it is recommended you use an instance type that supports Amazon Elastic Fabric Adapter (ml.p3dn.24xlarge and ml.p4d.24xlarge).

1. ml.p3.16xlarge
1. ml.p3dn.24xlarge [Recommended]
1. ml.p4d.24xlarge [Recommended]

**Instance count**

To get the best performance and the most out of SMDataParallel, you should use at least 2 instances, but you can also use 1 for testing this example.

**Distribution strategy**

Note that to use DDP mode, you need to update the `distribution` strategy, and set it to use `smdistributed dataparallel`. 

In [None]:
import os
from sagemaker.pytorch import PyTorch

In [None]:
instance_type = 'ml.p4d.24xlarge' # Other supported instance type: ml.p3.16xlarge, ml.p3dn.24xlarge
instance_count = 1  # You can use 2, 4, 8 etc.
docker_image = f"{account}.dkr.ecr.{region}.amazonaws.com/{image}:{tag}"  # YOUR_ECR_IMAGE_BUILT_WITH_ABOVE_DOCKER_FILE
username = "AWS"
subnets = ["<subnet-id>]  # Should be same as Subnet used for FSx. Example: subnet-0f9XXXX
security_group_ids = [
    "<security-group-id>"
]  # Should be same as Security group used for FSx. sg-03ZZZZZZ
file_system_id = "<fsx-id>"  # FSx file system ID with your training dataset. Example: 'fs-0bYYYYYY'

In [None]:
# Configure FSx Input for your SageMaker Training job
from sagemaker.inputs import FileSystemInput

file_system_directory_path = "<fsx-mount-name>"  # This is the mount name of the fsx
file_system_access_mode = "ro"
file_system_type = "FSxLustre"
train_fs = FileSystemInput(
    file_system_id=file_system_id,
    file_system_type=file_system_type,
    directory_path=file_system_directory_path,
    file_system_access_mode=file_system_access_mode,
)
data_channels = {"train": train_fs} 

In [None]:
# Configure hyper-parameters
# RoBERTa Large 1.3B parameters
hyperparameters = {
    'fp16': '',
    'task': 'masked_lm',
    'criterion': 'masked_lm',
    'arch': 'roberta_large',
    'sample-break-mode': 'complete',
    'tokens-per-sample': 512,
    'optimizer': 'adam',
    'adam-eps': 1e-6,
    'clip-norm': 0.0,
    'lr-scheduler': 'polynomial_decay',
    'lr': 0.0001,
    'warmup-updates': 10000,
    'total-num-update': 125000,
    'dropout': 0.1,
    'attention-dropout': 0.1,
    'weight-decay': 0.01,
    'max-sentences': 8,
    'update-freq': 1,
    'max-update': 125000,
    'log-format': 'simple',
    'log-interval': 10,
    'encoder-layers': 24,
    'encoder-embed-dim': 2048,
    'encoder-ffn-embed-dim': 8192,
    'memory-efficient-fp16': '',
    'distributed-no-spawn' : '',
}

In [None]:
# Configure hyper-parameters
# RoBERTa Large 350M parameters
hyperparameters = {
    'fp16': '',
    'task': 'masked_lm',
    'criterion': 'masked_lm',
    'arch': 'roberta_large',
    'sample-break-mode': 'complete',
    'tokens-per-sample': 512,
    'optimizer': 'adam',
    'adam-eps': 1e-6,
    'clip-norm': 0.0,
    'lr-scheduler': 'polynomial_decay',
    'lr': 0.0001,
    'warmup-updates': 10000,
    'total-num-update': 125000,
    'dropout': 0.1,
    'attention-dropout': 0.1,
    'weight-decay': 0.01,
    'max-sentences': 16,
    'update-freq': 1,
    'max-update': 125000,
    'log-format': 'simple',
    'log-interval': 10,
    'distributed-no-spawn' : ''
}


In [None]:
job_name = "pytorch-smdataparallel-roberta-testrun-new-16"  # This job name is used as prefix to the sagemaker training job. Makes it easy to look for your training job in SageMaker Training job console.
estimator = PyTorch(
    entry_point="entry_point.py",
    role=role,
    image_uri=docker_image,
    source_dir=".",
    instance_count=instance_count,
    instance_type=instance_type,
    sagemaker_session=sagemaker_session,
    hyperparameters=hyperparameters,
    subnets=subnets,
    security_group_ids=security_group_ids,
    debugger_hook_config=False,
    # Training using SMDataParallel Distributed Training Framework
    distribution={"smdistributed": {"dataparallel": {"enabled": True}}},
)
# Submit SageMaker training job
estimator.fit(inputs=data_channels, job_name=job_name)