# Use Amazon Sagemaker Distributed Model Parallel to Launch a BERT Training Job with Model Parallelization

SMP (Sagemaker Distributed Model Parallel) is a model parallelism library for training large deep learning models that were previously difficult to train due to GPU memory limitations. SMP automatically and efficiently splits a model across multiple GPUs and instances and coordinates model training, allowing you to increase prediction accuracy by creating larger models with more parameters.

Use this notebook to configure SMP to train a model using PyTorch (version 1.6.0) and the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html#train-a-model-with-the-sagemaker-python-sdk).

In this notebook, you will use a BERT example training script with SMP.
The example script is based on [Nvidia Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT) and will require you to download the datasets and upload to s3 as provided in the instructions below.

### Additional Resources
If you are a new user of Amazon SageMaker, you may find the following helpful to understand how SageMaker uses Docker to train custom models.
* To learn more about using Amazon SageMaker with your own training image, see [Use Your Own Training Algorithms
](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html).

* To learn more about using Docker to train your own models with Amazon SageMaker, see [Example Notebooks: Use Your Own Algorithm or Model](https://docs.aws.amazon.com/sagemaker/latest/dg/adv-bring-own-examples.html).
* To see other examples of distributed training using Amazon SageMaker and Pytorch, see [Distributed TensorFlow training using Amazon SageMaker
](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/distributed_tensorflow_mask_rcnn).


### Prerequisites 

* A S3 bucket to store the input data to be used for training.
* The input data you use for training must be in an Amazon S3 bucket in the same AWS Region as this notebook instances.

## Amazon SageMaker Initialization

Initialize the notebook instance. Get the aws region, sagemaker execution role

In [None]:
%%time
import sagemaker
from sagemaker import get_execution_role
from sagemaker.estimator import Estimator
from sagemaker.pytorch import PyTorch
import boto3

role = get_execution_role() # provide a pre-existing role ARN as an alternative to creating a new role
print(f'SageMaker Execution Role:{role}')

client = boto3.client('sts')
account = client.get_caller_identity()['Account']
print(f'AWS account:{account}')

session = boto3.session.Session()
region = session.region_name
print(f'AWS region:{region}')
sagemaker_session = sagemaker.session.Session(boto_session=session)
import sys
print(sys.path)

## Prepare/Identify your Training Data in Amazon S3

If you don't already have the BERT dataset in a S3 bucket, please see the instructions in [Nvidia BERT Example](https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/README.md) to download the dataset and upload it to a s3 bucket. 

Uncomment and use the following cell to specify the Amazon S3 bucket and prefix that contains your training data. For example, if your training data is in s3://your-bucket/training, enter your-bucket for s3_bucket and training for prefix. Note that your output data will be stored in the same bucket, under the "output" prefix.

In [None]:
s3_bucket = '<ADD BUCKET>'
#prefix = '<ADD PREFIX>'

## Define SageMaker Data Channels

In this step, you define Amazon SageMaker training data channel. 

In [None]:
s3train = f's3://{s3_bucket}/{prefix}'
train = sagemaker.session.TrainingInput(s3train, distribution='FullyReplicated', 
                                        s3_data_type='S3Prefix')

data_channels = {'train': train}

Required: Set your output data path:

In [None]:
s3_output_location = f's3://{s3_bucket}/output'
print(f'your output data will be stored in: s3://{s3_bucket}/output')

## Define SageMaker Training Job

Next, you will use SageMaker Estimator API to define a SageMaker Training Job. You will use a [`PyTorchEstimator`](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html) to define the number and type of EC2 instances Amazon SageMaker uses for training, as well as the size of the volume attached to those instances. 

You must update the following:
* `instance_count`
* `instance_type`
* `volume_size`

See the following sub-sections for more details. 

### Update the Type and Number of EC2 Instances Used

The instance type and number of instances you specify in `instance_type` and `instance_count` respectively will determine the number of GPUs Amazon SageMaker uses during training. Explicitly, `instance_type` will determine the number of GPUs on a single instance and that number will be multiplied by `instance_count`. 

You must specify values for `instance_type` and `instance_count` so that the total number of GPUs available for training is equal to `partitions` in `config` of `smp.init` in your training script. 

If you set ddp to `True`, you must ensure that the total number of GPUs available is divisible by `partitions`. The result of the division is inferred to be the number of model replicas to be used for Horovod (data parallelism degree). 

See [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/) for SageMaker supported instances and cost information. To look up GPUs for each instance types, see [Amazon EC2 Instance Types](https://aws.amazon.com/ec2/instance-types/). Use the section **Accelerated Computing** to see general purpose GPU instances. Note that an ml.p3.2xlarge has the same number of GPUs as an p3.2xlarge.

### Update your Volume Size

The volume size you specify in `volume_size` must be larger than your input data size.

### Set your parameters dictionary for SMP and set custom mpioptions

With the parameters dictionary you can configure: the number of microbatches, number of partitions, whether to use data parallelism with ddp, the pipelining strategy, the placement strategy and other BERT specific hyperparameters. 

In [None]:
mpioptions = "-verbose --mca orte_base_help_aggregate 0 "
mpioptions += "--mca btl_vader_single_copy_mechanism none"
parameters = {"optimize": "speed", "microbatches": 12, "partitions": 2, "ddp": True, "pipeline": "interleaved", "overlapping_allreduce": True, "placement_strategy": "cluster", "memory_weight": 0.3}
timeout = 60 * 60
metric_definitions = [{"Name": "base_metric", "Regex": "<><><><><><>"}]

hyperparameters = {"input_dir": "/opt/ml/input/data/train",
                   "output_dir": "./checkpoints", 
                   "config_file": "bert_config.json", 
                   "bert_model": "bert-large-uncased", 
                   "train_batch_size": 48, 
                   "max_seq_length": 128,
                   "max_predictions_per_seq": 20,
                   "max_steps": 7038,
                   "warmup_proportion": 0.2843,
                   "num_steps_per_checkpoint": 200,
                   "learning_rate": 6e-3,
                   "seed": 12439,
                   "steps_this_run": 500,
                   "allreduce_post_accumulation": 1,
                   "allreduce_post_accumulation_fp16": 1,
                   "do_train": 1,
                   "use_sequential": 1,
                   "skip_checkpoint": 1,
                   "smp": 1,
                   "apply_optimizer": 1,
                   "json-summary": "./dllogger.json"}

### Instantiate Pytorch Estimator with SMP enabled

In [None]:
pytorch_estimator = PyTorch("sagemaker_smp_pretrain.py",
                            role=role,
                            instance_type="ml.p3.16xlarge",
                            volume_size=200,
                            instance_count=1,
                            sagemaker_session=sagemaker_session,
                            py_version="py3",
                            framework_version='1.6.0',
                            distribution={
                                "smdistributed": {
                                    "modelparallel": {
                                        "enabled": True,
                                        "parameters": parameters
                                    }
                                },
                                "mpi": {
                                    "enabled": True,
                                    "process_per_host": 8,
                                    "custom_mpi_options": mpioptions,
                                }
                            },
                            source_dir='bert_example',
                            output_path=s3_output_location,
                            max_run=timeout,
                            hyperparameters=hyperparameters,
                            metric_definitions=metric_definitions)

Finally, you will use the estimator to launch the SageMaker training job.

In [None]:
pytorch_estimator.fit(data_channels, logs=True)