# Use Amazon Sagemaker Distributed Model Parallel to Launch a BERT Training Job with Model Parallelization

Sagemaker distributed model parallel (SMP) is a model parallelism library for training large deep learning models that were previously difficult to train due to GPU memory limitations. SMP automatically and efficiently splits a model across multiple GPUs and instances and coordinates model training, allowing you to increase prediction accuracy by creating larger models with more parameters.

Use this notebook to configure SMP to train a model using PyTorch (version 1.6.0) and the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html#train-a-model-with-the-sagemaker-python-sdk).

In this notebook, you will use a BERT example training script with SMP.
The example script is based on [Nvidia Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT) and requires you to download the datasets and upload them to Amazon Simple Storage Service (Amazon S3) as explained in the instructions below. This is a large dataset, and so depending on your connection speed, this process can take hours to complete. 

This notebook depends on the following files. You can find all files in the [bert directory](https://github.com/aws/amazon-sagemaker-examples/tree/master/training/distributed_training/pytorch/model_parallel/bert) in the model parllel section of the Amazon SageMaker Examples notebooks repo.

* `bert_example/sagemaker_smp_pretrain.py`: This is an entrypoint script that is passed to the Pytorch estimator in the notebook instructions. This script is responsible for end to end training of the BERT model with SMP. The script has additional comments at places where the SMP API is used.

* `bert_example/modeling.py`: This contains the model definition for the BERT model.

* `bert_example/bert_config.json`: This allows for additional configuration of the model and is used by `modeling.py`. Additional configuration includes dropout probabilities, pooler and encoder sizes, number of hidden layers in the encoder, size of the intermediate layers in the encoder etc.

* `bert_example/schedulers.py`: contains definitions for learning rate schedulers used in end to end training of the BERT model (`bert_example/sagemaker_smp_pretrain.py`).

* `bert_example/utils.py`: This contains different helper utility functions used in end to end training of the BERT model (`bert_example/sagemaker_smp_pretrain.py`).

* `bert_example/file_utils.py`: Contains different file utility functions used in model definition (`bert_example/modeling.py`).


### Additional Resources
If you are a new user of Amazon SageMaker, you may find the following helpful to learn more about SMP and using SageMaker with Pytorch. 

* To learn more about the SageMaker model parallelism library, see [Model Parallel Distributed Training with SageMaker Distributed](http://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html).

* To learn more about using the SageMaker Python SDK with Pytorch, see [Using PyTorch with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html).

* To learn more about launching a training job in Amazon SageMaker with your own training image, see [Use Your Own Training Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html).


### Prerequisites 

1. You must create an S3 bucket to store the input data to be used for training. This bucket must must be located in the same AWS Region you use to launch your training job. This is the AWS Region you use to run this notebook. To learn how, see [Creating a bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html) in the Amazon S3 documentation.

2. You must download the dataset that you use for training from [Nvidia Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT) and upload it to the S3 bucket you created. To learn more about the datasets and scripts provided to preprocess and download it, see [Getting the data](https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/README.md#getting-the-data) in the Nvidia Deep Learning Examples repo README. You can also use the [Quick Start Guide](https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/README.md#quick-start-guide) to learn how to download the dataset. The repository consists of three datasets. Optionally, you can to use the `wiki_only` parameter to only download the Wikipedia dataset. 

## Amazon SageMaker Initialization

Upgrade Sagemaker SDK to the latest version.
NOTE: This step may require a kernel restart.

In [1]:
import sagemaker
original_version = sagemaker.__version__
%pip install --upgrade sagemaker

Collecting sagemaker
  Downloading sagemaker-2.23.1.tar.gz (400 kB)
[K     |████████████████████████████████| 400 kB 7.8 MB/s eta 0:00:01
Collecting smdebug_rulesconfig==1.0.1
  Downloading smdebug_rulesconfig-1.0.1-py2.py3-none-any.whl (20 kB)
Building wheels for collected packages: sagemaker
  Building wheel for sagemaker (setup.py) ... [?25ldone
[?25h  Created wheel for sagemaker: filename=sagemaker-2.23.1-py2.py3-none-any.whl size=559547 sha256=6274f6fa8840ba1f32775764529946ada47d35f46fbea623d4a3ba01c8d4e4c5
  Stored in directory: /home/ubuntu/.cache/pip/wheels/f6/ea/42/c6241b7aef8d2f4cbe4af5672ecb3889f95fc3df8c599239a4
Successfully built sagemaker
Installing collected packages: smdebug-rulesconfig, sagemaker
  Attempting uninstall: smdebug-rulesconfig
    Found existing installation: smdebug-rulesconfig 1.0.0
    Uninstalling smdebug-rulesconfig-1.0.0:
      Successfully uninstalled smdebug-rulesconfig-1.0.0
  Attempting uninstall: sagemaker
    Found existing installation: sag

Initialize the notebook instance. Get the AWS Region, SageMaker execution role Amazon Resource Name (ARN).

In [1]:

%%time
import sagemaker
from sagemaker import get_execution_role
from sagemaker.estimator import Estimator
from sagemaker.pytorch import PyTorch
import boto3

role = get_execution_role() # provide a pre-existing role ARN as an alternative to creating a new role
print(f'SageMaker Execution Role:{role}')

client = boto3.client('sts')
account = client.get_caller_identity()['Account']
print(f'AWS account:{account}')

session = boto3.session.Session()
region = session.region_name
print(f'AWS region:{region}')
sagemaker_session = sagemaker.session.Session(boto_session=session)
import sys
print(sys.path)




Couldn't call 'get_role' to get Role ARN from role name RL to get Role path.


SageMaker Execution Role:arn:aws:iam::688520471316:role/RL
AWS account:688520471316
AWS region:us-west-2
['', '/home/ubuntu/anaconda3/envs/python3/lib/python36.zip', '/home/ubuntu/anaconda3/envs/python3/lib/python3.6', '/home/ubuntu/anaconda3/envs/python3/lib/python3.6/lib-dynload', '/home/ubuntu/.local/lib/python3.6/site-packages', '/home/ubuntu/anaconda3/envs/python3/lib/python3.6/site-packages', '/home/ubuntu/anaconda3/envs/python3/lib/python3.6/site-packages/IPython/extensions', '/home/ubuntu/.ipython']
CPU times: user 932 ms, sys: 95.6 ms, total: 1.03 s
Wall time: 2.22 s


In [5]:
#from smdistributed.modelparallel.torch.optimizers import FusedLAMB

## Prepare/Identify your Training Data in Amazon S3

If you don't already have the BERT dataset in an S3 bucket, please see the instructions in [Nvidia BERT Example](https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/README.md) to download the dataset and upload it to a s3 bucket. See the prerequisites at the beginning of this notebook for more information.

Uncomment and use the following cell to specify the Amazon S3 bucket and prefix that contains your training data. For example, if your training data is in s3://your-bucket/training, enter `'your-bucket'` for s3_bucket and `'training'` for prefix. Note that your output data will be stored in the same bucket, under the `output/` prefix.

In [2]:
s3_bucket = 'sagemaker-us-west-2-688520471316'
prefix = 'data/bert/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en_abstract'
#prefix = '<ADD PREFIX>'

## Define SageMaker Data Channels

In this step, you define Amazon SageMaker training data channel and output data path. The training data channel identifies where your training data is located in S3. 

In [14]:
s3train = f's3://{s3_bucket}/{prefix}'
train = sagemaker.session.TrainingInput(s3train, distribution='FullyReplicated', 
                                        s3_data_type='S3Prefix')

data_channels = {'train': train}


Set your output data path. This is where model artifacts are stored. 

In [13]:
s3_output_location = f's3://{s3_bucket}/output'
print(f'your output data will be stored in: s3://{s3_bucket}/output/bert')

your output data will be stored in: s3://sagemaker-us-west-2-688520471316/output/bert


## Define SageMaker Training Job

Next, you will use SageMaker Estimator API to define a SageMaker Training Job. You will use a [`PyTorchEstimator`](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html) to define the number and type of EC2 instances Amazon SageMaker uses for training, as well as the size of the volume attached to those instances. 

You must update the following:
* `instance_count`
* `instance_type`
* `volume_size`

See the following sub-sections for more details. 

### Update the Type and Number of EC2 Instances Used

The instance type and number of instances you specify in `instance_type` and `instance_count` respectively will determine the number of GPUs Amazon SageMaker uses during training. Explicitly, `instance_type` will determine the number of GPUs on a single instance and that number will be multiplied by `instance_count`. 

You must specify values for `instance_type` and `instance_count` so that the total number of GPUs available for training is equal to `partitions` in `config` of `smp.init` in your training script. 

If you set ddp to `True`, you must ensure that the total number of GPUs available is divisible by `partitions`. The result of the division is inferred to be the number of model replicas to be used for Horovod (data parallelism degree). 

See [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/) for SageMaker supported instances and cost information. To look up GPUs for each instance types, see [Amazon EC2 Instance Types](https://aws.amazon.com/ec2/instance-types/). Use the section **Accelerated Computing** to see general purpose GPU instances. Note that an ml.p3.2xlarge has the same number of GPUs as an p3.2xlarge.

### Update your Volume Size

The volume size you specify in `volume_size` must be larger than your input data size.

### Set your parameters dictionary for SMP and set custom mpioptions

With the parameters dictionary you can configure: the number of microbatches, number of partitions, whether to use data parallelism with ddp, the pipelining strategy, the placement strategy and other BERT specific hyperparameters. 

In [15]:
mpi_options = "-verbose --mca orte_base_help_aggregate 0 "
smp_parameters = {"optimize": "speed", "microbatches": 12, "partitions": 2, "ddp": True, "pipeline": "interleaved", "overlapping_allreduce": True, "placement_strategy": "cluster", "memory_weight": 0.3}
timeout = 60 * 60
metric_definitions = [{"Name": "base_metric", "Regex": "<><><><><><>"}]

hyperparameters = {"input_dir": "/opt/ml/input/data/train",
                   "output_dir": "./checkpoints", 
                   "config_file": "bert_config.json", 
                   "bert_model": "bert-large-uncased", 
                   "train_batch_size": 48, 
                   "max_seq_length": 128,
                   "max_predictions_per_seq": 20,
                   "max_steps": 7038,
                   "warmup_proportion": 0.2843,
                   "num_steps_per_checkpoint": 200,
                   "learning_rate": 6e-3,
                   "seed": 12439,
                   "steps_this_run": 500,
                   "allreduce_post_accumulation": 1,
                   "allreduce_post_accumulation_fp16": 1,
                   "do_train": 1,
                   "use_sequential": 1,
                   "skip_checkpoint": 1,
                   "smp": 1,
                   "apply_optimizer": 1}

### Instantiate Pytorch Estimator with SMP enabled

In [9]:
from sagemaker.local import LocalSession

local_session = LocalSession()
local_session.config = {
    'local' : {
        'local_mode':True
    }
}

In [16]:
pytorch_estimator = PyTorch("sagemaker_smp_pretrain.py",
                            role=role,
                            instance_type="ml.p3.16xlarge",
                            volume_size=200,
                            instance_count=1,
                            sagemaker_session=sagemaker_session,
                            py_version="py36",
                            framework_version='1.6.0',
                            distribution={
                                "smdistributed": {
                                    "modelparallel": {
                                        "enabled": True,
                                        "parameters": smp_parameters
                                    }
                                },
                                "mpi": {
                                    "enabled": True,
                                    "processes_per_host": 8,
                                    "custom_mpi_options": mpi_options,
                                }
                            },
                            source_dir='bert_example',
                            output_path=s3_output_location,
                            max_run=timeout,
                            hyperparameters=hyperparameters,
                            metric_definitions=metric_definitions)

Finally, you will use the estimator to launch the SageMaker training job.

In [17]:
# local 


pytorch_estimator.fit(data_channels, logs=True)

2020-12-30 00:48:36 Starting - Starting the training job...
2020-12-30 00:49:00 Starting - Launching requested ML instancesProfilerReport-1609289316: InProgress
.........
2020-12-30 00:50:35 Starting - Preparing the instances for training.........
2020-12-30 00:52:07 Downloading - Downloading input data
2020-12-30 00:52:07 Training - Downloading the training image..................
2020-12-30 00:55:06 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-12-30 00:55:07,379 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-12-30 00:55:07,459 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-12-30 00:55:10,528 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-12-30 00:55:10,995 sagemaker-tr

[34m[1,6]<stdout>:device: cuda:6 n_gpu: 1, mp_rank: 0, rank: 6, distributed training: False, 16-bits training: 0[0m
[34m[1,7]<stdout>:device: cuda:7 n_gpu: 1, mp_rank: 1, rank: 7, distributed training: False, 16-bits training: 0[0m
[34m[1,5]<stdout>:device: cuda:5 n_gpu: 1, mp_rank: 1, rank: 5, distributed training: False, 16-bits training: 0[0m
[34m[1,4]<stdout>:device: cuda:4 n_gpu: 1, mp_rank: 0, rank: 4, distributed training: False, 16-bits training: 0[0m
[34m[1,2]<stdout>:device: cuda:2 n_gpu: 1, mp_rank: 0, rank: 2, distributed training: False, 16-bits training: 0[0m
[34m[1,1]<stdout>:device: cuda:1 n_gpu: 1, mp_rank: 1, rank: 1, distributed training: False, 16-bits training: 0[0m
[34m[1,3]<stdout>:device: cuda:3 n_gpu: 1, mp_rank: 1, rank: 3, distributed training: False, 16-bits training: 0[0m
[34m[1,0]<stdout>:device: cuda:0 n_gpu: 1, mp_rank: 0, rank: 0, distributed training: False, 16-bits training: 0[0m
[34m[1,0]<stdout>:[2020-12-30 00:55:25.693: W smdistrib

[34m[1,0]<stdout>:Loss: 11.483488082885742[0m
[34m[1,0]<stdout>:Loss: 11.354116439819336[0m
[34m[1,0]<stdout>:Loss: 11.344836235046387[0m
[34m[1,0]<stdout>:Loss: 11.271285057067871[0m
[34m[1,0]<stdout>:Loss: 11.034238815307617[0m
[34m[1,0]<stdout>:Loss: 11.065356254577637[0m
[34m[1,0]<stdout>:Loss: 10.933156967163086[0m
[34m[1,0]<stdout>:Loss: 10.887144088745117[0m
[34m[1,0]<stdout>:Loss: 10.813663482666016[0m
[34m[1,0]<stdout>:Loss: 10.789870262145996[0m
[34m[1,0]<stdout>:Loss: 10.63941764831543[0m
[34m[1,0]<stdout>:Loss: 10.468856811523438[0m
[34m[1,0]<stdout>:Loss: 10.36077880859375[0m
[34m[1,0]<stdout>:Loss: 10.404821395874023[0m
[34m[1,0]<stdout>:Loss: 10.454989433288574[0m
[34m[1,0]<stdout>:Loss: 10.200750350952148[0m
[34m[1,0]<stdout>:Loss: 10.270511627197266[0m
[34m[1,0]<stdout>:Loss: 10.147346496582031[0m
[34m[1,0]<stdout>:Loss: 10.087596893310547[0m
[34m[1,0]<stdout>:Loss: 10.141414642333984[0m
[34m[1,0]<stdout>:Loss: 9.986273765563965

[34m[1,0]<stdout>:Loss: 2.1427998542785645[0m
[34m[1,0]<stdout>:Loss: 2.0572926998138428[0m
[34m[1,0]<stdout>:Loss: 1.0085976123809814[0m
[34m[1,0]<stdout>:Loss: 0.8136067390441895[0m
[34m[1,0]<stdout>:Loss: 0.9908639192581177[0m
[34m[1,0]<stdout>:Loss: 0.7030866146087646[0m
[34m[1,0]<stdout>:Loss: 1.5312809944152832[0m
[34m[1,0]<stdout>:Loss: 0.6484192609786987[0m
[34m[1,0]<stdout>:Loss: 0.49187231063842773[0m
[34m[1,0]<stdout>:Loss: 1.779942512512207[0m
[34m[1,0]<stdout>:Loss: 0.5889333486557007[0m
[34m[1,0]<stdout>:Loss: 1.6668152809143066[0m
[34m[1,0]<stdout>:Loss: 0.5027334094047546[0m
[34m[1,0]<stdout>:Loss: 1.5741355419158936[0m
[34m[1,0]<stdout>:Loss: 0.3856635093688965[0m
[34m[1,0]<stdout>:Loss: 0.3613825738430023[0m
[34m[1,0]<stdout>:Loss: 0.2412063479423523[0m
[34m[1,0]<stdout>:Loss: 0.197327122092247[0m
[34m[1,0]<stdout>:Loss: 0.1985853910446167[0m
[34m[1,0]<stdout>:Loss: 0.15104126930236816[0m
[34m[1,0]<stdout>:Loss: 0.1478540152311

[34m[1,0]<stdout>:Loss: 0.14584039151668549[0m
[34m[1,0]<stdout>:Loss: 0.4935380220413208[0m
[34m[1,0]<stdout>:Loss: 0.03370758146047592[0m
[34m[1,0]<stdout>:Loss: 0.014607151970267296[0m
[34m[1,0]<stdout>:Loss: 0.20980136096477509[0m
[34m[1,0]<stdout>:Loss: 0.1187763661146164[0m
[34m[1,0]<stdout>:Loss: 0.1161094456911087[0m
[34m[1,0]<stdout>:Loss: 0.19437524676322937[0m
[34m[1,0]<stdout>:Loss: 0.05651412904262543[0m
[34m[1,0]<stdout>:Loss: 0.1647794544696808[0m
[34m[1,0]<stdout>:Loss: 0.10710610449314117[0m
[34m[1,0]<stdout>:Loss: 0.14509719610214233[0m
[34m[1,0]<stdout>:Loss: 0.14240017533302307[0m
[34m[1,0]<stdout>:Loss: 0.07780162990093231[0m
[34m[1,0]<stdout>:Loss: 0.05311121046543121[0m
[34m[1,0]<stdout>:Loss: 0.05894014984369278[0m
[34m[1,0]<stdout>:Loss: 0.1920362412929535[0m
[34m[1,0]<stdout>:Loss: 0.060521967709064484[0m
[34m[1,0]<stdout>:Loss: 0.03312932699918747[0m
[34m[1,0]<stdout>:Loss: 0.03796634078025818[0m
[34m[1,0]<stdout>:Loss


2020-12-30 01:08:14 Uploading - Uploading generated training model
2020-12-30 01:08:14 Completed - Training job completed
ProfilerReport-1609289316: IssuesFound
Training seconds: 967
Billable seconds: 967
