# Distributed data parallel BERT training with TensorFlow2 and SMDataParallel

HSMDataParallel is a new capability in Amazon SageMaker to train deep learning models faster and cheaper. SMDataParallel is a distributed data parallel training framework for TensorFlow, PyTorch, and MXNet.

This notebook example shows how to use SMDataParallel with TensorFlow(version 2.3.1) on [Amazon SageMaker](https://aws.amazon.com/sagemaker/) to train a BERT model using [Amazon FSx for Lustre file-system](https://aws.amazon.com/fsx/lustre/) as data source.

The outline of steps is as follows:

1. Stage dataset in [Amazon S3](https://aws.amazon.com/s3/). Original dataset for BERT pretraining consists of text passages from BooksCorpus (800M words) (Zhu et al. 2015) and English Wikipedia (2,500M words). Please follow original guidelines by NVidia to prepare training data in hdf5 format - 
https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/README.md#getting-the-data
2. Create Amazon FSx Lustre file-system and import data into the file-system from S3
3. Build Docker training image and push it to [Amazon ECR](https://aws.amazon.com/ecr/)
4. Configure data input channels for SageMaker
5. Configure hyper-prarameters
6. Define training metrics
7. Define training job, set distribution strategy to SMDataParallel and start training

**NOTE:**  With large traning dataset, we recommend using (Amazon FSx)[https://aws.amazon.com/fsx/] as the input filesystem for the SageMaker training job. FSx file input to SageMaker significantly cuts down training start up time on SageMaker because it avoids downloading the training data each time you start the training job (as done with S3 input for SageMaker training job) and provides good data read throughput.


**NOTE:** This example requires SageMaker Python SDK v2.X.

## Amazon SageMaker Initialization

Initialize the notebook instance. Get the aws region, sagemaker execution role.

The IAM role arn used to give training and hosting access to your data. See the [Amazon SageMaker Roles](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the sagemaker.get_execution_role() with the appropriate full IAM role arn string(s). As described above, since we will be using FSx, please make sure to attach `FSx Access` permission to this IAM role.

In [17]:
%%time
! python3 -m pip install --upgrade sagemaker

Requirement already up-to-date: sagemaker in /home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages (2.19.0)
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/tensorflow_p36/bin/python3 -m pip install --upgrade pip' command.[0m
SageMaker Execution Role:arn:aws:iam::835319576252:role/service-role/AmazonSageMaker-ExecutionRole-20191006T135881
AWS account:835319576252
AWS region:us-east-1
CPU times: user 122 ms, sys: 12.2 ms, total: 134 ms
Wall time: 1.87 s


In [18]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.estimator import Estimator
import boto3

sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

role = get_execution_role() # provide a pre-existing role ARN as an alternative to creating a new role
print(f'SageMaker Execution Role:{role}')

client = boto3.client('sts')
account = client.get_caller_identity()['Account']
print(f'AWS account:{account}')

session = boto3.session.Session()
region = session.region_name
print(f'AWS region:{region}')

SageMaker Execution Role:arn:aws:iam::835319576252:role/service-role/AmazonSageMaker-ExecutionRole-20191006T135881
AWS account:835319576252
AWS region:us-east-1


## Prepare SageMaker Training Images

1. SageMaker by default use the latest [Amazon Deep Learning Container Images (DLC)](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) TensorFlow training image. In this step, we use it as a base image and install additional dependencies required for training BERT model.
2. In the Github repository https://github.com/HerringForks/DeepLearningExamples.git we have made TensorFlow2-SMDataParallel BERT training script available for your use. This repository will be cloned in the training image for running the model training.

### Build and Push Docker Image to ECR

Run the below command build the docker image and push it to ECR.

In [19]:
image = "tf2-smdataparallel-bert-sagemaker"  # Example: tf2-smdataparallel-bert-sagemaker
tag = "latest"   # Example: latest 

In [20]:
!pygmentize ./Dockerfile

[34mARG[39;49;00m region

[34mFROM[39;49;00m [33m763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.3.1-gpu-py37-cu110-ubuntu18.04[39;49;00m

[34mRUN[39;49;00m 	pip --no-cache-dir --no-cache install [33m\[39;49;00m
        scikit-learn==[34m0[39;49;00m.23.1 [33m\[39;49;00m
        [31mwandb[39;49;00m==[34m0[39;49;00m.9.1 [33m\[39;49;00m
        tensorflow-addons [33m\[39;49;00m
        [31mcolorama[39;49;00m==[34m0[39;49;00m.4.3 [33m\[39;49;00m
        pandas [33m\[39;49;00m
        apache_beam [33m\[39;49;00m
        [31mpyarrow[39;49;00m==[34m0[39;49;00m.16 [33m\[39;49;00m
        git+https://github.com/HerringForks/transformers.git@master [33m\[39;49;00m
        git+https://github.com/huggingface/nlp.git@703b761
        


In [21]:
!pygmentize ./build_and_push.sh

[37m#!/usr/bin/env bash[39;49;00m
[37m# This script shows how to build the Docker image and push it to ECR to be ready for use[39;49;00m
[37m# by SageMaker.[39;49;00m
[37m# The argument to this script is the image name. This will be used as the image on the local[39;49;00m
[37m# machine and combined with the account and region to form the repository name for ECR.[39;49;00m
[37m# set region[39;49;00m

[31mDIR[39;49;00m=[33m"[39;49;00m[34m$([39;49;00m [36mcd[39;49;00m [33m"[39;49;00m[34m$([39;49;00m dirname [33m"[39;49;00m[33m${[39;49;00m[31mBASH_SOURCE[39;49;00m[0][33m}[39;49;00m[33m"[39;49;00m [34m)[39;49;00m[33m"[39;49;00m && [36mpwd[39;49;00m [34m)[39;49;00m[33m"[39;49;00m

[34mif[39;49;00m [ [33m"[39;49;00m[31m$#[39;49;00m[33m"[39;49;00m -eq [34m3[39;49;00m ]; [34mthen[39;49;00m
    [31mregion[39;49;00m=[31m$1[39;49;00m
    [31mimage[39;49;00m=[31m$2[39;49;00m
    [31mtag[39;49;00m=[31m$3[39;49;00m
[34

In [22]:
%%time
! chmod +x build_and_push.sh; bash build_and_push.sh {region} {image} {tag}

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
Sending build context to Docker daemon  12.35MB
Step 1/3 : ARG region
Step 2/3 : FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.3.1-gpu-py37-cu110-ubuntu18.04
 ---> 73f448953d3a
Step 3/3 : RUN 	pip --no-cache-dir --no-cache install         scikit-learn==0.23.1         wandb==0.9.1         tensorflow-addons         colorama==0.4.3         pandas         apache_beam         pyarrow==0.16         git+https://github.com/HerringForks/transformers.git@master         git+https://github.com/huggingface/nlp.git@703b761
 ---> Using cache
 ---> 24901ecc9de0
Successfully built 24901ecc9de0
Successfully tagged tf2-smdataparallel-bert-sagemaker:latest
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
The push refers to repository [835319576252.dkr.ecr.us-east-1.amazonaws.com/tf2-smdataparallel-bert-sagemaker]

[1Bc96185e7: Preparing 
[

## Preparing FSx Input for SageMaker

1. Download and prepare your training dataset on S3.
2. Follow the steps listed here to create a FSx linked with your S3 bucket with training data - https://docs.aws.amazon.com/fsx/latest/LustreGuide/create-fs-linked-data-repo.html. Make sure to add an endpoint to your VPC allowing S3 access.
3. Follow the steps listed here to configure your SageMaker training job to use FSx https://aws.amazon.com/blogs/machine-learning/speed-up-training-on-amazon-sagemaker-using-amazon-efs-or-amazon-fsx-for-lustre-file-systems/

### Important Caveats

1. You need use the same `subnet` and `vpc` and `security group` used with FSx when launching the SageMaker notebook instance. The same configurations will be used by your SageMaker training job.
2. Make sure you set appropriate inbound/output rules in the `security group`. Specically, opening up these ports is necessary for SageMaker to access the FSx filesystem in the training job. https://docs.aws.amazon.com/fsx/latest/LustreGuide/limit-access-security-groups.html
3. Make sure `SageMaker IAM Role` used to launch this SageMaker training job has access to `AmazonFSx`.

## SageMaker TensorFlow Estimator function options

In the following code block, you can update the estimator function to use a different instance type, instance count, and distrubtion strategy. You're also passing in the training script you reviewed in the previous cell.

**Instance types**

SMDataParallel supports model training on SageMaker with the following instance types only:
1. ml.p3.16xlarge
1. ml.p3dn.24xlarge [Recommended]
1. ml.p4d.24xlarge [Recommended]

**Instance count**

To get the best performance and the most out of SMDataParallel, you should use at least 2 instances, but you can also use 1 for testing this example.

**Distribution strategy**

Note that to use DDP mode, you update the the `distribution` strategy, and set it to use `smdistributed dataparallel`.

### Training script

In the Github repository https://github.com/HerringForks/deep-learning-models.git we have made reference TensorFlow-SMDataParallel BERT training script available for your use. Clone the repository.

In [23]:
# Clone herring forks repository for reference implementation BERT with TensorFlow2-SMDataParallel
!rm -rf deep-learning-models
!git clone --recursive https://github.com/HerringForks/deep-learning-models.git

Cloning into 'deep-learning-models'...
remote: Enumerating objects: 42, done.[K
remote: Counting objects: 100% (42/42), done.[K
remote: Compressing objects: 100% (34/34), done.[K
remote: Total 1764 (delta 20), reused 18 (delta 8), pack-reused 1722[K
Receiving objects: 100% (1764/1764), 4.76 MiB | 59.42 MiB/s, done.
Resolving deltas: 100% (834/834), done.


In [24]:
import boto3
import sagemaker
sm = boto3.client('sagemaker')

In [25]:
notebook_instance_name = sm.list_notebook_instances()['NotebookInstances'][3]['NotebookInstanceName']
print(notebook_instance_name)

if notebook_instance_name != 'dsoaws':
    print('****** ERROR:  MUST FIND THE CORRECT NOTEBOOK ******')
    exit()    

dsoaws


In [26]:
notebook_instance = sm.describe_notebook_instance(NotebookInstanceName=notebook_instance_name)
notebook_instance

{'NotebookInstanceArn': 'arn:aws:sagemaker:us-east-1:835319576252:notebook-instance/dsoaws',
 'NotebookInstanceName': 'dsoaws',
 'NotebookInstanceStatus': 'InService',
 'Url': 'dsoaws.notebook.us-east-1.sagemaker.aws',
 'InstanceType': 'ml.c5.2xlarge',
 'SubnetId': 'subnet-0b8d836c',
 'SecurityGroups': ['sg-5383e807'],
 'RoleArn': 'arn:aws:iam::835319576252:role/service-role/AmazonSageMaker-ExecutionRole-20191006T135881',
 'NetworkInterfaceId': 'eni-0d92cf87d27f516fc',
 'LastModifiedTime': datetime.datetime(2020, 11, 28, 4, 35, 41, 853000, tzinfo=tzlocal()),
 'CreationTime': datetime.datetime(2020, 2, 24, 23, 6, 31, 851000, tzinfo=tzlocal()),
 'DirectInternetAccess': 'Enabled',
 'VolumeSizeInGB': 2000,
 'RootAccess': 'Enabled',
 'ResponseMetadata': {'RequestId': 'eb8ddbd7-d4c7-403a-9d57-dce93069c1d9',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'eb8ddbd7-d4c7-403a-9d57-dce93069c1d9',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '594',
   'dat

In [27]:
security_group_id = notebook_instance['SecurityGroups'][0]
print(security_group_id)

sg-5383e807


In [28]:
subnet_id = notebook_instance['SubnetId']
print(subnet_id)

subnet-0b8d836c


In [29]:
from sagemaker.tensorflow import TensorFlow

In [30]:
print(account)
print(region)
print(image)
print(tag)

835319576252
us-east-1
tf2-smdataparallel-bert-sagemaker
latest


In [31]:
instance_type = "ml.p3dn.24xlarge" # Other supported instance type: ml.p3.16xlarge, ml.p4d.24xlarge
instance_count = 2 # You can use 2, 4, 8 etc.
docker_image = f"{account}.dkr.ecr.{region}.amazonaws.com/{image}:{tag}" # YOUR_ECR_IMAGE_BUILT_WITH_ABOVE_DOCKER_FILE
username = 'AWS'
subnets = [subnet_id] # Should be same as Subnet used for FSx. Example: subnet-0f9XXXX
security_group_ids = [security_group_id] # Should be same as Security group used for FSx. sg-03ZZZZZZ
job_name = 'smdataparallel-bert-tf2-fsx-2p3dn' # This job name is used as prefix to the sagemaker training job. Makes it easy for your look for your training job in SageMaker Training job console.



In [None]:
# TODO:  Copy data to FSx/S3

In [32]:
!pip install datasets

Collecting datasets
  Downloading datasets-1.1.3-py3-none-any.whl (153 kB)
[K     |████████████████████████████████| 153 kB 10.7 MB/s eta 0:00:01
[?25hCollecting pyarrow>=0.17.1
  Downloading pyarrow-2.0.0-cp36-cp36m-manylinux2014_x86_64.whl (17.7 MB)
[K     |████████████████████████████████| 17.7 MB 29.0 MB/s eta 0:00:01
Collecting dill
  Downloading dill-0.3.3-py2.py3-none-any.whl (81 kB)
[K     |████████████████████████████████| 81 kB 22.9 MB/s  eta 0:00:01
Collecting dataclasses; python_version < "3.7"
  Downloading dataclasses-0.8-py3-none-any.whl (19 kB)
Collecting xxhash
  Downloading xxhash-2.0.0-cp36-cp36m-manylinux2010_x86_64.whl (242 kB)
[K     |████████████████████████████████| 242 kB 112.7 MB/s eta 0:00:01
[?25hCollecting multiprocess
  Downloading multiprocess-0.70.11.1-py36-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 22.0 MB/s eta 0:00:01
Installing collected packages: pyarrow, dill, dataclasses, xxhash, multiprocess, datasets
Successful

In [33]:
# For loading datasets
from datasets import list_datasets, load_dataset

# To see all available dataset names
print(list_datasets()) 

# To load a dataset
wiki = load_dataset("wikipedia", "20200501.en", split='train')


['aeslc', 'afrikaans_ner_corpus', 'ag_news', 'ai2_arc', 'ajgt_twitter_ar', 'allegro_reviews', 'allocine', 'amazon_reviews_multi', 'amazon_us_reviews', 'ambig_qa', 'amttl', 'anli', 'aqua_rat', 'arcd', 'arsentd_lev', 'art', 'arxiv_dataset', 'aslg_pc12', 'asnq', 'asset', 'autshumato', 'bible_para', 'big_patent', 'billsum', 'biomrc', 'blended_skill_talk', 'blimp', 'blog_authorship_corpus', 'bookcorpus', 'bookcorpusopen', 'boolq', 'break_data', 'bsd_ja_en', 'c3', 'c4', 'cail2018', 'capes', 'cawac', 'cdsc', 'cdt', 'cfq', 'chr_en', 'circa', 'civil_comments', 'clinc_oos', 'clue', 'cmrc2018', 'cnn_dailymail', 'coached_conv_pref', 'coarse_discourse', 'codah', 'code_search_net', 'com_qa', 'common_gen', 'commonsense_qa', 'compguesswhat', 'conceptnet5', 'conll2000', 'conll2002', 'conll2003', 'conv_ai', 'coqa', 'cornell_movie_dialog', 'cos_e', 'cosmos_qa', 'covid_qa_castorini', 'covid_qa_deepset', 'craigslist_bargains', 'crd3', 'crime_and_punish', 'crows_pairs', 'cs_restaurants', 'csv', 'curiosity_

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=4417.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=6866.0, style=ProgressStyle(description…




OSError: Not enough disk space. Needed: 34.06 GiB (download: 16.99 GiB, generated: 17.07 GiB, post-processed: Unknown size)

In [None]:
file_system_id = '<FSX_ID>' # FSx file system ID with your training dataset. Example: 'fs-0bYYYYYY'

In [14]:
SM_DATA_ROOT = '/opt/ml/input/data/train'

hyperparameters={
    "train_dir": '/'.join([SM_DATA_ROOT, 'tfrecords/train/max_seq_len_128_max_predictions_per_seq_20_masked_lm_prob_15']),
    "val_dir": '/'.join([SM_DATA_ROOT, 'tfrecords/validation/max_seq_len_128_max_predictions_per_seq_20_masked_lm_prob_15']), 
    "log_dir": '/'.join([SM_DATA_ROOT, 'checkpoints/bert/logs']), 
    "checkpoint_dir": '/'.join([SM_DATA_ROOT, 'checkpoints/bert']), 
    "load_from": "scratch", 
    "model_type": "bert", 
    "model_size": "large", 
    "per_gpu_batch_size": 64, 
    "max_seq_length": 128,
    "max_predictions_per_seq": 20, 
    "optimizer": "lamb", 
    "learning_rate": 0.005, 
    "end_learning_rate": 0.0003, 
    "hidden_dropout_prob": 0.1, 
    "attention_probs_dropout_prob": 0.1,
    "gradient_accumulation_steps": 1,
    "learning_rate_decay_power": 0.5, 
    "warmup_steps": 2812, 
    "total_steps": 2000, 
    "log_frequency": 10,
    "run_name" : job_name,
    "squad_frequency": 0
    }

In [None]:
estimator = TensorFlow(entry_point='albert/run_pretraining.py',
                        role=role,
                        image_uri=docker_image,
                        source_dir='deep-learning-models/models/nlp',
                        framework_version='2.3.1',
                        py_version='py3',
                        instance_count=instance_count,
                        instance_type=instance_type,
                        sagemaker_session=sagemaker_session,
                        subnets=subnets,
                        hyperparameters=hyperparameters,
                        security_group_ids=security_group_ids,
                        debugger_hook_config=False,
                        # Training using SMDataParallel Distributed Training Framework
                        distribution={'smdistributed':{
                                        'dataparallel':{
                                                'enabled': True
                                            }
                                        }
                                      }
                      )

# Configure FSx Input for the SageMaker Training Job

In [None]:
from sagemaker.inputs import FileSystemInput

#YOUR_MOUNT_PATH_FOR_TRAINING_DATA # NOTE: '/fsx/' will be the root mount path. Example: '/fsx/albert''''
file_system_directory_path='/fsx/' 
file_system_access_mode='rw'
file_system_type='FSxLustre'

train_fs = FileSystemInput(file_system_id=file_system_id,
                                    file_system_type=file_system_type,
                                    directory_path=file_system_directory_path,
                                    file_system_access_mode=file_system_access_mode)
data_channels = {'train': train_fs}

In [None]:
# Submit SageMaker training job
estimator.fit(inputs=data_channels, job_name=job_name)