# Train EleutherAI GPT-J with PyTorch 1.8.1 and Pipeline Parallelism Using the SageMaker Model Parallelism Library

**Please run this notebook with Data Science-> Python 3 Kernel on SageMaker Studio Notebook**

This notebook walks you through how to train the [EleutherAI's](https://www.eleuther.ai/) [GPT-J](https://arankomatsuzaki.wordpress.com/2021/06/04/gpt-j/) model with SageMaker's model parallelism.
EleutherAI released GPT-J 6B, an open-source alternative to [OpenAIs GPT-3](https://openai.com/blog/gpt-3-apps/). [GPT-J 6B](https://huggingface.co/EleutherAI/gpt-j-6B) is the 6 billion parameter successor to EleutherAIs GPT-NEO family, a family of transformer-based language models based on the GPT architecture for text generation.

EleutherAI's primary goal is to train a model that is equivalent in size to GPT⁠-⁠3 and make it available to the public under an open license.
Over the last few months, GPT-J gained a lot of interest from Researchers, Data Scientists, and even Software Developers, but it remained very challenging to fine tune GPT-J.

The weights of the 6 billion parameter model represent a ~24GB memory footprint. To load it in float32, one would need at least 2x model size CPU RAM: 1x for initial weights and another 1x to load the checkpoint. Apart from the model parameters, there are the gradients, optimizer states, and activations taking memory, so the actual memory usage might be significantly higher than 48GB. Just as an example, with Adam optimizer and FP32 training, the use from parameters, gradients and optimizer states might be 96GB+, and activation memory footprint would be even more than this, so the total memory usage might be easily larger than 200 GB.

In this notebook, you will learn how to easily fine tune GPT-J using Amazon SageMaker and Hugging Face on NVIDIA GPU instances.

This notebook depends on the following files and folders:

1. `train_gptj_smp_script.py`: This is an entrypoint script that is passed to the PyTorch estimator in the notebook instructions. This script is responsible for end to end training of the GPT-J model with SMP. The script has additional comments at places where the SMP API is used.
2. `fp16`: This folder is used for 16-bit float training, which contains a fp16 optimizer and various fp16 utilities.
3. `learning_rates.py`: This contains the functions for learning rate schedule.
4. `requirements.txt`: This will install the dependencies, like the right version of huggingface transformers.
5. `preprocess.py`: This will download and preprocess the sst2/glue dataset.
6. `args.py`: collection of difference arguments like training, data, SageMaker Model Parallel related args.
7. `smp_trainer.py`.py: Defines the SageMaker Model Parallel Trainer class.


## SageMaker Distributed Training 

SageMaker provides distributed training libraries for data parallelism and model parallelism. The libraries are optimized for the SageMaker training environment, help adapt your distributed training jobs to SageMaker, and improve training speed and throughput.

### Approaches

![SageMaker Distributed Training Approaches](img/TypesOfDistributedTraining.png)


### SageMaker Model Parallel

Model parallelism is the process of splitting a model up between multiple devices or nodes (such as GPU-equipped instances) and creating an efficient pipeline to train the model across these devices to maximize GPU utilization.

Increasing deep learning model size (layers and parameters) can result in better accuracy. However, there is a limit to the maximum model size you can fit in a single GPU. When training deep learning models, GPU memory limitations can be a bottleneck in the following ways:

1. They can limit the size of the model you train. Given that larger models tend to achieve higher accuracy, this directly translates to trained model accuracy.

2. They can limit the batch size you train with, leading to lower GPU utilization and slower training.

To overcome the limitations associated with training a model on a single GPU, you can use model parallelism to distribute and train your model on multiple computing devices.

### Core features of SageMaker Model Parallel 

1. [Automated Model Splitting](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html): When you use SageMaker's model parallel library, you can take advantage of automated model splitting, also referred to as automated model partitioning. The library uses a partitioning algorithm that balances memory, minimizes communication between devices, and optimizes performance. You can configure the automated partitioning algorithm to optimize for speed or memory.

2. [Pipeline Execution Schedule](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html): A core feature of SageMaker's distributed model parallel library is pipelined execution, which determines the order in which computations are made and data is processed across devices during model training. Pipelining is a technique to achieve true parallelization in model parallelism, by having the GPUs compute simultaneously on different data samples, and to overcome the performance loss due to sequential computation.

Pipelining is based on splitting a mini-batch into microbatches, which are fed into the training pipeline one-by-one and follow an execution schedule defined by the library runtime. A microbatch is a smaller subset of a given training mini-batch. The pipeline schedule determines which microbatch is executed by which device for every time slot.

In addition to its [core features](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html), the SageMaker distributed model parallel library offers [memory-saving features](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch.html) for training deep learning models with PyTorch: [tensor parallelism](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-tensor-parallelism.html), [optimizer state sharding](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-optimizer-state-sharding.html), [activation checkpointing](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-checkpointing.html), and [activation offloading](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-offloading.html). 

### SageMaker Model Parallel configuration

Please refer to all the [configuration parameters](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_general.html) related to SageMaker Distributed Training.

As we are going to use PyTorch and Hugging Face for training GPT-J, it is important to understand all the SageMaker Distributed configuration parameters specific to PyTorch [here](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_general.html#pytorch-specific-parameters).

#### Important

`process_per_host` must not be greater than the number of GPUs per instance and typically will be equal to the number of GPUs per instance.

For example, if you use one instance with 4-way pipeline parallelism and 2-way data parallelism, then processes_per_host should be 2 x 4 = 8. Therefore, you must choose an instance that has at least 8 GPUs, such as an ml.p3.16xlarge.

The following image illustrates how 4-way data parallelism and 2-way pipeline parallelism is distributed across 8 GPUs: the models is partitioned across 2 GPUs, and each partition is added to 4 GPUs.

It is also important to understand how the [ranking mechanism](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-ranking-mechanism.html) of model parallelism works with tensor parallelism. This is extended from the Ranking Basics for Core Features of the SageMaker Model Parallel Library.

![SageMaker Distributed Training Approaches](img/SMP-Pipeline-Parallel-DDP.png)


#### Additional Resources
If you are a new user of Amazon SageMaker, you may find the following helpful to learn more about SMP and using SageMaker with PyTorch.

1. To learn more about the SageMaker model parallelism library, see [Model Parallel Distributed Training with SageMaker Distributed](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html).

2. To learn more about using the SageMaker Python SDK with PyTorch, see Using [PyTorch with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html).

3. To learn more about launching a training job in Amazon SageMaker with your own training image, see [Use Your Own Training Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html).

#### Amazon SageMaker Initialization
Run the following cell to import SageMaker modules and retrieve information of your current SageMaker work environment, such as your AWS account ID, the AWS Region, and the ARN of your Amazon SageMaker execution role.

Upgrade SageMaker SDK to the latest version.

NOTE: This step might require a kernel restart.

In [47]:
!pip install -r requirements.txt

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
[0m

In [48]:
!pip show urllib3

Name: urllib3
Version: 1.26.7
Summary: HTTP library with thread-safe connection pooling, file post, and more.
Home-page: https://urllib3.readthedocs.io/
Author: Andrey Petrov
Author-email: andrey.petrov@shazow.net
License: MIT
Location: /opt/conda/lib/python3.7/site-packages
Requires: 
Required-by: botocore, requests, responses


In [49]:
!pip install --upgrade sagemaker

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
[0m

In [71]:
import botocore
import boto3
import sagemaker
import transformers
import pandas as pd
from sagemaker.local import LocalSession

sagemaker_session = LocalSession()
sagemaker_session.config = {"local": {"local_code": True}}

print(f"sagemaker: {sagemaker.__version__}")
print(f"transformers: {transformers.__version__}")

sagemaker: 2.87.0
transformers: 4.18.0


In [72]:
%%time
import os

import sagemaker
from sagemaker import get_execution_role
from sagemaker.pytorch import PyTorch
from sagemaker.huggingface import HuggingFace
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
import boto3

role = (
    get_execution_role()
)  # provide a pre-existing role ARN as an alternative to creating a new role
print(f"SageMaker Execution Role: {role}")

client = boto3.client("sts")
account = client.get_caller_identity()["Account"]
print(f"AWS account: {account}")

session = boto3.session.Session()
region = session.region_name
print(f"AWS region: {region}")

sm_boto_client = boto3.client("sagemaker")
sagemaker_session = sagemaker.session.Session(boto_session=session)

# get default bucket
default_bucket = sagemaker_session.default_bucket()
print()
print("Default bucket for this session: ", default_bucket)

SageMaker Execution Role: arn:aws:iam::232838030412:role/service-role/AmazonSageMaker-ExecutionRole-20211204T182243
AWS account: 232838030412
AWS region: us-west-2

Default bucket for this session:  sagemaker-us-west-2-232838030412
CPU times: user 135 ms, sys: 15.7 ms, total: 151 ms
Wall time: 678 ms


In [73]:
s3_output_bucket = f"s3://sagemaker-{region}-{account}/smp-model-parallel-outputdir/"

## Training Dataset

The training script fine-tunes GPT-J on the `sst2` dataset. 

#### DataLoader 

The DataLoader and Sampler is defined in `smp_trainer.py`

## Setup Hyperparameters
We will train on 4-node p3.16xlarge cluster.  Total number of GPUs in the cluster will be 32.
We will use 16-way pipeline parallelism and 2-way data parallel. Please note the `ddp=True` enables PyTorch's Distributed Data Parallel (DDP).

`(pipeline parallelism degree) x (data parallelism degree) = total number of GPUs`

In [74]:
model_name_or_path = "EleutherAI/gpt-j-6B"
ddp = True

In [75]:
load_from_s3 = False
if load_from_s3:
    %store -r model_location
    model_name_or_path = model_location

In [76]:
hyperparameters = {
    "dataset_name": "glue",
    "dataset_config_name": "sst2",
    "do_train": True,
    "do_eval": False,
    "load_from_s3": False,
    "per_device_train_batch_size": 2,
    "output_dir": "./temp",
    "model_name_or_path": model_name_or_path,
    "load_from_s3": load_from_s3,
    "max_steps": 100,
    "seed": 12345,
    "lr": 2.0e-4,
    "lr_decay_iters": 125000,
    "min_lr": 0.00001,
    "warmup": 0.01,
    "shard_optimizer_state": 1,
    "activation_checkpointing": 1,
    "activation_strategy": "each",
    "optimize": "memory",
    "ddp": ddp,
    "cache_dir": "/tmp",
    "save_final_full_model": 0,
}

In [77]:
model_config = "gpt-j-6B"

if model_config == "gpt-j-6B":
    model_params = {
        "tensor_parallel_degree": 1,
        "pipeline_parallel_degree": 16,
        "prescaled_batch": 0,#was 1
    }
    
for k, v in model_params.items():
    hyperparameters[k] = v

In [78]:
process_per_host = 8

In [79]:
mpioptions = "-x NCCL_DEBUG=WARN -x SMDEBUG_LOG_LEVEL=ERROR "
mpioptions += "-x SMP_NCCL_THROTTLE_LIMIT=1 "
mpioptions += "-x FI_EFA_USE_DEVICE_RDMA=1 -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1"


mpi = {
    "enabled": True,
    "processes_per_host": process_per_host,
    "custom_mpi_options": mpioptions,
}

smdistributed = {
    "modelparallel": {
        "enabled": True,
        "parameters": {
            "ddp": hyperparameters["ddp"],
            "microbatches": 2,
            # partitions is a required param in the current SM SDK so it needs to be passed,
            # these two map to the same config
            "partitions": hyperparameters["pipeline_parallel_degree"],
            "shard_optimizer_state": hyperparameters["shard_optimizer_state"] > 0,
            "prescaled_batch": hyperparameters["prescaled_batch"] > 0,
            "optimize": hyperparameters["optimize"],
            "auto_partition": True,
            "default_partition": 0,
            "offload_activations": True,
            "active_microbatches": 2,
            "optimize": hyperparameters["optimize"],
        },
    }
}


distribution = {"mpi": mpi, "smdistributed": smdistributed}

## Setup SageMaker Training Job

In [80]:
import os
from sagemaker.pytorch import PyTorch
import datetime

instance_type = "ml.p3.16xlarge"
volume_size = 900
instance_count = 2

# cur_time = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
# base_job_name = f"smp-{instance_type}-smp-gptj-{cur_time}".replace(".", "-").replace(
#     "/", "-"
# )
# print(base_job_name)

In [81]:
from time import gmtime, strftime

# Specify your experiment name
experiment_name = "smp-gptj-model-parallel"
# Specify your trial name
trial_name = f'{experiment_name}-trial1' 

all_experiment_names = [exp.experiment_name for exp in Experiment.list()]
# Load the experiment if it exists, otherwise create 
if experiment_name not in all_experiment_names:
    experiment = Experiment.create(experiment_name=experiment_name, sagemaker_boto_client=sm_boto_client)
else:
    experiment = Experiment.load(experiment_name=experiment_name, sagemaker_boto_client=sm_boto_client)

# Create the trial
trial = Trial.create(
        trial_name="smp-{}-{}".format(trial_name, strftime("%Y-%m-%d-%H-%M-%S", gmtime())),
        experiment_name=experiment.experiment_name,
        sagemaker_boto_client=sm_boto_client,
    )

In [82]:
machine_str = instance_type.split('.')[1] + instance_type.split('.')[2][:3]
pp_degree = hyperparameters['pipeline_parallel_degree']
tp_degree = hyperparameters['tensor_parallel_degree']
base_job_name = f'smp-{model_config}-{machine_str}-tp{tp_degree}-pp{pp_degree}-bs{hyperparameters["per_device_train_batch_size"]}'

In [83]:
kwargs = {}

# smp_estimator = HuggingFace(
#         entry_point="train_gptj_smp_script.py",
#         source_dir=os.getcwd(),
#         role=role,
#         instance_type=instance_type,
#         volume_size=volume_size,
#         instance_count=instance_count,
#         sagemaker_session=sagemaker_session,
#         distribution=distribution,
#         pytorch_version='1.10.2',
#         transformers_version='4.17.0',
#         py_version='py38',
#         output_path=s3_output_bucket,
# #         checkpoint_s3_uri=checkpoint_s3_uri if not use_fsx else None,
# #         checkpoint_local_path=hyperparameters['checkpoint-dir'] if use_fsx else None,
# #         metric_definitions=metric_definitions,
#         hyperparameters=hyperparameters,
#         debugger_hook_config=False,
#         disable_profiler=True,
#         base_job_name=base_job_name,
#         **kwargs
#     )

In [84]:
 smp_estimator = PyTorch(
    entry_point="train_gptj_smp_script.py",
    source_dir=os.getcwd(),
    role=role,
    instance_type=instance_type,
    volume_size=volume_size,
    instance_count=instance_count,
    distribution=distribution,
    framework_version="1.10",
    py_version="py38",
    hyperparameters=hyperparameters,
    debugger_hook_config=False,
    disable_profiler=True,
    base_job_name=base_job_name,
)

In [85]:
smp_estimator.fit(experiment_config={
                    "ExperimentName": experiment.experiment_name,
                    "TrialName": trial.trial_name,
                    "TrialComponentDisplayName": "Training",
                  },
                  logs=True)

INFO:sagemaker:Creating training-job with name: smp-gpt-j-6B-p316x-tp1-pp16-bs2-2022-04-21-08-40-49-493


2022-04-21 08:40:51 Starting - Starting the training job......
2022-04-21 08:41:47 Starting - Preparing the instances for training............
2022-04-21 08:43:40 Downloading - Downloading input data
2022-04-21 08:43:40 Training - Downloading the training image.......................[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2022-04-21 08:47:34,734 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2022-04-21 08:47:34,809 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2022-04-21 08:47:34,816 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[35mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[35mbash: no job control in this shell[0m
[35m2022-04-21 08:47:35,022 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorc

UnexpectedStatusException: Error for Training job smp-gpt-j-6B-p316x-tp1-pp16-bs2-2022-04-21-08-40-49-493: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "[1,14]<stderr>:RuntimeError[1,14]<stderr>:: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 6; 15.78 GiB total capacity; 14.05 GiB already allocated; 55.75 MiB free; 14.10 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
 [1,14]<stderr>: [1,14]<stderr>:During handling of the above exception, another exception occurred: [1,14]<stderr>: [1,14]<stderr>:Traceback (most recent call last): [1,14]<stderr>:  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main [1,14]<stderr>:    return _run_code(code, main_globals, None, [1,14]<stderr>:  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code [1,14]<stderr>:    exec(code, run_globals) [1,14]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/mpi4py/__main__.py", line 7, in <module> [1,14]<stderr>:    main() [1,14]<std

#### Specify SageMaker Model Parallel Hyperparameters

The following cell constructs a PyTorch estimator using the parameters defined above. To see how the SageMaker tensor parallelism modules and functions are applied to the script, see the `train_gptj_smp_script.py` file.

In [None]:
# from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig

# smp_estimator = HuggingFace(
#     entry_point="train_gptj_smp_script.py",
#     source_dir=os.getcwd(),
#     role=role,
#     transformers_version="4.17.0",
#     instance_type=instance_type,
#     volume_size=volume_size,
#     instance_count=instance_count,
#     distribution=distribution,
#     pytorch_version="1.10.2",
#     py_version="py38",
#     hyperparameters=hyperparameters,
#     debugger_hook_config=False,
#     disable_profiler=True,
#     base_job_name=base_job_name,
# )

In [None]:
# smp_estimator.fit(logs=True)

## Accessing the Training Logs

You can access the training logs from [Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html). Make sure to look at the logs of **algo-1** because that is the main node whose output stream will have the training job logs.

You can use CloudWatch to track SageMaker GPU and memory utilization during training and inference. To view the metrics and logs that SageMaker writes to CloudWatch, see [SageMaker Jobs and Endpoint Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html#cloudwatch-metrics-jobs) in the Amazon SageMaker Developer Guide.

If you are a new user of CloudWatch, see [Getting Started with Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/GettingStarted.html). 

For additional information on monitoring and analyzing Amazon SageMaker training jobs, see [Monitor and Analyze Training Jobs Using Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/training-metrics.html).

## Deploying Trained Model for Inference

In most cases, a trained model can be deployed on a single device for inference because inference only requires a small amount of memory. You can use the SMP API to create a single, unified model after training: the [smp.DistributedModel.save_model()](https://sagemaker.readthedocs.io/en/stable/api/training/smp_versions/latest/smd_model_parallel_tensorflow.html#smp.DistributedModel.save_model) method for TensorFlow, and the [smp.save()](https://sagemaker.readthedocs.io/en/stable/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#apis-for-saving-and-loading) function for PyTorch.

After you build and train your models, you can deploy them to get predictions in one of two ways:

* To set up a persistent endpoint to get predictions from your models, use SageMaker hosting services. For an overview on deploying a single model or multiple models with SageMaker hosting services, see [Deploy a Model on SageMaker Hosting Services](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-deployment.html#how-it-works-hosting).
* To get predictions for an entire dataset, use SageMaker batch transform. For an overview on deploying a model with SageMaker Batch Transform, see [Get Inferences for an Entire Dataset with Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-batch.html).

To learn more about deploying models for inference using SageMaker, see [Deploy Models for Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html). 
