# Train GPT2 with HuggingFace Trainer + the SageMaker Model Parallelism Library

This notebook walks you through how to use Hugging Face Transformer's Trainer with the SageMaker model parallelism (SMP) library to train a GPT-2 model. You'll learn how to train the model with tensor parallelism on a synthetic text dataset.

The GPT-2 model was proposed by OpenAI in paper [Language Models are Unsupervised Multitask Learners](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). The original GPT-2 is a large transformer-based language model with 1.5 billion parameters. In this notebook, you can experiment with the model parameters to achieve different model sizes. This notebook uses the [Hugging Face Transformers GPT-2](https://huggingface.co/transformers/model_doc/gpt2.html) implementation with SageMaker model parallel integration.

This notebook requires the following prerequisites:
- `run_clm.py`: This is an entry point script, which is the example training script for the SageMaker Hugging Face estimator. This script is responsible for end-to-end training of the GPT-2 model.
- `requirements.txt`: This file lists additional Python library dependencies that SageMaker will automatically install. This needs to be in the same directory as your entry point script. 

**Note**: To run this example training job, you must be in `us-west-2`. The container image used is located in this region. If your AWS Region is different from `us-west-2`, you must make sure you change the region code throughout this notebook.

### Additional Resources
If you are a new user of Amazon SageMaker, you may find the following helpful to learn more about SMP and using SageMaker with PyTorch.

- To learn more about the SageMaker model parallelism library, see [Model Parallel Distributed Training with SageMaker Distributed](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html).

- To learn more about using the SageMaker Python SDK with PyTorch, see [Using PyTorch with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html).

- To learn more about launching a training job in Amazon SageMaker with your own training image, see [Use Your Own Training Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html).

## Install and Upgrade Libraries

The SageMaker model parallelism library's tensor parallelism feature requires the SageMaker Python SDK and the SageMaker Experiments library. Run the following cell to install or upgrade the libraries.

**Note:** To finish applying the changes, you must restart the kernel.

In [None]:
# # run once, restart kernel, then comment out this cell
# # update sagemaker to the latest 2.x version
# ! pip3 install -qU pip
# ! pip3 install -qU "sagemaker>=2,<3"
# ! pip3 install -qU sagemaker-experiments

# import IPython
# IPython.Application.instance().kernel.do_shutdown(True)

Import and check if the SageMaker Python SDK version is successfully set to the latest version

In [None]:
import sagemaker

print(sagemaker.__version__)

## Amazon SageMaker Initialization

Throughout this example, you'll use a training script of the GPT-2 model and a text dataset.

Run the following cell to import SageMaker modules and retrieve information of your current SageMaker work environment: your AWS account ID, the AWS Region you are using to run the notebook, and the ARN of your Amazon SageMaker execution role.

In [None]:
%%time
import os

from sagemaker import get_execution_role
from sagemaker.huggingface import HuggingFace
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
import boto3

role = (
    get_execution_role()
)  # provide a pre-existing role ARN as an alternative to creating a new role
print(f"SageMaker Execution Role:{role}")

client = boto3.client("sts")
account = client.get_caller_identity()["Account"]
print(f"AWS account:{account}")

session = boto3.session.Session()
region = session.region_name
print(f"AWS region:{region}")

sm_boto_client = boto3.client("sagemaker")
sagemaker_session = sagemaker.session.Session(boto_session=session)

# get default bucket
default_bucket = sagemaker_session.default_bucket()
print()
print("Default bucket for this session: ", default_bucket)

You also need to specify an Amazon S3 bucket to store the output data such as training artifacts. The following cell sets up the default S3 bucket paired with the current SageMaker session. You can also modify this as needed.

In [None]:
s3_output_bucket = f"s3://{default_bucket}/output/"
print(f"Your output data will be stored in: {s3_output_bucket}")

## Set Up Hyperparameters, Metric Definitions, and MPI Options
The following `hyperparameters` dictionary is to pass arguments to the training script (`run_clm.py`) and set the model parallel configuration when creating the training job.

Note that the `run_clm.py` file is currently modified to work with SageMaker. If you want to run your own script, you'll need to add the relevant lines as seen in `run_clm.py`. You can find them quickly by searching for `SageMaker Support`.

You can also add custom mpi flags. By default, we have `--mca btl_vader_single_copy_mechanism none` to remove unnecessary logs.

Next, we add a base metric definitions to upload the training metrics for SageMaker Experiments. You can also add custom metric definitions.

In [None]:
save_steps = 60  # Set the interval for saving checkpoints
max_steps = 100  # Set the total number of steps you want to run

hyperparameters = {
    "output_dir": "/opt/ml/checkpoints",
    "overwrite_output_dir": "",
    "learning_rate": 0.0002,
    "do_train": "",
    "save_steps": save_steps,
    "max_steps": max_steps,
    "eval_steps": 20,
    "evaluation_strategy": "steps",
    "model_type": "gpt2",
    "tokenizer_name": "gpt2",
    "optim": "adamw_torch",
    "dataloader_drop_last": True,
}

## Specify a HuggingFace Dataset

In this step, you specify the dataset from Hugging Face that you want to train on. Here we use the `wikitext` dataset. Note that larger datasets will take longer to download and process.

In [None]:
# You can use any dataset available from Hugging Face
# Modify these parameters as needed
dataset_name = "wikitext"

if dataset_name == "wikitext":
    # 0.37 MB download, 1.1 GB generated
    hyperparameters["dataset_name"] = "wikitext"  # primary dataset name
    hyperparameters["dataset_config_name"] = "wikitext-2-raw-v1"  # set config for your data subset
else:
    raise RuntimeError("Unknown HuggingFace dataset")

Set the model configuration below. Choose one of `gpt2-small`, `gpt2-xl`, `gpt2-5b`, or define your own. If you want to start from the smallest model, specify `gpt2-small`. The other larger models require `p4d` instances with more GPU memory.

You can also specify different training parameters here such as batch size, tensor parallelism, data parallelism, and fp16 which will affect if your model can fit on your instance configuration.

For more information on these parameters and how to use them, please visit [SageMaker Distributed Training](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html).

Note: you may need to adjust these parameters such as `tp_degree` and `pp_degree` if you choose to train another size model.

In [None]:
model_config = "gpt2-small"  # ['gpt2-small', 'gpt2-xl', 'gpt2-5b']

if model_config == "gpt2-small":
    # 100M parameters
    hyperparameters["per_device_train_batch_size"] = 2
    hyperparameters["per_device_eval_batch_size"] = 2
    tp_degree = 4
    pp_degree = 1
    microbatches = 1
    fp16 = True
    hyperparameters["fp16"] = fp16
    prescaled_batch = False
    shard_optimizer_state = False
elif model_config == "gpt2-xl":
    # 1.5B parameters
    # Requires p4d
    hyperparameters[
        "config_overrides"
    ] = "n_embd=1536,n_layer=48,n_head=24"  # note: last param must not have trailing ','
    hyperparameters["per_device_train_batch_size"] = 2
    hyperparameters["per_device_eval_batch_size"] = 4
    tp_degree = 8
    pp_degree = 1
    microbatches = 1
    fp16 = True
    hyperparameters["fp16"] = fp16
    prescaled_batch = True
    shard_optimizer_state = True
elif model_config == "gpt2-5b":
    # 4.5B parameters
    # Requires p4d
    hyperparameters["config_overrides"] = "n_embd=3080,n_layer=40,n_head=40"
    hyperparameters["per_device_train_batch_size"] = 2
    hyperparameters["per_device_eval_batch_size"] = 4
    tp_degree = 8
    pp_degree = 2
    microbatches = 2
    fp16 = True
    hyperparameters["fp16"] = fp16
    prescaled_batch = True
    shard_optimizer_state = True
else:
    raise RuntimeError("Unknown model config")

## Set Up SageMaker Studio Experiment
Create or load [SageMaker Experiment](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html) for the example training job. This will create an experiment trial object in SageMaker Studio.

In [None]:
from time import gmtime, strftime

# Specify your experiment name
experiment_name = "gpt2-hf-trainer"
# Specify your trial name
trial_name = f"{experiment_name}-trial"

all_experiment_names = [exp.experiment_name for exp in Experiment.list()]
# Load the experiment if it exists, otherwise create
if experiment_name not in all_experiment_names:
    experiment = Experiment.create(
        experiment_name=experiment_name, sagemaker_boto_client=sm_boto_client
    )
else:
    experiment = Experiment.load(
        experiment_name=experiment_name, sagemaker_boto_client=sm_boto_client
    )

# Create the trial
trial = Trial.create(
    trial_name="gpt2-hf-trainer-{}-{}".format(trial_name, strftime("%Y-%m-%d-%H-%M-%S", gmtime())),
    experiment_name=experiment.experiment_name,
    sagemaker_boto_client=sm_boto_client,
)

## Specify Essential Parameters for a SageMaker Training Job

Next, you will use the [SageMaker Estimator API](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) to define a SageMaker training job, passing values through the following parameters, such as the training job name, the number of EC2 instances, the instance type, and the size of the volume attached to the instances.

* `instance_count`
* `instance_type`
* `volume_size`
* `base_job_name`

### Update the Type and Number of EC2 Instance to Use

The instance type and the number of instances you specify to the `instance_type` and `instance_count` parameters, respectively, will determine the total number of GPUs (world size).

$$ \text{(world size) = (the number of GPUs on a single instance)}\times\text{(the number of instance)}$$

In [None]:
# Set the instance_type here
# Note: to run models bigger than gpt2-small, please use p4d.24xlarge instances
instance_type = "ml.p3.16xlarge"  # ['ml.p3.16xlarge', 'ml.p4d.24xlarge]


# Set to the number of instances you want to use
# gpt2-small needs >= 2 p3d instances
# gpt2-xl needs >= 1 p4d instance
# gpt2-5b needs >= 2 p4d instances
instance_count = 2

# set to the number of GPUs on that instance
# p3d's and p4d's have 8 GPUs each
processes_per_host = 8

To look up the number of GPUs of different instance types, see [Amazon EC2 Instance Types](https://aws.amazon.com/ec2/instance-types/). Use the section **Accelerated Computing** to see general purpose GPU instances. Note that, for example, a given instance type `p4d.24xlarge` has a corresponding instance type `ml.p4d.24xlarge` in SageMaker.
For SageMaker supported `ml` instances and cost information, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/). 

### Attach an EBS Volume to the Training Instance
The volume size you specify in `volume_size` must be larger than your input data size. In this example, the volume size is set to 500GB.

In [None]:
volume_size = 500

### Specify a Base Job Name

In [None]:
SM_HP_MP_PARAMETERS = {
    "microbatches": microbatches,
    "optimize": "speed",
    "pipeline": "interleaved",
    "placement_strategy": "cluster",
    "tensor_parallel_degree": tp_degree,
    "partitions": pp_degree,
    "prescaled_batch": prescaled_batch,
    "shard_optimizer_state": shard_optimizer_state,
    "fp16": fp16,
}

machine_str = instance_type.split(".")[1] + instance_type.split(".")[2][:3]

base_job_name = f'smp-hf-trainer-{model_config}-{machine_str}-tp{tp_degree}-pp{pp_degree}-bs{hyperparameters["per_device_train_batch_size"]}'

In [None]:
mpioptions = "-x NCCL_DEBUG=WARN -x SMDEBUG_LOG_LEVEL=ERROR "
if instance_type in ["ml.p3dn.24xlarge", "ml.p4d.24xlarge"]:
    mpioptions += "-x FI_EFA_USE_DEVICE_RDMA=1 -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1 "
if SM_HP_MP_PARAMETERS["partitions"] > 1:
    mpioptions += "-x SMP_ENABLE_CROSS_NODE_D2D=1 "

metric_definitions = [
    {"Name": "base_metric", "Regex": "<><><><><><>"}
]  # Add your custom metric definitions

### Resume Training from a Previous Checkpoint

Here, you can choose to resume training from a previous checkpoint saved with HuggingFace Trainer.
Simply set `resume_from_checkpoint` to `True` and specify the bucket in which the checkpoint is stored. For convenience, we use the same bucket to load checkpoints and save output artifacts. You can also customize and set your own.

Note: The checkpoint path (`checkpoint_s3_uri`) is not unique per job.
You need to modify as needed for different runs.

In [None]:
resume_from_checkpoint = False

# We label our job with the model configuration and the number of nodes
job_name = f"{model_config}_nodes-{instance_count}"
# Here, we use the same bucket for both checkpoints and outputs
checkpoint_bucket = s3_output_bucket
# If you want to resume training, set checkpoint_s3_uri to the same checkpoint_s3_uri path as a previous job.
checkpoint_s3_uri = f"{checkpoint_bucket}/{job_name}/checkpoints"

# The previous checkpoint to load must have the same model config.
if resume_from_checkpoint:
    # the checkpoint step you want to resume training from
    # here, we set it to the first checkpoint saved, but you can set it to any
    checkpoint_step = save_steps
    checkpoint_dir = f"/opt/ml/checkpoints/checkpoint-{checkpoint_step}"
    hyperparameters["resume_from_checkpoint"] = checkpoint_dir

### Create a SageMaker HuggingFace Estimator

The following cell constructs a `HuggingFace` estimator using the parameters defined above. To see how the SageMaker tensor parallelism modules and functions are applied to the script, see the `run_clm.py` file and the private preview documentation. 

In [None]:
kwargs = {}

smp_estimator = HuggingFace(
    entry_point="run_clm.py",
    source_dir=os.getcwd(),  # copies your current working directory to S3 for SageMaker
    role=role,
    instance_type=instance_type,
    volume_size=volume_size,
    instance_count=instance_count,
    sagemaker_session=sagemaker_session,
    distribution={
        "mpi": {
            "enabled": True,
            "processes_per_host": processes_per_host,
            "custom_mpi_options": mpioptions,
        },
        "smdistributed": {
            "modelparallel": {
                "enabled": True,
                "parameters": {
                    "ddp": True,
                    "tensor_parallel_degree": SM_HP_MP_PARAMETERS["tensor_parallel_degree"],
                    # partitions is a required param in the current SM SDK so it needs to be passed,
                    # these two map to the same config
                    "partitions": SM_HP_MP_PARAMETERS["partitions"],
                    "microbatches": SM_HP_MP_PARAMETERS["microbatches"],
                    "shard_optimizer_state": SM_HP_MP_PARAMETERS["shard_optimizer_state"],
                    "prescaled_batch": SM_HP_MP_PARAMETERS["prescaled_batch"],
                    "fp16": SM_HP_MP_PARAMETERS["fp16"],
                    "optimize": SM_HP_MP_PARAMETERS["optimize"],
                    "auto_partition": True,
                    "default_partition": 0,
                },
            }
        },
    },
    py_version="py38",
    output_path=s3_output_bucket,
    checkpoint_s3_uri=checkpoint_s3_uri,
    metric_definitions=metric_definitions,
    hyperparameters=hyperparameters,
    image_uri="763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker",
    debugger_hook_config=False,
    disable_profiler=True,
    base_job_name=base_job_name,
    **kwargs,
)

Finally, run the estimator to launch the SageMaker training job of GPT2 model.

In [None]:
smp_estimator.fit(
    experiment_config={
        "ExperimentName": experiment.experiment_name,
        "TrialName": trial.trial_name,
        "TrialComponentDisplayName": "Training",
    },
    logs=True,
)

# Accessing the Training Logs

You can access the training logs using [Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html). Make sure to look at the logs of **algo-1**, which is the main node whose output stream has the entire training job logs.

You can use CloudWatch to track SageMaker GPU and memory utilization during training and inference. To view the metrics and logs that SageMaker writes to CloudWatch, see **Processing Job, Training Job, Batch Transform Job, and Endpoint Instance Metrics** in [Monitor Amazon SageMaker with Amazon CloudWatch](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html).

If you are a new user of Amazon CloudWatch, see [Getting Started with Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/GettingStarted.html).

For additional information about monitoring and analyzing Amazon SageMaker training jobs, see [Monitor and Analyze Training Jobs Using Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/training-metrics.html).