# Compile and Train the GPT2 Model using the Transformers Trainer API with the SST2 Dataset for Multi-Node Multi-GPU Training


1. [Introduction](#Introduction)  
2. [Development Environment and Permissions](#Development-Environment-and-Permissions)
    1. [Installation](#Installation)  
    2. [Permissions](#Permissions) 
3. [SageMaker Training Job](#SageMaker-Training-Job)  
    1. [Training with Optimized TensorFlow](#Training-with-Optimized-TensorFlow)   
    2. [Analysis](#Analysis)

## SageMaker Training Compiler Overview

SageMaker Training Compiler is a capability of SageMaker that makes these hard-to-implement optimizations to reduce training time on GPU instances. The compiler optimizes DL models to accelerate training by more efficiently using SageMaker machine learning (ML) GPU instances. SageMaker Training Compiler is available at no additional charge within SageMaker and can help reduce total billable time as it accelerates training. 

SageMaker Training Compiler is integrated into the AWS Deep Learning Containers (DLCs). Using the SageMaker Training Compiler enabled AWS DLCs, you can compile and optimize training jobs on GPU instances with minimal changes to your code. Bring your deep learning models to SageMaker and enable SageMaker Training Compiler to accelerate the speed of your training job on SageMaker ML instances for accelerated computing. 

For more information, see [SageMaker Training Compiler](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler.html) in the *Amazon SageMaker Developer Guide*.

## Introduction

In this demo, you'll use Hugging Face's `transformers` and `datasets` libraries with Amazon SageMaker Training Compiler to train the `gpt-2` model on the `Stanford Sentiment Treebank v2 (SST2)` dataset. To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on. 

**NOTE:** You can run this demo in SageMaker Studio, SageMaker notebook instances, or your local machine with AWS CLI set up. If using SageMaker Studio or SageMaker notebook instances, make sure you choose one of the TensorFlow-based kernels, `Python 3 (PyTorch x.y Python 3.x CPU Optimized)` or `conda_tensorflow_p36` respectively.

**NOTE:** This notebook uses two `ml.p3dn.24xlarge` instances that have multiple GPUs. If you don't have enough quota, see [Request a service quota increase for SageMaker resources](https://docs.aws.amazon.com/sagemaker/latest/dg/regions-quotas.html#service-limit-increase-request-procedure). 

## Development Environment 

### Installation

This example notebook requires the **SageMaker Python SDK v2.70.0** and **transformers v4.11.0**.

In [None]:
!pip install sagemaker botocore boto3 awscli --upgrade

In [None]:
!pip install transformers --upgrade

In [None]:
import botocore
import boto3
import sagemaker
import transformers
import pandas as pd

print(f"sagemaker: {sagemaker.__version__}")
print(f"transformers: {transformers.__version__}")

### SageMaker environment 

In [None]:
import sagemaker

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it does not exist
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

## SageMaker Training Job

To create a SageMaker training job, we use a `HuggingFace` estimator. Using the estimator, you can define which training script should SageMaker use through `entry_point`, which `instance_type` to use for training, which `hyperparameters` to pass, and so on.

When a SageMaker training job starts, SageMaker takes care of starting and managing all the required machine learning instances, picks up the `HuggingFace` Deep Learning Container, uploads your training script, and downloads the data from `sagemaker_session_bucket` into the container at `/opt/ml/input/data`.

In the following section, you learn how to set up an optimized SageMaker `HuggingFace` estimator with the compiler.

### Training Setup

In [None]:
# Here we configure the training job. Please configure the appropriate options below:

EPOCHS = 100

MODEL_NAME = "roberta-base"
SEQ_LEN = 512

# For more information about the options, please look into the training scripts

# SageMaker Training Compiler currently only supports training on GPU
# Select Instance type for training
INSTANCE_TYPE = "ml.p3dn.24xlarge"  # ml.p3.8xlarge is easily available. However, ml.p3.16xlarge provides better performance.
NUM_INSTANCES = 2

## Training with Optimized TensorFlow

The batch size below is the maximum batch we could fit into the memory of a `Nvidia V100` GPU. If you change the model, instance type, sequence length, and other parameters, you need to do some experiments to find the largest batch size that will fit into GPU memory. We also use AMP for faster training.

Note: We recommend you to turn the SageMaker Debugger's profiling and debugging tools off when you use compilation to avoid additional overheads.

In [None]:
from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig

# hyperparameters are passed to the training entrypoint as arguments
hyperparameters = {
    "epochs": EPOCHS,
    "train_batch_size": 16,
    "seq": SEQ_LEN,
    "learning_rate": 5e-5,
    "model_name": MODEL_NAME,
}

# configure the training job
optimized_estimator = HuggingFace(
    entry_point="train_with_keras.py",
    compiler_config=TrainingCompilerConfig(),  # We are enabling SageMaker Training Compiler here !
    source_dir="./scripts",
    instance_type=INSTANCE_TYPE,
    instance_count=NUM_INSTANCES,
    role=role,
    volume_size=500,
    py_version="py37",
    transformers_version="4.11.0",
    tensorflow_version="2.5.1",
    hyperparameters=hyperparameters,
    disable_profiler=True,  # Disable SageMaker Profiler to avoid overhead during benchmarking
    debugger_hook_config=False,  # Disable SageMaker Debugger to avoid overhead during benchmarking
)

# start the training job
optimized_estimator.fit(wait=False)
optimized_estimator.latest_training_job.name

### Wait for training jobs to complete


In [None]:
optimized_estimator = HuggingFace.attach(optimized_estimator.latest_training_job.name)

## Analysis

**Note:** If the estimator object is no longer available due to a kernel break or refresh, you need to directly use the training job name and manually attach the training job to a new HuggingFace estimator. For example:

```python
estimator = HuggingFace.attach("your_huggingface_training_job_name")
```

### Load logs of the training job *with* SageMaker Training Compiler

In [None]:
%%capture optimized

# access the logs of the optimized training job
optimized_estimator.sagemaker_session.logs_for_job(optimized_estimator.latest_training_job.name)

### Create helper functions for analysis

In [None]:
from ast import literal_eval
from collections import defaultdict
from matplotlib import pyplot as plt


def _summarize(captured):
    final = []
    for line in captured.stdout.replace("#010", "").replace("]", "\n").replace("-", "").split("\n"):
        cleaned = line.strip()
        if (
            cleaned.startswith("ETA")
            or "*" * 5 in cleaned
            or "ms/step" in cleaned
            or "Epoch" in cleaned
            or ("INFO" in cleaned and "=" in cleaned)
        ):
            final.append(cleaned)
    return final


def make_sense(string):
    try:
        return literal_eval(string)
    except:
        pass


def summarize(summary):
    final = {"train": {}, "eval": {}}
    phase = "train"
    final[phase]["1"] = {"convergence": [], "ms/step": -1}
    epoch = "1"
    for line in summary:
        if "Epoch" in line:
            epoch = line[line.index("/") - 1]
            final[phase][epoch] = {"convergence": [], "ms/step": -1}
        elif line.startswith("ETA"):
            try:
                extract = line[line.index("loss:") : line.index("loss:") + 49]
                values = [i for i in extract.split(" ") if make_sense(i)]
                loss, acc = values
                final[phase][epoch]["convergence"] += [{"loss": loss, "acc": acc}]
            except:
                pass
        elif "ms/step" in line:
            avg_step_latency = make_sense(
                [i for i in line.split(" ") if "ms/step" in i][0].replace("ms/step", "")
            )
            final[phase][epoch]["ms/step"] = avg_step_latency
        elif "*" in line:
            return final
            phase = "eval"
            epoch = 0
            final[phase][0] = {"convergence": [], "ms/step": -1}
    return final

### Convergence of Training Loss

SageMaker Training Compiler does not affect the model convergence behavior. Here, we see the decrease in training loss  with SageMaker Training Compiler


In [None]:
import pandas as pd

o = summarize(_summarize(optimized))
o_df = pd.DataFrame(o["train"]["1"]["convergence"])["loss"]
o_df = o_df.astype("float")

In [None]:
o_df.plot()
plt.title("Optimized Loss")

### Total Billable Time

Finally, the decrease in total training time results in a decrease in the billable seconds from SageMaker.

In [None]:
def BillableTimeInSeconds(name):
    describe_training_job = (
        optimized_estimator.sagemaker_session.sagemaker_client.describe_training_job
    )
    details = describe_training_job(TrainingJobName=name)
    return details["BillableTimeInSeconds"]

In [None]:
BillableTimeInSeconds(optimized_estimator.latest_training_job.name)

## Clean up

Stop all training jobs launched if the jobs are still running.

In [None]:
import boto3

sm = boto3.client("sagemaker")


def stop_training_job(name):
    status = sm.describe_training_job(TrainingJobName=name)["TrainingJobStatus"]
    if status == "InProgress":
        sm.stop_training_job(TrainingJobName=name)


stop_training_job(optimized_estimator.latest_training_job.name)

Also, to find instructions on cleaning up resources, see [Clean Up](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-cleanup.html) in the *Amazon SageMaker Developer Guide*.