# SageMaker Training Compiler - Finding Max Batch Size for Model Training

1. [Introduction](#Introduction)  
2. [Development Environment and Permissions](#Development-Environment-and-Permissions)
    1. [Installation](#Installation)  
    2. [Development environment](#Development-environment)  
3. [Finding max batch size](#Finding-max-batch-size)
    1. [Model and instance type specifications](#Model-and-instance-type-specifications)  
    2. [Finding max batch size for SageMaker Training Compiler with Hugging Face and PyTorch](#Finding-max-batch-size-for-SageMaker-Training-Compiler-with-Hugging-Face-and-PyTorch)
    3. [Wait for find max batch job to complete](#Wait-for-find-max-batch-job-to-complete)
4. [Results](#Results)  
    1. [Load logs for find max batch job](#Load-logs-for-find-max-batch-job)  
5. [Clean up](#Clean-up) 
6. [Conclusion](#Conclusion) 

## Introduction

The SageMaker Training Compiler allows AWS customers to train deep learning models faster on scalable GPU instances managed by SageMaker. The memory optimizations made by SageMaker Training Compiler typically allow for your training job to fit more data into GPU memory. By increasing the batch size as much as possible in your training job, you can speed up your training jobs even further.

For example, for a PyTorch fine-tuning job (Sequence_Len=512, Automatic Mixed Precision (AMP)) with a GPT-2 model from Hugging Face, the maximum batch size that can fit on an ml.p3.2xlarge instance increased from 6 to 20 with the Training Compiler enabled. A list of model examples and maximum batch sizes is available in the SageMaker Training Compiler documentation under "Tested Models": https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-support.html

The goal of this Notebook is to give you an example of how you can find the max batch size for a particular model and instance type. We show you how to find the max batch size for a gpt2 model below running on an `ml.p3.8xlarge` instance. You can customize this Notebook to fit your use case, and use the resulting max batch size as the value of your batch size parameter in your full training job.

In this demo, you'll use Hugging Face's `transformers` and `datasets` libraries with Amazon SageMaker Training Compiler to train the `gpt-2` model on the `Stanford Sentiment Treebank v2 (SST2)` dataset. Please note that by using this notebook you will be downloading SST2 from https://huggingface.co/datasets/sst2 and can check dataset information and terms there. To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on. 

The Notebook uses the HuggingFace training scripts (`run_mlm.py` and `run_clm.py`) and a hands-on script (`find_max_batch_size.py`) to iteratively search for the maximum batch for a given GPU instance. 

This Notebook runs the `run_clm.py` by default, as will be shown in the following sections. If you want to test with your own training script, you need to update the following:
- The `find_max_batch_size.py` script - In line 23 to 28 of the script, specify the right directory path and the file name of your training script.
- `hyperparameters` - In the following Tune a Native PyTorch Training Job section, modify the hyperparameters that your training script requires accordingly.

## Development Environment and Permissions

### Installation

This example notebook requires the **SageMaker Python SDK v2.70.0** and **transformers v4.11.0**.

In [None]:
!pip install --force-reinstall sagemaker==2.70.0

In [None]:
!pip install transformers==4.11.0

In [None]:
import botocore
import boto3
import sagemaker
import transformers
import pandas as pd

print(f"sagemaker: {sagemaker.__version__}")
print(f"transformers: {transformers.__version__}")

### Development environment

In [None]:
import sagemaker

sess = sagemaker.Session()

# SageMaker session bucket -> used for uploading data, models and logs
# SageMaker will automatically create this bucket if it does not exist
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

## Finding max batch size

### Model and instance type specifications

This notebook uses HF training script to demonstrate how to find the max batch size that can fit in memory, if you're using a customized training script, please update `find_max_batch_size.py` script and `hyperparameters` accordingly. Below, we specify the model we would like to find the max batch size for.

In [None]:
LANGUAGE_MODELING_LOSS = "clm"

MODEL_NAME = "gpt2"
TOKENIZER_NAME = "gpt2"
MODEL_CONFIG = "model_name_or_path"

INSTANCE_TYPE = "ml.p3.8xlarge"

# hyperparameters are passed to the training entrypoint as arguments
hyperparameters = {
    "training_script": f"run_{LANGUAGE_MODELING_LOSS}.py",
    MODEL_CONFIG: MODEL_NAME,
    "tokenizer_name": TOKENIZER_NAME,
    "fp16": True,
    "sequence_len": 512,
    "per_device_train_batch_size_min": 1,
    "per_device_train_batch_size_max": 128,
}

### Finding max batch size for SageMaker Training Compiler with Hugging Face and PyTorch

In [None]:
# This prints the training script for reference
# The script iteratively tests different batch sizes
!pygmentize ./scripts/find_max_batch_size.py

### Configure a SageMaker HuggingFace estimator with the SageMaker Training Compiler configuration and the script

In [None]:
from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig

# configure the training job
optimized_estimator = HuggingFace(
    entry_point="find_max_batch_size.py",  # Wrapper around training script that finds the maximum batch size
    compiler_config=TrainingCompilerConfig(),  # We are enabling SageMaker Training Compiler here !
    source_dir="./scripts",
    instance_type=INSTANCE_TYPE,
    instance_count=1,
    role=role,
    volume_size=100,
    py_version="py38",
    transformers_version="4.11.0",
    pytorch_version="1.9.0",
    hyperparameters=hyperparameters,
    disable_profiler=True,  # Disabling SageMaker Profiler to avoid overheads during benchmarking
    debugger_hook_config=False,  # Disabling SageMaker Debugger to avoid overheads during benchmarking
)

# start the training job
optimized_estimator.fit(wait=False)
optimized_estimator.latest_training_job.name

### Wait for the training job to complete

In [None]:
waiter = optimized_estimator.sagemaker_session.sagemaker_client.get_waiter(
    "training_job_completed_or_stopped"
)
waiter.wait(TrainingJobName=optimized_estimator.latest_training_job.name)

## Results

### Load logs for training jobs

In [None]:
%%capture optimized

# access the logs of the optimized training job
optimized_estimator.sagemaker_session.logs_for_job(optimized_estimator.latest_training_job.name)

In [None]:
# Print the max batch size below

for line in optimized.stdout.split("\n"):
    if "result" in line and "max_batch_size" in line or "Total max batch" in line:
        print(line)

## Clean up

Stop all training jobs launched if the jobs are still running.

In [None]:
import boto3

sm = boto3.client("sagemaker")


def stop_training_job(name):
    status = sm.describe_training_job(TrainingJobName=name)["TrainingJobStatus"]
    if status == "InProgress":
        sm.stop_training_job(TrainingJobName=name)


stop_training_job(optimized_estimator.latest_training_job.name)

Also, to find instructions on cleaning up resources, see [Clean Up](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-cleanup.html) in the *Amazon SageMaker Developer Guide*.

## Conclusion

SageMaker Training Compiler improves the efficiency of your training job by typically decreasing the memory footprint of the job. In this notebook, you found the largest `batch_size` that can fit in memory with Training Compiler's optimizations. Increasing the `batch_size` can decrease the time needed to train a model, reducing cost and enabling faster iteration.

Remember that learning rate should be adjusted when `batch_size` is changed to minimize difference in convergence behavior during training. For more information, see https://arxiv.org/abs/1706.02677