# SageMaker Training Compiler - Finding Max Batch Size for Model Training

1. [Introduction](#Introduction)  
2. [Development Environment and Permissions](#Development-Environment-and-Permissions)
    1. [Installation](#Installation)  
    2. [Development environment](#Development-environment)  
3. [Finding max batch size](#Finding-max-batch-size)
    1. [Model and instance type specifications](#Model-and-instance-type-specifications)  
    2. [Finding max batch size for SageMaker Training Compiler with Hugging Face and TensorFlow](#Finding-max-batch-size-for-SageMaker-Training-Compiler-with-Hugging-Face-and-TensorFlow)
    3. [Wait for find max batch job to complete](#Wait-for-find-max-batch-job-to-complete)
4. [Results](#Results)  
    1. [Load logs for find max batch job](#Load-logs-for-find-max-batch-job)  
5. [Clean up](#Clean-up) 
6. [Conclusion](#Conclusion) 

## Introduction

The SageMaker Training Compiler allows AWS customers to train deep learning models faster on scalable GPU instances managed by SageMaker. The memory optimizations made by SageMaker Training Compiler typically allow for your training job to fit more data into GPU memory. By increasing the batch size as much as possible in your training job, you can speed up your training jobs even further.

For example, for a TensorFlow fine-tuning job (Sequence_Len=512, Automatic Mixed Precision (AMP)) with a GPT-2 model from Hugging Face, the maximum batch size that can fit on a ml.p3.2xlarge instance increased from 6 to 20 with the Training Compiler enabled. A list of model examples and maximum batch sizes is available in the SageMaker Training Compiler documentation under "Tested Models": https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-support.html

The goal of this Notebook is to give you an example of how you can find the max batch size for a particular model and instance type. We show you how to find the max batch size for a gpt2 model below running on a `ml.p3.8xlarge` instance. This takes about 1 hour. You can customize this Notebook to fit your use case, and use the resulting max batch size as the value of your batch size parameter in your full training job.

The Notebook uses the Hugging Face training scripts (`run_mlm.py` and `run_clm.py`) and a hands-on script (`find_max_batch_size.py`) to iteratively search for the maximum batch for a given GPU instance. 

This Notebook runs the `run_clm.py` by default, as will be shown in the following sections. If you want to test with your own training script, you need to update the following:
- The `find_max_batch_size.py` script - In line 23 to 28 of the script, specify the right directory path and the file name of your training script.
- `hyperparameters` - In the following Tune a Native TensorFlow Training Job section, modify the hyperparameters that your training script requires accordingly.

## Development Environment and Permissions

### Installation

This example notebook requires the **SageMaker Python SDK v2.87.0** and **transformers v4.17.0**.

In [2]:
!pip install --force-reinstall sagemaker==2.87.0

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0mCollecting sagemaker==2.87.0
  Downloading sagemaker-2.87.0.tar.gz (522 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m522.9/522.9 KB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting attrs==20.3.0
  Downloading attrs-20.3.0-py2.py3-none-any.whl (49 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 KB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting boto3>=1.20.21
  Downloading boto3-1.22.1-py3-none-any.whl (132 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.5/132.5 KB[0m [31m5.1 MB/s[0m eta 

[0m  Attempting uninstall: six
    Found existing installation: six 1.16.0
    Uninstalling six-1.16.0:
      Successfully uninstalled six-1.16.0
[33m  DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0m  Attempting uninstall: pyparsing
    Found existing installation: pyparsing 3.0.6
    Uninstalling pyparsing-3.0.6:
      Successfully uninstalled pyparsing-3.0.6
[33m  DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0m  Attempting uninstall: protobuf
    Found existing installation: protobuf 3.19.4
    Uninstalling protobuf-3.19.4:
      Success

[0m  Attempting uninstall: sagemaker
    Found existing installation: sagemaker 2.81.1
    Uninstalling sagemaker-2.81.1:
      Successfully uninstalled sagemaker-2.81.1
[33m  DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0m[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
awscli 1.22.52 requires botocore==1.23.52, but you have botocore 1.25.1 which is incompatib

In [3]:
!pip install transformers==4.17.0

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0m

In [4]:
import botocore
import boto3
import sagemaker
import transformers
import pandas as pd

print(f"sagemaker: {sagemaker.__version__}")
print(f"transformers: {transformers.__version__}")

sagemaker: 2.87.0
transformers: 4.17.0


### Development environment

In [5]:
import sagemaker

sess = sagemaker.Session()

# SageMaker session bucket -> used for uploading data, models and logs
# SageMaker will automatically create this bucket if it does not exist
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

sagemaker role arn: arn:aws:iam::875423407011:role/Admin
sagemaker bucket: sagemaker-us-west-2-875423407011
sagemaker session region: us-west-2


## Finding max batch size

### Model and instance type specifications

This notebook uses Hugging Face training script to demonstrate how to find the max batch size that can fit in memory, if you're using a customized training script, please update `find_max_batch_size.py` script and `hyperparameters` accordingly. Below, we specify the model we would like to find the max batch size for.

In [6]:
LANGUAGE_MODELING_LOSS = "clm"

MODEL_NAME = "gpt2"
TOKENIZER_NAME = "gpt2"
MODEL_CONFIG = "model_name_or_path"

INSTANCE_TYPE = "ml.p3.8xlarge"

# hyperparameters are passed to the training entrypoint as arguments
hyperparameters = {
    "training_script": f"run_{LANGUAGE_MODELING_LOSS}.py",
    MODEL_CONFIG: MODEL_NAME,
    "tokenizer_name": TOKENIZER_NAME,
    "fp16": True,
    "sequence_len": 512,
    "per_device_train_batch_size_min": 1,
    "per_device_train_batch_size_max": 128,
}

### Finding max batch size for SageMaker Training Compiler with Hugging Face and TensorFlow

We use the wrapper script below to find the maximum batch size. The wrapper assumes that Cuda will throw an Out Of Memory error if the batch size is too high. We then use binary seach to find the maximum batch size that does not cause the training script to fail. In the interest of time, we only check the first 10 training steps for memory overflow.

In [7]:
# This prints the training script for reference
# The script iteratively tests different batch sizes
!pygmentize ./scripts/find_max_batch_size.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m, [04m[36msubprocess[39;49;00m, [04m[36mtime[39;49;00m

[34mif[39;49;00m [31m__name__[39;49;00m == [33m"[39;49;00m[33m__main__[39;49;00m[33m"[39;49;00m:

    parser = argparse.ArgumentParser()
    [37m# please update parameters if using a customized training script[39;49;00m
    [37m# model configs[39;49;00m
    parser.add_argument([33m"[39;49;00m[33m--language_modeling_loss[39;49;00m[33m"[39;49;00m, [36mtype[39;49;00m=[36mstr[39;49;00m, default=[33m"[39;49;00m[33mclm[39;49;00m[33m"[39;49;00m, help=[33m"[39;49;00m[33mselect either use training script run_mlm or run_clm[39;49;00m[33m"[39;49;00m)
    parser.add_argument([33m"[39;49;00m[33m--model_name_or_path[39;49;00m[33m"[39;49;00m, [36mtype[39;49;00m=[36mstr[39;49;00m, default=[33m"[39;49;00m[33mgpt2[39;49;00m[33m"[39;49;00m, help=[33m"[39;49;00m[33mHF model name[39;4

### Configure a SageMaker Hugging Face estimator with the SageMaker Training Compiler configuration and the script

In [9]:
from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig

# configure the training job
optimized_estimator = HuggingFace(
    entry_point="find_max_batch_size.py",  # Wrapper around training script that finds the maximum batch size
    compiler_config=TrainingCompilerConfig(),  # We are enabling SageMaker Training Compiler here !
    source_dir="./scripts",
    instance_type=INSTANCE_TYPE,
    instance_count=1,
    role=role,
    volume_size=100,
    py_version="py38",
    transformers_version="4.17.0",
    tensorflow_version="2.6.3",
    hyperparameters=hyperparameters,
    disable_profiler=True,  # Disabling SageMaker Profiler to avoid overheads during benchmarking
    debugger_hook_config=False,  # Disabling SageMaker Debugger to avoid overheads during benchmarking
)

# start the training job
optimized_estimator.fit(wait=False)
optimized_estimator.latest_training_job.name

'huggingface-tensorflow-trcomp-training-2022-04-27-17-15-34-114'

### Wait for the training job to complete

In [10]:
waiter = optimized_estimator.sagemaker_session.sagemaker_client.get_waiter(
    "training_job_completed_or_stopped"
)
waiter.wait(TrainingJobName=optimized_estimator.latest_training_job.name)

## Results

### Load logs for training jobs

In [11]:
%%capture optimized

# access the logs of the optimized training job
optimized_estimator.sagemaker_session.logs_for_job(optimized_estimator.latest_training_job.name)

In [12]:
# Print the max batch size below

for line in optimized.stdout.split("\n"):
    if "result" in line and "max_batch_size" in line or "Total max batch" in line:
        print(line)

[34mTotal max batch found in 25.9129 seconds, 7 iterations[0m
[34m[result]: model: gpt2, max_batch_size between 1 and 128 is 0[0m


## Clean up

Stop all training jobs launched if the jobs are still running.

In [None]:
import boto3

sm = boto3.client("sagemaker")


def stop_training_job(name):
    status = sm.describe_training_job(TrainingJobName=name)["TrainingJobStatus"]
    if status == "InProgress":
        sm.stop_training_job(TrainingJobName=name)


stop_training_job(optimized_estimator.latest_training_job.name)

Also, to find instructions on cleaning up resources, see [Clean Up](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-cleanup.html) in the *Amazon SageMaker Developer Guide*.

## Conclusion

SageMaker Training Compiler improves the efficiency of your training job by typically decreasing the memory footprint of the job. In this notebook, you found the largest `batch_size` that can fit in memory with Training Compiler's optimizations. Increasing the `batch_size` can decrease the time needed to train a model, reducing cost and enabling faster iteration.

Remember that learning rate should be adjusted when `batch_size` is changed to minimize difference in convergence behavior during training. For more information on how learning rate is connected to batch size, see [Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://arxiv.org/abs/1706.02677)