# Compile and Train a Vision Transformer Model on the MNIST Dataset using Multi Node Distributed Training

1. [Introduction](#Introduction)  
2. [Development Environment and Permissions](#Development-Environment-and-Permissions)
    1. [Installation](#Installation)  
    2. [SageMaker environment](#SageMaker-environment)
3. [Processing](#Preprocessing)   
    1. [Tokenization](#Tokenization)  
    2. [Uploading data to sagemaker_session_bucket](#Uploading-data-to-sagemaker_session_bucket)  
4. [SageMaker Training Job](#SageMaker-Training-Job)  
    1. [Training with Native PyTorch](#Training-with-Native-PyTorch)  
    2. [Training with Optimized PyTorch](#Training-with-Optimized-PyTorch)  
    3. [Analysis](#Analysis)  


## SageMaker Training Compiler Overview

SageMaker Training Compiler is a capability of SageMaker that makes these hard-to-implement optimizations to reduce training time on GPU instances. The compiler optimizes DL models to accelerate training by more efficiently using SageMaker machine learning (ML) GPU instances. SageMaker Training Compiler is available at no additional charge within SageMaker and can help reduce total billable time as it accelerates training. 

SageMaker Training Compiler is integrated into the AWS Deep Learning Containers (DLCs). Using the SageMaker Training Compiler enabled AWS DLCs, you can compile and optimize training jobs on GPU instances with minimal changes to your code. Bring your deep learning models to SageMaker and enable SageMaker Training Compiler to accelerate the speed of your training job on SageMaker ML instances for accelerated computing. 

For more information, see [SageMaker Training Compiler](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler.html) in the *Amazon SageMaker Developer Guide*.

## Introduction

In this demo, you'll use Hugging Face's `transformers` and `datasets` libraries with Amazon SageMaker Training Compiler to train the `RoBERTa` model on the `Stanford Sentiment Treebank v2 (SST2)` dataset. To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on. 

**NOTE:** You can run this demo in SageMaker Studio, SageMaker notebook instances, or your local machine with AWS CLI set up. If using SageMaker Studio or SageMaker notebook instances, make sure you choose one of the PyTorch-based kernels, `Python 3 (PyTorch x.y Python 3.x CPU Optimized)` or `conda_pytorch_p36` respectively.

**NOTE:** This notebook uses two `ml.p3.2xlarge` instances that have single GPU. If you don't have enough quota, see [Request a service quota increase for SageMaker resources](https://docs.aws.amazon.com/sagemaker/latest/dg/regions-quotas.html#service-limit-increase-request-procedure). 

## Development Environment 

### Installation

This example notebook requires the **SageMaker Python SDK v2.70.0** and **transformers v4.11.0**.

In [1]:
!pip install sagemaker botocore boto3 awscli --upgrade

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting botocore
  Downloading botocore-1.27.52-py3-none-any.whl (9.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m84.7 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
Collecting boto3
  Downloading boto3-1.24.52-py3-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.5/132.5 KB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting awscli
  Downloading awscli-1.25.52-py3-none-any.whl (3.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.9/3.9 MB[0m [31m24.7 MB/s[0m eta [36m0:00:00[0m:00:01[0m
Installing collected packages: botocore, boto3, awscli
  Attempting uninstall: botocore
    Found existing installation: botocore 1.24.19
    Uninstalling botocore-1.24.19:
      Successfully uninstalled botocore-1.24.19
  Attempting uninstall: boto3
    Found existing installation: boto3

In [2]:
!pip install -U transformers datasets --upgrade

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting transformers
  Downloading transformers-4.21.1-py3-none-any.whl (4.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.7/4.7 MB[0m [31m52.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting datasets
  Downloading datasets-2.4.0-py3-none-any.whl (365 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m365.7/365.7 KB[0m [31m68.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m49.5 MB/s[0m eta [36m0:00:00[0m:00:01[0m
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.5/101.5 KB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m

In [1]:
import botocore
import boto3
import sagemaker
import transformers
import pandas as pd

print(f"sagemaker: {sagemaker.__version__}")
print(f"transformers: {transformers.__version__}")

sagemaker: 2.103.0
transformers: 4.21.1


Copy and run the following code if you need to upgrade ipywidgets for `datasets` library and restart kernel. This is only needed when prerpocessing is done in the notebook.

```python
%%capture
import IPython
!conda install -c conda-forge ipywidgets -y
# has to restart kernel for the updates to be applied
IPython.Application.instance().kernel.do_shutdown(True) 
```

### SageMaker environment 

In [2]:
import sagemaker

sess = sagemaker.Session()

# SageMaker session bucket -> used for uploading data, models and logs
# SageMaker will automatically create this bucket if it does not exist
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

sagemaker role arn: arn:aws:iam::875423407011:role/SageMakerRole
sagemaker bucket: sagemaker-us-west-2-875423407011
sagemaker session region: us-west-2


## SageMaker Training Job

To create a SageMaker training job, we use a `HuggingFace` estimator. Using the estimator, you can define which fine-tuning script should SageMaker use through `entry_point`, which `instance_type` to use for training, which `hyperparameters` to pass, and so on.

When a SageMaker training job starts, SageMaker takes care of starting and managing all the required machine learning instances, picks up the `HuggingFace` Deep Learning Container, uploads your training script, and downloads the data from `sagemaker_session_bucket` into the container at `/opt/ml/input/data`.

In the following section, you learn how to set up two versions of the SageMaker `HuggingFace` estimator, a native one without the compiler and an optimized one with the compiler.

### Training Setup

Set up an option for fine-tuning or full training. Set `FINE_TUNING = 1` for fine-tuning and using `fine_tune_with_huggingface.py`. Set `FINE_TUNING = 0` for full training and using `full_train_roberta_with_huggingface.py`.

In [3]:
EPOCHS = 1

# SageMaker Training Compiler currently only supports training on GPU
# Select Instance type for training
INSTANCE_TYPE = "ml.p4d.24xlarge"
NUM_GPUS = 1

### Training with Native PyTorch

The `train_batch_size` in the following code cell is the maximum batch that can fit into the memory of an `ml.g4dn.2xlarge` instance. If you change the model, instance type, and other parameters, you need to do some experiments to find the largest batch size that will fit into GPU memory.

In [5]:
from sagemaker.huggingface import HuggingFace

kwargs = dict(
    source_dir="scripts",
    instance_type=INSTANCE_TYPE,
    role=role,
    py_version="py38",
    disable_profiler=True,
    debugger_hook_config=False,
    volume_size=60,
)

PER_DEVICE_BATCH_SIZE=248
cluster_size=1


In [7]:
from sagemaker.huggingface import HuggingFace


# The original LR was set for a batch of 8. Here we are scaling learning rate with batch size.
GLOBAL_BATCH_SIZE = PER_DEVICE_BATCH_SIZE * NUM_GPUS * cluster_size
LEARNING_RATE = float("2e-5") / 8 * GLOBAL_BATCH_SIZE

# configure the training job
huggingface_estimator = HuggingFace(
    image_uri="669063966089.dkr.ecr.us-west-2.amazonaws.com/pr-huggingface-pytorch-training:1.11.0-transformers4.21.1-gpu-py38-cu113-ubuntu20.04-pr-1824-2022-08-08-10-57-02",
    instance_count=cluster_size,
    entry_point='run_mim.py',
    hyperparameters={
        'model_type': 'vit',
        'dataset_name': 'mnist',
        'output_dir': '/opt/ml/model',
        'overwrite_output_dir': True,
        'remove_unused_columns': 'False',
        'label_names' : 'bool_masked_pos',
        'do_train': True,
        'do_eval': False,
        'learning_rate': LEARNING_RATE,
        'weight_decay': 0.05,
        'num_train_epochs': EPOCHS,
        'per_device_train_batch_size': PER_DEVICE_BATCH_SIZE,
        'per_device_eval_batch_size': PER_DEVICE_BATCH_SIZE,
        'logging_strategy': 'epoch',
        'evaluation_strategy': 'no',
        'save_strategy': 'no',
        'save_total_limit': 3,
    },
    distribution={'smdistributed': {'dataparallel': {'enabled': True}}},
    **kwargs,
)

# start training with our uploaded datasets as input
huggingface_estimator.fit(wait=False)

# The name of the training job.
print(huggingface_estimator.latest_training_job.name)

200 pr-huggingface-pytorch-training-2022-08-23-21-10-36-273
208 pr-huggingface-pytorch-training-2022-08-23-21-10-36-995
216 pr-huggingface-pytorch-training-2022-08-23-21-10-40-499
224 pr-huggingface-pytorch-training-2022-08-23-21-10-41-041
232 pr-huggingface-pytorch-training-2022-08-23-21-10-44-361
240 pr-huggingface-pytorch-training-2022-08-23-21-10-45-961
248 pr-huggingface-pytorch-training-2022-08-23-21-10-47-342
256 pr-huggingface-pytorch-training-2022-08-23-21-10-49-527
264 pr-huggingface-pytorch-training-2022-08-23-21-10-53-981
272 pr-huggingface-pytorch-training-2022-08-23-21-10-54-513


### Training with Optimized PyTorch

Compilation through Training Compiler changes the memory footprint of the model. Most commonly, this manifests as a reduction in memory utilization and a consequent increase in the largest batch size that can fit on the GPU. Note that if you want to change the batch size, you must adjust the learning rate appropriately.

**Note:** We recommend you to turn the SageMaker Debugger's profiling and debugging tools off when you use compilation to avoid additional overheads.

In [6]:
from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig
TrainingCompilerConfig.validate = lambda *args, **kwargs:None

NEW_PER_DEVICE_BATCH_SIZE=248
cluster_size=1

# The original LR was set for a batch of 8. Here we are scaling learning rate with batch size.
GLOBAL_BATCH_SIZE = NEW_PER_DEVICE_BATCH_SIZE * NUM_GPUS * cluster_size
LEARNING_RATE = float("2e-5") / 8 * GLOBAL_BATCH_SIZE

# configure the training job
optimized_estimator = HuggingFace(
    image_uri="669063966089.dkr.ecr.us-west-2.amazonaws.com/pr-huggingface-pytorch-trcomp-training:1.11.0-transformers4.21.1-gpu-py38-cu113-ubuntu20.04-pr-2032-2022-08-19-18-27-39",
    compiler_config=TrainingCompilerConfig(),
    instance_count=cluster_size,
    entry_point='run_mim.py',
    hyperparameters={
        'model_type': 'vit',
        'dataset_name': 'mnist',
        'output_dir': '/opt/ml/model',
        'overwrite_output_dir': True,
        'remove_unused_columns': 'False',
        'label_names' : 'bool_masked_pos',
        'do_train': True,
        'do_eval': False,
        'learning_rate': LEARNING_RATE,
        'weight_decay': 0.05,
        'num_train_epochs': EPOCHS,
        'per_device_train_batch_size': NEW_PER_DEVICE_BATCH_SIZE,
        'per_device_eval_batch_size': PER_DEVICE_BATCH_SIZE,
        'logging_strategy': 'epoch',
        'evaluation_strategy': 'no',
        'save_strategy': 'no',
        'save_total_limit': 3,
        'sagemaker_pytorch_xla_multi_worker_enabled': True,
    },
    **kwargs,
)

# start training with our uploaded datasets as input
optimized_estimator.fit(wait=False)

# The name of the training job.
print(optimized_estimator.latest_training_job.name)

248 pr-huggingface-pytorch-trcomp-training-2022-08-23-21-45-40-712
256 pr-huggingface-pytorch-trcomp-training-2022-08-23-21-45-41-485
264 pr-huggingface-pytorch-trcomp-training-2022-08-23-21-45-44-498
272 pr-huggingface-pytorch-trcomp-training-2022-08-23-21-45-46-143
280 pr-huggingface-pytorch-trcomp-training-2022-08-23-21-45-46-682
288 pr-huggingface-pytorch-trcomp-training-2022-08-23-21-45-51-186
296 pr-huggingface-pytorch-trcomp-training-2022-08-23-21-45-52-597
304 pr-huggingface-pytorch-trcomp-training-2022-08-23-21-45-53-330
312 pr-huggingface-pytorch-trcomp-training-2022-08-23-21-45-56-047
320 pr-huggingface-pytorch-trcomp-training-2022-08-23-21-45-56-676
