# SageMaker Training Toolkit Support for neuron_parallel_compile

This is a step by step guide on how to use ahead-of-time compilation to speed up SageMaker training jobs running on Amazon EC2 Trn1 (AWS Trainium) Instances by up to 10x using neuron_parallel_compile utility.

ML Frameworks such as PyTorch and Tensorflow, leverage compilers that take high-level descriptions of machine learning models, often in the form of computational graphs, and translate them into lower-level representations that can be efficiently executed on specific hardware architectures (ex. GPUs, TPUs). These optimizations may include parallelization, vectorization, and other techniques to make better use of the available hardware resources. Precompiling models in advance can result in training jobs running up to 10x faster.

Trn1 (AWS Trainium) EC2 instances leverage the Neuron SDK, the software stack that includes the Neuron hardware driver, user tools, framework integration, and compiler. Before you are able to train your model on Trn1 (AWS Trainium) EC2 instances you must leverage the Neuron compiler to complete a compilation step which converts your model from the standard ML framework-level model to a Neuron Executable File Format (NEFF) binary. The Neuron Compiler accepts Machine Learning models in various formats (TensorFlow, MXNet, PyTorch, XLA HLO) and optimizes them to run on Neuron devices. To provides some benchmarks, the compilation process for Llama2 7B with a parallel compilation of 16 nodes using trn1.2xlarge for 4k sequences takes approximately 3 minutes, and with 4 nodes using trn1.2xlarge , it takes around 5 minutes. 


The Neuron compiler has 2 compilation methods:

* just-in-time (JIT) compilation (default)
* ahead-of-time compilation with neuron_parallel_compile

PyTorch Neuron defaults to just-in-time (JIT) compilation of graphs during execution, this is where at every step, a graph is traced. If the traced graph varies from the previous executions, it is compiled by the neuron compiler. JIT compilation can be helpful to speed up developer workflow, however when using JIT, graphs are compiled sequentially which can lead to much longer compilation times than compared to neuron_parallel_compile. 

To reduce this compilation time during execution, the neuron_parallel_compile utility is provided as part of PyTorch Neuron installation. The neuron_parallel_compile utility will extract graphs from a trial run of your script, perform parallel pre-compilation of the graphs, and populate the Neuron Cache on disk or AWS S3 URL location with the compiled graphs. This pre-compilation run should be limited to a few training steps (eg. <100), enough for the utility to explore all code branches present in your training script to extract the different graphs needed for full execution. Once, the neuron_parallel_compile finishes compilation of all graphs, it will copy all the compilation results into the Neuron Cache directory (which can be a specified S3 location). Therefore you are then able to specify the location of the Neuron Cache directory in subsequent training runs so that the precompiled graphs will be used - avoiding recompilation.
 
With recent versions of the Neuron Deep Learning Containers (DLCs), you are now able to leverage neuron_parallel_compile utility with SageMaker training jobs by setting the `RUN_NEURON_PARALLEL_COMPILE = "1"` environment variable within the SageMaker Estimator class. 

### The specific steps to enable this workflow would be the following:

1. Ahead-of-time compilation SageMaker training run:<br>
    1. The 1st training run will use ahead-of-time compilation by setting the RUN_NEURON_PARALLEL_COMPILE = "1" environment variable in the SageMaker Estimator class. Please note the values that are outputted from the training script when using neuron_parallel_compile are placeholder values and should be disregarded (ex: loss_value=0, etc.).<br>
    
    2. You will also specific an S3 URL location to store your Neuron Persistant Cache files by using the NEURON_COMPILE_CACHE_URL environment variable in the SageMaker Estimator class. The Neuron SDK will check the specified S3 location for available Neuron Persistent Cache files as well as upload Neuron Persistent Cache files once the training job is complete.<br>
    
    3. To minimize the number of training steps you can set the max_steps hyperparameters to <100 steps. You want to ensure you set the max-steps to a minimum number of steps that are enough for the neuron_parallel_compile utility to explore and extract the different graphs needed for full execution. In most cases 100 steps would suffice.<br>
    
2. Subsequent SageMaker training runs leveraging the precompiled neuron persistent cache files located in S3:<br>

    1. You would run subsequent training jobs without setting the RUN_NEURON_PARALLEL_COMPILE = "1" environment variable but still making sure to pass in the NEURON_COMPILE_CACHE_URL. The Neuron SDK will check the specified S3 URL location and download the Neuron Persistent Cache which will have all of the precompiled graphs to be used for subsequent training runs. 



## Installation

_*Note:* we only install the required libraries from Hugging Face and AWS. You also need PyTorch or Tensorflow, if you haven´t it installed_

In [None]:
!pip install "sagemaker>=2.48.0"  --upgrade

## Permissions

In [None]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

## Set the parameters for the training job

In [None]:
model_name = "roberta-base"
env_var_options = ""
num_workers = 2
task_name = "mrpc"
batch_size = 8
max_seq_length = 128
learning_rate = 2e-05
num_train_epochs = 1
model_base_name = model_name
max_train_samples = 128

## Creating an Estimator and start a training job

In this example we will use the run_glue.py training script provided by Hugging Face library in this github repo [here](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification/run_glue.py).

In [None]:
from sagemaker.pytorch import PyTorch

# hyperparameters, which are passed into the training job
hyperparameters={
    'model_name_or_path': model_name,
    'task_name': task_name,
    'do_train': True,
    # 'do_eval': True,
    'max_seq_length': max_seq_length,
    'per_device_train_batch_size': batch_size,
    'learning_rate': learning_rate,
    'max_train_samples': max_train_samples,
    'num_train_epochs': num_train_epochs,
    'max_steps': 200,
    'output_dir': '/opt/ml/model',
}

# configuration for running training on smdistributed Data Parallel
distribution={"torch_distributed": {"enabled": True} }

# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/aws-neuron/aws-neuron-samples.git',
              'branch': 'master',
             }


# instance configurations
instance_type='ml.trn1.32xlarge'
instance_count=2
volume_size=450

# metric definition to extract the results
metric_definitions=[
     {"Name": "train_runtime", "Regex": "train_runtime.*=\D*(.*?)$"},
     {'Name': 'train_samples_per_second', 'Regex': "train_samples_per_second.*=\D*(.*?)$"},
     {'Name': 'epoch', 'Regex': "epoch.*=\D*(.*?)$"},
     {'Name': 'f1', 'Regex': "f1.*=\D*(.*?)$"},
     {'Name': 'exact_match', 'Regex': "exact_match.*=\D*(.*?)$"}]

In [None]:
# Specify the neuronx container image
training_image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training-neuronx:1.13.1-neuronx-py310-sdk2.12.0-ubuntu20.04"

In [None]:
# Using ahead-of-time compilation by setting RUN_NEURON_PARALLEL_COMPILE == "1" environment variable
# Please note the values that are outputted from the training script when using neuron_parallel_compile 
# are placeholder values and should be disregarded (ex: loss_value=0, etc.). 

pytorch_estimator = PyTorch(
    entry_point='run_glue.py', #training script file
    source_dir='./torch-neuronx/training/sagemaker_examples/neuron_parallel_compile', #source directory for your training script 
    git_config=git_config, #specifies the Git repository where your training script is stored
    metric_definitions=metric_definitions, #A list of dictionaries that defines the metric(s) used to evaluate the training jobs.
    instance_type=instance_type, #type of EC2 instance to use for training
    instance_count=instance_count, #number of Amazon EC2 instances to use for training
    volume_size=volume_size, #size in GB of the storage volume to use for storing input and output data during training
    role=role, #An AWS IAM role to be used by Amazon SageMaker training jobs and APIs 
    transformers_version='4.27.3', #specifies version of the transformer library to be used
    pytorch_version='1.13.1', #specifies version of the PyTorch library to be used
    py_version='py39', #specifies version of Python to be used
    distribution=distribution, #configuration for running training on smdistributed Data Parallel
    image_uri=training_image, #specifies the Docker image to use for training
    environment={
        "RUN_NEURON_PARALLEL_COMPILE": "1", # runs neuron precompile step if equal to "1"
        "FI_EFA_FORK_SAFE": "1", #Older Linux (<5.15) kernels require environment variable FI_EFA_FORK_SAFE to be set to 1 for the libfabric to operate correctly.
        "NEURON_COMPILE_CACHE_URL": f"s3://{sess.default_bucket()}/s3-neuron-cache" #specifies s3 path location to store neuron persistent cache
    },
    hyperparameters = hyperparameters) #hyperparameters to be passed into the training script

In [None]:
# starting the train job
pytorch_estimator.fit()

## Rerun Training Job Without Neuron_Parallel_Compile

In [None]:
#Subsequent training run without setting RUN_NEURON_PARALLEL_COMPILE == "1" environment variable
pytorch_estimator = PyTorch(
    entry_point='run_glue.py', #training script file
    source_dir='./torch-neuronx/training/sagemaker_examples/neuron_parallel_compile', #source directory for your training script 
    git_config=git_config, #specifies the Git repository where your training script is stored
    metric_definitions=metric_definitions, #A list of dictionaries that defines the metric(s) used to evaluate the training jobs.
    instance_type=instance_type, #type of EC2 instance to use for training
    instance_count=instance_count, #number of Amazon EC2 instances to use for training
    volume_size=volume_size, #size in GB of the storage volume to use for storing input and output data during training
    role=role, #An AWS IAM role to be used by Amazon SageMaker training jobs and APIs 
    transformers_version='4.27.3', #specifies version of the transformer library to be used
    pytorch_version='1.13.1', #specifies version of the PyTorch library to be used
    py_version='py39', #specifies version of Python to be used
    distribution=distribution, #configuration for running training on smdistributed Data Parallel
    image_uri=training_image, #specifies the Docker image to use for training
    environment={
        "FI_EFA_FORK_SAFE": "1", #Older Linux (<5.15) kernels require environment variable FI_EFA_FORK_SAFE to be set to 1 for the libfabric to operate correctly.
        "NEURON_COMPILE_CACHE_URL": f"s3://{sess.default_bucket()}/s3-neuron-cache" #specifies s3 path location to store neuron persistent cache
    },
    hyperparameters = hyperparameters) #hyperparameters to be passed into the training script

In [None]:
pytorch_estimator.fit()