## Predict Training Time and SageMaker Training Instance RAM, CPU and GPU resource consumption for a Custom Script

This notebook walks through how you can use the `canary_training` library to generate projections of training time, RAM, and CPU usage (collectivley refered to here as "resource consumption").

To briefly summarize, the canary_training library works by creating many small training jobs on small percentages of the data (generally, 1,2 and 3 percent). Based on the statistics gathered (using the SageMaker Profiler) it then extrapolates the resource consumption for the complete training job.

**Note** If you are using a SageMaker Notebook Instance, please use the `conda_python3` kernel. If you are using SageMaker Studio, please use `Python 3 (Data Science)` kernel.

In [2]:
import sagemaker
from sagemaker.pytorch import PyTorch
import pandas
import logging
logger = logging.getLogger('log')
#set logs if not done already
if not logger.handlers:
    logger.setLevel(logging.INFO)
    

This notebook relies on the `canary_training` package, which will be used for generating extrapolations.

In [None]:
#In SageMaker Studio
!pip install ~/canary_training/Canary_Training/canary_training/
#in a SageMaker Notebook Instance
#!pip install /home/ec2-user/SageMaker/canary_training/Canary_Training/canary_training #make sure this points to the canary_training directory
from canary_training import *

## Setup the Canary Job estimator and parameters
Before using canary_training to generate predictions of resource consumption, we need to define a few things.

1. A standard SageMaker estimator which defines our model.
2. The instance(s) that we want to test.
3. How many data points we want to make predictions based on.

In this example, we will try to predict resource consumption (i.e. CPU, RAM, and training time) when training on a `ml.p2.xlarge`.

This examples follows the [blog post](https://aws.amazon.com/blogs/machine-learning/fine-tuning-a-pytorch-bert-model-and-deploying-it-with-amazon-elastic-inference-on-amazon-sagemaker/) and associated Github resources to refine a BERT model using a custom script with Pytorch. The [associated dataset](https://nyu-mll.github.io/CoLA/) is 3 MB; which we partitioned to 200 csv files.

In this notebook, we use the SageMaker XGBoost built-in algorithm to generate an ML model.

**Note**: The dataset used for the ML model is located here: `s3://aws-hcls-ml/public_assets_support_materials/canary_training_data/cola_data/train_dir`.

First we will set canary training configuration and options. We will be training on 1%,2% and 3% of the data in triplicate.

In [4]:
import boto3
import sagemaker
from sagemaker import image_uris
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput
from time import  gmtime,strftime
import random

role = sagemaker.get_execution_role()
region = boto3.Session().region_name
sagemaker_session = sagemaker.Session()
output_bucket = sagemaker_session.default_bucket()

instance_types=['ml.p2.xlarge'] #test on a GPU instance

#set canary training parameters and inputs
output_s3_location=f"s3://{output_bucket}/bert_output_data"
#create a random local temporary directory which will be copied to s3
#create a random local temporary directory which will be copied to s3
#If this exists already, you can just point to it already
random_number=random.randint(10000000, 99999999)
the_temp_dir=f"canary-training-temp-dir-{str(random_number)}" 

training_percentages=[.01,.01,.01,.02,.02,.02,.03,.03,.03] #train jobs in triplicate in order to increase statistical confidence

In [5]:
print(output_bucket)

sagemaker-us-east-1-111918798052


Now we set standard SageMaker Estimator parameters. Because this is just a test, we use the same data for both the `training` and `test` channel.

In [6]:
data_location='s3://aws-hcls-ml/public_assets_support_materials/canary_training_data/cola_data/train_dir' #location of input data for training

# construct a SageMaker estimator that calls the xgboost-container
estimator = PyTorch(
    entry_point="train_deploy.py",
    source_dir="code",
    role=role,
    framework_version="1.4.0",
    py_version="py3",
    instance_count=1,  
    instance_type="None", #set instance type to None for now; it will be fille later by the canary training script
    output_path=output_s3_location,
    hyperparameters={
        "epochs": 50,
        "num_labels": 2,
        "backend": "gloo",
    },
    disable_profiler=False, # disable debugger
)



## Set up canary training jobs

We will set up the canary training by:
1. Creating samples of the underlying data
2. Create manifest files that will be used for these smaller training jobs
3. Copy the underlying manifest files to S3.
4. Build estimators for SageMaker that will be used for these smaller training jobs.

In [7]:
ct=CanaryTraining(data_location=data_location,output_s3_location=output_s3_location,
                           the_temp_dir=the_temp_dir,instance_types=instance_types,estimator=estimator,training_percentages=training_percentages)

ct.prepare_canary_training_data()

aws s3 cp --recursive canary-training-temp-dir-81739571 s3://sagemaker-us-east-1-111918798052/bert_output_data/canary-training-temp-dir-81739571/


## Kick of canary training jobs
Now that we have the list of estimators, let's kick off the canary training jobs.
**Note**: By default, the canary_training library kicks off all of the jobs in parallel. For this example, this will mean that there will be 9 jobs on a `ml.p2.xlarge` running. If your account does not support this  many jobs of that instance type (and you cannot request an increase), you can run each job serially.

If you run the jobs in parallel, the total amount of time taken is about 20 minutes. If you run them one-after-another, it takes about 2 hours. 

In [None]:
#kick off in parallel
ct.kick_off_canary_training_jobs(wait=True) #set wait equal to True, since we are using GPU instances.

# Wait until the jobs are finished before continuing in the next section!!!
Before continuing, please make sure that all the jobs kicked off for canary training are finished. You can see these jobs in the `SageMake Training` console. 

# Gather Statistics and Perform Extrapolations

In the next section we will gather statistics around the training jobs, and use them to **extrapolate** resource consumption for the entire training job. We will do three things:

1. Extract relevant information from the training job and the SageMaker Profiler around CPU, RAM, GPU and Training Time.
2. Report the extrapolated CPU usage, RAM, GPU and Training Time and cost.
3. Report the raw CPU usage, RAM, GPU and Training Time for the canary training jobs themselves. This will allow the user to make an informed decision based on this detailed information.

In [13]:
#submitted_jobs_information
predicted_resource_usage_df,raw_actual_resource_usage_df=ct.get_predicted_resource_consumption()

In [14]:
predicted_resource_usage_df.head()

Unnamed: 0,Projected_CPUUtilization,Projected_MemoryUsedPercent,Projected_TrainingTimeInSeconds,Projected_GPUUtilization,Projected_GPUMemoryUtilization,price,Projected_TotalCost
ml.p2.xlarge,27.5,5.159929,21504.810337,110.0,59.401159,0.0003125,6.72025


Now report the raw info from the canary jobs. 

**Note** that the `PercentageDataTrainedOn` column does not exactly match the 1,2 and 3 percentages due to those numbers not evenly dividing into the number of partitions of the data (200 partitions).

# Inspect Canary Training Job Results
You can inspect the underlying data for the canary training results. This is the data that was used to create the forcasts. While the forecasts may be useful, we strongly encourage data scientists to inspect the raw results as well. Note that CPUUtilization,MemoryUsedPercent,GPUUtilization,and GPUMemoryUtilization are all p99 values.

In [12]:
raw_actual_resource_usage_df.head()

Unnamed: 0,TrainingJobStatus,TrainingTimeInSeconds,InstanceType,ManifestLocation,job_name,PercentageDataTrainedOn,CPUUtilization,I/OWaitPercentage,MemoryUsedPercent,GPUUtilization,GPUMemoryUtilization
0,Completed,348,ml.p2.xlarge,s3://sagemaker-us-east-1-111918798052/bert_out...,canary-training--job-2022-03-22-14-51-04-03417...,0.01,25.0,100.0,4.72,100.0,55.0
1,Completed,429,ml.p2.xlarge,s3://sagemaker-us-east-1-111918798052/bert_out...,canary-training--job-2022-03-22-15-01-05-05249...,0.01,25.0,100.0,4.7,100.0,54.0
2,Stopped,1,ml.p2.xlarge,s3://sagemaker-us-east-1-111918798052/bert_out...,canary-training--job-2022-03-22-15-10-56-07184...,0.01,25.0,100.0,4.7,100.0,54.0
3,Completed,509,ml.p2.xlarge,s3://sagemaker-us-east-1-111918798052/bert_out...,canary-training--job-2022-03-22-15-13-08-05118...,0.02,25.0,98.0,4.6938,100.0,54.0
4,Completed,504,ml.p2.xlarge,s3://sagemaker-us-east-1-111918798052/bert_out...,canary-training--job-2022-03-22-15-23-58-05594...,0.02,25.0,97.96,4.69,100.0,54.0
