# Experiment Overview

How to design and run an experiment for benchmarking a HuggingFace AutoModel for NLP with Sagemaker? Let's find out!

## Table of Contents:

1. [Generate Design](#Generate-Design)
2. [Install/Import Required Packages](#Install/Import-Required-Packages)
3. [Load Experimental Design](#Load-Experimental-Design)
4. [Add Hyperparameters and Customization to Experiment](#Add-Hyperparameters-and-Customization-to-Experiment)
5. [Export design to individual csv files](#Export-design-to-individual-csv-files)
6. [Upload Data to S3](#Upload-Data-to-S3)
7. [Execute Experiments](#Execute-Experiments)

### Design Automation:
Steps 3 through 5 are equivalently taken care of by running the command "make experiment" in the terminal, using the Makefile and make_experiment.py script. The make_experiment.py script is identical to the notebook code below in sections 3 - 5, and once you are happy with your design/exploration, you can modify the make_experiment.py file to run these steps automatically with a single command.

## 1. Generate Design

First, a .csv file containing the experimental design must be uploaded to data/raw. (See "experimental_design.csv" for an example of this).
An experiment can be designed in many ways using many types of software. For this experiment, the JMP Custom Design Tool was used to create an initial experimental design. The data/external folder contains screenshots from JMP regarding the design quality. The design is limited to 4 nodes due to account level resource restrictions.

The file experimental_design.csv in data/interim can be replaced with any experimental design of the reader's choosing.

## 2. Install/Import Required Packages

In [1]:
# if running in SageMaker Notebooks
# !pip install -q datasets
# !pip install -q datasets[s3]
# !pip install -q transformers

In [None]:
# if running in SageMaker Notebooks - uncomment and run below to load environment variables from .env
# %load_ext dotenv
# %dotenv

In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sagemaker
import datasets

# import custom run experiment functions
from run_experiment import *

## 3. Load Experimental Design

Load it in from the data/raw folder. If you have changed the design and file name, please change the file name called below. This design represents the minimum information required to see our trends of interest.

In [20]:
# load raw experimental design designed previously
exp_design = pd.read_csv('../data/raw/experimental_design.csv')
display(exp_design)

Unnamed: 0,num_nodes,dataset_size,train_time,f1,billable_seconds,cost
0,4,600000,,,,
1,4,100000,,,,
2,4,1000000,,,,
3,2,1000000,,,,
4,1,100000,,,,
5,4,600000,,,,
6,1,1000000,,,,
7,1,100000,,,,
8,4,1000000,,,,
9,1,600000,,,,


In [21]:
# confounding pattern previously calculated
conf_pattern = pd.read_csv('../data/raw/confounding_pattern.csv')
sns.heatmap(conf_pattern.iloc[:,1:], cmap = 'Greens')
plt.title("Factors measured in the experiment \nhave low to no inherent correlation");

## 4. Add Hyperparameters and Customization to Experiment

In the next section, we add all the information our training jobs will need to know to run all experiments.

### 4a) Calculate the epochs required to compare training with different numbers of samples:

In order to compare training jobs with different numbers of samples (also referred to as "dataset sizes" here), we need to ensure the experiment is controlled in terms of the number of steps (weight updates) made during training for each model. These steps represent the learning opportunities for the model. Hence, if one model is given more learning opportunities than another, it will outperform, and the comparison of their metrics won't be objective.

<b>For example, take the case of training a model on a 1 GPU instance, i.e. an ml.p3.2xlarge, for 3 epochs, with a batch size of 100 samples:</b>
* Dataset Size: 100,000 samples, batch size = 100, steps in each epoch = 1,000 (100,000/100)
    * 3 Epochs: 1,000 steps per epoch * 3 epochs = 3,000 steps 
* Dataset Size: 1,000,000 samples, batch size = 100, steps in each epoch = 10,000 (1,000,000/100)
    * 3 Epochs: 10,000 steps per epoch * 3 epochs = 30,000 steps 
    
With this discrepancy in steps, the results across multiple dataset sizes cannot be compared apples to apples in an experiment, as the second dataset size has 10x more steps than the first example. 

<b>Therefore, the number of steps should be fixed whenever we want to compare the results of different dataset sizes. The epochs are adjusted to achieve this.</b>

Rearranging the equation to solve for the number of epochs ...

* num_steps_per_epoch = num_samples/batch_size
* num_epochs = num_steps/num_steps_per_epoch

Here's an example calculation below.

The parameters selected in the present experiment were the best obtainable in view of time and resource constraints - however they can no doubt be improved or tuned to the needs of the deep learning practitioner.

In [22]:
# For the case of 1 GPU - num steps selected via guess and check to get whole numbered epochs
num_steps = 84375 
batch_size = 32*1 # for 1 GPU, change value to adjust for number of GPUs used
num_samples = exp_design['dataset_size'].values*0.9 # 90% of each size is used for training

num_steps_per_epoch = num_samples/batch_size

# calculate the number of epochs to be passed as hyperparameters for each experiment
num_epochs = num_steps/num_steps_per_epoch
print("\nTraining dataset sample sizes:", num_samples)
print("\nRequired epochs for different sample sizes on 1 GPU:", num_epochs)


Training dataset sample sizes: [540000.  90000. 900000. 900000.  90000. 540000. 900000.  90000. 900000.
 540000.  90000.  90000. 900000.  90000. 540000.]

Required epochs for different sample sizes on 1 GPU: [ 5. 30.  3.  3. 30.  5.  3. 30.  3.  5. 30. 30.  3. 30.  5.]


### 4b) Select the global batch sizes, learning rates, and number of steps for different compute power:

<b>Hyperparameter Selection:</b>

* Aim was to fix as many parameters as possible, and to collect interpretable data with minimal time spent on troubleshooting
* Parameters Selected:
    * per_device_batch_size: 32 (HuggingFace suggested default)
    * learning_rate: 5e-5 (HuggingFace suggested default)
    * global_batch_size: varied based on number of GPUs in use
    * num_epochs: 3 (1M samples), 5 (6M samples), 30 (100k samples)
    * num_steps: calculated (see above)
        * based on global batch size and epochs for the experiment 
        * constant across dataset sizes at a given number of GPUs and global batch size
        * num_steps = num_epochs x num_samples/global_batch_size

NOTE: it is common practice in deep learning to vary the learning rate with the global batch size. For the sake of having another experimental control, the learning rate was kept constant during experimentation. However, to improve the performance of the models, the learning rate can be adjusted proportionally with global batch size to the desired level of fine-tuning. Additionally, the experiment could be re-designed to fix the global batch size and vary the per device batch size, or keep the number of steps constant across all runs. The deep learning practitioner can select the setup that works best for their needs, and generate the experimental design accordingly. 

Additional info we require to set up our experiment:

* automodel_name: or "checkpoint", used to automatically configure both tokenization and modelling
* dataset_name: name to load from the huggingface datasets hub
* epochs: calculated above
* num_parameters_tumed: custom to each model, params for distilbert below
* s3_bucket: customize (TO DO: load from environment variable)

Let's add this info into the experimental design and save it in the interim folder. The table will be completed by inserting the final values from the experiment execution.

In [23]:
s3_bucket = os.getenv("BUCKET_NAME")
dataset_name = os.getenv("HF_DATASET")
model_name = os.getenv("HF_MODEL")
tunable_params = os.getenv("TUNED_PARAMS")

In [24]:
exp_design.insert(loc = 0, column = 'dataset_name', value = dataset_name)
exp_design.insert(loc = 1, column = 'automodel_name', value = model_name)
exp_design.insert(loc = 2, column = 'num_parameters_tuned', value = tunable_params) # constant for this model
exp_design.insert(loc = 3, column = 's3_bucket', value = s3_bucket) # constant for this model
exp_design.insert(loc = 4, column = 'per_device_train_batch_size', value = batch_size) # calculated above
exp_design.insert(loc = 5, column = 'learning_rate', value = 5e-5)
exp_design.insert(loc = 6, column = 'epochs', value = num_epochs) # calculated above


We can add even more info to the design now, by mapping the number of nodes to the types of instances needed. Additionally, we should state the number of GPUs, EBS volume required (more for smaller instances), the price per hour based on the instance type, and whether or not parallelism is enabled.

In [25]:
# map the num_nodes column to specific factor levels for experimentation
instance_mapper = {1:'ml.p3.2xlarge', 2:'ml.p3.16xlarge', 4:'ml.p3.16xlarge'}
gpu_mapper = {1:1, 2:16, 4:32}
parallel_enabled_mapper = {1:False, 2:True, 4:True}
EBS_volume_mapper = {'ml.p3.2xlarge':1024, 'ml.p3.8xlarge':1024, 'ml.p3.16xlarge':30} # leave default for 16xlarges, add more storage for small instances
price_mapper = {"ml.p3.2xlarge": 3.825, "ml.p3.8xlarge":14.688, "ml.p3.16xlarge":28.152} # hourly instance pricing from SageMaker website


In [26]:
exp_design.insert(loc = 7, column = 'instance_type', value = exp_design['num_nodes'].map(instance_mapper))
exp_design.insert(loc = 8, column = 'num_gpus', value = exp_design['num_nodes'].map(gpu_mapper))
exp_design.insert(loc = 9, column = 'global_batch_size', value = exp_design['num_gpus']*exp_design['per_device_train_batch_size']) 
exp_design.insert(loc = 10, column = 'num_steps', value = np.rint((exp_design['epochs'] * exp_design['dataset_size']*0.9)/exp_design['global_batch_size'])) # note - in the future num steps could be held constant across exps instead
exp_design.insert(loc = 11, column = 'hourly_price', value = exp_design['instance_type'].map(price_mapper))
exp_design.insert(loc = 12, column = 'volume_size', value = exp_design['instance_type'].map(EBS_volume_mapper)) 
exp_design.insert(loc = 13, column = 'parallel_enabled', value = exp_design['num_nodes'].map(parallel_enabled_mapper)) 

Let's view the completed design.

In [27]:
display(exp_design)

Unnamed: 0,dataset_name,automodel_name,num_parameters_tuned,s3_bucket,per_device_train_batch_size,learning_rate,epochs,instance_type,num_gpus,global_batch_size,num_steps,hourly_price,volume_size,parallel_enabled,num_nodes,dataset_size,train_time,f1,billable_seconds,cost
0,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,5.0,ml.p3.16xlarge,32,1024,2637.0,28.152,30,True,4,600000,,,,
1,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,30.0,ml.p3.16xlarge,32,1024,2637.0,28.152,30,True,4,100000,,,,
2,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,3.0,ml.p3.16xlarge,32,1024,2637.0,28.152,30,True,4,1000000,,,,
3,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,3.0,ml.p3.16xlarge,16,512,5273.0,28.152,30,True,2,1000000,,,,
4,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,30.0,ml.p3.2xlarge,1,32,84375.0,3.825,1024,False,1,100000,,,,
5,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,5.0,ml.p3.16xlarge,32,1024,2637.0,28.152,30,True,4,600000,,,,
6,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,3.0,ml.p3.2xlarge,1,32,84375.0,3.825,1024,False,1,1000000,,,,
7,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,30.0,ml.p3.2xlarge,1,32,84375.0,3.825,1024,False,1,100000,,,,
8,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,3.0,ml.p3.16xlarge,32,1024,2637.0,28.152,30,True,4,1000000,,,,
9,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,5.0,ml.p3.2xlarge,1,32,84375.0,3.825,1024,False,1,600000,,,,


In [28]:
# save completed design to file
exp_design.to_csv('../data/interim/experimental_design.csv', index_label = 'run_id')

Now that we have the format of our ideal design, we can add any additional experiments to the list above the  minimum required generated in JMP.

Below, we add these additional centrepoints to the design, for more data, and write it to the file we just exported.

In [29]:
from csv import writer

# if any additional data points desired, add additional rows to csv with custom params
cp1 = [15,dataset_name,model_name,tunable_params,s3_bucket,32,5e-05,5.0,'ml.p3.8xlarge',4,128,21094,14.688,1024,False,1,600000]
cp2 = [16,dataset_name,model_name,tunable_params,s3_bucket,16,5e-05,5.0,'ml.p3.16xlarge',8,128,10547,28.152,1024,True,1,600000] # pd batch size adjusted for cuda error
cp3 = [17,dataset_name,model_name,tunable_params,s3_bucket,32,5e-05,30.0,'ml.p3.16xlarge',8,256,10547,28.152,1024,True,1,100000]
cp4 = [18,dataset_name,model_name,tunable_params,s3_bucket,32,5e-05,3.0,'ml.p3.16xlarge',8,256,10547,28.152,1024,True,1,1000000]
cp5 = [19,dataset_name,model_name,tunable_params,s3_bucket,32,5e-05,30.0,'ml.p3.8xlarge',4,128,21094,14.688,1024,False,1,100000]
cp6 = [20,dataset_name,model_name,tunable_params,s3_bucket,32,5e-05,3.0,'ml.p3.8xlarge',4,128,21094,14.688,1024,False,1,1000000]
cp7 = [21,dataset_name,model_name,tunable_params,s3_bucket,32,2.83e-4,30.0,'ml.p3.16xlarge',32,1024,2637,28.152,30,True,4,100000] 
cp8 = [22,dataset_name,model_name,tunable_params,s3_bucket,32,2.83e-4,3.0,'ml.p3.16xlarge',32,1024,2637,28.152,30,True,4,1000000] 
cp9 = [23,dataset_name,model_name,tunable_params,s3_bucket,32,5e-05,30.0,'ml.p3.16xlarge',8,256,10547,28.152,1024,False,1,100000]
cp10 = [24,dataset_name,model_name,tunable_params,s3_bucket,32,5e-05,3.0,'ml.p3.16xlarge',8,256,10547,28.152,1024,False,1,1000000]

# Open our existing CSV file in append mode
# Create a file object for this file
with open('../data/interim/experimental_design.csv', 'a') as f_object:
  
    # Pass this file object to csv.writer()
    # and get a writer object
    writer_object = writer(f_object)
  
    # Pass the list as an argument into
    # the writerow()
    writer_object.writerow(cp1)
    writer_object.writerow(cp2)
    writer_object.writerow(cp3)
    writer_object.writerow(cp4)
    writer_object.writerow(cp5)
    writer_object.writerow(cp6)

    # increased learn rate to compare with lr controlled version
    writer_object.writerow(cp7)
    writer_object.writerow(cp8)
    writer_object.writerow(cp9)
    writer_object.writerow(cp10)
  
    #Close the file object
    f_object.close()

exp_design = pd.read_csv('../data/interim/experimental_design.csv')
display(exp_design)


Unnamed: 0,run_id,dataset_name,automodel_name,num_parameters_tuned,s3_bucket,per_device_train_batch_size,learning_rate,epochs,instance_type,num_gpus,...,num_steps,hourly_price,volume_size,parallel_enabled,num_nodes,dataset_size,train_time,f1,billable_seconds,cost
0,0,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,5.0,ml.p3.16xlarge,32,...,2637.0,28.152,30,True,4,600000,,,,
1,1,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,30.0,ml.p3.16xlarge,32,...,2637.0,28.152,30,True,4,100000,,,,
2,2,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,3.0,ml.p3.16xlarge,32,...,2637.0,28.152,30,True,4,1000000,,,,
3,3,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,3.0,ml.p3.16xlarge,16,...,5273.0,28.152,30,True,2,1000000,,,,
4,4,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,30.0,ml.p3.2xlarge,1,...,84375.0,3.825,1024,False,1,100000,,,,
5,5,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,5.0,ml.p3.16xlarge,32,...,2637.0,28.152,30,True,4,600000,,,,
6,6,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,3.0,ml.p3.2xlarge,1,...,84375.0,3.825,1024,False,1,1000000,,,,
7,7,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,30.0,ml.p3.2xlarge,1,...,84375.0,3.825,1024,False,1,100000,,,,
8,8,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,3.0,ml.p3.16xlarge,32,...,2637.0,28.152,30,True,4,1000000,,,,
9,9,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,5.0,ml.p3.2xlarge,1,...,84375.0,3.825,1024,False,1,600000,,,,


## 5. Export design to individual csv files

Ideally, this would run in one loop. However, to mitigate capacity planning restrictions, experiment runs will be executed individually.

In [30]:
# export desired runs into individual csvs
for ix, val in exp_design.iterrows():
    exp_design.loc[exp_design.index == ix].to_csv(f'../data/interim/run{ix}_experimental_design.csv', index_label = 'run_id')

## 6. Upload Split and Tokenized Dataset to S3 According to Experimental Design

On first setup of your experiment, you can prep and upload your data to s3 all in one go by executing the wrangle_datasets.py script. The script loads your data from the HuggingFace Hub, tokenizes the data, determines how many dataset sample sizes are in your experimental design (dataset_size), splits it into train-val-test splits (90% train, 10% val, in the case of amazon_polarity test is separated already), organizes directories for the files in s3 that can be automatically called by SageMaker training scripts later, and finally uploads the data to the correct directory. 

The script can be executed most conveniently by running the below cell. Fair warning - for large datasets this can take some time (the cell below has an estimated 1 hour and 10 minutes run time for the default experiment). Fortunately - you will only have to do it once for your whole experiment! So grab a coffee and a book. :-)

 If you are using the default amazon-polarity dataset this repo came with, it will work out of the box, by uncommenting and running the wrangle_datasets.py script.


### If you are using a different HuggingFace Hub dataset than amazon-polarity, you will have to customize a few things for this to work. 
This feature has not been tested with this repo other datasets from the HuggingFace Hub. However, it should work given a few adjustments.

* Change HF_DATASET environment variable to the name of the HuggingFace hub dataset you desire
* In wrangle_datasets.py, adjust the name of the column containing the core text data you wish to tokeinze. The column name is 'content' for amazon_polarity.
* Adjust train-val-test splitting code to suit the needs of your dataset, amazon polarity comes in a train and test split by default, so by default the script splits the train split 90-10 and uses the same test set for all cases
* Be sure to customize your SageMaker training script as well in src/modes/train_model.py num_labels if your new dataset is for multi-class classification.







In [14]:
%run ../src/data/wrangle_datasets.py

Configuration settings:
sagemaker role arn: arn:aws:iam::513667113968:role/service-role/AmazonSageMaker-ExecutionRole-20210610T144190
sagemaker bucket: distilbert-benchmarking
sagemaker session region: us-east-1
Settings for this experiment:


Unnamed: 0,run_id,dataset_name,automodel_name,num_parameters_tuned,s3_bucket,per_device_train_batch_size,learning_rate,epochs,instance_type,num_gpus,...,num_steps,hourly_price,volume_size,parallel_enabled,num_nodes,dataset_size,train_time,f1,billable_seconds,cost
0,0,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,5.0,ml.p3.16xlarge,32,...,2637.0,28.152,30,True,4,600000,,,,
1,1,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,30.0,ml.p3.16xlarge,32,...,2637.0,28.152,30,True,4,100000,,,,
2,2,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,3.0,ml.p3.16xlarge,32,...,2637.0,28.152,30,True,4,1000000,,,,
3,3,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,3.0,ml.p3.16xlarge,16,...,5273.0,28.152,30,True,2,1000000,,,,
4,4,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,30.0,ml.p3.2xlarge,1,...,84375.0,3.825,1024,False,1,100000,,,,
5,5,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,5.0,ml.p3.16xlarge,32,...,2637.0,28.152,30,True,4,600000,,,,
6,6,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,3.0,ml.p3.2xlarge,1,...,84375.0,3.825,1024,False,1,1000000,,,,
7,7,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,30.0,ml.p3.2xlarge,1,...,84375.0,3.825,1024,False,1,100000,,,,
8,8,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,3.0,ml.p3.16xlarge,32,...,2637.0,28.152,30,True,4,1000000,,,,
9,9,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,5.0,ml.p3.2xlarge,1,...,84375.0,3.825,1024,False,1,600000,,,,


None
Dataset sizes: [ 600000  100000 1000000]
Downloading tokenizer.
Splitting and tokenizing data.


Reusing dataset amazon_polarity (/Users/samstu/.cache/huggingface/datasets/amazon_polarity/amazon_polarity/3.0.0/ac31acedf6cda6bc2aa50d448f48bbad69a3dd8efc607d2ff1a9e65c2476b4c1)
Loading cached split indices for dataset at /Users/samstu/.cache/huggingface/datasets/amazon_polarity/amazon_polarity/3.0.0/ac31acedf6cda6bc2aa50d448f48bbad69a3dd8efc607d2ff1a9e65c2476b4c1/cache-d5c7a05fe0dc6a8d.arrow and /Users/samstu/.cache/huggingface/datasets/amazon_polarity/amazon_polarity/3.0.0/ac31acedf6cda6bc2aa50d448f48bbad69a3dd8efc607d2ff1a9e65c2476b4c1/cache-2a3aff018382a4ed.arrow
100%|██████████| 540/540 [02:31<00:00,  3.57ba/s]
100%|██████████| 60/60 [00:14<00:00,  4.05ba/s]


Uploading data to S3 for size: 600000
Training/Val dataset paths for dataset size 600000:
s3://distilbert-benchmarking/datasets/amazon_polarity/600000/train
s3://distilbert-benchmarking/datasets/amazon_polarity/600000/val


Reusing dataset amazon_polarity (/Users/samstu/.cache/huggingface/datasets/amazon_polarity/amazon_polarity/3.0.0/ac31acedf6cda6bc2aa50d448f48bbad69a3dd8efc607d2ff1a9e65c2476b4c1)
Loading cached split indices for dataset at /Users/samstu/.cache/huggingface/datasets/amazon_polarity/amazon_polarity/3.0.0/ac31acedf6cda6bc2aa50d448f48bbad69a3dd8efc607d2ff1a9e65c2476b4c1/cache-23260bf369fdf7f1.arrow and /Users/samstu/.cache/huggingface/datasets/amazon_polarity/amazon_polarity/3.0.0/ac31acedf6cda6bc2aa50d448f48bbad69a3dd8efc607d2ff1a9e65c2476b4c1/cache-839e7d50b8dfd116.arrow
100%|██████████| 90/90 [00:21<00:00,  4.17ba/s]
100%|██████████| 10/10 [00:02<00:00,  4.08ba/s]


Uploading data to S3 for size: 100000
Training/Val dataset paths for dataset size 100000:
s3://distilbert-benchmarking/datasets/amazon_polarity/100000/train
s3://distilbert-benchmarking/datasets/amazon_polarity/100000/val


Reusing dataset amazon_polarity (/Users/samstu/.cache/huggingface/datasets/amazon_polarity/amazon_polarity/3.0.0/ac31acedf6cda6bc2aa50d448f48bbad69a3dd8efc607d2ff1a9e65c2476b4c1)
Loading cached split indices for dataset at /Users/samstu/.cache/huggingface/datasets/amazon_polarity/amazon_polarity/3.0.0/ac31acedf6cda6bc2aa50d448f48bbad69a3dd8efc607d2ff1a9e65c2476b4c1/cache-ded99e5319a56d55.arrow and /Users/samstu/.cache/huggingface/datasets/amazon_polarity/amazon_polarity/3.0.0/ac31acedf6cda6bc2aa50d448f48bbad69a3dd8efc607d2ff1a9e65c2476b4c1/cache-b4c98bb64b50c282.arrow
100%|██████████| 900/900 [05:07<00:00,  2.93ba/s]
100%|██████████| 100/100 [00:34<00:00,  2.86ba/s]


Uploading data to S3 for size: 1000000
Training/Val dataset paths for dataset size 1000000:
s3://distilbert-benchmarking/datasets/amazon_polarity/1000000/train
s3://distilbert-benchmarking/datasets/amazon_polarity/1000000/val


Reusing dataset amazon_polarity (/Users/samstu/.cache/huggingface/datasets/amazon_polarity/amazon_polarity/3.0.0/ac31acedf6cda6bc2aa50d448f48bbad69a3dd8efc607d2ff1a9e65c2476b4c1)
100%|██████████| 400/400 [01:36<00:00,  4.16ba/s]


Test dataset path:
s3://distilbert-benchmarking/datasets/amazon_polarity/test


## 7. Execute experiments

Ideally, we would be able to execute all experiments in one loop, in the randomized order of the experimental design from JMP. However, any capacity planning restrictions will cause the loop to terminate.

Hence, below, experimental runs can be executed individually, by calling the run_experiment function.

The experiments can be run one at a time by calling the cell below. Alternatively, if you are able to open multiple sagemaker notebooks and clone your customized version of this repo to them, you can parallelize the running of experiments, and simply commit the result files as they come in back to your shared git branch. Once you have run all your desired experiments, you can analyze results by following instructions in the analyze_results notebook.

In [15]:
# for example, executing a replicate of run 13
run_experiment(exp_design_path = '../data/interim/run13_experimental_design.csv')

Configuration settings:
sagemaker role arn: arn:aws:iam::513667113968:role/service-role/AmazonSageMaker-ExecutionRole-20210610T144190
sagemaker bucket: distilbert-benchmarking
sagemaker session region: us-east-1
Settings for this experiment:


Unnamed: 0,run_id,run_id.1,dataset_name,automodel_name,num_parameters_tuned,s3_bucket,per_device_train_batch_size,learning_rate,epochs,instance_type,...,num_steps,hourly_price,volume_size,parallel_enabled,num_nodes,dataset_size,train_time,f1,billable_seconds,cost
0,13,13,amazon_polarity,distilbert-base-uncased,66955010,distilbert-benchmarking,32,5e-05,30.0,ml.p3.16xlarge,...,5273.0,28.152,30,True,2,100000,,,,


None
Dataset sizes: [100000]
Preparing to initiate experiment w/ run number: 13
Starting Training Job.
Train input path: s3://distilbert-benchmarking/datasets/amazon_polarity/100000/train
2021-08-20 18:47:01 Starting - Starting the training job...
2021-08-20 18:47:27 Starting - Launching requested ML instancesProfilerReport-1629485220: InProgress
.........
2021-08-20 18:49:05 Starting - Preparing the instances for training............
2021-08-20 18:51:07 Downloading - Downloading input data...
2021-08-20 18:51:42 Training - Downloading the training image.........[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-08-20 18:53:10,310 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-08-20 18:53:10,388 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34mbash: cannot set terminal process group (-1): Inap

# Manual Result Generation

If one of the training jobs executes in a different Sagemaker Notebook, and you want to generate its results in the notebook you're in, you can use the below code to generate the results of your successfully executed training experiment manually.

Simply pass the training job name, and the run number according to your design, to generate the results for the experiment in the notebook you are in.

In [18]:
# MANUAL LOOKUP 
# after training job execution, get response values
run_job_name = "experiment-run13-20-18-47-00"
run_number = 13

metrics_session = sagemaker.session.Session() # use to get metrics after training

print("Training job finished. Fetching metrics.")
train_metrics = sagemaker.TrainingJobAnalytics(run_job_name).dataframe()
run_f1 = train_metrics['value'].values[0]
print("Run {} \nF1: {}".format(run_number, run_f1))

# get train time and billable seconds
job_description = metrics_session.describe_training_job(run_job_name)
run_train_time = job_description['TrainingTimeInSeconds']
run_bill_time = job_description['BillableTimeInSeconds']

print("Train Time:", run_train_time, "Bill Time:", run_bill_time)

# write results to file
run_results = {"run_number":run_number, "job_name":run_job_name, "training_time":run_train_time, "bill_time":run_bill_time, "mean_f1":run_f1}

with open('../data/interim/run{}_results.txt'.format(run_number), 'w') as convert_file:
    convert_file.write(json.dumps(run_results))

print("Experiment {} complete. Results written to ..data/interim folder.".format(run_number))

Training job finished. Fetching metrics.
Run 13 
F1: 0.9195691016660961
Train Time: 5497 Bill Time: 5497
Experiment 13 complete. Results written to ..data/interim folder.


Note: the original run 13 performed with the first experiment is formally committed to this repo rather than the above results from the re-run for demonstration.