# Accelerate finetuning of GPT2 model for Language Modeling task using ONNX Runtime Training
This notebook contains a walkthrough of using ONNX Runtime Training in Azure Machine Learning service to finetune [GPT2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) models. This example uses ONNX Runtime Training to fine-tune the GPT2 PyTorch model maintained at https://github.com/huggingface/transformers.
Specificaly, we showcase finetuning the [pretrained GPT2-medium](https://huggingface.co/transformers/pretrained_models.html), which has 345M parameters using ORT.

Steps:
- Intialize an AzureML workspace
- Register a datastore to use preprocessed data for training
- Create an AzureML experiment
- Provision a compute target
- Create a PyTorch Estimator
- Configure and Run

Prerequisites
If you are using an Azure Machine Learning [Compute Instance](https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-instance) you are all set. Otherwise, you need to setup your environment by installing AzureML Python SDK to run this notebook. Refer to [How to use Estimator in Azure ML](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/training-with-deep-learning/how-to-use-estimator/how-to-use-estimator.ipynb) notebook first if you haven't already to establish your connection to the AzureML Workspace. 

Refer to instructions at https://github.com/microsoft/onnxruntime-training-examples/blob/master/huggingface-gpt2/README.md before running the steps below.

### Check SDK installation

In [182]:
import os
import requests
import sys
import re

# AzureML libraries
import azureml.core
from azureml.core import Experiment, Workspace, Datastore, Run
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.container_registry import ContainerRegistry
from azureml.core.runconfig import MpiConfiguration, RunConfiguration, DEFAULT_GPU_IMAGE
from azureml.train.dnn import PyTorch
from azureml.train.estimator import Estimator
from azureml.widgets import RunDetails

from azure.common.client_factory import get_client_from_cli_profile
from azure.mgmt.containerregistry import ContainerRegistryManagementClient

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.34.0


### AzureML Workspace setup

In [183]:
# Create or retrieve Azure machine learning workspace
# see https://docs.microsoft.com/en-us/python/api/overview/azure/ml/?view=azure-ml-py
ws = Workspace.get(name="demo", subscription_id='', resource_group='demo')

# Print workspace attributes
print('Workspace name: ' + ws.name, 
      'Workspace region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

Workspace name: demo
Workspace region: westus2
Subscription id: 47c81f7b-f720-4f17-9116-69d540091679
Resource group: demo


### Register Datastore
Before running the step below, data prepared using the instructions at https://github.com/microsoft/onnxruntime-training-examples/blob/master/huggingface-gpt2/README.md should be transferred to an Azure Blob container referenced in the `Datastore` registration step. Refer to the documentation at https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data for details on using data in Azure ML experiments.

In [184]:
# Create a datastore from blob storage containing training data.
# Consult README.md for instructions downloading and uploading training data.
#ds = Datastore.register_azure_blob_container(workspace=ws, 
#                                             datastore_name='wikitext',
#                                             account_name='demo1879244313', 
#                                             account_key='',
#                                             container_name='tokenfiles')

In [219]:
ds = Datastore.get(workspace=ws, datastore_name='gpt_wikitext')
# Print datastore attributes
print('Datastore name: ' + ds.name, 
      'Container name: ' + ds.container_name, 
      'Datastore type: ' + ds.datastore_type, 
      'Workspace name: ' + ds.workspace.name, sep = '\n')

Datastore name: gpt_wikitext
Container name: wikitext
Datastore type: AzureBlob
Workspace name: demo


In [187]:
import azureml.core
from azureml.core import Workspace, Datastore, Dataset

train_data = Dataset.get_by_name(name='wikitext_train', workspace=ws)
valid_data = Dataset.get_by_name(name='wikitext_valid', workspace=ws)

print(train_data.name)
print(valid_data.name)


wikitext_train
wikitext_valid


### Create AzureML Compute Cluster
This recipe is supported on Azure Machine Learning Service using 16 x Standard_NC24rs_v3 or 8 x Standard_ND40rs_v2 VMs. In the next step, you will create an AzureML Compute cluster of Standard_NC40s_v2 GPU VMs with the specified name, if it doesn't already exist in your workspace. 

In [188]:
# Create GPU cluster
#gpu_cluster_name = "ortgptfinetune" 
gpu_cluster_name = "cassieb1" 
try:
    gpu_compute_target = ComputeTarget(workspace=ws, name=gpu_cluster_name)
    print('Found existing compute target.')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_ND40rs_v2', min_nodes=0, max_nodes=8)
    gpu_compute_target = ComputeTarget.create(ws, gpu_cluster_name, compute_config)
    gpu_compute_target.wait_for_completion(show_output=True)

Found existing compute target.


### Create Estimator
Notes before running the following step:
* Update the following step to replace two occurences of `<blob-path-to-training-data>` with the actual path in the datastore to the training data.
* If you followed instructions at https://github.com/microsoft/onnxruntime-training-examples/blob/master/huggingface-gpt2/README.md to prepare data, make sure that the data and others files that are not code or config are moved out `workspace` directory. Data files should have been moved to a `Datastore` to use in training. 
* Update the occurance of `<tagged-onnxruntime-gpt-container>` with the tag of the built docker image pushed to a container registry. Similarly, update the `<azure-subscription-id>` and `<container-registry-resource-group>` with the contair registry's subscription ID and resource group.


| VM SKU             | GPU memory   | gpu_count |    ORT_batch_size    |
| ------------------ |:----------------:|:---------:|:-------:|
| Standard_ND40rs_v2 | 32 GB            | 8         | 4   |
| Standard_NC24rs_v3 | 16 GB            | 4         | 1   |



In [189]:
# this directory should contain run_language_modeling.py, after files copied over based on the instructions at https://github.com/microsoft/onnxruntime-training-examples/blob/master/huggingface-gpt2/README.md 
#project_folder = 'orttrainer/huggingface-gpt2/transformers/examples'
project_folder = '.'

# set MPI configuration
# set processes per node to be equal to GPU count on SKU.
# this will change based on NC v/s ND series VMs
mpi_distr_config = MpiConfiguration(process_count_per_node=4, node_count=1)

experiment = Experiment(ws,'onnxruntime-gpt2')

import uuid
output_id = uuid.uuid1().hex

output_dir = f'/output/{experiment.name}/{output_id}/'
print(output_dir)



/output/onnxruntime-gpt2/b579802c1b1111ecbe6b000d3af6b150/


In [280]:
# Define the script parameters.
# To run training PyTorch instead of ORT, remove the --ort_trainer flag.
# To run evaluation using PyTorch instead of ORT, use the --do_eval_in_torch flag.
script_params = [
    '--model_type', 'gpt2-medium', 
    '--model_name_or_path', 'gpt2-medium', 
    '--tokenizer_name' , 'gpt2-medium', 
    '--config_name' , 'gpt2-medium', 
    '--do_eval' , '', 
    '--do_train', '', 
    '--path', '/home/azureuser/cloudfiles/data/dataset/train_data_txt/',
    '--train_file' ,'train.txt',
    '--validation_file' , 'valid.txt',
    '--output_dir' , output_dir, 
    '--per_gpu_train_batch_size' , '4', 
    '--per_gpu_eval_batch_size' , '4', 
    '--gradient_accumulation_steps' , '4',
    '--block_size' , '1024', 
    '--weight_decay' , '0.01', 
    '--overwrite_output_dir' , '', 
    '--num_train_epocs' , '5',
    '--ort_trainer' , ''
    ]

In [281]:
import os
# List the files in the mounted path
print(os.listdir("/home/azureuser/cloudfiles/data/dataset/train_data_txt/"))
ds

['train.txt', 'valid.txt']


{
  "name": "gpt_wikitext",
  "container_name": "wikitext",
  "account_name": "demo1879244313",
  "protocol": "https",
  "endpoint": "core.windows.net"
}

In [276]:
os.path.abspath('/home/azureuser/cloudfiles/data/dataset/train_data_txt/')

'/home/azureuser/cloudfiles/data/dataset/train_data_txt'

In [282]:
from azureml.core import Environment
from azureml.core import ScriptRunConfig
from azureml.core.runconfig import DockerConfiguration

docker_config = DockerConfiguration(use_docker=True)
## env created based on my docker image in aml
onnxruntime_gpu_env = Environment.get(workspace=ws, name="onnxruntime-gpt")



In [285]:
script_run_config = ScriptRunConfig(
                      source_directory=project_folder,
                      script='run_language_modeling.py',
                      arguments = script_params,
                      #compute
                      compute_target=gpu_compute_target,
                      # custom docker image
                      environment=onnxruntime_gpu_env,
                      #mpi
                      distributed_job_config=mpi_distr_config,
                      docker_runtime_config=docker_config
                      )

### Run AzureML experiment

In [286]:
experiment.submit(script_run_config)


Experiment,Id,Type,Status,Details Page,Docs Page
onnxruntime-gpt2,onnxruntime-gpt2_1632268750_1b1bca3e,azureml.scriptrun,Preparing,Link to Azure Machine Learning studio,Link to Documentation
