## Instruction Fine tune GPT NEO

Language models have recently exploded in both size and popularity. In 2018, BERT-large entered the scene and, with its 340M parameters and novel transformer architecture, set the standard on NLP task accuracy. Within just a few years, state-of-the-art NLP model size has grown by more than 500x with models such as OpenAI’s 175 billion parameter GPT-3 and similarly sized open source Bloom 176B raising the bar on NLP accuracy. This increase in the number of parameters is driven by the simple and empirically-demonstrated positive relationship between model size and accuracy: more is better. With easy access from models zoos such as Hugging Face and improved accuracy in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, deploying them can be a challenge because of their size.

In this notebook, we explore how to train a large language model - GPT-Neo on SageMaker using SageMaker Distributed Model Parallel Library.
SageMaker provides distributed training libraries and supports various distributed training options for deep learning tasks such as computer vision (CV) and natural language processing (NLP). With SageMaker’s distributed training libraries, you can run highly scalable and cost-effective custom data parallel and model parallel deep learning training jobs. For training GPT-Neo model we will be using Sharded Data Parallel(SDP). Sharded data parallelism is a memory-saving distributed training technique that splits the training state of a model (model parameters, gradients, and optimizer states) across GPUs in a data parallel group.

## Licence agreement
 - View license information https://github.com/EleutherAI/gpt-neox/blob/main/LICENSE before using the model.
 - This notebook is a sample notebook and not intended for production use. Please refer to the licence at https://github.com/aws/mit-0.


 
 


#### Lets begin by installing SageMaker SDK and importing libraries

In [None]:
! pip install -U sagemaker

In [None]:
import sagemaker
from sagemaker.pytorch import PyTorch

In [None]:
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

#### Data Preparation

For running the training job we will use a dataset available in Huggingface datasets.

In [None]:
! pip install datasets

In [None]:
from datasets import load_dataset

instruction_data = load_dataset('tatsu-lab/alpaca')

In [None]:
import pandas as pd

instructionDF = pd.DataFrame(instruction_data["train"])

In [None]:
train_df = instructionDF.iloc[:5000]
valid_df = instructionDF.iloc[5000:7000]

train_df.to_csv("train.csv",index=False)
valid_df.to_csv("valid.csv",index=False)

#### Upload the training data to s3

In [None]:
train_data_url = sess.upload_data(
    path="train.csv",
    key_prefix="alpaca",
)

valid_data_url = sess.upload_data(
    path="valid.csv",
    key_prefix="alpaca",
)

In [None]:
print(f"training file path {train_data_url}")
print(f"validation file path {valid_data_url}")

### Train Model

Now we are ready to run the training using SageMaker Estimator. A training script is required for SageMaker PyTorch estimator to run a model training job. Below is the script for fine-tuning a pretrained Hugging Face GPT-Neo model with the dataset we just put in the S3.

In [None]:
!pygmentize ./scripts/train.py

In [None]:
hyperparameters = {}
SM_DATA_DIR = "/opt/ml/input/data" 

hyperparameters["model_name_or_path"] = "EleutherAI/gpt-neo-2.7B"
hyperparameters["checkpoint_dir"] = "/opt/ml/checkpoints"
hyperparameters["train_file"] = f"{SM_DATA_DIR}/train/train.csv"
hyperparameters["validation_file"] = f"{SM_DATA_DIR}/valid/valid.csv"
hyperparameters["per_device_train_batch_size"] = 1
hyperparameters["per_device_eval_batch_size"] = 1
hyperparameters["block_size"] = 2048
hyperparameters["num_train_epochs"] = 2

##### Store model files as checkpoints for easy deployment


In [None]:

checkpoint_dir = "/opt/ml/checkpoints"
checkpoint_s3_path = "s3://" + sess.default_bucket() + "/gptneo-checkpoints"

#### Setup params for Sharded Data Parallel (SDP)

In [None]:
smp_options = {
    "enabled":True,
    "parameters": {                        # Required
        "pipeline_parallel_degree": 1,     # Required
        "ddp": True,
        "ddp_dist_backend": "auto",
        # parameters for sharded data parallelism
        "sharded_data_parallel_degree": 4,              # Add this to activate sharded data parallelism
        "partitions":1,
        "offload_activations": True,           
        "fp16":True,
        "skip_tracing": True

    }
}

mpi_options = {
    "enabled" : True,                      # Required
    "processes_per_host" : 4               # Required
}

#### Start the training job
We use g5.12.xlarge which consists of 4 GPU to shard the model states and run the training.

In [None]:

base_job_name = "gpt-neo-instruction-fine-tuning"
estimator = PyTorch(
    base_job_name=base_job_name,
    source_dir="./scripts",
    entry_point="train.py",
    role=role,
    framework_version="2.0.0",
    py_version="py310",
    instance_count=1,
    instance_type="ml.g5.12xlarge",
    hyperparameters=hyperparameters,
    checkpoint_local_path=checkpoint_dir,
    checkpoint_s3_uri=checkpoint_s3_path,
    disable_profiler=True,
    distribution={
        "smdistributed": {"modelparallel": smp_options},
        "mpi": mpi_options
    }

)

In [None]:
estimator.fit({"train":train_data_url,"valid":valid_data_url})

#### Store the checkpoint path to reuse in the deploy notebook

In [None]:
%store checkpoint_s3_path