# Pre-training and Continuous Pre-training for Llama2 70b SageMaker using AWS Trainium instances

<div class="alert alert-block alert-info"> 

<b>NOTE:</b> This notebook can be used for both pretraining and continuous pretraining of large language models.

Pretraining: Pretraining pretraining involves training all parameters of a model on a large dataset to improve the model's performance and learning more about of the problem domain.

Continuous Pretraining: Continuous pretraining refers to the process of periodically updating the model's parameters with new data during training. This allows the model to adapt and improve over time as it encounters new examples and scenarios.
</div>

This example provides a walkthrough of pretraining and continuous pretraining for [Llama 70b](https://huggingface.co/meta-llama/Llama-2-70b-hf) model using [NeuronX](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide.html) Distributed on trainium instances. 

NeuronX Distributed is a package for supporting different distributed training/inference mechanism for Neuron devices. It would provide xla friendly implementations of some of the more popular distributed training/inference techniques. The library can be easily installed via pip.

Running the continuous pre-training process requires to have executed the `Notebook 1` to obtain the path of the `pre-trained weights`. These pre-trained weights represent a checkpoint or a starting point for the continual pretraining process. Continued-pretraining involves taking a pre-trained model (in this case, the pre-trained weights) and further pre-training it on additional to improve its  knowledge, capabilities, generalization across tasks and specific domains.

## Prerequisites

---
This Jupyter Notebook can be run on a `ml.t3.medium instance`. However, to execute the training job for preparing the pre-trained weights for the continuous pre-training process, you may need to request a quota increase. The number of instances you need to request for the quota increase depends on how quickly you may want the training job to complete. The range is between **8** and **32** instances.

To request a quota increase, follow these steps:

1. Navigate to the [Service Quotas console](https://console.aws.amazon.com/servicequotas/).
2. Choose Amazon SageMaker.
3. Review your default quota for the following resources:
   - `ml.trn1.32xlarge` for training job usage
   - `ml.trn1.32xlarge` for training warm pool usage
   - `Maximum number of instances per training job`

<div class="alert alert-block alert-warning"> 

<b>NOTE:</b> To make sure that you have enough quotas to support your usage requirements, it's a best practice to monitor and manage your service quotas. Requests for Amazon SageMaker service quota increases are subject to review by AWS engineering teams. Also, service quota increase requests aren't immediately processed when you submit a request. After your request is processed, you receive an email notification.
</div>

## Contents

---
The example has the following main sections:

1. [Install require packages](#Install-required-packages)
2. [Download tokenizer and model](#Download-tokenizer-and-model)
3. [Download training dataset](#Download-training-dataset)
4. [Tokenize the data using Llama2 tokenizer](#Tokenize-the-data-using-Llama2-tokenizer)
5. [Upload data to S3](#Upload-data-to-S3)
6. [Run the training job](#Run-the-training-job)
6. [Terminate the warmpool](#Terminate-the-warmpool)

## Requirements
---

1. Create an Amazon SageMaker Notebook Instance - [Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html)
    - For Notebook Instance type, choose ml.t3.medium.
2. For Select Kernel, choose [conda_python3](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-prepare.html).
3. Install the required packages.

<div class="alert alert-block alert-info"> 

<b>NOTE:

- </b> For <a href="https://aws.amazon.com/sagemaker/studio/" target="_blank">Amazon SageMaker Studio</a>, select Kernel "<span style="color:green;">Python 3 (ipykernel)</span>".

- For <a href="https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html" target="_blank">Amazon SageMaker Studio Classic</a>, select Image "<span style="color:green;">Base Python 3.0</span>" and Kernel "<span style="color:green;">Python 3</span>".

</div>

To run this notebook you would need to install the following dependencies:

In [None]:
!pip install -U sagemaker==2.218.0 boto3==1.34.97 botocore==1.34.97 --force --quiet
!pip install transformers==4.40.1 datasets[s3]==2.19.0 --quiet

<div class="alert alert-block alert-warning"> 

<b>IMPORTANT:</b> You may need to restart the kernel before continuing for the above library installs to be recognized by the remainder of the code! Ignore this message if the libraries were priviously installed in `Notebook 1`.
</div>

## Setup
---

<div class="alert alert-block alert-info"> 

<b>NOTE:</b>

- For enabling the **continous pretraining hyperparameter** for Llama2 70B, set **is_continuous_pretraining** to "<span style="color:green;">True</span>"

- For enabling the **full pretraining hyperparameters** for Llama2 70B, set **is_continuous_pretraining** to "<span style="color:green;">False</span>"

</div>

In [None]:
is_continuous_pretraining = False

### Download tokenizer and model

Update the [access token](https://huggingface.co/docs/hub/en/security-tokens) to download the tokenizer

In [None]:
from huggingface_hub.hf_api import HfFolder
access_token = "hf_xxxx"
HfFolder.save_token(access_token)

In [None]:
from transformers import AutoTokenizer
tokenizer_name = "meta-llama/Llama-2-70b-hf"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
block_size = 4096

### Download training dataset

For this example we will download the [wikicorpus data](https://huggingface.co/datasets/wikicorpus) from huggingface datasets.

In [None]:
from datasets import load_dataset
from itertools import chain
import os

# Dataset
dataset_name = "wikicorpus"
dataset_config_name = "raw_en"

# Create cache directory
save_path = "data/wikicorpus_llama2_7B_tokenized_4k"
save_path = os.path.expanduser(save_path)

if not os.path.exists(save_path):
    os.makedirs(save_path)

# Download wikicorpus data
raw_datasets = load_dataset(
    dataset_name,
    dataset_config_name,
    cache_dir=save_path,
    trust_remote_code=True
)

column_names = raw_datasets["train"].column_names
text_column_name = "text" if "text" in column_names else column_names[0]

### Tokenize the data using Llama2 tokenizer

Tokenize training dataset with llama2 tokenizer and then upload it to S3 to use during training.

In [None]:
# Tokenize training dataset
def tokenize_function(examples):
    return tokenizer(examples[text_column_name])


tokenized_datasets = raw_datasets.map(
    tokenize_function,
    batched=True,
    remove_columns=column_names,
    load_from_cache_file=True,
    desc="Running tokenizer on dataset",
)

if block_size > tokenizer.model_max_length:
    print("block_size > tokenizer.model_max_length")
block_size = min(block_size, tokenizer.model_max_length)


# Main data processing function that will concatenate all texts from our dataset and generate chunks of block_size.
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, and if the total_length < block_size  we exclude this batch and return an empty dict.
    # We could add padding if the model supported it instead of this drop, you can customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result


lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    load_from_cache_file=True,
    desc=f"Grouping texts in chunks of {block_size}",
)

# Final training dataset
train_dataset = lm_datasets["train"]
print(len(train_dataset))

### Upload data to S3

In [None]:
import sagemaker
import boto3

sess = sagemaker.Session()

# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
region_name = sess.boto_region_name

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {region_name}")

In [None]:
# save training dataset to s3
training_input_path = f's3://{sess.default_bucket()}/neuronx_distributed/data'
print(f"uploading training dataset to: {training_input_path}")
train_dataset.save_to_disk(training_input_path)

print(f"uploaded data to: {training_input_path}")

At this point we should have the data uploaded to S3 and ready to kick start the training job.

### Run the training job

For the training job we will be using Trn1.32xlarge instances. Each Trn1.32xlarge instances will have 32 neuron cores and we will use Tensor parallelism and pipeline parallelism to shard the model across neuron cores and train. The below cell provides basic setting for pretraining llama 2 70b using Trn1. 

*Note: Change the number of instances within the cluster to increase or decrease the job execution time. The range is between **8** and **32** `Trn1.32xlarge instances`*.

In [None]:
# Number of neuron cores per instance
PROCESSES_PER_NODE = 32

# Number of instances within the cluster, change this if you want to tweak the instance_count parameter
WORLD_SIZE = 8

# Global batch size
GBS = 512

# Input sequence length
SEQ_LEN = 4096

# Pipeline parallel degree
PP_DEGREE = 8

# Tensor parallel degree
TP_DEGREE = 8

# Data paralell size
DP = ((PROCESSES_PER_NODE * WORLD_SIZE / TP_DEGREE / PP_DEGREE))

# Batch size per model replica
BS = ((GBS / DP))

# Number microbatches for pipeline execution. Setting same as BS so each microbatch contains a single datasample
NUM_MICROBATCHES = BS

# Number of total steps for which to train model. This number should be adjusted to the step number when the loss function is approaching convergence.
MAX_STEPS = 1500

# Timeout in seconds for training. After this amount of time Amazon SageMaker terminates the job regardless of its current status.
MAX_RUN = 2 * (24 * 60 * 60)

Hyperparameters for pre-training Llama2 70B model

In [None]:
hyperparameters = {}
hyperparameters["train_batch_size"] = int(BS)
hyperparameters["use_meta_device_init"] = 1
hyperparameters["training_dir"] = "/opt/ml/input/data/train" # path where sagemaker uploads the training data
hyperparameters["training_config"] = "config.json" # config file containing llama 70b configuration , change this for tweaking the number of parameters.
hyperparameters["max_steps"] = MAX_STEPS
hyperparameters["seq_len"] = SEQ_LEN
hyperparameters["pipeline_parallel_size"] = PP_DEGREE
hyperparameters["tensor_parallel_size"] = TP_DEGREE
hyperparameters["num_microbatches"] = int(NUM_MICROBATCHES)
hyperparameters["lr"] = 0.00015
hyperparameters["min_lr"] = 1e-05
hyperparameters["beta1"] = 0.9
hyperparameters["beta2"] = 0.95
hyperparameters["weight_decay"] = 0.1
hyperparameters["warmup_steps"] = 2000
hyperparameters["constant_steps"] = 0
hyperparameters["use_zero1_optimizer"] = 1
hyperparameters["tb_dir"] = "/opt/ml/checkpoints/tensorboard" # The tensorboard logs will be stored here and eventually pushed to S3.

Hyperparameters for continually pre-training Llama2 70B model

In [None]:
if is_continuous_pretraining:
    hyperparameters["checkpoint_dir"] = "/opt/ml/checkpoints/checkpts"
    hyperparameters["checkpoint_freq"] = 10
    hyperparameters["num_kept_checkpoint"] = 1
    hyperparameters["use_zero1_optimizer"] = 1
    hyperparameters["save_load_xser"] = 0
    hyperparameters["pretrained_weight_dir"] = "/opt/ml/checkpoints/llama70b_weights"

In [None]:
# Docker image for training a models on AWS Trainium
docker_image = f"763104351884.dkr.ecr.{region_name}.amazonaws.com/pytorch-training-neuronx:1.13.1-neuronx-py310-sdk2.18.0-ubuntu20.04"

For more details about neron docker images:
- [AWS Neuron Deep Learning Containers](https://github.com/aws-neuron/deep-learning-containers/tree/main0)
- [Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md)

In [None]:
# Retrieve the checkpoint s3 uri in Notebook 1
%store -r checkpoint_s3_uri 

In [None]:
if 'checkpoint_s3_uri' not in vars():
    print("The variable checkpoint_s3_uri does not exist. If you are running the continuous pretraining process, check the value for checkpoint_s3_uri in Notebook 1 and define the variable within this notebook. Otherwise, ignore this message and continue.")
else:
    # Define checkpoint directory that will contain the weights and other relevant data for the trained model
    checkpoint_s3_uri = "s3://" + sagemaker_session_bucket + "/neuronx_llama_experiment"
    # Use store magic to save the checkpoint s3 directory to use in subsequent notebooks.
    %store checkpoint_s3_uri 
    print(checkpoint_s3_uri)

In [None]:
import time

# Define Training Job Name
job_name = f'llama-neuron-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'
checkpoint_dir = '/opt/ml/checkpoints'

In [None]:
# Define neuron chache directory
cache_dir = "/opt/ml/checkpoints/neuron_cache"

In [None]:
# Environment variables to be set for use during training job 
env = {}
env['FI_PROVIDER'] = 'efa'
env['NCCL_PROTO'] = 'simple'
env['FI_EFA_USE_DEVICE_RDMA'] = '1'
env['RDMAV_FORK_SAFE'] = '1'
env['FI_EFA_FORK_SAFE'] = '1'
env['NEURON_FUSE_SOFTMAX'] = '1'
env['MALLOC_ARENA_MAX'] = '128'
env['XLA_DOWNCAST_BF16'] = '1'
env['NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS'] = '5'
env['NCCL_SOCKET_IFNAME'] = '^lo,docker'
env['NEURON_CC_FLAGS'] = "--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=" + cache_dir

[PyTorch estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) for training a job on Amazon SageMaker:

In [None]:
from sagemaker.pytorch import PyTorch

# Handle end-to-end Amazon SageMaker training and deployment tasks.
pt_estimator = PyTorch(
    entry_point='run_llama_nxd.py',
    source_dir='./scripts',
    instance_type="ml.trn1.32xlarge",
    image_uri=docker_image,
    instance_count=WORLD_SIZE,
    max_run=MAX_RUN,
    hyperparameters=hyperparameters,
    role=role,
    base_job_name=job_name,
    environment=env,
    input_mode="FastFile",
    disable_output_compression=True,
    keep_alive_period_in_seconds=600, # this is added to enable warm pool capability
    checkpoint_s3_uri=checkpoint_s3_uri,
    checkpoint_local_path=checkpoint_dir,
    distribution={"torch_distributed": {"enabled": True}} # enable torchrun 
)

In [None]:
# Start training job
pt_estimator.fit({"train": training_input_path})

## Terminate the warmpool

Execute the below cell to terminate the warmpool if you no longer need it.

In [None]:
sess.update_training_job(pt_estimator.latest_training_job.job_name, resource_config={"KeepAlivePeriodInSeconds":0})

In this example we looked at how to pretrain and continually pre-trained Llama2 70b model using Amazon SageMaker training jobs on AWS Trainium instance.

# Thank You!