# (OPTIONAL): Convert pre-trained weights with tensor parallelism for Continuous Pre-training

<div class="alert alert-block alert-warning"> 

<b>NOTE: This notebook is optional.</b> You should only run this notebook if you want to experiment with the **continuous pretraining** process on Neuronx. If you want to skip this step and proceed with the full pretraining process for Llama2 70B on Neuronx, you can skip this notebook and go directly to `Notebook 2`. 
    
**Continuous pretraining is a technique where we take a pre-trained model and continue training it on additional data to further improve its performance.**
</div>

Before starting the continuous pre-training process, we need to download the pre-trained weights for the [Llama 70b](https://huggingface.co/meta-llama/Llama-2-70b-hf) model from Hugging Face. In this notebook, we'll be utilizing a combination of two parallelism techniques: [Pipeline Parallelism and Tensor Parallelism](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-ranking-mechanism.html). By leveraging these techniques, we can convert the pre-trained weights into a .pt (PyTorch) weights file, which will be used for the continuous pre-training process in `Notebook 2`.

Pipeline Parallelism is a technique that divides a deep neural network into multiple stages or layers, with each stage executed on a different device, such as a GPU. This approach allows for efficient use of computational resources by distributing the workload across multiple devices. Tensor Parallelism, on the other hand, splits the tensors (multidimensional arrays) of the neural network across multiple devices. This technique is particularly useful for models with large tensors that cannot fit into the memory of a single device.

By combining Pipeline Parallelism and Tensor Parallelism, we can effectively handle the large size of the Llama 70b model and convert its pre-trained weights into a more efficient and usable format (.pt) for the continuous pre-training process.

## Prerequisites

---
This Jupyter Notebook can be run on a `t3.medium instance` (ml.t3.medium). However, to execute the training job for preparing the pre-trained weights for the continuous pre-training process, you may need to request a quota increase. The number of instances you need to request for the quota increase depends on how quickly you may want the training job to complete. The range is between **8** and **32** instances.

To request a quota increase, follow these steps:

1. Navigate to the [Service Quotas console](https://console.aws.amazon.com/servicequotas/).
2. Choose Amazon SageMaker.
3. Review your default quota for the following resources:
   - `ml.trn1.32xlarge` for training job usage
   - `ml.trn1.32xlarge` for training warm pool usage
   - `Maximum number of instances per training job`

<div class="alert alert-block alert-warning"> 

<b>NOTE:</b> To make sure that you have enough quotas to support your usage requirements, it's a best practice to monitor and manage your service quotas. Requests for Amazon EC2 service quota increases are subject to review by AWS engineering teams. Also, service quota increase requests aren't immediately processed when you submit a request. After your request is processed, you receive an email notification.
</div>

## Contents

---
The example has the following main sections:

1. [Requirements](#Requirements)
2. [Setup](#Setup)
3. [Training job parameters](#Training-job-parameters)
4. [Run training job](#Run-training-job)

## Requirements
---

1. Create an Amazon SageMaker Notebook Instance - [Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html)
    - For Notebook Instance type, choose ml.t3.medium.
2. For Select Kernel, choose [conda_python3](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-prepare.html).
3. Install the required packages.

<div class="alert alert-block alert-info"> 

<b>NOTE:</b> For <a href="https://aws.amazon.com/sagemaker/studio/" target="_blank">Amazon SageMaker Studio</a>, select kernel "<span style="color:green;">Base Python 3.0</span>"

</div>

To run this notebook you would need to install the following dependencies:

In [None]:
!pip install -U sagemaker boto3 --quiet

## Setup
---

In [None]:
import sagemaker

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
region_name = sess.boto_region_name

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {region_name}")

In [None]:
# Define checkpoint directory that will contain the weights and other relevant data for the trained model
checkpoint_s3_uri = "s3://" + sagemaker_session_bucket + "/neuronx_llama_experiment"
print(checkpoint_s3_uri)

In [None]:
# Use store magic to save the checkpoint s3 directory to use in subsequent notebooks.
%store checkpoint_s3_uri

In [None]:
# Docker image for training a models on AWS Trainium
docker_image = f"763104351884.dkr.ecr.{region_name}.amazonaws.com/pytorch-training-neuronx:1.13.1-neuronx-py310-sdk2.18.0-ubuntu20.04"

For more details about neron docker images:
- [AWS Neuron Deep Learning Containers](https://github.com/aws-neuron/deep-learning-containers/tree/main0)
- [Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md)

Update the [access token](https://huggingface.co/docs/hub/en/security-tokens) to download the model weights

In [None]:
access_token = "hf_xxxx"
model_name = "meta-llama/Llama-2-70b-chat-hf"

## Training job parameters
---

Hyperparameters for converting pre-trained weights for Llama2 70B model

In [None]:
hyperparameters = {}
hyperparameters["access_token"] = access_token
hyperparameters["model_name"] = model_name
hyperparameters["tp_size"] = 8
hyperparameters["pp_size"] = 8

In [None]:
# Use the sagemaker s3 checkpoints mechanism since we need read/write access to the paths.
hyperparameters["output_dir"] = "/opt/ml/checkpoints/llama70b_weights"
hyperparameters["checkpoint-dir"] = '/opt/ml/checkpoints'
hyperparameters["n_layers"] = 80
hyperparameters["convert_from_full_model"] = ""

## Run training job
---

[PyTorch estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) for running a job on Amazon SageMaker:

In [None]:
from sagemaker.pytorch import PyTorch

# Handle end-to-end Amazon SageMaker training and deployment tasks.
# NOTES: Multinode with torchrun is a work-in-progresss. Use a single node.
estimator = PyTorch(
    base_job_name="neuronx-llama-download-model-weights",
    source_dir="./scripts",
    entry_point="convert_checkpoints.py",
    role=role,
    image_uri=docker_image,
    instance_count=1,
    instance_type="ml.trn1.32xlarge",
    sagemaker_session=sess,
    volume_size=1024,
    hyperparameters=hyperparameters,
    debugger_hook_config=False,
    checkpoint_s3_uri=checkpoint_s3_uri,
    checkpoint_local_path=hyperparameters["checkpoint-dir"],
    disable_output_compression=True,
    keep_alive_period_in_seconds=600
)

In [None]:
# Start SageMaker job
estimator.fit()

# Thank You!