## Convert pre-trained weights with tensor parallelism for Fine-tuning

Before we start the fine-tuning process, we need to download the pre-trained weights for the [Llama 70b](https://huggingface.co/meta-llama/Llama-2-70b-hf) model. In this notebook, we'll be using a combination of two parallelism techniques: [Pipeline Parallelism and Tensor Parallelism](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-ranking-mechanism.html). By leveraging these techniques, we can convert the pre-trained weights into a .pt (PyTorch) weights file. This converted weights file will then be used for fine-tuning the model in `Notebook 2`.

Pipeline Parallelism is a technique that divides a deep neural network into multiple stages or layers, where each stage is executed on a different device (e.g., GPU). This approach allows for efficient use of computational resources by distributing the workload across multiple devices.

Tensor Parallelism, on the other hand, splits the tensors (multidimensional arrays) of the neural network across multiple devices. This technique is particularly useful for models with large tensors that cannot fit into the memory of a single device.

By combining these two parallelism techniques, we can effectively handle the large size of the Llama 70b model and convert its pre-trained weights into a more efficient and usable format (.pt) for the fine-tuning process in Notebook 2.

### Contents

The example has the following main sections:

- [Install require packages](#Install-required-packages)
- [Download and prepare pre-trained weights for fine-tuning](#Download-and-prepare-pre-trained-weights-for-fine-tuning)

### Instance type quota increase

Complete the following steps:

- Open the [Service Quotas console](https://console.aws.amazon.com/servicequotas/).
- Choose Amazon SageMaker.
- Choose the service quota.
- Choose Request quota increase.

**Notes**: *To make sure that you have enough quotas to support your usage requirements, it's a best practice to monitor and manage your service quotas. Requests for Amazon EC2 service quota increases are subject to review by AWS engineering teams. Also, service quota increase requests aren't immediately processed when you submit a request. After your request is processed, you receive an email notification.*

*This Jupyter Notebook can be run on a t3.medium instance (`ml.t3.medium`). However, to save the pre-trained weights into a .pt weights file, we use a `trn1.32xlarge` instance type.*

*Before you run this notebook, you'll need to request a `quota increase of 32` from Amazon SageMaker for the following resources:*

1. *ml.trn1.32xlarge instance type for training job usage*
2. *ml.trn1.32xlarge instance type for training warm pool usage*
3. *Maximum number of instances per training job*

### Install required packages

In [None]:
!pip install -U sagemaker boto3 --quiet

### Download and prepare pre-trained weights for fine-tuning

In [None]:
import sagemaker 

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
region_name = sess.boto_region_name

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {region_name}")

Update the [access token](https://huggingface.co/docs/hub/en/security-tokens) to download the model weights

In [None]:
access_token = "hf_xxxx"
model_name = "meta-llama/Llama-2-70b-chat-hf"

Hyperparameters for converting pre-trained weights for Llama2 70B model

In [None]:
hyperparameters = {}
hyperparameters["access_token"] =  access_token
hyperparameters["model_name"] = model_name
hyperparameters["tp_size"] = 8
hyperparameters["pp_size"] = 8

In [None]:
# Use the sagemaker s3 checkpoints mechanism since we need read/write access to the paths.
hyperparameters["output_dir"] = "/opt/ml/checkpoints/llama70b_weights"
hyperparameters["checkpoint-dir"] = '/opt/ml/checkpoints'
hyperparameters["n_layers"] = 80
hyperparameters["convert_from_full_model"] = "" #

In [None]:
# Docker image for training a models on AWS Trainium
docker_image = f"763104351884.dkr.ecr.{region_name}.amazonaws.com/pytorch-training-neuronx:1.13.1-neuronx-py310-sdk2.17.0-ubuntu20.04"

For more details about neron docker images:
- [AWS Neuron Deep Learning Containers](https://github.com/aws-neuron/deep-learning-containers/tree/main0)
- [Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md)

In [None]:
# Define checkpoint directory that contains the weights and other relevant data for the trained model
checkpoint_s3_uri = "s3://" + sagemaker_session_bucket + "/neuronx_llama_experiment"

[PyTorch estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) for running a job on Amazon SageMaker:

In [None]:
from sagemaker.pytorch import PyTorch

# Handle end-to-end Amazon SageMaker training and deployment tasks.
# NOTES: Multinode with torchrun is a work-in-progresss. Use a single node.
estimator = PyTorch(
    base_job_name="neuronx-llama-download-model-weights",
    source_dir="./scripts",
    entry_point="convert_checkpoints.py",
    role=role,
    image_uri=docker_image,
    instance_count=1,
    instance_type="ml.trn1.32xlarge",
    sagemaker_session=sess,
    volume_size=1024,
    hyperparameters=hyperparameters,
    debugger_hook_config=False,
    checkpoint_s3_uri=checkpoint_s3_uri,
    checkpoint_local_path=hyperparameters["checkpoint-dir"],
    disable_output_compression=True,
    keep_alive_period_in_seconds=600
)

In [None]:
# Start SageMaker job
estimator.fit()