# 0: Download and Compile Llama 2 7b Chat weights for AWS Neuron (HuggingFace)

- Neuronx 2.15
- SageMaker Notebook Kernel: `conda_python3`
- SageMaker Notebook Instance Type: ml.m5d.large | ml.t3.large

In this notebook, we will prepare the Llama 2 Chat instruction tuned large language model from Facebook to run on [AWS Tranium (Trn1)](https://aws.amazon.com/ec2/instance-types/trn1/) accelerators. We will create a python script to download the model from Hugging Face, the use the  the `transformers-neuronx` package to transform and compile the model weights for Neuron. Then we will create and submit a SageMaker training job to execute the script on Tranium (trn1) then upload the model artifacts to S3.

Elastic Compute Cloud (EC2) Trn1 instances, powered by AWS Trainium accelerators, are purpose built for high-performance deep learning (DL) training of generative AI models, including large language models (LLMs) and latent diffusion models. Trn1 instances offer up to 50% cost-to-train savings over other comparable Amazon EC2 instances. You can use Trn1 instances to train 100B+ parameter DL and generative AI models across a broad set of applications, such as text summarization, code generation, question answering, image and video generation, recommendation, and fraud detection. [AWS Neuron SDK](https://awsdocs-neuron.readthedocs-hosted.com/). This notebook uses the **2.14.1** version of Neuron SDK.

## Runtime 

This notebook takes approximately 30 minutes to run (after prerequisites have been met)

## Contents

1. [Prerequisites](#prerequisites)
1. [Setup](#setup)
1. [Prepare and execute the training job](#prepare-and-execute-the-training-job)
1. [Save the model location](#save-the-model-location)

## Prerequisites

You will need to complete the following steps to get permission to download LLama2 pre-trained weights from Meta. 

### Create Hugging Face account

Go to (https://huggingface.co/join) and create a Hugging Face account if you don't have one. Log into HF hub after that.

### Step 2 - Create an Access token

Follow the instructions from (https://huggingface.co/docs/hub/security-tokens) and create a new Access token. Copy the token.

### Step 3 - Meta approval to download weights

Follow the instructions from (https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) to get an approval from Meta for you to download and use the weights. It can take some time. After approved you'll see a message like: Gated model You have been granted access to this model at the top of the same page. Now you're ready to download and compile your model to Inferentia2


## Setup

Let's start by installing and importing the required packages for this notebook. 

<div class="alert alert-block alert-warning"><b>Note:</b> Verify that the notebook kernel is `conda_python3`. Also, if you run into an issue where a module can't be imported after installation, restart the notebook kernel, then rerun the import notebook cell.</div>

In [None]:
%pip install --upgrade sagemaker --quiet
%pip install python-dotenv --quiet

In [None]:
import os
import json
import boto3
import logging
import sagemaker
from IPython.display import display
from dotenv import load_dotenv
from ipywidgets import widgets
from sagemaker.pytorch import PyTorch

***

Next, we will initialize the SageMaker session and create a working directory.

***


In [None]:
sagemaker_session = sagemaker.Session()
role = sagemaker_session.get_caller_identity_arn()
bucket = sagemaker_session.default_bucket()
region = sagemaker_session.boto_region_name

# Create the working directory
os.makedirs("build/train", exist_ok=True)

# Print the session detail
print(f"Sagemaker version: {sagemaker.__version__}")
print(f"Sagemaker role arn: {role}")
print(f"Sagemaker bucket: {bucket}")
print(f"Sagemaker region: {region}")

***

Downloading the Llama 2 model weights from Hugging Face requires a user access token. After running the next cell, a input text box will appear below the cell where you should enter the token. Once entered, select the cell below the text box and run it. This will write the token to a .env file on the notebook server and clear the input text box to prevent the token from being saved with the notebook.

If you have already done this step in a previous execution you don't need to enter the token again. If you want to change the token value or if it was input incorrectly you can set the `force_overwrite` variable to True and it will replace the value in the file.

***

In [None]:
input_text = widgets.Password(placeholder="Enter Token", description="Hugging Face Token:", disabled=False)
display(input_text)

In [None]:
force_overwrite = False
env_file = ".env"

# write the access token to the .env file
if not os.path.exists(env_file) or force_overwrite:
    print("Creating environment file")
    with open(env_file, "w") as file:
        file.write(f"HF_TOKEN={input_text.value}")
else:
    print("File already exists")
# clear the input value
input_text.value = ""

***

Load the environment file and test that the token has a value.

***


In [None]:
load_dotenv()

assert (
    os.getenv("HF_TOKEN") != ""
), "Go to your HF account and get an access token. Set HF_TOKEN in the .env file with the token value."
os.makedirs("build/train", exist_ok=True)

## Prepare and execute the training job

Next, we will create files needed by the SageMaker training job to prepare the model for Inferentia 2. 

- `requirements.txt` - Install packages needed by the prepare_llama2.py script
- `prepare_llama2.py` - The training job script to download the model from hugging face

Read through the `prepare_llama2.py` script to understand what it's doing.


In [None]:
%%writefile build/train/requirements.txt

-i https://pip.repos.neuron.amazonaws.com
torchserve==0.9.0
sentencepiece==0.1.99
transformers==4.34.1
neuronx-cc==2.11.0.34
torch-neuronx==1.13.1.1.12.0
transformers-neuronx==0.8.268
torchvision

In [None]:
%%writefile build/train/prepare_llama2.py

import os
import sys
import time
import torch
import shutil
import argparse
import traceback


from huggingface_hub import login
from transformers import LlamaForCausalLM, AutoTokenizer
from transformers_neuronx.module import save_pretrained_split
from transformers_neuronx.llama.model import LlamaForSampling

if __name__=='__main__':
    parser = argparse.ArgumentParser()
    
    parser.add_argument("--model_id", type=str, default="meta-llama/Llama-2-7b-chat-hf")    
    parser.add_argument("--hf_access_token", type=str, default=os.environ["HF_TOKEN"])
    parser.add_argument("--model_dir", type=str, default=os.environ["SM_MODEL_DIR"])
    parser.add_argument("--tp_degree", type=int, default=2)
    parser.add_argument("--n_positions", type=int, default=1024)
    parser.add_argument("--batch_size", type=int, default=1)
    parser.add_argument("--dtype", type=str, default="bf16")

    args, _ = parser.parse_known_args()
    
    # Set Neuron environment variables
    os.environ["NEURON_COMPILE_CACHE_URL"]=os.path.join(args.model_dir, "neuron_cache")
    # Specifies the number of NeuronCores to be used at runtime and it should match the tensor parallelism (TP) degree specified for the model
    os.environ["NEURON_RT_NUM_CORES"] = str(args.tp_degree)
    # Enables compiler optimization on decoder-only LLM models.
    # -O1 - not optimized for performance
    # -O2 - default settings
    # -O3 - best performance
    os.environ["NEURON_CC_FLAGS"] = "-O3"

    # log into HuggingFace
    login(args.hf_access_token)

    print("Loading model...")
    t=time.time()

    # Loads model weights from HuggingFace
    model = LlamaForCausalLM.from_pretrained(args.model_id) 
    print(f"Elapsed: {time.time()-t}s")

    print("Splitting and saving...")
    t=time.time()

    save_pretrained_split(model, os.path.join(args.model_dir, "llama2-split")) 
    
    print(f"Elapsed: {time.time()-t}s, Done")    

    print("Saving tokenizer...")
    t=time.time()

    tokenizer = AutoTokenizer.from_pretrained(args.model_id) # Loads tokenizer from HuggingFace
    tokenizer.save_pretrained(args.model_dir) # Saves tokenizer to output model directory

    print(f"Elapsed: {time.time()-t}s, Done")

    kwargs = {
        "batch_size": args.batch_size,
        "amp": args.dtype,
        "tp_degree": args.tp_degree,
        "n_positions": args.n_positions,
        "unroll": None
    }

    print("Compiling model...")
    t=time.time()

    model = LlamaForSampling.from_pretrained(os.path.join(args.model_dir, "llama2-split"), **kwargs)
    model.to_neuron()
    neuron_model.save(os.path.join(model_dir, 'neuron_artifacts'))

    print(f"Compilation time: {time.time()-t}")

***

Set the training job parameters

- `tp_degree` - Sets the number of neuron cores to be used during compilation; this needs to match the target. Since we will be running this model on an Inf2.xlarge instance type which has 1 accelerator and there are 2 cores per accelerator, tp_degree should be set to 2.
- `batch_size` - The batch size number
- `sentence_length` - The maximum sequence length that this model might ever be used with. For Llama 2 the max token length is 4096.
- `instance_type` - the SageMaker instance on which the training job will execute.
- `image_uri` - the ECR container image url. 

***

In [None]:
tp_degree = 2  # set to the number of neuron cores available (2 * accelerators)
dtype = "f16"
batch_size = 1
sentence_length = 4096
instance_type = "ml.trn1.2xlarge"
image_uri = (
    f"763104351884.dkr.ecr.{region}.amazonaws.com/pytorch-training-neuronx:1.13.1-neuronx-py310-sdk2.14.1-ubuntu20.04"
)

***

Create the SageMaker PyTorch estimator and run the job.

This step can take up to **25 minutes**. 

***

In [None]:
estimator = PyTorch(
    entry_point="prepare_llama2.py",  # Specify your train script
    source_dir="build/train",
    role=role,
    sagemaker_session=sagemaker_session,
    instance_count=1,
    instance_type=instance_type,
    output_path=f"s3://{bucket}/output",
    disable_profiler=True,
    disable_output_compression=True,
    image_uri=image_uri,
    volume_size=128,
    environment={
        "HF_TOKEN": os.getenv("HF_TOKEN"),
        "FI_EFA_FORK_SAFE": "1",  # https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/nrt-troubleshoot.html
    },
    hyperparameters={"model_id": "meta-llama/Llama-2-7b-chat-hf", "tp_degree": tp_degree, "n_positions": sentence_length},
)
estimator.framework_version = "1.13.1"  # workround when using image_uri

estimator.fit()

## Save the model location

Write the training job model information to disk to be reused in `03-deploy-model.ipynb` notebook

In [None]:
with open("model_data.json", "w") as file:
    file.write(json.dumps(estimator.model_data))

estimator.model_data

## Notebook complete

You've finished preparing the model for Inferentia. Please move to the next workbook.
