<div style="background-color: #FFDDDD; border-left: 5px solid red; padding: 10px; color: black;">
    <strong>Kernel: Python 3 (ipykernel)
</div>

# Fine-tune Llama 3.1 with PyTorch FSDP and Q-Lora on Amazon SageMaker

This blog post explains how you can fine-tune a Llama 3.1 8b model using PyTorch FSDP and Q-Lora with the help of Hugging Face [TRL](https://huggingface.co/docs/trl/index), [Transformers](https://huggingface.co/docs/transformers/index), [peft](https://huggingface.co/docs/peft/index) & [datasets](https://huggingface.co/docs/datasets/index) on Amazon SageMaker. 

**This notebook is validated and optimized to run on `ml.p3.2xlarge` instances**

**FSDP + Q-Lora Background**

Hugging Face share the support of Q-Lora and PyTorch FSDP (Fully Sharded Data Parallel). FSDP and Q-Lora allows you now to fine-tune Llama-like architectures or Mixtral 8x7B. Hugging Face PEFT is were the core logic resides, read more about it in the [PEFT documentation](https://huggingface.co/docs/peft/v0.10.0/en/accelerate/fsdp).

* [PyTorch FSDP](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) is a data/model parallelism technique that shards model across GPUs, reducing memory requirements and enabling the training of larger models more efficiently​​​​​​.
* Q-LoRA is a fine-tuning method that leverages quantization and Low-Rank Adapters to efficiently reduced computational requirements and memory footprint. 

This blog post walks you thorugh how to fine-tune open LLMs from Hugging Face using Amazon SageMaker.

## 1. Setup Development Environment

Our first step is to install Hugging Face Libraries we need on the client to correctly prepare our dataset and start our training/evaluations jobs. Ignore this line if you've already run task 1

In [None]:
%pip install -Uq py7zr==0.22.0
%pip install -Uq datasets==2.21.0
%pip install -Uq transformers==4.45.0
%pip install -Uq peft==0.12.0
%pip install -Uq s3fs==2024.9.0

In [None]:
%%bash
#temp fix for a bug in SM Studio with ffspec-2023 not being properly updated
export SITE_PACKAGES_FOLDER=$(python3 -c "import sysconfig; print(sysconfig.get_paths()['purelib'])")
rm -rf $SITE_PACKAGES_FOLDER/fsspec-2023*

echo "ffspec-2023 bug fix run successfully"

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.



In [None]:
import sagemaker
import boto3
from datasets import load_dataset
from sagemaker.pytorch import PyTorch
import matplotlib.pyplot as plt
from sagemaker.s3 import S3Downloader
import os
from sagemaker.model import Model


sess = sagemaker.Session()
sagemaker_session_bucket=None

if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

In [None]:
# HF dataset that we will be working with 
dataset_name="Samsung/samsum"

In [None]:
#This parameter will toggle between local mode (downloading from S3) and loading models from the HF Model Hub.
#In a workshop environment you will have a local model pre-downloaded. 
#Otherwise you will either download the model to S3 and leave this True, or set this to false and fill in the HuggingFace Model ID and Token if necessary.
USE_LOCAL_MODEL_FROM_S3 = True


if USE_LOCAL_MODEL_FROM_S3 == True:
    os.environ['use_local']="true"
    #the default path set here is for workshop environments. 
    #If using this outside of a hosted workshop, you will need to set this to wherever you downloaded your model.
    #Ignore the model_id and hf_token fields, they are simply being cleared here to avoid conflicts with subsequent runs.
    os.environ['model_id']=""
    os.environ['hf_token']=""
    os.environ['base_model_s3_path']=f"s3://{sess.default_bucket()}/sagemaker/models/base/llama3_1_8b_instruct/"

else:
    os.environ['use_local']="false"
    # Model_ID - set this to the HuggingFace Model ID you want to load.
    os.environ['model_id']="meta-llama/Meta-Llama-3.1-8B-Instruct"    
    # HF_Token - use your HuggingFace Token here to access gated models. Llama-3-8B-Instruct is a gated model.
    os.environ['hf_token']="<your-hf-token>"
    #ignore this env variable for remote mode
    os.environ['base_model_s3_path']=""

In [None]:
# Uncomment this variable and set the value to your MLFlow Tracking Server ARN to activate MLFLow experiment tracking
# os.environ['mlflow_tracking_server_arn']="<mlflow-server-arn>"

## 2. Create and prepare the dataset

We will use SAMSum dataset which consists of approximately 16,000 messenger-like conversations designed for the task of abstractive summarization. Created by linguists fluent in English, the dataset captures real-life dialogue styles and includes informal, semi-formal, and formal conversations, often containing slang and emoticons. Each conversation is paired with a concise human-written summary that encapsulates the main points discussed. The dataset is divided into training, validation, and test splits, with 14,732 examples in the training set, 818 in validation, and 819 in the test set, making it a valuable resource for research in dialogue summarization.

In [None]:
# Convert dataset to summarization messages    
def create_summarization_prompts(data_point):
    full_prompt =f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
                    You are an AI assistant trained to summarize conversations. Provide a concise summary of the dialogue, capturing the key points and overall context.
                    <|eot_id|><|start_header_id|>user<|end_header_id|>
                    Summarize the following conversation:

                    {data_point["dialogue"]}
                    <|eot_id|><|start_header_id|>assistant<|end_header_id|>
                    Here's a concise summary of the conversation:

                    {data_point["summary"]}
                    <|eot_id|>"""
    return {"prompt": full_prompt}

# Load dataset from the hub
dataset = load_dataset(dataset_name, trust_remote_code=True)

# Add system message to each conversation
columns_to_remove = list(dataset["train"].features)

dataset = dataset.map(
    create_summarization_prompts,
    remove_columns=columns_to_remove,
    batched=False
)

In [None]:
# Review dataset
dataset, dataset['train'][0]

After we processed the datasets we are going to use the [FileSystem integration](https://huggingface.co/docs/datasets/filesystems) to upload our dataset to S3. We are using the `sess.default_bucket()`, adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.

In [None]:
# save train_dataset to s3 using our SageMaker session
input_path = f's3://{sess.default_bucket()}/datasets/llama3'

# Save datasets to s3
# We will fine tune only with 20 records due to limited compute resource for the workshop
dataset["train"].select(range(100)).to_json(f"{input_path}/train/dataset.json", orient="records")
train_dataset_s3_path = f"{input_path}/train/dataset.json"
dataset["test"].select(range(20)).to_json(f"{input_path}/test/dataset.json", orient="records")
test_dataset_s3_path = f"{input_path}/test/dataset.json"

print(f"Training data uploaded to:")
print(train_dataset_s3_path)
print(test_dataset_s3_path)
print(f"\nYou can view the uploaded dataset in the console here: \nhttps://s3.console.aws.amazon.com/s3/buckets/{sess.default_bucket()}/?region={sess.boto_region_name}&prefix={input_path.split('/', 3)[-1]}/")

In [None]:
# YOU CAN USE THIS CONFIGURATION FOR A TRAINING RUN ON THE COMPLETE DATASET. 
# THIS WILL NOT WORK IN A WORKSHOP ENVIRONMENT DUE TO RESOURCE CONSTRAINTS (ml.p3.2xlarge).

# save train_dataset to s3 using our SageMaker session
#input_path = f's3://{sess.default_bucket()}/datasets/llama3'

# Save datasets to s3
# We will fine tune only with 20 records due to limited compute resource for the workshop
#dataset["train"].to_json(f"{input_path}/train/dataset.json", orient="records")
#train_dataset_s3_path = f"{input_path}/train/dataset.json"
#dataset["test"].to_json(f"{input_path}/test/dataset.json", orient="records")
#test_dataset_s3_path = f"{input_path}/test/dataset.json"

#print(f"Training data uploaded to:")
#print(train_dataset_s3_path)
#print(test_dataset_s3_path)
#print(f"https://s3.console.aws.amazon.com/s3/buckets/{sess.default_bucket()}/?region={sess.boto_region_name}&prefix={input_path.split('/', 3)[-1]}/")

### Measure input length

While passing in a dataset to the LLM for fine-tuning, it's important to ensure that the inputs are all of a uniform length. To achieve this, we first visualize the distribution of the input token lengths (or alternatively, firectly find the max length). Based on these results, we identify the maximum input token length, and utilize "padding" to ensure all the inputs are of the same length.

In [None]:
def plot_data_lengths(tokenized_train_dataset, tokenized_validation_dataset):
    lengths1 = [len(x["prompt"]) for x in tokenized_train_dataset]
    lengths2 = [len(x["prompt"]) for x in tokenized_validation_dataset]
    lengths = lengths1 + lengths2
    
    plt.figure(figsize=(10,6))
    plt.hist(lengths, bins=20, alpha=0.7, color="blue")
    plt.xlabel("prompt lengths")
    plt.ylabel("Frequency")
    plt.title("Distribution of lengths of input_ids")
    plt.show()

In [None]:
plot_data_lengths(dataset["train"], dataset["test"])

## 3. Fine-tune Llama 3.1 on Amazon SageMaker

We are now ready to fine-tune our model. We will use the [SFTTrainer](https://huggingface.co/docs/trl/sft_trainer) from `trl` to fine-tune our model. The `SFTTrainer` makes it straightfoward to supervise fine-tune open LLMs. The `SFTTrainer` is a subclass of the `Trainer` from the `transformers`. We prepared a script [launch_fsdp_qlora.py](../scripts/launch_fsdp_qlora.py) which will loads the dataset from disk, prepare the model, tokenizer and start the training. It usees the [SFTTrainer](https://huggingface.co/docs/trl/sft_trainer) from `trl` to fine-tune our model. 

For configuration we use `TrlParser`, that allows us to provide hyperparameters in a yaml file. This `yaml` will be uploaded and provided to Amazon SageMaker similar to our datasets. Below is the config file for fine-tuning Llama 3.1 8B on ml.p3.2xlarge 16GB GPUs. We are saving the config file as `args.yaml` and upload it to S3.


In [None]:
%%bash

cat > ./args.yaml <<EOF
hf_token: "${hf_token}"                         # Use HF token to login into Hugging Face to access the Llama 3.1 8b model
model_id: "${model_id}"                         # Hugging Face model id
use_local: "${use_local}"

max_seq_length: 2048  #512 # 2048               # max sequence length for model and packing of the dataset
# sagemaker specific parameters
train_dataset_path: "/opt/ml/input/data/train/" # path to where SageMaker saves train dataset
test_dataset_path: "/opt/ml/input/data/test/"   # path to where SageMaker saves test dataset
base_model_s3_path: "/opt/ml/input/data/basemodel/"
#tokenizer_s3_path: "/opt/ml/input/data/tokenizer/"
ml_tracking_server_arn: "${mlflow_tracking_server_arn}"

output_dir: "/opt/ml/model/llama3.1/adapters/sum"         # path to where SageMaker will upload the model 
# training parameters
report_to: "mlflow"                    # report metrics to tensorboard
learning_rate: 0.0002                  # learning rate 2e-4
lr_scheduler_type: "constant"          # learning rate scheduler
num_train_epochs: 1                    # number of training epochs
per_device_train_batch_size: 1         # batch size per device during training
per_device_eval_batch_size: 1          # batch size for evaluation
gradient_accumulation_steps: 1         # number of steps before performing a backward/update pass
optim: adamw_torch                     # use torch adamw optimizer
logging_steps: 10                      # log every 10 steps
save_strategy: epoch                   # save checkpoint every epoch
eval_strategy: epoch                   # evaluate every epoch
max_grad_norm: 0.3                     # max gradient norm
warmup_ratio: 0.03                     # warmup ratio
bf16: false                            # use bfloat16 precision
tf32: false                            # use tf32 precision
fp16: true
gradient_checkpointing: true           # use gradient checkpointing to save memory
# FSDP parameters: https://huggingface.co/docs/transformers/main/en/fsdp
fsdp: "full_shard auto_wrap offload"   # remove offload if enough GPU memory
fsdp_config:
  backward_prefetch: "backward_pre"
  forward_prefetch: "false"
  use_orig_params: "false"

EOF

Lets upload the config file to S3. 

In [None]:
from sagemaker.s3 import S3Uploader

# upload the model yaml file to s3
model_yaml = "args.yaml"
train_config_s3_path = S3Uploader.upload(local_path=model_yaml, desired_s3_uri=f"{input_path}/config")

print(f"Training config uploaded to:")
print(train_config_s3_path)

# Fine-tune LoRA adapter

Below estimtor will train the model with QLoRA and will save the LoRA adapter in S3 

In [None]:
# Create SageMaker PyTorch Estimator

# define Training Job Name 
job_name = f'llama3-1-8b-finetune'

pytorch_estimator = PyTorch(
    entry_point= 'launch_fsdp_qlora.py',
    source_dir="./scripts",
    job_name=job_name,
    base_job_name=job_name,
    max_run=5800,
    role=role,
    framework_version="2.2.0",
    py_version="py310",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    sagemaker_session=sess,
    volume_size=50,
    disable_output_compression=True,
    keep_alive_period_in_seconds=1800,
    distribution={"torch_distributed": {"enabled": True}},
    hyperparameters={
        "config": "/opt/ml/input/data/config/args.yaml" # path to TRL config which was uploaded to s3
    }
)

_Note: When using QLoRA, we only train adapters and not the full model. The [launch_fsdp_qlora.py](../scripts/fsdp/run_fsdp_qlora.py) saves the `adapter` at the end of the training to Amazon SageMaker S3 bucket (sagemaker-<region name>-<account_id>)._

We can now start our training job, with the `.fit()` method passing our S3 path to the training script.

In [None]:
# define a data input dictonary with our uploaded s3 uris
data = {
  'train': train_dataset_s3_path,
  'test': test_dataset_s3_path,
  'config': train_config_s3_path
  }

if(os.environ["use_local"].lower()=="true"):
    data.update({'basemodel':os.environ['base_model_s3_path']})    
 
# Check input channels configured 
data

In [None]:
# starting the train job with our uploaded datasets as input
pytorch_estimator.fit(data, wait=True)

In [None]:
# Fine the job name of the last run or you can browse the console
latest_run_job_name=pytorch_estimator.latest_training_job.job_name
latest_run_job_name

In [None]:
# Find S3 path for the last job that ran successfully. You can find this from the SageMaker console 

# *** Get a job name from the AWS console for the last training run or from the above cell
job_name = latest_run_job_name

def get_s3_path_from_job_name(job_name):
    # Create a Boto3 SageMaker client
    sagemaker_client = boto3.client('sagemaker')
    
    # Describe the training job
    response = sagemaker_client.describe_training_job(TrainingJobName=job_name)
    
    # Extract the model artifacts S3 path
    model_artifacts_s3_path = response['ModelArtifacts']['S3ModelArtifacts']
    
    # Extract the output path (this is the general output location)
    output_path = response['OutputDataConfig']['S3OutputPath']
    
    return model_artifacts_s3_path, output_path


model_artifacts, output_path = get_s3_path_from_job_name(job_name)

print(f"Model artifacts S3 path: {model_artifacts}")

In [None]:
# Point to the directory where we have the adapter saved 

adapter_dir_path=f"{model_artifacts}/llama3.1/adapters/sum/"

adapter_serving_dir_path=f"{model_artifacts}/llama3.1/"

print(f'\nAdapter S3 Dir path:{adapter_dir_path} \n')

print(f'\nServing S3 Dir path:{adapter_serving_dir_path} \n')

!aws s3 ls {adapter_dir_path}

In [None]:
# Assuming you already have this environment variable set
base_model_s3_path = os.environ['base_model_s3_path'] if os.environ['use_local'].lower() == 'true' else os.environ['model_id']

# Store the variables required for the next notebook 
%store base_model_s3_path
%store adapter_serving_dir_path


### Next Step - Use register_model_adapter.ipynb notebook to register the adapter to the SageMaker model registry 

=================

## Optional Step - Merge base model with fine-tuned adapter in fp16 and Test Inference 

Following Steps are taken by the next estimator:
1. Load base model in fp16 precision
2. Convert adapter saved in previous step from fp32 to fp16
3. Merge the model
4. Run inference both on base model and merged model for comparison 

In [None]:
# Create SageMaker PyTorch Estimator

# Define Training Job Name 
job_name = f'llama3-1-8b-merge-adapter'

pytorch_estimator_adapter = PyTorch(
    entry_point= 'merge_model_adapter.py',
    source_dir="./scripts",
    job_name=job_name,
    base_job_name=job_name,
    max_run=5800,
    role=role,
    framework_version="2.2.0",
    py_version="py310",
    instance_count=1,
    volume_size=50,
    instance_type="ml.p3.2xlarge",
    sagemaker_session=sess,
    disable_output_compression=True,
    keep_alive_period_in_seconds=1800,
    hyperparameters={
        "model_id": os.environ['model_id'],  # Hugging Face model id
        "hf_token": os.environ['hf_token'],
        "dataset_name":dataset_name,
        "use_local": os.environ['use_local']
    }
)

In [None]:
!aws s3 ls {adapter_dir_path}

In [None]:
# define a data input dictonary with our uploaded s3 uris
data = {
  'adapter': adapter_dir_path,
  'testdata': test_dataset_s3_path 
  }

if(os.environ["use_local"].lower()=="true"):
    data.update({'basemodel':os.environ['base_model_s3_path']})

data

In [None]:
# starting the train job with our uploaded datasets as input
pytorch_estimator_adapter.fit(data, wait=True)