<div class="alert alert-block alert-warning">
<b>Warning:</b> This module cannot be executed in Workshop Studio Accounts, and you will have to run this notebook in your own account.
</div>

## Fine-tuning Llama2 Pre-trained Models using Amazon Bedrock

Prompting Large Language Models (aka. LLM) to generate highly accurate and relevant responses on Natural Language Processing (NLP) tasks is a powerful way to build generative AI applications. To do so, you may begin with trying simple prompting techniques, such as zero-shot / few-shot prompting. Sooner or later, you may realize customizing LLMs on your specific use cases is an effective way to deliver great business outcomes; hence, you may experiment Retrieval Augmented Generation (RAG) to steer LLMs on context information for your tasks, especially, RAG suits most for question & answer task. Besides RAG, there is another alternative to custom LLMs - fine-tuning LLMs on custom data, and it suits for text summarization and classification tasks while improving model responses. In this blog, we will walk through fine-tuning Meta Llama2 13B pre-trained model on a text summarization task using Model Customization for Amazon Bedrock, which provides optimized performance on your specific use case(s) easily without in-depth ML expertise.

### What is Llama2 Foundation Models?
Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. Llama 2 is open-sourced and publicly available. It comes in a range of parameter sizes—7 billion, 13 billion, and 70 billion—as well as pre-trained and fine-tuned variations. According to Meta, the tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align to human preferences for helpfulness and safety. Llama 2 was pre-trained on 2 trillion tokens of data from publicly available sources. The tuned models are intended for assistant-like chat, whereas pre-trained models can be adapted for a variety of natural language generation tasks. Regardless of which version of the model a developer uses, [the responsible use guide](https://llama.meta.com/responsible-use-guide/) from Meta can assist in guiding additional fine-tuning that may be necessary to customize and optimize the models with appropriate safety mitigations. Also, Amazon Bedrock currently supports Llama2 13 billion and 70 billion models, and customers need to check End-User License Agreement (EULA) before planning to use it for commercial activities.

### What is Amazon Bedrock?
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon via a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can easily experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources.

To fine-tune Llama2 13B pretrained model in Model Customization for Amazon Bedrock, we will walk through the process, including data preparation, model fine-tuning, training process analysis, model deployment using [Provisioned Throughput](https://docs.aws.amazon.com/bedrock/latest/userguide/prov-throughput.html) provided by Amazon Bedrock and evaluation using [BERTScore](https://arxiv.org/abs/1904.09675) metric. We will share the code on the key steps and will include AWS console guidance. To access the source code, please refer to code repository - [meta-llama2-fine-tuning-text-summarization](https://github.com/aws-samples/amazon-bedrock-samples/tree/main/bedrock-fine-tuning/meta-llama2-text-summarization-blog).

> Given Amazon Bedrock makes LLMs fine-tuning easily achievable, besides using Llama2 13B pre-trained model for fine-tuning, you may want to explore other foundation models for your specific use cases. 


**Note**: For following up the below fine-tuning process, please run [00_setup.ipynb](./00\_setup.ipynb) notebook to setup necessary S3 bucket or IAM access role. 

Setup environment variable for default region and aws profile. 

> If you are running the notebook in an EC2 instance / SageMaker notebook environment, which providing necessary IAM permissions, please ignore `AWS_PROFILE` setup. Otherwise, please setup an AWS profile at your notebook environment.

In [None]:
%env AWS_DEFAULT_REGION=us-east-1
%env AWS_PROFILE=

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
%store -r bucket_name
%store -r role_arn

### Data Preparation

In our example, we are using open sourced [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) dataset for a text summarization task. [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) is a large-scale dialogue summarization dataset, including 14,460 dialogues with corresponding manually-labelled summaries and topics.  

Before fine-tuning Llama2 models in Amazon Bedrock, we need to prepare a training and optional validation dataset by creating a JSONL file with multiple JSON lines. As we fine-tune a text-to-text model, each JSON line must contain a `prompt` and `completion` fields. While using [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) dataset, we will transform `summary` attribute to `prompt` field using a text summarization prompt template, and take `summary` attribute as `completion` field. (reference doc - [Prepare the datasets](https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-prereq.html#model-customization-prepare))

Below is the example of “fine-tuning: text-to-text” dataset format

```json
{"prompt": "<prompt1>", "completion": "<expected generated text>"}
{"prompt": "<prompt2>", "completion": "<expected generated text>"}
...
```

And, below is an example for a text summarization task:

```json
{"prompt": "### Instruction\n\nSummarize the following conversation...", "completion": "Jessica tells Mr. White..."}
...
```
In general, the dataset size should be comparable to model size. The more data records you have in training dataset, the better the quality of the fine-tuned model. However, to simplify the data preparation, we will sample 1K records from [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) dataset. Please note that model customization job for Llama2 fine-tuning can support upto 10K records for training and 1K for validation dataset. Furthermore, we will split the data into training, validation and test for training and model evaluation.

> Splitting training data to training, validation and test is one of the ML best practices. We will be using training and validation dataset in model customization job (for fine-tuning), which generates training process output for us to analyze how the fine-tuned model(s) fit on training data; then, we will use the test dataset as part of the evaluation. For learning more about the best practices, please refer to  [Training, validation, and test data sets](https://en.wikipedia.org/wiki/Training,_validation,_and_test_data_sets)

Now let’s walk through the code for the data preparation process:

To setup local data folder for storing [DialogSum dataset](https://huggingface.co/datasets/knkarthick/dialogsum) dataset.

In [None]:
import boto3
import json

from pathlib import Path

# to create local data folder
data_folder = Path.cwd() / "data"
data_folder.mkdir(exist_ok=True)

source_file_path = data_folder / "dialogsum.train.jsonl"

To download the [DialogSum dataset file](https://raw.githubusercontent.com/cylnlp/dialogsum/main/DialogSum_Data/dialogsum.train.jsonl).

In [None]:
import urllib.request as request

source_training_data_uri = "https://raw.githubusercontent.com/cylnlp/dialogsum/main/DialogSum_Data/dialogsum.train.jsonl"
request.urlretrieve(source_training_data_uri, str(source_file_path))

To choose prompt, completion attributes and sample 1,000 records from the dataset.

In [None]:
import pandas as pd

df = pd.read_json(source_file_path, lines=True)

# renaming to "prompt" & "completion" to meet fine-tuning dataset format
df.rename(columns={"dialogue": "prompt", "summary": "completion"}, inplace=True)

# only keep the fine-tuning data fields
df = df[["prompt", "completion"]]

# only sample 1,000 records for our fine-tuning example
df = df.sample(n=1_000, random_state=42)

To split the data to training, validation & test dataset.

In [None]:
# split the dataset to be 80% training, 10% validation, 10% testing.
train_df = df.sample(frac=0.8, random_state=42)
validation_df = df.drop(train_df.index)  
test_df = validation_df.sample(frac=0.5, random_state=42)
validation_df = validation_df.drop(test_df.index)

To define specified prompt template for fine-tuning training and validation dataset. Especially, you can define your prompt style and tone, which may work best for your specific use cases. 

In [None]:
llama2_fine_tuning_prompt_template = """
### Instruction
Summarize the following conversation.

### Context
{context}

### Answer
"""

We only apply the prompt template to training and validation dataset, which will be the input for model customization. We will leave the test dataset for model evaluation.

In [None]:
train_df['prompt'] = train_df['prompt'].apply(lambda x: llama2_fine_tuning_prompt_template.format(context=x))
validation_df['prompt'] = validation_df['prompt'].apply(lambda x: llama2_fine_tuning_prompt_template.format(context=x))

To upload training & validation dataset to S3 bucket, which will be input for model customization. . (To find m Python function `upload_data_to_s3` implementation - [utils.upload_data_to_s3()]())

In [None]:
from scripts.utils import upload_data_to_s3

fine_tuning_data_type = "text-summarization"
prefix = "llama2"
train_data_uri = upload_data_to_s3(train_df, data_folder, f"{fine_tuning_data_type}-training-data.jsonl", bucket_name, prefix)
validation_data_uri = upload_data_to_s3(validation_df, data_folder, f"{fine_tuning_data_type}-validation-data.jsonl", bucket_name, prefix)

## Fine-tuning Llama2 13B pre-trained model

Fine-tuning LLMs is a process of providing training data to a model (pre-trained or fine-tuned foundation models) in order to improve its performance on specific use-cases, which creates better user experience on generative AI applications. Custom models for Amazon Bedrock provides APIs to make the process of fine-tuning LLMs easy and accessible; instead of creating your own training script, you only need to call APIs using selected base model, hyperparameters and training and evaluation dataset, etc. 

> To find out more base models which can be fine-tuned with Amazon Bedrock, please check boto3 API document - [list_foundation_models()](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock/client/list_foundation_models.html).

Besides high-quality dataset, using proper hyperparameters for training process is the key to achieve high performing fine-tuned model(s). Custom models for Amazon Bedrock supports 3 hyperparameters - epochCount, batchSize and learningRate. To learn more details, please refer to [Meta Llama 2 model customization hyperparameters document](https://docs.aws.amazon.com/bedrock/latest/userguide/cm-hp-meta-llama2.html). Meanwhile, to find out proper hyperparameters, it’s recommended to analyze model customization job results via plotting training and validation metrics from the [output files](https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-analyze.html). For Llama2 model fine-tuning, Custom model for Amazon Bedrock provide ‘loss’ and ‘perplexity’ training metrics, and optional validation metrics when validation dataset is provided in model customization job. 

Now, we will submit model customization job(s) to kick off Llama2 13B pre-trained model fine-tuning. There are two options to create a job: use python boto3 SDK and AWS console:

### Option 1: Using Python boto3 API

> In the below code, we are using a IAM role for a model customization job so as to access training and validation dataset, and generate the output to target S3 folder. For more detail on setting up the IAM role, please refer to [Set up a service role with permissions to run a model customization job](https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-iam-role.html).

In [None]:
bedrock = boto3.client(service_name="bedrock")

# the fine-tunable base model id in Bedrock
base_model_id='meta.llama2-13b-v1:0:4k'

custom_model_name = "llama2-13b-textsum-custom-model-000"
model_customization_job_name = f"{custom_model_name}-job"

fine_tuning_output_uri = f"s3://{bucket_name}/{prefix}/output"

# kick off model customization job
response = bedrock.create_model_customization_job(
    jobName=model_customization_job_name,
    customModelName=custom_model_name,
    roleArn=role_arn, # the IAM role for Amazon Bedrock to read training / validation dataset and output training process to S3 bucket(s)
    baseModelIdentifier=base_model_id,
    customizationType="FINE_TUNING",
    hyperParameters={
        "epochCount": "4",
        "batchSize": "1",
        "learningRate": "0.0004"
    },
    trainingDataConfig={
        "s3Uri": train_data_uri  # training dataset S3 uri
    },
    validationDataConfig={
        'validators': [
            {
                's3Uri': validation_data_uri  # validation dataset S3 uri
            },
        ]
    },
    outputDataConfig={
        "s3Uri": fine_tuning_output_uri  # output location for training process
    }
)

### Option 2: Using AWS console

You may login to your AWS account and navigate to **Amazon Bedrock**, then select **Custom models**.

![fine-tuning-in-console](./images/Bedrock-custom-models-fine-tuning.png)

To create a fine-tuning job, choose **Customize model**, then choose **Create Fine-tuning job**, and fill in fine-tuning job details:

![create-fine-tuning-job](./images/Bedrock-custom-models-create-fine-tuning-job.png)

### Analyze model customization job results

Once the job is completed, your fine-tuned model is stored securely by Amazon Bedrock. 

Before deploying / testing the model, it's highly recommended to analyze the training process to evaluate [overfitting](https://aws.amazon.com/what-is/overfitting/) or [underfitting](https://aws.amazon.com/what-is/overfitting/) risk. To do so, you may download the model customization output and analyze the training & evaluation process. (For more details - [Analyze job results](https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-analyze.html))

Amazon Bedrock custom models service doesn't visualize the training process. However, we can use `pandas` and `matplotlib` to plot the loss and perplexity result for training and evaluation process. Below code snippet provides a custom python function `plot_train_process` to access and plot the training process results. (For source code reference - [model_utils.plot_training_process()]())

In [None]:
from scripts.model_utils import (
    plot_training_process, 
)

training_job_name = model_customization_job_name
plot_training_process(training_job_name, include_metrics=['loss', 'perplexity'])

Below is an example of **overfitting**, when fine-tuning a Llama2 pre-trained model with 10 epochs, the model learns too well on training data, and will not generalize well on other text summarization tasks, e.g. in the diagram, training loss continuously gets smaller after epoch 4, but the validation loss gets larger. To avoid **overfitting**, we may tune the hyperparameter `epochCount` to be 4, or may try different `learning_rate`. For more details on model customization, please refer to these [guidelines](https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-guidelines.html).

![overfitting](./images/Bedrock-fine-tuning-analysis-overfitting.png)


## Custom model(s) deployment using Provisioned Throughput

Custom model(s) deployment is managed with [Provisioned Throughput](https://docs.aws.amazon.com/bedrock/latest/userguide/prov-throughput.html), which provides a level of throughput at a fixed cost. The cost varies depending on the model that you use and the level of commitment you choose. There are three types of commitments: `1-month`, `6-month`, or `no commitment` (hourly pricing). For demo purpose, we will use 'no commitment' option to deploy the fine-tuned Llama2 13B model, and please remember to delete the [Provisioned Throughput](https://docs.aws.amazon.com/bedrock/latest/userguide/prov-throughput.html) resource when it's not being used.

Now, we will purchase [Provisioned Throughput](https://docs.aws.amazon.com/bedrock/latest/userguide/prov-throughput.html) for deploying Llama2 13B fine-tuned model. There are two options provided to illustrate how to use python boto3 SDK and UI to create a job:

### Option 1: Using Python boto3 API

> ‘commitmentDuration’ is ignored in the API call so as to create ‘no commitment’ Provisioned Throughput


In [None]:
response = bedrock.create_provisioned_model_throughput(
    modelUnits=1,
    provisionedModelName="pt-llama2-custom-model",
    modelId=custom_model_name, # the custmoer model name you use in fine-tuning job
    # commitmentDurgion parameter is ignored to set 'no commitment' option.
)
pt_arn = response['provisionedModelArn']

Option 2: Using AWS console

You may login to your AWS account and navigate to **Amazon Bedrock**, then select **Provisioned Throughput**.

![bedrock-custom-model-pt](./images/Bedrock-custom-models-pt-purchase.png)

then, to choose **Purchase Provisioned Throughput**, and provide the details:

![bedrock-custom-model-pt-creation](./images/Bedrock-custom-models-pt-creation.png)

Now that we have created the provisioned throughput, we can use the test dataset to evaluate the model.

## LLMs model evaluation

Model evaluation is a critical step to understand how well the (fine-tuned) foundation models may fit to your use cases, and we will be focused on ‘Accuracy’ metric, which is to evaluate how accurately the LLM model output may match the reference summary (aka. ground truth). In general, there are two types of evaluation: human evaluation vs algorithm evaluation. Especially, for algorithm evaluation, it can be done via lexical metrics, e.g. [ROUGE](https://huggingface.co/spaces/evaluate-metric/rouge) (Recall-Oriented Understudy Generation Evaluation), or semantic similarity, e.g. [BERTScore](https://huggingface.co/spaces/evaluate-metric/bertscore) metric.

In the text summarization task, we will be measuring ‘Accuracy’ metric using F1 measure of [BERTScore](https://huggingface.co/spaces/evaluate-metric/bertscore) metrics. [BERTScore](https://huggingface.co/spaces/evaluate-metric/bertscore) metrics suit the task given it has been shown to correlate with human judgement on sentence-level and system-level evaluation.

Now, let’s walk through the code on evaluating fine-tuned Llama2 13B pretrained. Furthermore, we will be doing the same on Amazon Bedrock base model - Llama2 13B Chat base model provided by Amazon Bedrock. Last, we will compare the F1 measure of [BERTScore](https://huggingface.co/spaces/evaluate-metric/bertscore) metrics between the two. 

To make Amazon Bedrock model inferencing code reusable, we prepare a Python function `completion()`:

In [None]:
bedrock_runtime = boto3.client(service_name="bedrock-runtime")

def completion(
        model_id:str, 
        prompt:str, 
        temperature:float=0.5, 
        top_p:float=0.9,
        max_tokens:int=256):
    body = {
        "prompt": prompt,
        "temperature": temperature,
        "top_p": top_p,
        "max_gen_len": max_tokens
    }

    response = bedrock_runtime.invoke_model(
        contentType='application/json',
        accept='application/json',
        modelId=model_id, # OD Base model id or PT ARN
        body=json.dumps(body)
    )
    response_body = json.loads(response["body"].read())
    generation = response_body["generation"]
    return generation

Next, we will choose a random record from test dataset and use it to compare the model output between the fine-tuned model and Llama2 13B Chat base model.

In [None]:
row = test_df.sample(n=1, random_state=23).iloc[0]

dialogue = row['prompt']
reference_summary = row['completion']

print("-" * 20)
print(dialogue)
print("-" * 20)
print(reference_summary)

To get model output from the fine-tuned model using the defined prompt template:

In [None]:

prompt = llama2_fine_tuning_prompt_template.format(context=dialogue)

# pt_arn is the Provisioned Throughput ARN for the fine-tuned model
ft_model_input = completion(pt_arn, prompt)

print(ft_model_input)

Next, to get the model output from Llama2 13B Chat base model using on-demand invocation. Please note that Llama2 prompt tags are being used to better output: 

In [None]:
llama2_chat_specific_prompt_template = """
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant, which is good at reading dialogues and generate concise summary.

<</SYS>>

Please read the below dialogue and provide a concise summary:

[/INST]
Dialogue:
{context}

Summary:
"""

prompt = llama2_chat_specific_prompt_template.format(context=dialogue)
llama2_13b_chat_base_model_id = "meta.llama2-13b-chat-v1"
llama2_chat_model_output = completion(llama2_13b_chat_base_model_id, prompt)

print(llama2_chat_model_output)

While reading the model output from both, we will evaluate based on F1 measure of BERTScore metrics:

In [None]:
import evaluate as hf_evaluate

hf_bertscore = hf_evaluate.load("bertscore")

def get_bert_score(reference_output: str, model_output: str, lang: str="en", model_type: str="roberta-large") -> float:
    result = hf_bertscore.compute(
        predictions=[model_output],
        references=[reference_output], 
        lang=lang, 
        model_type=model_type
    )
    return result

fine_tuned_bertscore = get_bert_score(reference_summary, ft_model_input)
print(f"Model output from fine-tuned Llama2 13B model - BERTScore: {fine_tuned_bertscore['f1'][0]}")

chat_bertscore = get_bert_score(reference_summary, llama2_chat_model_output)
print(f"Model output from Llama2 13B Chat model - BERTScore: {chat_bertscore['f1'][0]}")

The comparison result above indicates that fine-tuned Llama2 13B model performs better than Llama2 13B Chat model on text summarization for the selected data (BERTScore `0.9293031692504883` vs `0.8647099137306213`).

> **Tip**: The prompt template we used for invoking Llama2 13B Chat model may not be the best, hence, you may do further prompt engineering to drive better result. Meanwhile, besides using BERTScore, you may consider using different metrics (e.g. ROUGE, or BLEU, etc.) for your use cases.

## Conclusion

In the blog, we discussed a text summarization task with fine-tuning Llama2 13B pre-trained model, and covered the key steps starting from data preparation, model customization, training process analysis, and custom model deployment in Bedrock. At the end, we evaluated fine-tuned model using semantic similarity metric - BERTScore, which provides a qualitative measurement on test dataset. Overall, you can leave the heavy-lifting engineering effort to Custom Models for Amazon Bedrock, and focus on data preparation, training process analysis while tuning hyper-parameters to achieve better fine-tuned model(s), and model evaluation before moving your fined-tuned model(s) to production.

## Reference

* [Amazon Bedrock Developer Guide](https://docs.aws.amazon.com/bedrock/latest/userguide/what-is-bedrock.html)