# Fine-tune and Evaluate Meta Llama2 13B model provided by Amazon Bedrock: End-to-End

In this notebook we demonstrate using Boto3 sdk for the fine-tuning and provisioning of [Llama2 13B](#https://ai.meta.com/llama/get-started/) model in Bedrock. You can also do this through the Bedrock Console.

<div class="alert alert-block alert-warning">
<b>Warning:</b> This module cannot be executed in Workshop Studio Accounts, and you will have to run this notebook in your own account.
</div>

### A Summarization Use Case
In this notebook, we build an end-to-end workflow for fine-tuning and evaluating the Foundation Models (FMs) in Amazon Bedrock. We choose [Meta Llama 2 13B](https://ai.meta.com/llama/) as our FM to perform the customization through fine-tuning, we then create provisioned throughput of the fine-tuned model, test the provisioned model invocation, and finally evaluate the fine-tuned model performance using [fmeval](https://github.com/aws/fmeval) on the summarization accuracy metrics including METEOR, ROUGE, and BERT scores. We have defined these scores in the `Evaluate the Provisioned Custom ModelÂ¶` section below. 

> *This notebook should work well with the **`Data Science 3.0`**, **`Python 3`**, and **`ml.c5.2xlarge`** kernel in SageMaker Studio*

## Prerequisites

 - Make sure you have executed `00_setup.ipynb` notebook.
 - Make sure you are using the same kernel and instance as `00_setup.ipynb` notebook.

In this notebook we demonstrate using Boto3 sdk for the fine-tuning and provisioning of [Llama2 13B](#https://ai.meta.com/llama/get-started/) model in Bedrock. You can also do this through the Bedrock Console.

<div class="alert alert-block alert-warning">
<b>Warning:</b> This notebook will train the foundation model and there will be cost associated with fine-tuning the model. Additionally, we will create provisioned throughput for testing the fine-tuned model. Therefore, please make sure to delete the provisioned throughput as mentioned in the last section of the notebook, otherwise you will be charged for it, even if you are not using it.
</div>

## Setup

In [None]:
## Fetching varialbes from `00_setup.ipynb` notebook. 
%store -r role_arn
%store -r s3_train_uri
%store -r s3_validation_uri
%store -r s3_test_uri
%store -r bucket_name

In [None]:
import pprint
pprint.pp(role_arn)
pprint.pp(s3_train_uri)
pprint.pp(s3_validation_uri)
pprint.pp(s3_test_uri)
pprint.pp(bucket_name)

In [None]:
# # install the fmeval package for foundation model evaluation
!rm -Rf ~/.cache/pip/*
!pip install -qU fmeval
!pip install -qU ipywidgets

In [None]:
# restart kernel for packages to take effect
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [None]:
import warnings
warnings.filterwarnings('ignore')
import json
import os
import sys
import boto3

In [None]:
session = boto3.session.Session()
region = session.region_name
sts_client = boto3.client('sts')
s3_client = boto3.client('s3')
aws_account_id = sts_client.get_caller_identity()["Account"]
bedrock = boto3.client(service_name="bedrock")
bedrock_runtime = boto3.client(service_name="bedrock-runtime")

In [None]:
test_file_name = "test-cnn-10.jsonl"
data_folder = "fine-tuning-datasets"

## Create the fine-tuning job
<div class="alert alert-block alert-info">
<b>Note:</b> Fine-tuning job will take around 60mins to complete with 5K records.</div>

Meta Llama2 customization hyperparameters: 
- `epochs`: The number of iterations through the entire training dataset and can take up any integer values in the range of 1-10, with a default value of 2.
- `batchSize`: The number of samples processed before updating model parametersand can take up any integer values in the range of 1-64, with a default value of 1.
- `learningRate`:	The rate at which model parameters are updated after each batch	which can take up a float value betweek 0.0-1.0 with a default value set to	1.00E-5.
- `learningRateWarmupSteps`: The number of iterations over which the learning rate is gradually increased to the specified rate and can take any integer value between 0-250 with a default value of 5.

For guidelines on setting hyper-parameters refer to the guidelines provided [here](#https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-guidelines.html)

In [None]:
from datetime import datetime
ts = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")


# Select the foundation model you want to customize (you can find this from the "modelId" from listed foundation model above)
base_model_id = "meta.llama2-13b-v1:0:4k"

# Select the customization type from "FINE_TUNING" or "CONTINUED_PRE_TRAINING". 
customization_type = "FINE_TUNING"

# Specify the roleArn for your customization job
customization_role = role_arn

# Create a customization job name
customization_job_name = f"llama2-finetune-sm-test-model-{ts}"

# Create a customized model name for your fine-tuned Llama2 model
custom_model_name = f"llama2-finetune-{ts}"

# Define the hyperparameters for fine-tuning Llama2 model
hyper_parameters = {
        "epochCount": "2",
        "batchSize": "1",
        "learningRate": "0.00005",
    }

# Specify your data path for training, validation(optional) and output
training_data_config = {"s3Uri": s3_train_uri}

# # uncomment the below section if you have validation dataset and provide the s3 uri for it. 
validation_data_config = {
        "validators": [{
            # "name": "validation",
            "s3Uri": s3_validation_uri
        }]
    }

output_data_config = {"s3Uri": f's3://{bucket_name}/outputs/output-{custom_model_name}'}

# Create the customization job
bedrock.create_model_customization_job(
    customizationType=customization_type,
    jobName=customization_job_name,
    customModelName=custom_model_name,
    roleArn=customization_role,
    baseModelIdentifier=base_model_id,
    hyperParameters=hyper_parameters,
    trainingDataConfig=training_data_config,
    validationDataConfig=validation_data_config,
    outputDataConfig=output_data_config
)

## Check Customization Job Status

In [None]:
import time
fine_tune_job = bedrock.get_model_customization_job(jobIdentifier=customization_job_name)["status"]
print(fine_tune_job)

while fine_tune_job == "InProgress":
    time.sleep(60)
    fine_tune_job = bedrock.get_model_customization_job(jobIdentifier=customization_job_name)["status"]
    print (fine_tune_job)

## Retrieve Custom Model
Once the customization job is finished, you can check your existing custom model(s) and retrieve the modelArn of your fine-tuned Llama2 model.

In [None]:
# You can list your custom models using the command below
bedrock.list_custom_models()

In [None]:
# retrieve the modelArn of the fine-tuned model
custom_model_id = bedrock.get_custom_model(modelIdentifier=custom_model_name)['modelArn']

## Create Provisioned Throughput
<div class="alert alert-block alert-info">
<b>Note:</b> Creating provisioned throughput will take around 20-30mins to complete.</div>
You will need to create provisioned throughput to be able to evaluate the model performance. You can do so through the [console](https://docs.aws.amazon.com/bedrock/latest/userguide/prov-cap-console.html) or use the following api call.

In [None]:
# Create the provision throughput job and retrieve the provisioned model id
provisioned_model_id = bedrock.create_provisioned_model_throughput(
     modelUnits=1,
    # create a name for your provisioned throughput model
     provisionedModelName='test-model-v1-001', 
     modelId=custom_model_id
    )['provisionedModelArn']    

In [None]:
# check provisioned throughput job status
import time
status_provisioning = bedrock.get_provisioned_model_throughput(provisionedModelId = provisioned_model_id)['status'] 
while status_provisioning == 'Creating':
    time.sleep(60)
    status_provisioning = bedrock.get_provisioned_model_throughput(provisionedModelId=provisioned_model_id)['status']
    print(status_provisioning)

## Invoke the Provisioned Custom Model
Invoke the privisioned custom model, notice you will need to run the previous step (create provisioned throughput) before proceeding. 

You can replace the follwing prompt_txt with the prompts that are more similar to your fine-tuning dataset, this helps to check whether the fine-tuned model is performing as expected. 

In [None]:
# Provide the prompt text 
test_file_path = f'{data_folder}/{test_file_name}'
with open(test_file_path) as f:
    lines = f.read().splitlines()

In [None]:
test_prompt = json.loads(lines[0])['prompt']
reference_summary = json.loads(lines[0])['completion']
print(test_prompt)
print()
print(reference_summary)

Construct model input following the format needed by Llama2 model following instructions [here](#https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-meta.html).
Please pay attention to the "Model invocation request body field" section

In [None]:
body = json.dumps({
    "prompt": test_prompt,
    # specify the parameters as needed
    "max_gen_len": 200,
    "temperature": 0.4,
    "top_p": 0.3,
})

# provide the modelId of the provisioned custom model
modelId = provisioned_model_id
accept = 'application/json'
contentType = 'application/json'

# invoke the provisioned custom model
response = bedrock_runtime.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)

response_body = json.loads(response.get('body').read())
print(response_body)

## Evaluate the Provisioned Custom Model
We will use the [fmeval](https://github.com/aws/fmeval) as the framework to create an evaluation workflow for our fine-tuned model.

FMEval is a library to evaluate Large Language Models (LLMs) and select the best LLM for your use case. The library can help evaluate LLMs for the following tasks:

- Open-ended generation - the production of natural language as a response to general prompts that do not have a pre-defined structure.
- Text summarization - summarizing the most important parts of a text, shortening a text while preserving its meaning.
- Question Answering - the generation of a relevant and accurate response to a question.
- Classification - assigning a category, such as a label or score, to text based on its content.

For our dataset we will leverage the `Text summarization` metrics namely `METEOR`, `ROUGE`, and `BERT` score. 

- `ROUGE`: The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. ROUGE metrics range between 0 and 1, with higher scores indicating higher similarity between the automatically produced summary and the reference. [Wikipedia link](#https://en.wikipedia.org/wiki/ROUGE_(metric))
- `BERTScore`: calculates the similarity between a summary and reference texts based on the outputs of BERT (Bidirectional Encoder Representations from Transformers), a powerful language model. [Medium article link](#https://haticeozbolat17.medium.com/bertscore-and-rouge-two-metrics-for-evaluating-text-summarization-systems-6337b1d98917)
- `METEOR`: computes a score that represents the semantic alignment and similarity between the so-called reference (original content) and the candidate (summary) sentences. For this, it takes into account both, exact word matches and similar word changes, that preserve the same meaning. [MDPI documentation](#https://www.mdpi.com/2673-2688/4/4/49#:~:text=METEOR)


### Evaluation Dataset

In [None]:
import glob

# Check that the dataset file to be used by the evaluation is present
if not glob.glob("./fine-tuning-datasets/test-cnn-10.jsonl"):
    print("ERROR - please make sure the file, your_evaluation_data_set.jsonl, exists.")

### Model Evaluation Setup

In [None]:
from fmeval.data_loaders.data_config import DataConfig
from fmeval.model_runners.bedrock_model_runner import BedrockModelRunner
from fmeval.constants import MIME_TYPE_JSONLINES
from fmeval.eval_algorithms.summarization_accuracy import SummarizationAccuracy

### Data Config Setup

Below, we create a DataConfig for the local dataset file, xsum_sample.jsonl.
- `dataset_name` is just an identifier for your own reference
- `dataset_uri` is either a local path to a file or an S3 URI
- `dataset_mime_type` is the MIME type of the dataset. Currently, JSON and JSON Lines are supported.
- `model_input_location` and `target_output_location` are JMESPath queries used to find the model inputs and target outputs within the dataset. The values that you specify here depend on the structure of the dataset itself. Take a look at xsum_sample.jsonl to see where "document" and "summary" show up.

In [None]:
path = "fine-tuning-datasets"
config = DataConfig(
    dataset_name=test_file_name,
    dataset_uri=test_file_path,
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="prompt",
    target_output_location="completion"
)

### Config Bedrock Model Runner

In [None]:
bedrock_model_runner = BedrockModelRunner(
    model_id=provisioned_model_id,
    # 'generation' is the field name for response content generated by the fine-tuned and provisioned model (see output formats of Llama2 models follows https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-meta.html - "Model invocation response body field")
    output='generation',
    # content_template is the input formats of Llama2 model
    content_template='{"prompt": $prompt, "max_gen_len": 50, "temperature": 0.1, "top_p": 0.3}',
)

### Create Model Evaluation Job

In this section, we will create an evaluation job using `SummaryAccuracy` class from `fmeval` package with `METEOR, ROUGE, and BERTScores`. 
Please note that this is a sample notebook where we have fine-tuned the model with 5K records and for 2 epochs with learning rate of `0.00005`. For your use case and based on your dataset, when you will fine-tune the model with relevant number of records, and epochs, you might see different results than in this notebook.

In [None]:
# prompt_template is a template for prompt, if you would like to change the prompt to see how it affects the model peformance of your fine-tuned model, you can play around with
# this parameter. E.g. for an Anthropic Claude model, this could be prompt_template_txt="Human: $feature\n\nAssistant:\n"
# the value "$feature" is a placeholder when you have nothing to add to the prompt on top of the "prompt" field in your fine-tuning data.
prompt_template_txt = "$feature"

# call the SummarizationAccuracy class to create the evaluation job with METEOR, ROUGE, and BERT scores
eval_algo = SummarizationAccuracy()
eval_output = eval_algo.evaluate(model=bedrock_model_runner, dataset_config=config, prompt_template=prompt_template_txt, save=True)

In [None]:
# Pretty-print the evaluation output (notice the score).
import json
print(json.dumps(eval_output, default=vars, indent=4))

In [None]:
# Create a Pandas DataFrame to visualize the results
import pandas as pd

data = []

# We obtain the path to the results file from "output_path" in the cell above
with open(f"/tmp/eval_results/summarization_accuracy_test-cnn-10.jsonl.jsonl", "r") as file:
    for line in file:
        data.append(json.loads(line))
df = pd.DataFrame(data)
df['eval_algo'] = df['scores'].apply(lambda x: x[2]['name'])
df['eval_score'] = df['scores'].apply(lambda x: x[2]['value'])
df[0:10]

## Clean up
<div class="alert alert-block alert-warning">
<b>Warning:</b> Please make sure to delete providsioned throughput as there will cost incurred if its left in running state, even if you are not using it. 
</div>

In [None]:
# delete the provisioned throughput
bedrock.delete_provisioned_model_throughput(provisionedModelId=provisioned_model_id)