# Model deployment of Fine-Tuned Llama 3.1 8B in Amazon Bedrock

---

In this demo notebook, we demonstrate how to deploy the fine-tuned model from the notebook [01_llama-3.1-8b-qlora-sft.ipynb](./01_llama-3.1-8b-qlora-sft.ipynb) in Amazon Bedrock by using the [Custom Model Import](https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-import-model.html) functionality.

---

JupyterLab Instance Type: ml.t3.medium

Install the required libriaries, including the Hugging Face libraries, and restart the kernel.

In [None]:
%pip install -r requirements.txt

In [None]:
%pip install -q -U boto3
%pip install -q -U botocore
%pip install -q -U Levenshtein
%pip install -q -U scikit-learn==1.5.1

## Unzip generated model.tar.gz

Amazon Bedrock Custom Model Import provides two different import options:
1. From S3 bucket
2. From an Amazon SageMaker Model

In the following cells, we are going to upload in S3 the uncompressed model generated in the notebook [01_llama-3.1-8b-qlora-sft.ipynb](./01_llama-3.1-8b-qlora-sft.ipynb)

### Download model

In [None]:
import boto3
import sagemaker

In [None]:
s3_client = boto3.client("s3")
sagemaker_session = sagemaker.Session()

In [None]:
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

bucket_name = sagemaker_session.default_bucket()
job_prefix = f"train-{model_id.split('/')[-1].replace('.', '-')}-auto"

In [None]:
def get_last_job_name(job_name_prefix):
    import boto3
    sagemaker_client = boto3.client('sagemaker')
    
    search_response = sagemaker_client.search(
        Resource='TrainingJob',
        SearchExpression={
            'Filters': [
                {
                    'Name': 'TrainingJobName',
                    'Operator': 'Contains',
                    'Value': job_name_prefix
                },
                {
                    'Name': 'TrainingJobStatus',
                    'Operator': 'Equals',
                    'Value': "Completed"
                }
            ]
        },
        SortBy='CreationTime',
        SortOrder='Descending',
        MaxResults=1)
    
    return search_response['Results'][0]['TrainingJob']['TrainingJobName']

In [None]:
job_name = get_last_job_name(job_prefix)

job_name

Donwload fine-tuned Peft model

In [None]:
s3_client.download_file(bucket_name, f"{job_name}/{job_name}/output/model.tar.gz", "model.tar.gz")

In [None]:
! rm -rf ./model && mkdir -p ./model && tar -xf model.tar.gz -C ./model

Upload uncompressed content

In [None]:
import boto3
import os
import sagemaker

In [None]:
s3 = boto3.resource('s3')
sagemaker_session = sagemaker.Session()

In [None]:
bucket_name = sagemaker_session.default_bucket()

In [None]:
for root, dirs, files in os.walk('./model'):
    for file in files:
        file_path = os.path.join(root, file)
        print(f'Sending {file_path} to S3...')
        s3.Bucket(bucket_name).upload_file(
            file_path,
            f"{job_name}/{job_name}/model/{file}"
        )
        print(f'{file_path} sent successfully to S3')

Remove resources

In [None]:
! rm -rf model.tar.gz ./model

## Deploy the Fine-Tuned model

Now follow the steps from the link below to continue to import this model

https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-import-model.html

![Model Import](./images/bedrock-llama31-import.png)

![Model Import Job](./images/bedrock-llama31-job-name.png)

![Model location](./images/bedrock-llama31-model-settings.png)

![Bedrock role](./images/bedrock-llama31-role.png)

## Model Evaluation - Fine-tuned model vs. Base model

We are going to evaluate the fine-tuned model and the base model on two metrics:
* BLEU Score
* Accuracy score with Levenshtein distance

BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another.


Normalized Levenshtein distance is an algorithm for evaluating accuracy degree of how close the calculated or measured values are to the actual value.

### Define Amazon Bedrock client

In [None]:
import boto3

In [None]:
bedrock_client = boto3.client('bedrock-runtime')

In [None]:
fine_tuned_model_id = "<IMPORTED_MODEL_ARN>"
bedrock_model_id = "meta.llama3-70b-instruct-v1:0"

### Create an evaluation dataset

In [None]:
import pandas as pd

df = pd.read_csv("./sample_dataset.csv")

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.1, random_state=42)
train, valid = train_test_split(train, test_size=10, random_state=42)

print("Number of validation elements: ", len(valid))

### Answer generation

In [None]:
import json
import time

evaluation_set = []

for index, row in valid.iterrows():
    print("Example ", index)

    ## Generate response with the fine-tuned model

    prompt = f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>These are the information related to the defect:\nManufacturer: {row['MFGNAME']}\nComponent: {row['COMPNAME']}\nDescription of the defect: {row['DESC_DEFECT']}\n\n\nWhat are the consequences of defect?<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

    body = {
        "prompt": prompt,
        "max_gen_len": 512,
        "temperature": 0.1,
        "top_p": 0.9,
    }

    start_time = time.time()

    response = bedrock_client.invoke_model(
        modelId=fine_tuned_model_id,
        body=json.dumps(body)
    )

    end_time = time.time()

    print(f"Generated response with fine-tuned model: {end_time - start_time:.6f} seconds")

    response_fine_tuned = json.loads(response["body"].read())

    response_fine_tuned = response_fine_tuned["generation"]

    print(response_fine_tuned)

    ## Generate response with the base model

    prompt = f"""
        These are the information related to the defect:
        Manufacturer: {row['MFGNAME']}
        Component: {row['COMPNAME']}
        Description of a defect:
        {row['DESC_DEFECT']}
    """

    messages = [
        {
            "role": "user",
            "content": [{"text": prompt},
                        {"text": "What are the consequences?"}]
        }
    ]

    start_time = time.time()

    response = bedrock_client.converse(
        modelId=bedrock_model_id,
        messages=messages,
        inferenceConfig={
            "temperature": 0.2,
            "topP": 0.9,
            "maxTokens": 200
        }
    )

    end_time = time.time()

    print(f"Generated response with base model: {end_time - start_time:.6f} seconds")

    response_base = response['output']['message']["content"][0]["text"]
    print(response_base)

    evaluation_set.append({
        "index": index,
        "target_answer": row["CONEQUENCE_DEFECT"],
        "fine_tuned_answer": response_fine_tuned,
        "base_answer": response_base
    })

    print("******************")

with open("llama_32_1b_evaluation_dataset.json", "w") as f:
    json.dump(evaluation_set, f, indent=4)

### BLEU Score evaluation

In [None]:
import nltk
import re

def clean_array(string):
    filtered_words = []
    
    for element in string:
        cleaned_word = re.sub(r'[^a-zA-Z]', '', element)
        if cleaned_word:
            filtered_words.append(cleaned_word)
    
    return filtered_words

def calculate_score(index, reference, hp_1, hp_2):
    reference_split = clean_array(reference.split(" "))
    
    hp_1_split = clean_array(hp_1.split(" "))
    hp_2_split = clean_array(hp_2.split(" "))
    
    BLEUscore_hp_1 = nltk.translate.bleu_score.sentence_bleu([reference_split], hp_1_split)
    BLEUscore_hp_2 = nltk.translate.bleu_score.sentence_bleu([reference_split], hp_2_split)
    print("Example ", index)
    print("Fine-tuned score: ", BLEUscore_hp_1)
    print("Base score: ", BLEUscore_hp_2)

    print("******************")

    return BLEUscore_hp_1, BLEUscore_hp_2

In [None]:
import json

with open('llama_32_1b_evaluation_dataset.json', 'r') as file:
    data = json.load(file)

file.close()

data = []

for el in evaluation_set:
    BLEUscore_fine_tuned, BLEUscore_base = calculate_score(
        el["index"],
        el["target_answer"],
        el["fine_tuned_answer"],
        el["base_answer"])
    
    data.append([el["index"], BLEUscore_fine_tuned, BLEUscore_base])

df = pd.DataFrame(data, columns=["index", "Fine-tuned score", "Base score"])

df["Fine-tuned score"] = df["Fine-tuned score"].astype(float)
df["Base score"] = df["Base score"].astype(float)

df.to_csv("./llama_32_1b_bleu_scores.csv", index=False)

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

df = pd.read_csv("llama_32_1b_bleu_scores.csv")

data1 = df['Fine-tuned score']
data2 = df['Base score']

combined_data = pd.DataFrame({
    'Fine-tuned model scores': data1,
    'Base model scores': data2
})

plt.figure(figsize=(12, 6))
sns.boxplot(data=combined_data)
plt.xlabel('Models', fontsize=12)
plt.ylabel('Score')
plt.title('Distribution of Scores: Fine-tuned vs Base Model')

plt.savefig('./images/llama_32_1b_bleu_scores.png')

plt.show()

#### BLEU Score Table

##### Example 1

Training Arguments:
* `epochs`: 1
* `per_device_train_batch_size`: 2
* `per_device_test_batch_size`: 2
* `gradient_accumulation_steps`: 2
* `gradient_checkpointing`: True

These results are obtained by fine-tuning on 6000 rows in total, where 3000 rows of the dataset were duplicated for having both therminology on the `CONEQUENCE_DEFECT` and `CORRECTIVE_ACTION`.

Total time for fine-tuning:
* `ml.g5.12xlarge`: ~42 minutes on 4 GPUs

Evaluation is performed on 10 rows extracted from the original dataset and not contained in the dataset used for the fine-tuning.

BLEU score is performed with the fine-tuned model hosted on Amazon SageMaker, with an `ml.g5.8xlarge`, and the base model in Amazon Bedrock.

Base model: `LLama-3.1 8B Instruct`

![BLEU Scores Table](./images/llama_32_1b_bleu_scores_table.png)

##### BLEU Scores graph

![BLEU Scores Table](./images/llama_32_1b_bleu_scores.png)

By comparing the scores in the "Fine-tuned Score" and "Base Score" columns, we can assess the performance improvement (or degradation) achieved by fine-tuning the model on the specific task or domain.

The analysis suggest that in most cases, the fine-tuned model seems to be outperforming the base model. The fine-tuned model appears to be more consistent in its performance.

Possible improvements:
* Examples repetition: Provide similar examples for improving further improving the vocabulary of the fine-tuned model
* Increse the number of epochs

***

Base model: `LLama-3 70B Instruct`

![BLEU Scores Table](./images/llama_32_1b_bleu_scores_table_70.png)

##### BLEU Scores graph

![BLEU Scores Table](./images/llama_32_1b_bleu_scores_70.png)

By comparing the scores in the "Fine-tuned Score" and "Base Score" columns, we can assess the performance improvement (or degradation) achieved by fine-tuning the model on the specific task or domain.

The analysis suggest that in most cases, the fine-tuned model seems to be outperforming the base model. The fine-tuned model appears to be more consistent in its performance.

Possible improvements:
* Examples repetition: Provide similar examples for improving further improving the vocabulary of the fine-tuned model
* Increse the number of epochs

***

### Accuracy evaluation

In [None]:
import Levenshtein

def levenshtein_similarity(str1, str2):
    distance = Levenshtein.distance(str1, str2)
    max_len = max(len(str1), len(str2))
    normalized_distance = 1 - (distance / max_len) if max_len > 0 else 1
    return normalized_distance

In [None]:
import json

with open('llama_32_1b_evaluation_dataset.json', 'r') as file:
    data = json.load(file)

file.close()

data = [] 

for el in evaluation_set:
    print("Example ", el["index"])
    score_fine_tuned = levenshtein_similarity(el["fine_tuned_answer"], el["target_answer"])
    print("Fine-tune score: ", score)
    score_base = levenshtein_similarity(el["base_answer"], el["target_answer"])
    print("Base score: ", score)
    print("******************")

    data.append([el["index"], score_fine_tuned, score_base])

df = pd.DataFrame(data, columns=["index", "Fine-tuned score", "Base score"])

df["Fine-tuned score"] = df["Fine-tuned score"].astype(float)
df["Base score"] = df["Base score"].astype(float)

df.to_csv("./llama_32_1b_levenshtein_scores.csv", index=False)

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

df = pd.read_csv("llama_32_1b_levenshtein_scores.csv")

data1 = df['Fine-tuned score']
data2 = df['Base score']

combined_data = pd.DataFrame({
    'Fine-tuned model scores': data1,
    'Base model scores': data2
})

plt.figure(figsize=(12, 6))
sns.boxplot(data=combined_data)
plt.xlabel('Models', fontsize=12)
plt.ylabel('Score')
plt.title('Distribution of Scores: Fine-tuned vs Base Model')

plt.savefig('./images/llama_32_1b_levenshtein_scores.png')

plt.show()

#### Normalized Levenshtein Score Table

#### Example 1

Training Arguments:
* `epochs`: 1
* `per_device_train_batch_size`: 2
* `per_device_test_batch_size`: 1
* `gradient_accumulation_steps`: 2
* `gradient_checkpointing`: True

These results are obtained by fine-tuning on 6000 rows in total, where 3000 rows of the dataset were duplicated for having both therminology on the `CONEQUENCE_DEFECT` and `CORRECTIVE_ACTION`.

Total time for fine-tuning:
* `ml.g5.12xlarge`: ~39 minutes on 4 GPUs

Evaluation is performed on 10 rows extracted from the original dataset and not contained in the dataset used for the fine-tuning.

Levenshtein score is performed with the fine-tuned model hosted on Amazon SageMaker, with an `ml.g5.8xlarge`, and the base model in Amazon Bedrock.

Base model: `LLama-3.1 8B Instruct`

![BLEU Scores Table](./images/llama_32_1b_levenshtein_scores_table.png)

##### BLEU Scores graph

![BLEU Scores Table](./images/llama_32_1b_levenshtein_scores.png)

By comparing the scores in the "Fine-tuned Score" and "Base Score" columns, we can assess the performance improvement (or degradation) achieved by fine-tuning the model on the specific task or domain.

The analysis suggest that the fine-tuned model is clearly outperforming the base model across almost all examples. This suggests that the fine-tuning process has been quite effective in improving the model's accuracy for this specific task.

In the Normalized Levenshtein distance, the range is from 0 to 1, where closer to 0 means better performance. The fine-tuned model often achieves scores closer to 0, indicating higher accuracy.

Possible improvements:
* Examples repetition: Provide similar examples for improving further improving the vocabulary of the fine-tuned model
* Increse the number of epochs

***

Base model: `LLama-3 70B Instruct`

![BLEU Scores Table](./images/llama_32_1b_levenshtein_scores_table_70.png)

##### BLEU Scores graph

![BLEU Scores Table](./images/llama_32_1b_levenshtein_scores_70.png)

By comparing the scores in the "Fine-tuned Score" and "Base Score" columns, we can assess the performance improvement (or degradation) achieved by fine-tuning the model on the specific task or domain.

The analysis suggest that the fine-tuned model is clearly outperforming the base model across almost all examples. This suggests that the fine-tuning process has been quite effective in improving the model's accuracy for this specific task.

In the Normalized Levenshtein distance, the range is from 0 to 1, where closer to 0 means better performance. The fine-tuned model often achieves scores closer to 0, indicating higher accuracy.

Possible improvements:
* Examples repetition: Provide similar examples for improving further improving the vocabulary of the fine-tuned model
* Increse the number of epochs
