# Amazon Bedrock Multimodal Workshop
## Multimodal Embeddings Model Customization
This notebook is an end-to-end example of finetuning an Amazon Titan Multimodal Embeddings model which adapts the model to your domain. 

In this case we are going to train an embeddings model with sign language images. 

## 1. Import needed libraries
Let's begin by importing all the libraries and initializing all the clients needed. 

In [None]:
!pip -q install opensearch-py requests_aws4auth

In [1]:
import os
import time
import json
import boto3
import base64
import shutil
import random
import logging
import datetime
from PIL import Image
import concurrent.futures
from utils import resize_image
from ipywidgets import Dropdown
from requests_aws4auth import AWS4Auth
from opensearchpy import OpenSearch, RequestsHttpConnection
from aoss_utils import createEncryptionPolicy, createNetworkPolicy, createAccessPolicy, createCollection, waitForCollectionCreation

# Boto3 clients
s3_client = boto3.client('s3')
iam_client = boto3.client('iam')
sts_client = boto3.client('sts')
bedrock_client = boto3.client('bedrock')
opensearch_client = boto3.client('opensearchserverless')
bedrock_runtime = boto3.client(service_name="bedrock-runtime")

# Account and region info
session = boto3.session.Session()
region = session.region_name
account_id = sts_client.get_caller_identity()["Account"]
identity_arn = session.client('sts').get_caller_identity()['Arn']
print(identity_arn)

arn:aws:sts::947565228676:assumed-role/SageMakerRole/SageMaker


## 2. Dataset preparation

For the training dataset, create a .jsonl file with multiple JSON lines. Each JSON line contains both an image-ref and caption attributes similar to Sagemaker Augmented Manifest format. A validation dataset is required. Auto-captioning is not currently supported.

```
   {"image-ref": "s3://bucket-1/folder1/0001.png", "caption": "some text"}
   {"image-ref": "s3://bucket-1/folder1/0002.png", "caption": "some text"}
   {"image-ref": "s3://bucket-1/folder1/0003.png", "caption": "some text"}
```  

The Amazon S3 paths need to be in the same folders where you have provided permissions for Amazon Bedrock to access the data by attaching an IAM policy to your Amazon Bedrock service role.

### 2.1 Create an Amazon S3 bucket
Create an bucket where your input/output data will be stored.

If you already have a bucket created, replace the name in the next cell and skip the following cell. 

In [2]:
s3_bucket_name = "amazonbr-fine-tune-embeddings-{}-{}".format(account_id, region)
s3_bucket_path = "s3://{}".format(s3_bucket_name)

In [3]:
try:
    if region != 'us-east-1':
        s3_client.create_bucket(
            Bucket=s3_bucket_name,     
            CreateBucketConfiguration={
                'LocationConstraint': region
            },
        )
    else:
        s3_client.create_bucket(Bucket=s3_bucket_name)
    print("AWS Bucket: {}".format(s3_bucket_name))
except Exception as err:
    print("ERROR: {}".format(err))

s3_bucket_path = "s3://{}".format(s3_bucket_name)
print("S3 bucket path: {}".format(s3_bucket_path))

AWS Bucket: amazonbr-fine-tune-embeddings-947565228676-us-east-1
S3 bucket path: s3://amazonbr-fine-tune-embeddings-947565228676-us-east-1


### 2.2 Prepare the dataset
In this section we will iterate over our training images to prepare the jsonl file with the images location and caption. 

The images will also be uploaded to Amazon S3.

In [4]:
training_images_folder = 'training_images'
validation_images_folder = 'validation_images'
test_images_folder = 'test_images'
training_output_file = 'training-output.jsonl'
validation_output_file = 'validation-output.jsonl'
s3_folder_training = 'hand-signs/training'
s3_folder_validation = 'hand-signs/validation'
json_objects = []

### Prepare the dataset
Let's begin downloading the dataset and dividing it in training and validation folders

The dataset we will be using is the [American Sign Language Letters](https://public.roboflow.com/object-detection/american-sign-language-letters?ref=blog.roboflow.com) by David Lee.


In [5]:
!mkdir dataset
!curl -L "https://public.roboflow.com/ds/yGoz92zYzk?key=cNAlYH46dk" > dataset/roboflow.zip;
!unzip -q dataset/roboflow.zip; 
!rm README.roboflow.txt
!rm README.dataset.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   901  100   901    0     0   3142      0 --:--:-- --:--:-- --:--:--  3150
100 21.9M  100 21.9M    0     0  24.1M      0 --:--:-- --:--:-- --:--:-- 63.7M


In [6]:
parent_dir_1 = "train"
parent_dir_2 = "valid"
if not os.path.exists(training_images_folder):
    os.makedirs(training_images_folder)
if not os.path.exists(validation_images_folder):
    os.makedirs(validation_images_folder)
if not os.path.exists(test_images_folder):
    os.makedirs(test_images_folder)


# Function to handle duplicate filenames
def get_unique_filename(dest_dir, filename):
    name, ext = os.path.splitext(filename)
    counter = 1
    new_name = name + ext
    while os.path.exists(os.path.join(dest_dir, new_name)):
        new_name = f"{name}_{counter}{ext}"
        counter += 1
    return new_name

# Iterate over the parent directories
for parent_dir in [parent_dir_1, parent_dir_2]:
    for letter_dir in os.listdir(parent_dir):
        letter_path = os.path.join(parent_dir, letter_dir)
        if os.path.isdir(letter_path):
            # Get a list of files in the subfolder
            files = [f for f in os.listdir(letter_path) if os.path.isfile(os.path.join(letter_path, f))]

            # Shuffle the list of files randomly
            random.shuffle(files)

            # Calculate the split index for 80/20 split
            split_index = int(0.8 * len(files))

            # Iterate over the files and move them to the appropriate destination
            for i, filename in enumerate(files):
                src_file = os.path.join(letter_path, filename)
                name_parts = filename.split("_")
                new_name = name_parts[0] + os.path.splitext(filename)[1]

                if i < split_index:
                    # Move to training_images
                    dest_dir = training_images_folder
                else:
                    # Move to validation_images
                    dest_dir = validation_images_folder

                # Handle duplicate filenames
                new_name = get_unique_filename(dest_dir, new_name)
                dest_file = os.path.join(dest_dir, new_name)
                shutil.move(src_file, dest_file)

# Set the directory paths
src_dir = "test"
dst_dir = "test_images"

# Loop through each subfolder in the source directory
for subfolder in os.listdir(src_dir):
    subfolder_path = os.path.join(src_dir, subfolder)
    if os.path.isdir(subfolder_path):
        # Loop through each file in the subfolder
        for filename in os.listdir(subfolder_path):
            file_path = os.path.join(subfolder_path, filename)
            if os.path.isfile(file_path):
                # Split the filename by '_'
                name_parts = os.path.splitext(filename)
                # Get the part before '_' and the extension
                new_name = name_parts[0].split('_')[0] + name_parts[1]
                # Construct the new file path
                new_file_path = os.path.join(dst_dir, new_name)
                # Move the file to the destination directory with the new name
                shutil.move(file_path, new_file_path)

In [None]:
!rm -R train/
!rm -R valid/
!rm -R test/
!rm -R dataset/

#### Upload dataset to Amazon S3

In [33]:
def create_dataset(s3_folder, image_folder, output_file):
    json_objects = []
    for filename in os.listdir(image_folder):
        if filename.endswith('.png') or filename.endswith('.jpg'):
            # Construct the image reference
            image_ref = os.path.join('s3://{}/{}'.format(s3_bucket_name, s3_folder), filename)
            # Extract the first letter of the filename to create the caption
            first_letter = filename[0]
            caption = "The letter {} in sign language".format(first_letter)
            # Create the JSON object
            json_obj = {"image-ref": image_ref, "caption": caption}
            # Append the JSON object to the list
            json_objects.append(json_obj)
            # Path to the current image file
            file_path = os.path.join(image_folder, filename)
            # Resize the image if it exceeds 2048x2048
            resize_image(file_path)
            # Upload the image to S3
            s3_client.upload_file(file_path,
                                  s3_bucket_name,
                                  os.path.join(s3_folder, filename))

    # Write the JSON objects to the output file in JSON Lines format
    with open(output_file, 'w') as f:
        for obj in json_objects:
            f.write(json.dumps(obj) + '\n')

    print("Output file '{}' created successfully.".format(output_file))
    s3_client.upload_file(output_file, s3_bucket_name, output_file)

In [35]:
create_dataset(s3_folder_training, training_images_folder, training_output_file)

Output file 'training-output.jsonl' created successfully.


In [36]:
create_dataset(s3_folder_validation, validation_images_folder, validation_output_file)

Output file 'validation-output.jsonl' created successfully.


## 3. Model fine-tuning

### 3.1 Fine tune job preparation - Creating role and policies requirements

We will now prepare the necessary role for the fine-tune job. That includes creating the policies required to run customization jobs with Amazon Bedrock.

#### Create Trust relationship
This JSON object defines the trust relationship that allows the bedrock service to assume a role that will give it the ability to talk to other required AWS services. The conditions set restrict the assumption of the role to a specfic account ID and a specific component of the bedrock service (model_customization_jobs)

In [9]:
role_name = "AmazonBedrockFineTuning-Multimodal-Embeddings-test"
s3_bedrock_ft_access_policy="AmazonBedrockFT-Multimodal-S3-test"
customization_role = f"arn:aws:iam::{account_id}:role/{role_name}"

In [None]:
# This JSON object defines the trust relationship that allows the bedrock service to assume a role that will give it the ability to talk to other required AWS services. The conditions set restrict the assumption of the role to a specfic account ID and a specific component of the bedrock service (model_customization_jobs)
ROLE_DOC = f"""{{
    "Version": "2012-10-17",
    "Statement": [
        {{
            "Effect": "Allow",
            "Principal": {{
                "Service": "bedrock.amazonaws.com"
            }},
            "Action": "sts:AssumeRole",
            "Condition": {{
                "StringEquals": {{
                    "aws:SourceAccount": "{account_id}"
                }},
                "ArnEquals": {{
                    "aws:SourceArn": "arn:aws:bedrock:{region}:{account_id}:model-customization-job/*"
                }}
            }}
        }}
    ]
}}
"""

#### Create S3 access policy

This JSON object defines the permissions of the role we want bedrock to assume to allow access to the S3 bucket that we created that will hold our fine-tuning datasets and allow certain bucket and object manipulations.


In [None]:
ACCESS_POLICY_DOC = f"""{{
    "Version": "2012-10-17",
    "Statement": [
        {{
            "Effect": "Allow",
            "Action": [
                "s3:AbortMultipartUpload",
                "s3:DeleteObject",
                "s3:PutObject",
                "s3:GetObject",
                "s3:GetBucketAcl",
                "s3:GetBucketNotification",
                "s3:ListBucket",
                "s3:PutBucketNotification"
            ],
            "Resource": [
                "arn:aws:s3:::{s3_bucket_name}",
                "arn:aws:s3:::{s3_bucket_name}/*"
            ]
        }}
    ]
}}"""

#### Create IAM role and attach policies

Let's now create the IAM role with the created trust policy and attach the s3 policy to it

In [None]:
response = iam_client.create_role(
    RoleName=role_name,
    AssumeRolePolicyDocument=ROLE_DOC,
    Description="Role for Bedrock to access S3 for finetuning",
)

role_arn = response["Role"]["Arn"]
response = iam_client.create_policy(
    PolicyName=s3_bedrock_ft_access_policy,
    PolicyDocument=ACCESS_POLICY_DOC,
)
policy_arn = response["Policy"]["Arn"]
iam_client.attach_role_policy(
    RoleName=role_name,
    PolicyArn=policy_arn,
)

### 3.2 Create Fine-tuning job

<div class="alert alert-block alert-info">
    <b>Note:</b> Fine-tuning job will take around 1 hours to complete.
</div>

Now that we have all the requirements in place, let's create the fine-tuning job with the Titan Multimodal Embeddings model.

To do so, we need to set the model **hyperparameters** for `epoochsCount`, `batchSize` and `learningRate` and provide the path to your training and validation data. 

In [None]:
ts = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
base_model_id = "amazon.titan-embed-image-v1:0"
customization_type = "FINE_TUNING"
customization_role = role_arn
customization_job_name = f"image-emb-ft-{ts}"
custom_model_name = f"image-emb-ft-{ts}"
hyper_parameters = {
    "epochCount": "auto",
    "batchSize": "256",
    "learningRate": "0.00001",
}
s3_train_uri = s3_bucket_path + "/" + training_output_file
s3_validation_uri = s3_bucket_path + "/" + validation_output_file
training_data_config = {"s3Uri": s3_train_uri}
validation_data_config = {
        'validators': [
            {
                's3Uri': s3_validation_uri
            },
        ]
    }

output_data_config = {"s3Uri": f's3://{s3_bucket_name}/outputs/output-{custom_model_name}'}

# Create the customization job
bedrock_client.create_model_customization_job(
    customizationType=customization_type,
    jobName=customization_job_name,
    customModelName=custom_model_name,
    roleArn=customization_role,
    baseModelIdentifier=base_model_id,
    hyperParameters=hyper_parameters,
    trainingDataConfig=training_data_config,
    validationDataConfig=validation_data_config,
    outputDataConfig=output_data_config
)

#### Waiting until customization job is completed
Once the customization job is finished, you can check your existing custom model(s) and retrieve the modelArn of your fine-tuned model.

In [None]:
status = bedrock_client.list_model_customization_jobs(
    nameContains=customization_job_name
)["modelCustomizationJobSummaries"][0]["status"]
while status == 'InProgress':
    time.sleep(50)
    status = bedrock_client.list_model_customization_jobs(
        nameContains=customization_job_name
    )["modelCustomizationJobSummaries"][0]["status"]
    print(status)

### 3.3 Provision the fine-tuned model
You can use the fine-tuned model by purchasing provisioning or with Amazon Bedrock Batch Inference. In this notebook we will provision the model. 

In [11]:
customization_jobs = {}
dropdown_vals = []
for cj in bedrock_client.list_model_customization_jobs()["modelCustomizationJobSummaries"]:
    if cj["status"] == "Completed":
        customization_jobs[cj["customModelName"]] = cj
        dropdown_vals.append(cj["customModelName"] + " - creationTime: " + cj["creationTime"].strftime("%Y-%m-%d %H:%M:%S"))

# display the model-ids in a dropdown to select a model for inference.
model_dropdown = Dropdown(
    options=dropdown_vals,
    value=dropdown_vals[0],
    description="Select a model",
    style={"description_width": "initial"},
    layout={"width": "max-content"},
)
display(model_dropdown)

Dropdown(description='Select a model', layout=Layout(width='max-content'), options=('image-emb-ft-2024-03-26-0…

In [12]:
selected_model = model_dropdown.value.split(" - creationTime: ")[0]
custom_model_name, custom_model_arn = selected_model, customization_jobs[selected_model]["customModelArn"]
custom_model_name, custom_model_arn

('image-emb-ft-2024-03-26-09-58-05',
 'arn:aws:bedrock:us-east-1:947565228676:custom-model/amazon.titan-embed-image-v1:0/wrmjvnf7a922')


#### Create Provisioned Model Throughput
**Note:** Creating provisioned throughput will take around 10 mins to complete.

You will need to create provisioned throughput to be able to evaluate the model performance. You can do so through the console or use the following api call.


In [13]:
# Create the provision throughput job and retrieve the provisioned model id
provisioned_model_id = bedrock_client.create_provisioned_model_throughput(
    modelUnits=1,
    # create a name for your provisioned throughput model
    provisionedModelName=custom_model_name,
    modelId=custom_model_arn
)['provisionedModelArn']

In [14]:
%%time
# check provisioned throughput job status
import time
status_provisioning = bedrock_client.get_provisioned_model_throughput(provisionedModelId = provisioned_model_id)['status'] 
while status_provisioning == 'Creating':
    time.sleep(60)
    status_provisioning = bedrock_client.get_provisioned_model_throughput(provisionedModelId=provisioned_model_id)['status']
    print(status_provisioning)

Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
InService
CPU times: user 67.4 ms, sys: 16.1 ms, total: 83.5 ms
Wall time: 11min 2s


## 4. Create two indexes with the base and finetuned models embeddings
Now we have fine-tuned our model, we are going to test it against the base model. To achieve this, we will create two Amazon Opensearch Serverless indexes and populate them with hand sign language image embeddings, each one using its respective embeddings model. The we will query the index to retrieve images related to alphabet letters in hand sign language.

### Create the vector indexes using Amazon OpenSearch Serverless

#### Create an Amazon OpenSearch Serverless Collection

In [15]:
client = boto3.client('opensearchserverless')
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, "aoss", session_token=credentials.token)
collection_name = "sign-language-collection"

In [16]:
createEncryptionPolicy(client, collection_name)
createNetworkPolicy(client, collection_name)
createAccessPolicy(client, collection_name, identity_arn)
createCollection(client, collection_name)
hostname, collection_id = waitForCollectionCreation(client, collection_name)


Encryption policy created:
{'securityPolicyDetail': {'createdDate': 1712314742191, 'description': 'Encryption policy for sign-language-collection collection', 'lastModifiedDate': 1712314742191, 'name': 'sign-language-collection-policy', 'policy': {'Rules': [{'Resource': ['collection/sign-language-collection*'], 'ResourceType': 'collection'}], 'AWSOwnedKey': True}, 'policyVersion': 'MTcxMjMxNDc0MjE5MV8x', 'type': 'encryption'}, 'ResponseMetadata': {'RequestId': '33f2bc35-f678-4d22-88c4-a7aa34e439cf', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '33f2bc35-f678-4d22-88c4-a7aa34e439cf', 'date': 'Fri, 05 Apr 2024 10:59:02 GMT', 'content-type': 'application/x-amz-json-1.0', 'content-length': '383', 'connection': 'keep-alive'}, 'RetryAttempts': 0}}

Network policy created:
{'securityPolicyDetail': {'createdDate': 1712314742289, 'description': 'Network policy for sign-language-collection collection', 'lastModifiedDate': 1712314742289, 'name': 'sign-language-collection-policy', '

In [27]:
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, "aoss", session_token=credentials.token)
OSSclient = OpenSearch(
    hosts=[{'host': hostname, 'port': 443}],
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection,
    timeout=900
)

#### Create the vector indexes

Before creating the indexes you will need to define the output vector size for Amazon Titan Multimodal Embeddings.

Available sizes – 1,024 (default), 384, 256

In [18]:
outputEmbeddingLength = 1024

In [19]:
def create_index(index, outputEmbeddingLength):
    if not OSSclient.indices.exists(index):
        settings = {
            "settings": {
                "index": {
                    "knn": True,
                }
            },
            "mappings": {
                "properties": {
                    "letter": {"type": "text"},
                    "description": {"type": "text"},
                    "createtime": {"type": "text"},
                    "image_path":{"type": "text"},
                    "vector_field": {
                        "type": "knn_vector",
                        "dimension": outputEmbeddingLength,
                    },
                }
            },
        }
        res = OSSclient.indices.create(index, body=settings)
        print(res)

In [20]:
base_index_name = "sign-language-collection-base-{}".format(outputEmbeddingLength)
finetuned_index_name = "sign-language-collection-finetuned-{}".format(outputEmbeddingLength)
base_modelId = "amazon.titan-embed-image-v1"
finetuned_modelId = provisioned_model_id

In [22]:
create_index(base_index_name, outputEmbeddingLength)
create_index(finetuned_index_name, outputEmbeddingLength)

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'sign-language-collection-finetuned-1024'}


### Populate the vector indexes
Let's populate both indexes with our test images.

In [23]:
def get_embeddings_of_image(image, modelId, outputEmbeddingLength = outputEmbeddingLength):
    with open(image, "rb") as image_file:
        imageEncoded = base64.b64encode(image_file.read()).decode('utf8')

    body = json.dumps(
    {
            "inputImage": imageEncoded,
            #"inputText":text,
            "embeddingConfig": { 
                "outputEmbeddingLength": outputEmbeddingLength
            }
        }
    )

    response = bedrock_runtime.invoke_model(
        body=body, 
        modelId=modelId, 
        accept="application/json", 
        contentType="application/json"       
    )

    vector = json.loads(response['body'].read().decode('utf8'))
    return vector


def create_dataset_list(folder_path):
    dataset_list = []
    
    # Iterate over all files in the folder
    for filename in os.listdir(folder_path):
        # Check if the file is an image (you can modify the extension check as needed)
        if filename.endswith('.jpg') or filename.endswith('.png'):
            # Construct the full file path
            file_path = os.path.join(folder_path, filename)
            dataset_list.append(file_path)
    
    return dataset_list
    
def process_batch(batch, index, modelId, outputEmbeddingLength):
    start_time = datetime.datetime.now()
    bulk_data = ""
    for entry in batch:        
        letter = entry.split('/')[-1][0]
        text = "The letter {} in sign language".format(letter)
        vector = get_embeddings_of_image(entry, modelId, outputEmbeddingLength)
        dt = datetime.datetime.now().isoformat()
        doc = {
            "vector_field" : vector["embedding"],
            "createtime": dt,
            "letter": letter,
            "description": text,
            "image_path": entry
        }
        bulk_entry = "{{\"index\": {{\"_index\": \"{}\"}}}}\n{}\n".format(index, json.dumps(doc))
        bulk_data += bulk_entry
    end_time = datetime.datetime.now()
    processing_time = (end_time - start_time).total_seconds() * 1000  # Convert to milliseconds
    print("Processed {} records in {} ms".format(len(batch), processing_time))
    response = OSSclient.bulk(bulk_data)
    if (response["errors"] is False):
        print("Sent {} records in {} ms".format(len(response["items"]), response["took"]))
    else:
        print("Error found")


def populate_vector_database(folder_path, index, modelId, outputEmbeddingLength, batch_size=25):
    dataset_list = create_dataset_list(folder_path)
    with concurrent.futures.ThreadPoolExecutor() as executor:
        # Split the dataset into batches
        batches = [dataset_list[i:i+batch_size] for i in range(0, len(dataset_list), batch_size)]

        # Map the process_batch function to each batch in the dataset using multiple threads
        futures = [executor.submit(process_batch, batch, index, modelId, outputEmbeddingLength) for batch in batches]

        # Wait for all threads to complete
        concurrent.futures.wait(futures)

In [28]:
populate_vector_database(test_images_folder, base_index_name, base_modelId, outputEmbeddingLength)

Processed 25 records in 3384.6099999999997 ms
Processed 25 records in 3779.57 ms
Sent 25 records in 459 ms
Sent 25 records in 377 ms
Processed 22 records in 4985.093 ms
Sent 22 records in 307 ms


In [29]:
populate_vector_database(test_images_folder, finetuned_index_name, finetuned_modelId, outputEmbeddingLength)

Processed 22 records in 6004.804 ms
Sent 22 records in 294 ms
Processed 25 records in 6815.2 ms
Sent 25 records in 306 ms
Processed 25 records in 7781.734 ms
Sent 25 records in 303 ms


Wait a couple of minutes for all the data to be accesible on Opensearch Serverless.

## 5. Compare results
Now we have populated both indexes, let's compare the query results using the base embeddings and fine-tuned embeddings.

### Text Search

In [33]:
def get_embedding_for_text(text, modelId, outputEmbeddingLength):
    body = json.dumps(
        {"inputText": text, 
         "embeddingConfig": { 
                "outputEmbeddingLength": outputEmbeddingLength
            }
        }
    )

    response = bedrock_runtime.invoke_model(
        body=body, 
        modelId=modelId, 
        accept="application/json", 
        contentType="application/json"       
    )

    vector_json = json.loads(response['body'].read().decode('utf8'))

    return vector_json, text

def query_the_database_with_text(text, index, modelId, outputEmbeddingLength, k):
    o_vector_json, o_text = get_embedding_for_text(text, modelId, outputEmbeddingLength)
    query = {
      'query': {
        'bool': {
            "must": [
                {
                    "knn":{
                       'vector_field':{
                           "vector":o_vector_json["embedding"],
                           "k": k
                       } 
                    }
                }
            ]
        }
      }
    }

    response = OSSclient.search(
        body = query,
        index = index
    )

    return response

In [34]:
results_text_base = query_the_database_with_text("The letter B", base_index_name, base_modelId,  outputEmbeddingLength, k=10)
results_text_base["hits"]["hits"][0]["_source"]["description"]

'The letter H in sign language'

In [35]:
results_text_finetuned = query_the_database_with_text("The letter B", finetuned_index_name, finetuned_modelId, outputEmbeddingLength, k=10)
results_text_finetuned["hits"]["hits"][0]["_source"]["description"]

'The letter B in sign language'

#### Full comparison

In [36]:
import pandas as pd

def compare_alphabet_results(base_index_name, base_modelId, finetuned_index_name, finetuned_modelId, outputEmbeddingLength):
    alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
    results = []

    for letter in alphabet:
        query = f"The letter {letter} in sign language"
        results_text_base = query_the_database_with_text(query, base_index_name, base_modelId, outputEmbeddingLength, k=5)
        base_description = results_text_base["hits"]["hits"][0]["_source"]["description"]

        results_text_finetuned = query_the_database_with_text(query, finetuned_index_name, finetuned_modelId, outputEmbeddingLength, k=5)
        finetuned_description = results_text_finetuned["hits"]["hits"][0]["_source"]["description"]

        results.append({
            'Letter': letter,
            'Base Result': base_description,
            'Fine-tuned Result': finetuned_description
        })

    df = pd.DataFrame(results)
    return df

# Call the function with your parameters
df = compare_alphabet_results(base_index_name, base_modelId, finetuned_index_name, finetuned_modelId, outputEmbeddingLength)
print(df)

   Letter                    Base Result              Fine-tuned Result
0       A  The letter Q in sign language  The letter S in sign language
1       B  The letter F in sign language  The letter B in sign language
2       C  The letter F in sign language  The letter C in sign language
3       D  The letter Q in sign language  The letter D in sign language
4       E  The letter F in sign language  The letter N in sign language
5       F  The letter F in sign language  The letter F in sign language
6       G  The letter Q in sign language  The letter G in sign language
7       H  The letter V in sign language  The letter H in sign language
8       I  The letter V in sign language  The letter I in sign language
9       J  The letter Q in sign language  The letter J in sign language
10      K  The letter V in sign language  The letter K in sign language
11      L  The letter Q in sign language  The letter Y in sign language
12      M  The letter Q in sign language  The letter T in sign l

In [37]:
def calculate_precision(df):
    base_tp = 0
    base_fp = 0
    base_fn = 0
    finetuned_tp = 0
    finetuned_fp = 0
    finetuned_fn = 0

    for _, row in df.iterrows():
        letter = row['Letter']
        base_result = row['Base Result']
        finetuned_result = row['Fine-tuned Result']

        expected_result = f"The letter {letter} in sign language"

        if base_result == expected_result:
            base_tp += 1
        else:
            base_fp += 1
            base_fn += 1

        if finetuned_result == expected_result:
            finetuned_tp += 1
        else:
            finetuned_fp += 1
            finetuned_fn += 1

    base_precision = base_tp / (base_tp + base_fp) if (base_tp + base_fp) > 0 else 0
    finetuned_precision = finetuned_tp / (finetuned_tp + finetuned_fp) if (finetuned_tp + finetuned_fp) > 0 else 0

    return base_precision, finetuned_precision

# Call the function with your DataFrame
base_precision, finetuned_precision = calculate_precision(df)

print("Base Model:")
print(f"Precision: {base_precision:.2f}")

print("\nFine-tuned Model:")
print(f"Precision: {finetuned_precision:.2f}")


Base Model:
Precision: 0.08

Fine-tuned Model:
Precision: 0.77


## 6. Clean-up
To avoid unnecessary costs, let's now delete the provisioned throughput model

In [38]:
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def delete_security_policy(client, collection_name, policy_type):
    try:
        response = client.delete_security_policy(
            name=f'{collection_name}-policy',
            type=policy_type
        )
        logging.info(f'Successfully deleted {policy_type} security policy for {collection_name}')
    except Exception as e:
        logging.error(f'Error deleting {policy_type} security policy for {collection_name}: {e}')

def delete_access_policy(client, collection_name):
    try:
        response = client.delete_access_policy(
            name=f'{collection_name}-policy',
            type='data'
        )
        logging.info(f'Successfully deleted access policy for {collection_name}')
    except Exception as e:
        logging.error(f'Error deleting access policy for {collection_name}: {e}')

def delete_collection(client, collection_id):
    try:
        response = client.delete_collection(
            id=collection_id
        )
        logging.info(f'Successfully deleted collection with ID {collection_id}')
    except Exception as e:
        logging.error(f'Error deleting collection with ID {collection_id}: {e}')

def delete_provisioned_model_throughput(bedrock_client, provisioned_model_id):
    try:
        response = bedrock_client.delete_provisioned_model_throughput(
            provisionedModelId=provisioned_model_id
        )
        logging.info(f'Successfully deleted provisioned model throughput for ID {provisioned_model_id}')
    except Exception as e:
        logging.error(f'Error deleting provisioned model throughput for ID {provisioned_model_id}: {e}')

delete_security_policy(client, collection_name, 'encryption')
delete_security_policy(client, collection_name, 'network')
delete_access_policy(client, collection_name)
delete_collection(client, collection_id)
delete_provisioned_model_throughput(bedrock_client, provisioned_model_id)

2024-04-05 13:52:41,799 - INFO - Successfully deleted encryption security policy for sign-language-collection
2024-04-05 13:52:41,885 - INFO - Successfully deleted network security policy for sign-language-collection
2024-04-05 13:52:41,979 - INFO - Successfully deleted access policy for sign-language-collection
2024-04-05 13:52:42,070 - INFO - Successfully deleted collection with ID 2pfquksf2w8l8t7cffjc
2024-04-05 13:52:43,200 - INFO - Successfully deleted provisioned model throughput for ID arn:aws:bedrock:us-east-1:947565228676:provisioned-model/d41cl5bj2h4x
