# Amazon Bedrock Multimodal Workshop
## Image clustering with Multimodal Embeddings

In this Jupyter Notebook, we will explore the power of multimodal embeddings to cluster images into different groups. Our goal is to classify images into three categories: kitchen, bedroom, and bathroom. We will leverage the capabilities of vector databases and Amazon Titan Multimodal Embeddings in Amazon Bedrock to achieve this task. Additionally we will make use of Amazon Bedrock Batch Inference to get the embeddings for a larger number of images. 


#### Preview SDK for Batch Inference
At the time of creating this notebook, Batch Inference for Amazon Bedrock is still in public preview. To complete this notebook we will download and install the boto3 and botocore clients versions which include the preview.  

In [None]:
from utils import process_zip

In [None]:
process_zip("https://d2eo22ngex1n9g.cloudfront.net/Documentation/SDK/bedrock-python-sdk-reinvent.zip")

In [None]:
!pip uninstall botocore boto3 -qy  
!pip install -q preview-sdk/botocore-1.32.4-py3-none-any.whl
!pip install -q preview-sdk/boto3-1.29.4-py3-none-any.whl

### Install and import needed libraries
For this notebook to run correctly, we will need to install, import and initialize the necessary libraries and clients. 

In [None]:
!pip install -q pinecone-client

In [None]:
import os
import time
import json
import boto3
import base64
import datetime
from PIL import Image
from utils import resize_image, process_zip
from pinecone import Pinecone, PodSpec

# Boto3 clients
s3_client = boto3.client('s3')
iam_client = boto3.client('iam')
sts_client = boto3.client('sts')
bedrock_client = boto3.client('bedrock')
bedrock_runtime = boto3.client(service_name="bedrock-runtime")

# Account and region info
session = boto3.session.Session()
region = session.region_name
account_id = sts_client.get_caller_identity()["Account"]

### Amazon Bedrock Titan Multimodal Embeddings (TMME)
In this section of the notebook we will create the functions needed to retrieve embeddings using Amazon Bedrock TMME. 

#### Define output embedding length
Titan Multimodal Embeddings gives you the option to create embeddings with three vector sizes: 1024, 384 or 256

In [None]:
outputEmbeddingLength = 1024 # Define output vector size – 1,024 (default), 384, 256

#### Image embeddings
This function will transform an image into an embeddings vector using TMME. 

In [None]:
def get_embeddings_of_image(image, outputEmbeddingLength = outputEmbeddingLength):
    with open(image, "rb") as image_file:
        imageEncoded = base64.b64encode(image_file.read()).decode('utf8')

    body = json.dumps(
        {
            "inputImage": imageEncoded,
            "embeddingConfig": { 
                "outputEmbeddingLength": outputEmbeddingLength
            }
        }
    )

    response = bedrock_runtime.invoke_model(
        body=body,
        modelId="amazon.titan-embed-image-v1",
        accept="application/json",
        contentType="application/json"
    )

    vector = json.loads(response['body'].read().decode('utf8'))
    return vector

### Pinecone Vector Database

[Pinecone](https://www.pinecone.io) is a vector database that allows you to store and retrieve high-dimensional vectors efficiently. In this notebook, we will be using Pinecone to store the image embeddings generated from our image data.

Vector databases like Pinecone are particularly useful for similarity search tasks, where you want to find the closest matches to a given query vector. By storing our image embeddings in Pinecone, we can easily perform tasks like image similarity search, clustering, and recommendation systems.

Before we can start using Pinecone, we need to set up a Pinecone account and create an index. An index in Pinecone is a collection of vectors that can be queried and updated.



### Store embeddings in the vector database
Now we know how to get embeddings for our content, we are going to store them in a vector database to later on query and retrieve results. 

To complete this section you will need to create a free account with Pinecone which includes a free index to test and retrieve your API key. 

#### Create the vector database

In [None]:
pinecone =  Pinecone(api_key="#INSERT YOUR API KEY")

In [None]:
index_name = "house-rooms"

In [None]:
pinecone.create_index(index_name,
        dimension=outputEmbeddingLength,
        metric='cosine',
        spec=PodSpec(environment="gcp-starter"))

In [None]:
pinecone.describe_index(index_name)

### Store images into the vector database -- Index 3 images (clustering groups)

After creating the Pinecone index, the next step is to generate embeddings for your images and then upload (ingest) those embeddings into the index. The process will involve resizing the images to the maximum height and width supported by the Amazon Titan Multimodal Embeddings model. Additionally, it will create associated metadata for each vector, which includes the type of room the image represents.

The process can be broken down into the following steps:

1. **Image Resizing**: Before generating embeddings, you'll need to resize your images to the maximum dimensions supported by the Amazon Titan Multimodal Embeddings model. This step ensures that the model can process the images correctly and generate accurate embeddings.

2. **Embedding Generation**: After resizing the images, you'll use the Amazon Titan Multimodal Embeddings model to generate embeddings for each image. These embeddings are high-dimensional vector representations that capture the visual features and semantics of the images.

3. **Metadata Creation**: Along with the embeddings, you'll create associated metadata for each image. This metadata will include information about the type of room the image represents, such as "living room," "bedroom," or "kitchen" and the image path to later retrieve. 

4. **Data Preparation**: Before ingesting the embeddings and metadata into the Pinecone index, you'll need to prepare your data. This involves creating a list or generator that yields tuples of (vector_embeddings, image_ids, metadata) for each image.

5. **Ingestion into Pinecone Index**: Finally, you'll use the `index.upsert()` method from the Pinecone Python client to ingest the embeddings and associated metadata into the index. This method takes a list or generator of (vector_embeddings, image_path, room_type) tuples as input. You can specify the batch size for ingestion to optimize performance.

After completing this process, your Pinecone index will be populated with the image embeddings and their associated metadata, including the room types. This will enable you to query the index for similar images based on their embeddings and retrieve the closest vector, retrieving the room type from the metadata.

In [None]:
index = pinecone.Index(index_name)

In [None]:
def send_info_to_vectordb(vector, image, image_name, type, index):
    index.upsert([
        (image_name, vector["embedding"], {"path": image, "type": type})
    ])

In [None]:
images_folder = "base-images"
for image_name in os.listdir(images_folder):
    if image_name.endswith(".jpg"):
        image_path = os.path.join(images_folder, image_name)
        imagename_without_extension = os.path.splitext(image_name)[0]
        type = imagename_without_extension.split("_")[0]
        image = Image.open(image_path)
        if (image.size[0] > 2048 or image.size[1] > 2048):
            resize_image(image_path)
        print("Indexing:", image_path)
        vector = get_embeddings_of_image(image_path)
        send_info_to_vectordb(vector, image_path, imagename_without_extension, type, index)
time.sleep(30)

### Compare a sample image to retrieve room type

This section of the notebook demonstrates how to use multimodal embeddings and Pinecone classifiying the images into different room types.

#### Image query
With this function we will first transform our image query into an embeddings vector, which we will then use to query the vector database. 

In [None]:
def query_the_database_with_image(image):
    vector = get_embeddings_of_image(image)["embedding"]
    results = index.query(
        vector=vector,
        top_k=1,
        include_metadata=True,
        include_values=True
    )
    return results

#### Cluster the test images
With this function we will first cluster our test images.

In [None]:
def test_images_classification():
    images_folder = "test-images"
    for image_name in os.listdir(images_folder):
       if image_name.endswith(".jpg"):
           image_path = os.path.join(images_folder, image_name)
           image = Image.open(image_path)
           if (image.height > 2048 or image.width > 2048):
                resize_image(image_path)
           results = query_the_database_with_image(image_path)
           print("Photo: {} is type {}".format(image_path, results["matches"][0]["metadata"]["type"]))

In [None]:
test_images_classification()

### Create embeddings at large scale with Amazon Bedrock Batch Inference

With batch inference, you can run multiple inference requests asynchronously to process a large number of requests efficiently by running inference on data that is stored in an S3 bucket. You can use batch inference to improve the performance of model inference on large datasets. In this section you will explore how to prepare the dataset and run an Amazon Bedrock batch inference job.

#### Create an Amazon S3 bucket
Create an bucket where your input/output data will be stored.

If you already have a bucket created, replace the name in the next cell and skip the following cell. 

In [None]:
bucket_name = "amazonbr-batch-embeddings-{}-{}".format(account_id, region)
s3_bucket_path = "s3://{}".format(bucket_name)

In [None]:
try:
    if region != 'us-east-1':
        s3_client.create_bucket(
            Bucket=bucket_name,     
            CreateBucketConfiguration={
                'LocationConstraint': region
            },
        )
    else:
        s3_client.create_bucket(Bucket=bucket_name)
    print("AWS Bucket: {}".format(bucket_name))
except Exception as err:
    print("ERROR: {}".format(err))

s3_bucket_path = "s3://{}".format(bucket_name)
print("S3 bucket path: {}".format(s3_bucket_path))

### Batch inference preparation - Creating role and policies requirements

We will now prepare the necessary role for the batch inference job. That includes creating the policies required to run model invocation jobs with Amazon Bedrock.

#### Create Trust relationship
This JSON object defines the trust relationship that allows the bedrock service to assume a role that will give it the ability to talk to other required AWS services. The conditions set restrict the assumption of the role to a specfic account ID and a specific component of the bedrock service (model_invocation_jobs)

In [None]:
role_name = "AmazonBedrockModelInvocation-batch-embeddings-3"
s3_bedrock_ft_access_policy="AmazonBedrock-batch-embeddings-3"
embeddings_model_arn= "arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-image-v1"

In [None]:
ROLE_DOC = f"""{{
    "Version": "2012-10-17",
    "Statement": [
        {{
            "Effect": "Allow",
            "Principal": {{
                "Service": "bedrock.amazonaws.com"
            }},
            "Action": "sts:AssumeRole",
            "Condition": {{
                "StringEquals": {{
                    "aws:SourceAccount": "{account_id}"
                }},
                "ArnEquals": {{
                    "aws:SourceArn": "arn:aws:bedrock:{region}:{account_id}:model-invocation-job/*"
                }}
            }}
        }}
    ]
}}
"""

### Create S3 access policy

This JSON object defines the permissions of the role we want bedrock to assume to allow access to the S3 bucket that we created that will hold our prompts and allow certain bucket and object manipulations.


In [None]:
ACCESS_POLICY_DOC = f"""{{
    "Version": "2012-10-17",
    "Statement": [
        {{
            "Effect": "Allow",
            "Action": [
                "s3:AbortMultipartUpload",
                "s3:DeleteObject",
                "s3:PutObject",
                "s3:GetObject",
                "s3:GetBucketAcl",
                "s3:GetBucketNotification",
                "s3:ListBucket",
                "s3:PutBucketNotification"
            ],
            "Resource": [
                "arn:aws:s3:::{bucket_name}",
                "arn:aws:s3:::{bucket_name}/*"
            ]
        }}
    ]
}}"""

### Create IAM role and attach policies

Let's now create the IAM role with the created trust policy and attach the s3 policy to it

In [None]:
response = iam_client.create_role(
    RoleName=role_name,
    AssumeRolePolicyDocument=ROLE_DOC,
    Description="Role for Bedrock to access S3 for model invocation",
)

In [None]:
role_arn = response["Role"]["Arn"]
response = iam_client.create_policy(
    PolicyName=s3_bedrock_ft_access_policy,
    PolicyDocument=ACCESS_POLICY_DOC,
)
policy_arn = response["Policy"]["Arn"]
iam_client.attach_role_policy(
    RoleName=role_name,
    PolicyArn=policy_arn,
)

### Configure the model invocation job
#### Create the input dataset

In [None]:
folder_path = "batch-images"
input_key = "input.jsonl"
output_path = "validation/output/"

def image_to_base64(image_path):
    with open(image_path, "rb") as img_file:
        return base64.b64encode(img_file.read()).decode('utf-8')

data = []

# Supported image file extensions
image_extensions = ['.jpg', '.jpeg', '.png', '.bmp', '.gif']

# Iterate over files in the folder
for i, file_name in enumerate(os.listdir(folder_path)):
    file_path = os.path.join(folder_path, file_name)

    _, extension = os.path.splitext(file_path)
    if extension.lower() in image_extensions:
        image = Image.open(file_path)
        if image.height > 2048 or image.width > 2048:
            resize_image(file_path)

        model_input = {
            "inputImage": image_to_base64(file_path),
            "embeddingConfig": {
                "outputEmbeddingLength": outputEmbeddingLength 
            }
        }

        data.append({'recordId': file_name, 'modelInput': model_input})

In [None]:
print("Input data items are:", len(data))

#### Process data and output to new lines
The model invocation job requires the input data to be in jsonl format and located Amazon S3.

In [None]:
output_data = ""
for row in data:
    output_data += json.dumps(row) + "\n"

s3_client.put_object(Body=output_data, Bucket=bucket_name, Key=input_key)

#### Define data configuration and launch job
As the input data is prepared and uploaded to Amazon S3 we can go ahead and launch the invocation job.

In [None]:
inputDataConfig=({
    "s3InputDataConfig": {
        "s3InputFormat": "JSONL",
        "s3Uri": "{}/{}".format(s3_bucket_path, input_key)
    }
})

outputDataConfig=({
    "s3OutputDataConfig": {
        "s3Uri": "{}/{}".format(s3_bucket_path, output_path)
    }
})
date_time = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")

response = bedrock_client.create_model_invocation_job(
    roleArn=role_arn,
    modelId=embeddings_model_arn,
    jobName=f"my-batch-job-test-{date_time}",
    inputDataConfig=inputDataConfig,
    outputDataConfig=outputDataConfig
)

jobArn = response.get('jobArn')

In [None]:
%%time
status = bedrock_client.get_model_invocation_job(jobIdentifier=jobArn)['status']
while status not in ["Completed", "Failed", "Stopping", "Stopped"]:
    status = bedrock_client.get_model_invocation_job(jobIdentifier=jobArn)['status']
    print(status)
    time.sleep(30)

#### Retrieve the embeddings 
Once the batch job is complete we can go ahead and download the output file and extract the embeddings. 

In [None]:
job_id = jobArn.split("/")[-1]
images_file = "input.jsonl.out"
s3_client.download_file(bucket_name, "{}{}/{}".format(output_path, job_id, images_file), images_file)

In [None]:
record_embedding_list = []
with open('input.jsonl.out', 'r') as file:
    for line in file:
        data = json.loads(line)
        record_id = data['recordId']
        embedding = data['modelOutput']['embedding']
        record_embedding = {'recordId': record_id, 'embedding': embedding}
        record_embedding_list.append(record_embedding)
print("The embedding list contains {} records.".format(len(record_embedding_list)))

#### Classify the embeddings
Now we have the embeddings from our batch job we can query them against the vector database to retrieve the room type of each picture.

In [None]:
def query_the_database_with_vector(vector):
    results = index.query(
        vector=vector,
        top_k=1,
        include_metadata=True,
        include_values=True
    )
    return results


def test_batch_job_classification(record_embedding_list):
       for record in record_embedding_list:
        vector = record['embedding']
        image_id = record['recordId']

        results = query_the_database_with_vector(vector)
        photo_type = results["matches"][0]["metadata"]["type"]
        print(f"Photo: {image_id} is type {photo_type}")

In [None]:
test_batch_job_classification(record_embedding_list)

### Clean Up
In this section we will delete any resource which may incur in unnecessary costs.

#### Delete the Amazon S3 bucket

In [None]:
# Delete all objects in the bucket
try:
    response = s3_client.list_objects_v2(Bucket=bucket_name)
    if 'Contents' in response:
        for obj in response['Contents']:
            s3_client.delete_object(Bucket=bucket_name, Key=obj['Key'])
        print(f"All objects in {bucket_name} have been deleted.")
except Exception as e:
    print(f"Error deleting objects from {bucket_name}: {e}")

# Delete the bucket
try:
    response = s3_client.delete_bucket(Bucket=bucket_name)
    print(f"Bucket {bucket_name} has been deleted.")
except Exception as e:
    print(f"Error deleting bucket {bucket_name}: {e}")

#### Delete the Pinecone Index

In [None]:
pinecone.delete_index(index_name)