# Using Bedrocks Titan Multi-Modal embeddings to create a gallery of document images for classification.
---

<div class="alert alert-block alert-info"> 
    <b>NOTE:</b> For SageMaker <b>studio classic</b> will need to use a Jupyter Kernel with Python 3.9 or above to use this notebook. If you are in Amazon SageMaker Studio, you can use the `SageMaker Distribution 1.4` image.
    You can ignore any ERROR or WARNINGS during the `pip installs`.
</div>

<div class="alert alert-block alert-warning"> 
    <b>NOTE:</b> You will need model access to <b>Titan Multimodal Embeddings Generation 1 (G1)</b> to be able to run this notebook. Verify if you have access to the model by going to <a href="https://console.aws.amazon.com/bedrock" target="_blank">Amazon Bedrock console</a> > left menu "Model access". The "Access status" for Titan Multimodal Embeddings G1 must be in "Access granted" status in green. If you do not have access, then click "Edit" button on the top right > select the model checkbox > click "Save changes" button at the bottom. You should have access to the model within a few moments.
</div>

In this notebook we will show how you can use Titan multimodal model to create embeddings of various document types from their images. We will then store these embeddings in an in-memory vector database that will be used as our gallery. 

Next, we will take some random sample of similar documents, create embeddings of these and do a similarity search against our in-memory vector database to find the closest single match. We will then use this matched known document in our gallery to help identify and classify the randomly selected document. 

Lets start by installing our dependencies. Boto3 along with Botocore provide our python AWS SDK libary for the API calls. FAISS is a open source Vector DB we will use.

In [None]:
%pip install -q -U botocore boto3 faiss-cpu

For this notebook demonstration we will copy some sample documents consisting of Bank Statements, Closing Disclosures, Invoices, Social Security cards and W4's into a local folder that we will use to create our embeddings from and populate into the Vector DB that we will use as a gallery.

In [None]:
# copy zip containing sample files from a S3 location to our local directory
!curl https://idp-assets-wwso.s3.us-east-2.amazonaws.com/workshop-data/docClassificationSamples.zip --output docClassificationSamples.zip

Unpack Zip file containing our sample and testing documents to a local directory

In [None]:
import shutil
EXTRACTDIR: str = "classification-embedding-samples"

shutil.unpack_archive("./docClassificationSamples.zip", extract_dir=EXTRACTDIR)

---
## Vector database
Now we will create our in-memory vector database. For this demonstration we will use open source FAISS. Faiss is a library for efficient similarity search and clustering of dense vectors. See https://github.com/facebookresearch/faiss/wiki
We will first check to see if the Vector DB has already been created and saved by reading first from disk, otherwise we will create a new instance.

In [None]:
import os
import faiss

INDEX_NAME: str = "faissGallery.index"  # Name of our vector DB index

# check to see if FAISS index is written to disk 
# from previous run and load into memory
if os.path.isfile(INDEX_NAME):
    indexIDMap=faiss.read_index(INDEX_NAME)
else:
    index = faiss.IndexFlatL2(1024)
    indexIDMap = faiss.IndexIDMap(index)

---
# Create embeddings from images and store
To generate our embeddings from the document known image types we just copied, we will utilize Amazon Bedrock. Bedrock is a fully managed service that makes foundation models from leading AI startups and Amazon available via an API. For our purposes here we will utilize Amazon Titan Multimodal Embeddings Generation 1 (G1)

In [None]:
import boto3

bedrock = boto3.client(
     service_name='bedrock-runtime',
     region_name='us-west-2'
)

Here we will create a function that we will call repeatably for each document in our sample that will fetch embeddings from Titan. While Titan multimodal embeddings Generation 1 is multimodal, meaning we can create embeddings for both text and or image combined, for this demonstration we will only be creating embeddings for image that we will use in our vector DB to generate the gallery.

In [None]:
import json, numpy as np


def getEmbeddings(inputImageB64):
    request_body = {}
    request_body["inputText"] = None  # not using any text
    request_body["inputImage"] = inputImageB64
    body = json.dumps(request_body)
    response = bedrock.invoke_model(
        body=body,
        modelId="amazon.titan-embed-image-v1",
        accept="application/json",
        contentType="application/json")
    response_body = json.loads(response.get("body").read())
    return np.array([response_body.get("embedding")]).astype(np.float32)

Titan Multimodal embeddings has an image size constraint of 2048px by 2048px. We will use this function to peform a image resize if needed before we send it to Titan.

In [None]:
from io import BytesIO
from PIL import Image

MAX_IMAGE_HEIGHT: int = 2048
MAX_IMAGE_WIDTH: int = 2048


def resizeandGetByteData(imageFile):
    image = Image.open(imageFile)
    if (image.size[0] * image.size[1]) > (MAX_IMAGE_HEIGHT * MAX_IMAGE_WIDTH):
        image = image.resize((MAX_IMAGE_HEIGHT, MAX_IMAGE_WIDTH))
    with BytesIO() as output:
        image.save(output, 'png')
        bytes_data = output.getvalue()
    return bytes_data

This function orchestrates reading the file from disk, base64 encoding the bytes then calling the functions above for sending to Titan.

In [None]:
import base64
import os


# enumerate over classified documents,
# create embeddings of each and store in vector DB
def getDocumentsandIndex(directory, classID):
    for fileName in os.listdir(directory):
        doc = f"{directory}/{fileName}"
        if os.path.isfile(doc):
            with open(doc, 'rb') as f:
                if (fileName.endswith('.png') or
                        fileName.endswith('.jpeg') or
                        fileName.endswith('.jpg') or
                        fileName.endswith('.tif')):
                    bytes_data = resizeandGetByteData(f)
                    input_image_base64 = base64.b64encode(bytes_data).decode('utf8')
                    embeddings = getEmbeddings(input_image_base64)
                    print(f"Adding file {directory}/{fileName} to Index.")
                    indexIDMap.add_with_ids(embeddings, classID)

We will now enumerate over our document samples, create the embeddings and save those into our Vector DB.
For each known document type we will store a integer as a piece of meta data associated with our document in the vector DB.

0. Closing Disclosure
1. Invoices
2. Social Security Cards
3. W4
4. Bank Statement


<div class="alert alert-block alert-warning"> 
    <b>NOTE:</b> To execute this notebook with your own sample documents, simply create a folder for each document type under sampleGallery and copy your documents into their respective folder. Modify the DOC_CLASSES array with your additional document classes and add the additional method calls to getDocumentsandIndex with your newly created folder and class.
</div>

In [None]:
DOC_CLASSES: list[str] = ["Closing Disclosure", "Invoices", "Social Security Card", "W4", "Bank Statement", "Email"]

if indexIDMap.ntotal == 0:  # populate our Vector DB if it is empty on first run
    getDocumentsandIndex(f"{EXTRACTDIR}/sampleGallery/ClosingDisclosure", DOC_CLASSES.index("Closing Disclosure"))
    getDocumentsandIndex(f"{EXTRACTDIR}/sampleGallery/Invoices", DOC_CLASSES.index("Invoices"))
    getDocumentsandIndex(f"{EXTRACTDIR}/sampleGallery/SSCards", DOC_CLASSES.index("Social Security Card"))
    getDocumentsandIndex(f"{EXTRACTDIR}/sampleGallery/W4", DOC_CLASSES.index("W4"))
    getDocumentsandIndex(f"{EXTRACTDIR}/sampleGallery/BankStatements", DOC_CLASSES.index("Bank Statement"))

print(f"A total of {indexIDMap.ntotal} image embeddings are stored in the vector DB")

Lets save the FAISS Vector DB to disk. If we rerun this notebook, the DB will be read from disk and reused. If you would like to create a new Vector DB, either delete this one from disk or modify the DB name "faissGallery.index" found in a cell near the beginning.

In [None]:
# Save our in-memory index to disk.
faiss.write_index(indexIDMap, INDEX_NAME)

---
# Test by doing a similarity search in the vector DB.
Now that we have our vector DB populated with embeddings from the sample document images, our virtual gallery is ready. We will now enumerate over some additional sample documents in our testing folder, for each we will create embeddings and then we will do a similarity search against our vector DB to find the single closest match and display the result.
You will notice the Euclidiean distance is commented out below. This value represents how close the similarity seach is, with a 0 indicating an exact match. This value could be leveraged so that should it exceed a certain threshold, the image could be surfaced to a human for verification. Uncomment to see this distance value.

In [None]:
testingDirectory = f"{EXTRACTDIR}/testGallery"
for fileName in os.listdir(testingDirectory):
    if os.path.isfile(f"{testingDirectory}/{fileName}"):
        with open(f"{testingDirectory}/{fileName}", "rb") as f:
            if fileName.endswith('.png') or fileName.endswith('.jpeg') or fileName.endswith('.jpg') or fileName.endswith('.tif'):
                bytes_data = resizeandGetByteData(f)
                input_image_base64 = base64.b64encode(bytes_data).decode('utf8') 
                embeddings = getEmbeddings(input_image_base64)
                distances = indexIDMap.search(embeddings, k=1)
                print(f"File Name:- {fileName} ---- Document Class:- {DOC_CLASSES[distances[1][0][0]]}")
                # See the distance between the search image and the retrieved image
                print (f"-- Vector Euclidean Distance L2 :- {distances[0][0][0]}")

As you can see from the results above, the images were correctly identified with the exeception of Email. No email image was initially included in our Vector DB gallery, so the closest match found was an Invoice document. We can improve our results going forward by adding the embeddings of an email image to our gallery.

In [None]:
getDocumentsandIndex(f"{EXTRACTDIR}/sampleGallery/Emails", DOC_CLASSES.index("Email"))
print(f"A total of {indexIDMap.ntotal} image embeddings are stored in the vector DB")

Now that we have added an email image to our DB, lets rerun our tests again.


In [None]:
testingDirectory = f"{EXTRACTDIR}/testGallery"
for fileName in os.listdir(testingDirectory):
    if os.path.isfile(f"{testingDirectory}/{fileName}"):
        with open(f"{testingDirectory}/{fileName}", "rb") as f:
            if fileName.endswith('.png') or fileName.endswith('.jpeg') or fileName.endswith('.jpg') or fileName.endswith('.tif'):
                bytes_data = resizeandGetByteData(f)
                input_image_base64 = base64.b64encode(bytes_data).decode('utf8') 
                embeddings = getEmbeddings(input_image_base64)
                distances = indexIDMap.search(embeddings, k=1)
                print(f"File Name:- {fileName} ---- Document Class:- {DOC_CLASSES[distances[1][0][0]]}")
                # If you would like to see Vector Euclidean L2 distance between
                # image document and what was found in the gallery,
                # uncomment next line
                # print (f"-- Vector Euclidean Distance L2 :- {distances[0][0][0]}")

---
# Clean Up
This Notebook does not create any resources or S3 buckets for cleanup. If you are running this Notebook from a newly created SageMaker’s Jupyter Notebook environment, you can stop the instance to avoid any reoccurring charges.

# Conclusion
As demonstrated here, by using Amazon Titan multimodal embeddings we are able to create a gallery of images that we can then use for similarity search to help identify the type of document we have in hand. One of the advantages of using a vector DB containing embeddings is that we can quickly add a new set of embeddings from new documents to our gallery in near real-time to further refine our similarity search.

During our testing we found this solution works very well when documents have certain unique characteristics. For example, invoices that might be processed that are received from various vendors. Each vendors invoice will typically have a certain look and feel to it like the way the line items are laid out, or having the vendor logo somewhere on the invoice. Image embeddings will achieve good results with this type of vendor classification. Where results might be more challenging is when we have documents that are dense with text from top to bottom. In this scenario it might be better to use a NLP such as Amazon Comprehend to classify on text or use a few shot prompt with an LLM.