# Data ingestion

***This notebook works best with the `conda_python3` on the `ml.t3.large` instance***.

---

In this notebook we download the images corresponding to the slide deck that we uploaded into Amazon S3 in the [1_data_prep.ipynb](./1_data_prep) notebook, convert them into embeddings and then ingest these embeddings into a vector database i.e. [Amazon OpenSearch Service Serverless](https://aws.amazon.com/opensearch-service/features/serverless/).

1. We use the [Amazon Titan Multiodal Embeddings](https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-titan-multimodal-embeddings-model-bedrock/) model to convert the images into embeddings.

1. The embeddings are then ingested into OpenSearch Service Serverless using the [Amazon OpenSearch Ingestion](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ingestion.html) pipeline. The embeddings are uploaded into an S3 bucket and that triggers the OpenSearch Ingestion pipeline which ingests the data into an OpenSearch Serverless index.

1. The OpenSearch Service Serverless Collection is created via the AWS CloudFormation stack for this blog post.


## Step 1. Setup

Install the required Python packages and import the relevant files.

In [None]:
import sys
!{sys.executable} -m pip install -r requirements.txt

In [None]:
import os
import glob
import json
import time
import boto3
import codecs
import base64
import logging
import botocore
import numpy as np
import globals as g
from typing import List
from pathlib import Path
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
from utils import upload_to_s3, get_cfn_outputs, download_image_files_from_s3

logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

## Step 2. Download the images files from S3 and convert to Base64 

Now we download the image files from the S3 bucket. Once downloaded these files are converted into [Base64](https://en.wikipedia.org/wiki/Base64) encoding so that we can create embeddings from the images.

In [None]:
# download images from S3, we would be converting these to embeddings
image_files: List = download_image_files_from_s3(g.BUCKET_NAME, g.BUCKET_IMG_PREFIX, g.IMAGE_DIR, g.IMAGE_FILE_EXTN)
logger.info(f"downloaded {len(image_files)} from s3")

Convert jpg files into Base64.

In [None]:
def encode_image_to_base64(image_file_path: str) -> str:
    with open(image_file_path, "rb") as image_file:
        b64_image = base64.b64encode(image_file.read()).decode('utf8')
        b64_image_path = os.path.join(g.B64_ENCODED_IMAGES_DIR, f"{Path(image_file_path).stem}.b64")
        with open(b64_image_path, "wb") as b64_image_file:
            b64_image_file.write(bytes(b64_image, 'utf-8'))
    return b64_image_path

In [None]:
os.makedirs(g.B64_ENCODED_IMAGES_DIR, exist_ok=True)
file_list: List = glob.glob(os.path.join(g.IMAGE_DIR, f"*{g.IMAGE_FILE_EXTN}"))
logger.info(f"there are {len(file_list)} files in the {g.IMAGE_DIR} directory for conversion to base64")

# convert each file to base64 and store the base64 in a new file
b64_image_file_list = list(map(encode_image_to_base64, file_list))
logger.info(f"base64 conversion done, there are {len(b64_image_file_list)} base64 encoded files")

## Step 3. Get embeddings for the base64 encoded images

Now we are ready to use Amazon Bedrock via the Amazon Titan Multimodal Embeddings model to conver the base64 version of the images into embeddings. We store these embeddings into a single JSON file which is then uploaded into S3. 

It is important to note that the embeddings corresponding to all the images are stored in a single file so that they can be ingested into the vector database in a single PUT operation (one bulk ingest call is more effecient than one ingest call for each image).

In [None]:
def get_multimodal_embeddings(bedrock: botocore.client, image: str) -> np.ndarray:
    body = json.dumps(dict(inputImage=image))
    try:
        response = bedrock.invoke_model(
            body=body, modelId=g.FMC_MODEL_ID, accept=g.ACCEPT_ENCODING, contentType=g.CONTENT_ENCODING
        )
        response_body = json.loads(response.get("body").read())
        embeddings = np.array([response_body.get("embedding")]).astype(np.float32)
    except Exception as e:
        logger.error(f"exception while image(truncated)={image[:10]}, exception={e}")
        embeddings = None

    return embeddings

In [None]:
embeddings_list = []
bedrock = boto3.client(service_name="bedrock-runtime", region_name=g.AWS_REGION, endpoint_url=g.FMC_URL)
file_list: List = glob.glob(os.path.join(g.B64_ENCODED_IMAGES_DIR, "*.b64"))
logger.info(f"there are {len(file_list)} to convert to embeddings")
for image_file_path in file_list:
    logger.info(f"going to convert {image_file_path} into embeddings")
    
    # read the file, MAX image size supported is 2048 * 2048 pixels
    with open(image_file_path, "rb") as image_file:
        input_image_b64 = image_file.read().decode('utf-8')
    
    # make a call to Bedrock to get the embeddings corresponding to
    # this image's base64 data
    st = time.perf_counter()
    embeddings = get_multimodal_embeddings(bedrock, input_image_b64)
    if embeddings is None:
        logger.error(f"error creating multimodal embeddings for {os.path.basename(image_file_path)}")
        continue
    latency = time.perf_counter() - st
    logger.info(f"successfully converted {image_file_path} to embeddings in {latency:.2f} seconds")
    
    # convert the data we want to ingest for this image into a JSON, this include the metadata as well
    # the metadata can be used later as part of hybrid search from the vector db
    data = {
        "image_path": f"s3://{g.BUCKET_NAME}/{g.BUCKET_IMG_PREFIX}/{Path(image_file_path).stem}{g.IMAGE_FILE_EXTN}",
        "metadata": {
          "slide_filename": g.SLIDE_DECK,
          "model_id": g.FMC_MODEL_ID,
          "slide_description": ""
        },
        "vector_embedding": embeddings[0].tolist()
      }
    
    embeddings_list.append(data)
    logger.info(f"appended json data corresponding to {image_file_path}")

In [None]:
fpath: str = f"{Path(g.SLIDE_DECK).stem}.json"
json.dump(embeddings_list, codecs.open(fpath, 'w', encoding='utf-8'), 
          separators=(',', ':'), 
          sort_keys=True, 
          indent=4)
logger.info(f"saved multimodal embeddings for all images in {fpath}")

## Step 4. Create the OpenSearch Service Serverless index

**This step is only required until we get support creating an OpenSearch Service Serverless index via AWS CloudFormation**.

Get the name of the OpenSearch Service Serverless collection endpoint and index name from the CloudFormation stack outputs.

In [None]:
outputs = get_cfn_outputs(g.CFN_STACK_NAME)
host = outputs['MultimodalCollectionEndpoint'].split('//')[1]
index_name = outputs['OpenSearchIndexName']
logger.info(f"opensearchhost={host}, index={index_name}")

We use the OpenSearch client to create an index.

In [None]:
session = boto3.Session()
credentials = session.get_credentials()
auth = AWSV4SignerAuth(credentials, g.AWS_REGION, g.OS_SERVICE)

os_client = OpenSearch(
    hosts = [{'host': host, 'port': 443}],
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection,
    pool_maxsize = 20
)

The structure of the index is important. Note the following about the index.

1. The index is a (k-NN](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/knn.html) index so that we can do a vector similarity search in this index.

1. The vector dimension is 1024 which corresponds to the output dimension of the Amazon Titan Multimodal Embeddings model that we are using.

1. The index uses the [`Hierarchical Navigable Small World (HNSW)`](https://aws.amazon.com/blogs/big-data/choose-the-k-nn-algorithm-for-your-billion-scale-use-case-with-opensearch/) algorithm for similarity search.


In [None]:
index_body = """
{
  "settings": {
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "vector_embedding": {
        "type": "knn_vector",
        "dimension": 1024,
        "method": {
          "name": "hnsw",
          "engine": "nmslib",
          "parameters": {}
        }
      },
      "image_path": {
        "type": "text"
      },
       "metadata": { 
        "properties" :
          {
            "slide_filename" : {
              "type" : "text"
            },
            "model_id" : {
              "type" : "text"
            },
            "slide_description":{
              "type": "text"
            }
          }
      }
    }
  }
}
"""

# We would get an index already exists exception if the index already exists, and that is fine.
index_body = json.loads(index_body)
try:
    response = os_client.indices.create(index_name, body=index_body)
    logger.info(f"response received for the create index -> {response}")
except Exception as e:
    logger.error(f"error in creating index={index_name}, exception={e}")

## Step 5. Upload the embeddings file to S3

Now we are all set for ingesting the embeddings file that contains multimodal embeddings for all the slides in our slide deck into OpenSearch Service Serverless.

We do this by simply uploading the file in the designated S3 bucket (see CloudFormation template [`template.yaml`](../template.yaml)) and that triggers a run of the OpenSearch Ingestion pipeline which ultimately ingests the data into the OpenSearch Service Serverless index.

In [None]:
upload_to_s3(f"{Path(g.SLIDE_DECK).stem}.json", g.BUCKET_NAME, g.BUCKET_EMB_PREFIX)