# Data ingestion

***This notebook works best with the `conda_python3` on the `ml.t3.large` instance***.

---

In this notebook we download the images corresponding to each slide deck in the [sample dataset](../qa.jsonl), convert them into embeddings and then ingest these embeddings into a vector database i.e. [Amazon OpenSearch Service Serverless](https://aws.amazon.com/opensearch-service/features/serverless/).

1. We use the [Amazon Titan Multiodal Embeddings](https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-titan-multimodal-embeddings-model-bedrock/) model to convert the images into embeddings.

1. The embeddings are then ingested into OpenSearch Service Serverless using the [Amazon OpenSearch Ingestion](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ingestion.html) pipeline. The embeddings are uploaded into an S3 bucket and that triggers the OpenSearch Ingestion pipeline which ingests the data into an OpenSearch Serverless index.

1. The OpenSearch Service Serverless Collection is created via the AWS CloudFormation stack for this blog post.


## Step 1. Setup

Install the required Python packages and import the relevant files.

In [None]:
import sys
!{sys.executable} -m pip install -r requirements.txt

In [9]:
import os
import glob
import json
import time
import boto3
import codecs
import base64
import logging
import botocore
import jsonlines
import numpy as np
import pandas as pd 
import globals as g
import requests as req
from typing import List
from pathlib import Path
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
from utils import upload_to_s3, get_cfn_outputs, get_bucket_name, download_image_from_url, encode_image_to_base64

logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

In [10]:
bedrock = boto3.client(service_name="bedrock-runtime", endpoint_url=g.TITAN_URL)
bucket_name: str = get_bucket_name(g.CFN_STACK_NAME)

## Step 2. Create the OpenSearch Service Serverless index

**This step is only required until we get support creating an OpenSearch Service Serverless index via AWS CloudFormation**.

Get the name of the OpenSearch Service Serverless collection endpoint and index name from the CloudFormation stack outputs.

In [11]:
outputs = get_cfn_outputs(g.CFN_STACK_NAME)
host = outputs['MultimodalCollectionEndpoint'].split('//')[1]
index_name = outputs['OpenSearchIndexName']
logger.info(f"opensearchhost={host}, index={index_name}")

[2024-07-17 18:22:48,790] p30904 {725157277.py:5} INFO - opensearchhost=7uiiz7d87b3q8u2kfmtd.us-east-1.aoss.amazonaws.com, index=blog3slides-app1


We use the OpenSearch client to create an index.

In [12]:
session = boto3.Session()
credentials = session.get_credentials()
auth = AWSV4SignerAuth(credentials, g.AWS_REGION, g.OS_SERVICE)

os_client = OpenSearch(
    hosts = [{'host': host, 'port': 443}],
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection,
    pool_maxsize = 20
)

[2024-07-17 18:22:53,763] p30904 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


The structure of the index is important. Note the following about the index.

1. The index is a (k-NN](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/knn.html) index so that we can do a vector similarity search in this index.

1. The vector dimension is 1,024 which corresponds to the output dimension of the Amazon Titan Multimodal Embeddings model that we are using.

1. The index uses the [`Hierarchical Navigable Small World (HNSW)`](https://aws.amazon.com/blogs/big-data/choose-the-k-nn-algorithm-for-your-billion-scale-use-case-with-opensearch/) algorithm for similarity search.


In [14]:
index_body = """
{
  "settings": {
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "vector_embedding": {
        "type": "knn_vector",
        "dimension": 1024,
        "method": {
          "name": "hnsw",
          "engine": "nmslib",
          "parameters": {}
        }
      },
      "image_url": {
        "type": "text"
      },
       "metadata": { 
        "properties" :
          {
            "deck_name" : {
              "type" : "text"
            },
            "deck_url" : {
              "type" : "text"
            }
          }
      }
    }
  }
}
"""

# We would get an index already exists exception if the index already exists, and that is fine.
index_body = json.loads(index_body)
try:
    response = os_client.indices.create(index_name, body=index_body)
    logger.info(f"response received for the create index -> {response}")
except Exception as e:
    logger.error(f"error in creating index={index_name}, exception={e}")

[2024-07-17 18:23:09,969] p30904 {base.py:259} INFO - PUT https://7uiiz7d87b3q8u2kfmtd.us-east-1.aoss.amazonaws.com:443/blog3slides-app1 [status:200 request:0.558s]
[2024-07-17 18:23:09,971] p30904 {3614055036.py:70} INFO - response received for the create index -> {'acknowledged': True, 'shards_acknowledged': True, 'index': 'blog3slides-app1'}


## Step 3: Download images locally, get embeddings for each image

In [15]:
def get_multimodal_embeddings(bedrock: botocore.client, image: str) -> np.ndarray:
    body = json.dumps(dict(inputImage=image))
    try:
        response = bedrock.invoke_model(
            body=body, modelId=g.FMC_MODEL_ID, accept=g.ACCEPT_ENCODING, contentType=g.CONTENT_ENCODING
        )
        response_body = json.loads(response.get("body").read())
        embeddings = np.array([response_body.get("embedding")]).astype(np.float32)
    except Exception as e:
        logger.error(f"exception while image(truncated)={image[:10]}, exception={e}")
        embeddings = None

    return embeddings

In [None]:
os.makedirs(g.IMAGE_DIR, exist_ok=True)
os.makedirs(g.B64_ENCODED_IMAGES_DIR, exist_ok=True)
os.makedirs(g.EMBEDDINGS_DIR, exist_ok=True)

cols = ['url']
with jsonlines.open('qa.jsonl') as f:
    for line in f.iter():
        embeddings_list = []
        deck_name = line['deck_name']
        deck_url = line['deck_url']
        img_df = pd.DataFrame(line['image_urls'], columns=cols)
        for ind, row in img_df.iterrows():
            img_url = row['url']
            img_path = download_image_from_url(img_url, g.IMAGE_DIR)
            if img_path != "":
                b64_img_path = encode_image_to_base64(img_path)

                logger.info(f"going to convert {img_url} into embeddings")

                # read the file, MAX image size supported is 2048 * 2048 pixels
                with open(b64_img_path, "rb") as image_file:
                    input_image_b64 = image_file.read().decode('utf-8')

                # make a call to Bedrock to get the embeddings corresponding to
                # this image's base64 data
                st = time.perf_counter()
                embeddings = get_multimodal_embeddings(bedrock, input_image_b64)
                if embeddings is None:
                    logger.error(f"error creating multimodal embeddings for {img_url}")
                    continue
                latency = time.perf_counter() - st
                logger.info(f"successfully converted {img_url} to embeddings in {latency:.2f} seconds")

                # convert the data we want to ingest for this image into a JSON, this include the metadata as well
                # the metadata can be used later as part of hybrid search from the vector db
                data = {
                    "image_url": img_url,
                    "metadata": {
                      "deck_name": deck_name,
                      "deck_url": deck_url
                    },
                    "vector_embedding": embeddings[0].tolist()
                  }

                embeddings_list.append(data)
                logger.info(f"appended json data corresponding to {img_url}")
                time.sleep(1)
        
        if len(embeddings_list) > 0:
            fpath: str = f"{g.EMBEDDINGS_DIR}/{deck_name}.json"
            json.dump(embeddings_list, codecs.open(fpath, 'w', encoding='utf-8'), 
                      separators=(',', ':'), 
                      sort_keys=True, 
                      indent=4)
            logger.info(f"saved multimodal embeddings for deck {deck_name} in {fpath}")


In [None]:
embedding_file_list: List = glob.glob(os.path.join(g.EMBEDDINGS_DIR, "*.json"))
logger.info(f"there are {len(embedding_file_list)} embeddings")
for embedding_file_path in embedding_file_list:
    upload_to_s3(embedding_file_path, bucket_name, g.BUCKET_EMB_PREFIX)
    logger.info(f"uploaded embeddings for deck {os.path.basename(embedding_file_path)} to S3 bucket {bucket_name}")