# Data ingestion

***This notebook works best with the `conda_python3` on the `ml.t3.xlarge` instance***.

---

In this notebook we download the images corresponding to the slide deck that we uploaded into Amazon S3 in the [1_data_prep.ipynb](./1_data_prep) notebook, convert them into embeddings and then ingest these embeddings into a vector database i.e. [Amazon OpenSearch Service Serverless](https://aws.amazon.com/opensearch-service/features/serverless/).

1. We use the [Amazon Titan Multiodal Embeddings](https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-titan-multimodal-embeddings-model-bedrock/) model to convert the images into embeddings.

1. The embeddings are then ingested into OpenSearch Service Serverless using the [Amazon OpenSearch Ingestion](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ingestion.html) pipeline. The embeddings are uploaded into an S3 bucket and that triggers the OpenSearch Ingestion pipeline which ingests the data into an OpenSearch Serverless index.

1. The OpenSearch Service Serverless Collection is created via the AWS CloudFormation stack for this blog post.


## Step 1. Setup

Install the required Python packages and import the relevant files.

In [None]:
import sys
!{sys.executable} -m pip install -r requirements.txt

In [1]:
import os
import time
import glob
import json
import time
import boto3
import codecs
import base64
import logging
import requests
import botocore
import sagemaker
import numpy as np
import globals as g
from typing import List
from pathlib import Path
from requests_auth_aws_sigv4 import AWSSigV4
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
from utils import upload_to_s3, get_cfn_outputs, get_bucket_name, download_image_files_from_s3, get_multimodal_embeddings

logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [2]:
bucket_name: str = get_bucket_name(g.CFN_STACK_NAME)
s3 = boto3.client('s3')

In [3]:
sagemaker_session = sagemaker.Session()
sm_client = sagemaker_session.sagemaker_client
sm_runtime_client = sagemaker_session.sagemaker_runtime_client

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [4]:
bedrock = boto3.client(service_name="bedrock-runtime", region_name=g.AWS_REGION, endpoint_url=g.FMC_URL)

In [5]:
# TODO: read pipeline endpoint and index name from CFN outputs
outputs = get_cfn_outputs(g.CFN_STACK_NAME)
host = outputs['MultimodalCollectionEndpoint'].split('//')[1]
# index_name = outputs['OpenSearchIndexName']
# logger.info(f"opensearchhost={host}, index={index_name}")

osi_endpoint = "https://test-pipeline-4-5tdu5j2gt7yf6kzpuqqt74jcxm.us-east-1.osis.amazonaws.com/data/ingest"
index_name = "blog2index"

In [6]:
session = boto3.Session()
credentials = session.get_credentials()
auth = AWSV4SignerAuth(credentials, g.AWS_REGION, g.OS_SERVICE)

os_client = OpenSearch(
    hosts = [{'host': host, 'port': 443}],
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection,
    pool_maxsize = 20
)

[2024-02-16 22:57:47,718] p8907 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


In [7]:
index_body = """
{
  "settings": {
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "vector_embedding": {
        "type": "knn_vector",
        "dimension": 1536,
        "method": {
          "name": "hnsw",
          "engine": "nmslib",
          "parameters": {}
        }
      },
      "image_path": {
        "type": "text"
      },
      "slide_text": {
        "type": "text"
      },
      "slide_number": {
        "type": "text"
      },
       "metadata": { 
        "properties" :
          {
            "filename" : {
              "type" : "text"
            },
            "desc":{
              "type": "text"
            }
          }
      }
    }
  }
}
"""

# We would get an index already exists exception if the index already exists, and that is fine.
index_body = json.loads(index_body)
try:
    response = os_client.indices.create(index_name, body=index_body)
    logger.info(f"response received for the create index -> {response}")
except Exception as e:
    logger.error(f"error in creating index={index_name}, exception={e}")

[2024-02-16 22:57:49,010] p8907 {952156850.py:48} ERROR - error in creating index=blog2index, exception=RequestError(400, 'resource_already_exists_exception', 'OpenSearch exception [type=resource_already_exists_exception, reason=index [blog2index/POwjtI0Bgl6llOBo6eMi] already exists]- server : [envoy]')


In [8]:
llava13b_endpoint = Path(g.ENDPOINT_FILENAME).read_text()
logger.info(f"llava13b endpoint {llava13b_endpoint}")

[2024-02-16 22:57:50,793] p8907 {4145819150.py:2} INFO - llava13b endpoint llava-djl-2024-02-08-22-47-21-804-12xl-endpoint


In [9]:
# prompt = "Describe the image in detail, include titles and numbers."

prompt = """
Describe the slide you see in detail. Describe any images, tables, and charts you see in detail with accurate numbers be descriptive about all figures/charts including column/row names. Your response should be extremely detailed and data oriented. Only give the data/numbers that are clearly visible in the image and DO NOT mention anything if it is not in the image or if it is blurry. Be completely accurate"
"""

print(prompt)


Describe the slide you see in detail. Describe any images, tables, and charts you see in detail with accurate numbers be descriptive about all figures/charts including column/row names. Your response should be extremely detailed and data oriented. Only give the data/numbers that are clearly visible in the image and DO NOT mention anything if it is not in the image or if it is blurry. Be completely accurate"



In [11]:
print(payload)

b'{"text": ["\\nDescribe the slide you see in detail. Describe any images, tables, and charts you see in detail with accurate numbers be descriptive about all figures/charts including column/row names. Your response should be extremely detailed and data oriented. Only give the data/numbers that are clearly visible in the image and DO NOT mention anything if it is not in the image or if it is blurry. Be completely accurate\\"\\n"], "input_image_s3": "s3://multimodal-bucket-563851014557/multimodal/img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_11.jpg"}'


In [10]:
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=bucket_name, Prefix=g.BUCKET_IMG_PREFIX)

slide_number = 1
for page in pages:
    for obj in page['Contents']:
        obj_key=obj["Key"]
        obj_name=(obj["Key"]).split( "/")[2].split(".")[0]

        input_image_s3 = os.path.join("s3://", bucket_name, obj_key)
        payload = bytes(json.dumps(
                {
                    "text": [prompt],
                    "input_image_s3": input_image_s3,
                }
            ), 'utf-8')
        response = sm_runtime_client.invoke_endpoint(
            EndpointName=llava13b_endpoint, 
            ContentType="application/json",
            Body=payload
        )
        
        img_desc = response['Body'].read().decode("utf-8").replace('"', "'")

        embeddings = get_multimodal_embeddings(bedrock, img_desc)
        
        data = json.dumps([{
            "image_path": input_image_s3, 
            "slide_text": img_desc, 
            "slide_number": slide_number, 
            "metadata": {
                "filename": obj_name, 
                "desc": "" 
            }, 
            "vector_embedding": embeddings[0].tolist()
        }])
        
        r = requests.request(
            method='POST', 
            url=osi_endpoint, 
            data=data,
            auth=AWSSigV4('osis'))

        logger.info("Ingesting data into pipeline")
        logger.info(f"Response: {slide_number} - {r.text}")
        slide_number = slide_number + 1


[2024-02-16 22:58:16,971] p8907 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-02-16 22:58:17,195] p8907 {3235372340.py:44} INFO - Ingesting data into pipeline
[2024-02-16 22:58:17,196] p8907 {3235372340.py:45} INFO - Response: 1 - 200 OK
[2024-02-16 22:58:33,066] p8907 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-02-16 22:58:33,225] p8907 {3235372340.py:44} INFO - Ingesting data into pipeline
[2024-02-16 22:58:33,226] p8907 {3235372340.py:45} INFO - Response: 2 - 200 OK


ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/llava-djl-2024-02-08-22-47-21-804-12xl-endpoint in account 563851014557 for more information.