# Data ingestion

***This notebook works best with the `conda_python3` on the `ml.t3.xlarge` instance***.

---

In this notebook we download the images corresponding to the slide deck that we uploaded into Amazon S3 in the [1_data_prep.ipynb](./1_data_prep) notebook, convert them into embeddings and then ingest these embeddings into a vector database i.e. [Amazon OpenSearch Service Serverless](https://aws.amazon.com/opensearch-service/features/serverless/).

1. We use the [Amazon Titan Multiodal Embeddings](https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-titan-multimodal-embeddings-model-bedrock/) model to convert the images into embeddings.

1. The embeddings are then ingested into OpenSearch Service Serverless using the [Amazon OpenSearch Ingestion](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ingestion.html) pipeline. The embeddings are uploaded into an S3 bucket and that triggers the OpenSearch Ingestion pipeline which ingests the data into an OpenSearch Serverless index.

1. The OpenSearch Service Serverless Collection is created via the AWS CloudFormation stack for this blog post.


## Step 1. Setup

Install the required Python packages and import the relevant files.

In [14]:
import sys
!{sys.executable} -m pip install -r requirements.txt

Collecting git+https://github.com/haotian-liu/LLaVA.git@v1.1.1 (from -r requirements.txt (line 2))
  Cloning https://github.com/haotian-liu/LLaVA.git (to revision v1.1.1) to /tmp/pip-req-build-x_bb2oix
  Running command git clone --filter=blob:none --quiet https://github.com/haotian-liu/LLaVA.git /tmp/pip-req-build-x_bb2oix
  Running command git checkout -q 1619889c712e347be1cb4f78ec66e7cf414ac1a6
  Resolved https://github.com/haotian-liu/LLaVA.git to commit 1619889c712e347be1cb4f78ec66e7cf414ac1a6
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone


In [15]:
import os
import time
import glob
import json
import time
import boto3
import codecs
import base64
import logging
import requests
import botocore
import sagemaker
import numpy as np
import globals as g
from typing import List
from pathlib import Path
from requests_auth_aws_sigv4 import AWSSigV4
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
from utils import upload_to_s3, get_cfn_outputs, get_bucket_name, download_image_files_from_s3, get_multimodal_embeddings

logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

In [16]:
bucket_name: str = get_bucket_name(g.CFN_STACK_NAME)
s3 = boto3.client('s3')

In [17]:
sagemaker_session = sagemaker.Session()
sm_client = sagemaker_session.sagemaker_client
sm_runtime_client = sagemaker_session.sagemaker_runtime_client

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [18]:
bedrock = boto3.client(service_name="bedrock-runtime", region_name=g.AWS_REGION, endpoint_url=g.FMC_URL)

In [19]:
# TODO: read pipeline endpoint and index name from CFN outputs
outputs = get_cfn_outputs(g.CFN_STACK_NAME)
host = outputs['MultimodalCollectionEndpoint'].split('//')[1]
index_name = outputs['OpenSearchIndexName']
osi_endpoint = f"https://{outputs['OpenSearchPipelineEndpoint']}/data/ingest"
logger.info(f"opensearch pipeline endpoint={osi_endpoint}")
logger.info(f"opensearchhost={host}, index={index_name}")

# osi_endpoint = "https://test-pipeline-4-5tdu5j2gt7yf6kzpuqqt74jcxm.us-east-1.osis.amazonaws.com/data/ingest"
# index_name = "blog2index-mp"

[2024-03-04 20:14:19,458] p22686 {3151558163.py:6} INFO - opensearch pipeline endpoint=https://multimodalpipeline-blog2-mo23af2cyfdrwecz7la3mpofau.us-east-1.osis.amazonaws.com/data/ingest
[2024-03-04 20:14:19,459] p22686 {3151558163.py:7} INFO - opensearchhost=idd4kkct85f04ubds03d.us-east-1.aoss.amazonaws.com, index=slides


In [20]:
session = boto3.Session()
credentials = session.get_credentials()
auth = AWSV4SignerAuth(credentials, g.AWS_REGION, g.OS_SERVICE)

os_client = OpenSearch(
    hosts = [{'host': host, 'port': 443}],
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection,
    pool_maxsize = 20
)

[2024-03-04 20:14:19,488] p22686 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


In [21]:
index_body = """
{
  "settings": {
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "vector_embedding": {
        "type": "knn_vector",
        "dimension": 1536,
        "method": {
          "name": "hnsw",
          "engine": "nmslib",
          "parameters": {}
        }
      },
      "image_path": {
        "type": "text"
      },
      "slide_text": {
        "type": "text"
      },
      "slide_number": {
        "type": "text"
      },
       "metadata": { 
        "properties" :
          {
            "filename" : {
              "type" : "text"
            },
            "desc":{
              "type": "text"
            }
          }
      }
    }
  }
}
"""

# We would get an index already exists exception if the index already exists, and that is fine.
index_body = json.loads(index_body)
try:
    response = os_client.indices.create(index_name, body=index_body)
    logger.info(f"response received for the create index -> {response}")
except Exception as e:
    logger.error(f"error in creating index={index_name}, exception={e}")

[2024-03-04 20:14:19,926] p22686 {952156850.py:48} ERROR - error in creating index=slides, exception=RequestError(400, 'resource_already_exists_exception', 'OpenSearch exception [type=resource_already_exists_exception, reason=index [slides/lKUaC44Bs5VOQH-mWEse] already exists]- server : [envoy]')


In [25]:
llava13b_endpoint = Path(g.ENDPOINT_FILENAME).read_text()
logger.info(f"llava13b endpoint {llava13b_endpoint}")

[2024-03-04 22:26:21,574] p22686 {4145819150.py:2} INFO - llava13b endpoint llava-djl-2024-03-04-22-06-53-656-endpoint


In [23]:
# prompt = "Describe the image in detail, include titles and numbers."

prompt = """
Describe the slide you see in detail. Describe any images, tables, and charts you see in detail with accurate numbers be descriptive about all figures/charts including column/row names. Your response should be extremely detailed and data oriented. Only give the data/numbers that are clearly visible in the image and DO NOT mention anything if it is not in the image or if it is blurry. Be completely accurate"
"""

print(prompt)


Describe the slide you see in detail. Describe any images, tables, and charts you see in detail with accurate numbers be descriptive about all figures/charts including column/row names. Your response should be extremely detailed and data oriented. Only give the data/numbers that are clearly visible in the image and DO NOT mention anything if it is not in the image or if it is blurry. Be completely accurate"



In [26]:
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=bucket_name, Prefix=g.BUCKET_IMG_PREFIX)

slide_number = 1
for page in pages:
    for obj in page['Contents']:
        obj_key=obj["Key"]
        obj_name=(obj["Key"]).split( "/")[2].split(".")[0]

        input_image_s3 = os.path.join("s3://", bucket_name, obj_key)
        payload = bytes(json.dumps(
                {
                    "text": [prompt],
                    
                    "input_image_s3": input_image_s3,
                }
            ), 'utf-8')
        response = sm_runtime_client.invoke_endpoint(
            EndpointName=llava13b_endpoint, 
            ContentType="application/json",
            Body=payload
        )
        
        img_desc = response['Body'].read().decode("utf-8").replace('"', "'")
        print(slide_number)
        print(obj_key)
        print(img_desc)

        embeddings = get_multimodal_embeddings(bedrock, img_desc)
        
        data = json.dumps([{
            "image_path": input_image_s3, 
            "slide_text": img_desc, 
            "slide_number": slide_number, 
            "metadata": {
                "filename": obj_name, 
                "desc": "" 
            }, 
            "vector_embedding": embeddings[0].tolist()
        }])
        
        r = requests.request(
            method='POST', 
            url=osi_endpoint, 
            data=data,
            auth=AWSSigV4('osis'))

        logger.info("Ingesting data into pipeline")
        logger.info(f"Response: {slide_number} - {r.text}")
        slide_number = slide_number + 1
        break

1
multimodal/img/CMP301_TrainDeploy_E1_20230607_SPEdited_image_1.jpg
The image features a blue and purple background with a white and blue logo for the '2023 Summit.' The logo is positioned towards the left side of the image. The background is predominantly blue, with a few purple accents.

There is a table in the image, which is located towards the right side. The table has a few rows and columns, but the content is not clearly visible. The focus of the image is on the logo and the background, which create a visually appealing and professional presentation.</s>


[2024-03-04 22:26:50,129] p22686 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-03-04 22:26:51,701] p22686 {2777291136.py:48} INFO - Ingesting data into pipeline
[2024-03-04 22:26:51,702] p22686 {2777291136.py:49} INFO - Response: 1 - 200 OK
