# Data ingestion

***This notebook works best with the `conda_python3` on the `ml.t3.xlarge` instance***.

---

In this notebook we download the images and text files corresponding to the `pdf file/slide deck` that we uploaded into Amazon S3 in the [1_data_prep.ipynb](./1_data_prep) notebook, get text description from `images` and `text files`, convert them into embeddings and then ingest these embeddings into a vector database i.e. [Amazon OpenSearch Service Serverless](https://aws.amazon.com/opensearch-service/features/serverless/).

1. We use the [Anthropic’s Claude 3 Sonnet foundation model](https://aws.amazon.com/about-aws/whats-new/2024/03/anthropics-claude-3-sonnet-model-amazon-bedrock/) available on Bedrock to convert image to text.

1. We use the text extracted from each pdf page as is and convert them into embeddings using [Amazon Titan Text Embeddings](https://docs.aws.amazon.com/bedrock/latest/userguide/titan-embedding-models.html) and stored in a `text` index. Each image file is first described using `Claude Sonnet` then the embeddings of the text description of that image is stored in an `image index`.

1. We use an `entities` field in the `index body metadata` to store entities from both images and texts in their respective `image and text indexes`. The entities from images are extracted using `Claude Sonnet` and entities from texts extracted files using `nltk`. The purpose of extracting these entities is to later use them as a `prefilter` to get only the related documents to any user question.

1. We use `Ray` for running Bedrock inference concurrently in an asynchronous manner.

1. The embeddings are then ingested into OpenSearch Service Serverless using the [Amazon OpenSearch Ingestion](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ingestion.html) pipeline. We ingest the embeddings into an OpenSearch Serverless index via the OpenSearch Ingestion API.

1. The OpenSearch Service Serverless Collection is created via the AWS CloudFormation stack for this blog post.


## Step 1. Setup

Install the required Python packages and import the relevant files.

In [1]:
# install the requirements before running this notebook
import sys
!{sys.executable} -m pip install -r requirements.txt



In [2]:
# import the libraries that are needed to run this notebook
import os
import re
import csv
import ray
import time
import glob
import json
import yaml
import time
import nltk
import boto3
import base64
import logging
import requests
import botocore
import sagemaker
import numpy as np
import opensearchpy
import globals as g
from pathlib import Path
from nltk.tree import Tree
from nltk.tag import pos_tag
from typing import List, Dict
from nltk.chunk import ne_chunk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, word_tokenize, punkt
from requests_auth_aws_sigv4 import AWSSigV4
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
from utils import get_cfn_outputs, get_bucket_name, download_image_files_from_s3, get_text_embedding, load_and_merge_configs

  from scipy.stats import fisher_exact


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [3]:
# set a logger
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

In [4]:
if ray.is_initialized():
    ray.shutdown()
ray.init()

2024-09-04 20:17:57,220	INFO worker.py:1752 -- Started a local Ray instance.


0,1
Python version:,3.10.14
Ray version:,2.10.0


[36m(async_process_image_data pid=928)[0m [2024-09-04 20:18:00,199] p928 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[36m(async_process_image_data pid=928)[0m [2024-09-04 20:18:00,262] p928 {4021940080.py:12} INFO - going to convert img/b64_images/ml-best-practices-healthcare-life-sciences_page_2.b64 into embeddings


[36m(async_process_image_data pid=929)[0m file_path: img/b64_images/ml-best-practices-healthcare-life-sciences_page_1.b64, image description (prefiltered with entities extracted): The image does not contain any specific named entities like person names, organizations, locations, or dates. However, it does include the following data/metric entities and custom entities related to the healthcare and life sciences domain:
[36m(async_process_image_data pid=929)[0m 
[36m(async_process_image_data pid=929)[0m Data/Metric Entities:
[36m(async_process_image_data pid=929)[0m - Machine Learning Best Practices
[36m(async_process_image_data pid=929)[0m 
[36m(async_process_image_data pid=929)[0m Custom Entities:
[36m(async_process_image_data pid=929)[0m - Healthcare
[36m(async_process_image_data pid=929)[0m - Life Sciences
[36m(async_process_image_data pid=929)[0m 
[36m(async_process_image_data pid=929)[0m The image is the cover page or title slide of what appears to be an AWS whi

[36m(async_process_image_data pid=929)[0m [2024-09-04 20:18:17,141] p929 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole[32m [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)[0m
[36m(async_process_image_data pid=927)[0m [2024-09-04 20:18:00,384] p927 {4021940080.py:12} INFO - going to convert img/b64_images/ml-best-practices-healthcare-life-sciences_page_3.b64 into embeddings[32m [repeated 2x across cluster][0m
[36m(async_process_image_data pid=929)[0m [2024-09-04 20:18:17,224] p929 {4021940080.py:51} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=929)[0m [2024-09-04 20:18:17,224] p929 {4021940080.py:52} INFO - image desc: 200 OK


[36m(async_process_image_data pid=928)[0m file_path: img/b64_images/ml-best-practices-healthcare-life-sciences_page_2.b64, image description (prefiltered with entities extracted): The image does not contain any specific named entities like people's names, organizations, locations or dates. However, there are a few potential data entities or custom entities present in the text:
[36m(async_process_image_data pid=928)[0m 1. 'Machine Learning Best Practices in Healthcare and Life Sciences'
[36m(async_process_image_data pid=928)[0m 2. 'AWS Whitepaper'
[36m(async_process_image_data pid=928)[0m 3. 'Amazon Web Services, Inc.'
[36m(async_process_image_data pid=928)[0m 4. Potential custom entities related to legal/trademark terminology: 'trademarks', 'trade dress', 'product or service', 'customers', 'disparages or discredits'
[36m(async_process_image_data pid=928)[0m The text appears to be legal/copyright notice related to an AWS (Amazon Web Services) whitepaper on machine learning b

[36m(async_process_image_data pid=927)[0m [2024-09-04 20:18:24,052] p927 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole[32m [repeated 2x across cluster][0m
[36m(async_process_image_data pid=928)[0m [2024-09-04 20:18:18,817] p928 {4021940080.py:51} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=928)[0m [2024-09-04 20:18:18,817] p928 {4021940080.py:52} INFO - image desc: 200 OK
[36m(async_process_image_data pid=927)[0m [2024-09-04 20:18:24,087] p927 {4021940080.py:51} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=927)[0m [2024-09-04 20:18:24,088] p927 {4021940080.py:52} INFO - image desc: 200 OK


[36m(async_process_image_data pid=927)[0m file_path: img/b64_images/ml-best-practices-healthcare-life-sciences_page_3.b64, image description (prefiltered with entities extracted): Based on the image provided, which appears to be a table of contents from a whitepaper on machine learning best practices in healthcare and life sciences, here are the relevant entities I can identify:
[36m(async_process_image_data pid=927)[0m Named Entities:
[36m(async_process_image_data pid=927)[0m - AWS (likely referring to Amazon Web Services)
[36m(async_process_image_data pid=927)[0m Data Entities:
[36m(async_process_image_data pid=927)[0m - Machine learning
[36m(async_process_image_data pid=927)[0m - Life sciences
[36m(async_process_image_data pid=927)[0m - Benefits of machine learning
[36m(async_process_image_data pid=927)[0m - Life sciences at AWS
[36m(async_process_image_data pid=927)[0m - Current regulatory situation
[36m(async_process_image_data pid=927)[0m - AI/ML enabled GxP w

In [5]:
CONFIG_FILE_PATH = "config.yaml"

In [6]:
# load the merged config file - user config file, and parent config file
config = load_and_merge_configs(g.CONFIG_SUBSET_FILE, g.FULL_CONFIG_FILE)
logger.info(f"config file -> {json.dumps(config, indent=2)}")

[2024-09-04 20:17:58,775] p596 {1465141196.py:3} INFO - config file -> {
  "aws": {
    "cfn_stack_name": "multimodal-blog4-stack",
    "os_service": "aoss"
  },
  "dir_info": {
    "source_dir": "data",
    "metrics_dir_name": "metrics",
    "img_path": "images",
    "txt_path": "text_files",
    "extracted_data": "extracted_data",
    "json_img_dir": "img_json_dir",
    "json_txt_dir": "text_json_dir",
    "manually_saved_images_path": "manually_saved_imgs",
    "prompt_dir": "prompt_templates",
    "image_description_prompt": "image_description_prompt.txt",
    "search_in_images_template": "retrieve_answer_from_images_prompt.txt",
    "search_in_text_template": "retrieve_answer_from_texts_prompt.txt",
    "extract_entities_from_user_question": "extract_question_entities_prompt.txt",
    "final_combined_llm_response_prompt": "final_combined_response_prompt_template.txt",
    "final_llm_as_a_judge_summary_analysis": "claude_final_summary_analysis_prompt.txt",
    "extract_image_entiti

In [7]:
region: str = boto3.Session().region_name
claude_model_id: str = config['model_info']['inference_model_info'].get('model_id')
endpoint_url: str = g.BEDROCK_EP_URL.format(region=region)
bedrock = boto3.client(service_name="bedrock-runtime", region_name=region, endpoint_url=endpoint_url)

In [8]:
bucket_name: str = get_bucket_name(config['aws']['cfn_stack_name'])
logger.info(f"Bucket name being used to store extracted images and texts from data: {bucket_name}")
s3 = boto3.client('s3')

[2024-09-04 20:17:58,924] p596 {292065259.py:2} INFO - Bucket name being used to store extracted images and texts from data: multimodal-rag-poc-bucket-759878486861-us-east-1


In [9]:
sagemaker_session = sagemaker.Session()
sm_client = sagemaker_session.sagemaker_client
sm_runtime_client = sagemaker_session.sagemaker_runtime_client

In [10]:
outputs = get_cfn_outputs(config['aws']['cfn_stack_name'])
host = outputs['MultimodalCollectionEndpoint'].split('//')[1]
text_index_name = outputs['OpenSearchTextIndexName']
img_index_name = outputs['OpenSearchImgIndexName']
logger.info(f"opensearchhost={host}, text index={text_index_name}, image index={img_index_name}")
osi_text_endpoint = f"https://{outputs['OpenSearchPipelineTextEndpoint']}/data/ingest"
osi_img_endpoint = f"https://{outputs['OpenSearchPipelineImgEndpoint']}/data/ingest"

[2024-09-04 20:17:59,414] p596 {2042488222.py:5} INFO - opensearchhost=mdqhpj5uy4izw0n0czf9.us-east-1.aoss.amazonaws.com, text index=texts, image index=images


#### We use the OpenSearch client to create an index.
---
For the purpose of segregation and ease of understanding, we are initializing two opensearch clients (for each image and text index). You can create/use just one index too.

In [11]:
session = boto3.Session()
credentials = session.get_credentials()
auth = AWSV4SignerAuth(credentials, region, g.OS_SERVICE)

# Represents the OSI client for images
img_os_client = OpenSearch(
    hosts = [{'host': host, 'port': 443}],
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection,
    pool_maxsize = 20
)

# Represents the OSI client for texts
text_os_client = OpenSearch(
    hosts = [{'host': host, 'port': 443}],
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection,
    pool_maxsize = 20
)

[2024-09-04 20:17:59,448] p596 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


#### Index Body
---
Given below is the index body that is stored in the opensearch service. It contains information about:

1. **File path**: The path of the text or image file in the index

1. **File text**: The texts extracted from the pdf files (for the text index) or the image descriptions for images that are stored in the image index

1. **Page number**: Represents the page number that the content is stemming from

1. **Metadata**: This field within the index body contains information about the name of the file and entities. Entities represent names of organizations, people, and other important within the pdf text/image that is extracted and stored as metadata for future prefilter purposes to only get relevant documents during the process of search for relevant documents.

In [12]:
index_body = """
{
  "settings": {
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "vector_embedding": {
        "type": "knn_vector",
        "dimension": 1536,
        "method": {
          "name": "hnsw",
          "engine": "nmslib",
          "parameters": {}
        }
      },
      "file_path": {
        "type": "text"
      },
      "file_text": {
        "type": "text"
      },
      "page_number": {
        "type": "text"
      },
      "metadata": {
        "properties": {
          "filename": {
            "type": "text"
          },
          "entities": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

"""

# We would get an index already exists exception if the index already exists, and that is fine
index_body = json.loads(index_body)
try:
    # Check if the image index exists
    if not img_os_client.indices.exists(img_index_name):
        img_response = img_os_client.indices.create(img_index_name, body=index_body)
        logger.info(f"Response received for the create index for images -> {img_response}")
    else:
        logger.info(f"The image index '{img_index_name}' already exists.")

    # Check if the text index exists
    if not text_os_client.indices.exists(text_index_name):
        txt_response = text_os_client.indices.create(text_index_name, body=index_body)
        logger.info(f"Response received for the create index for texts -> {txt_response}")
    else:
        logger.info(f"The text index '{text_index_name}' already exists.")
except Exception as e:
    logger.error(f"Error in creating index, exception: {e}")


[2024-09-04 20:17:59,526] p596 {base.py:258} INFO - HEAD https://mdqhpj5uy4izw0n0czf9.us-east-1.aoss.amazonaws.com:443/images [status:200 request:0.068s]
[2024-09-04 20:17:59,527] p596 {720552406.py:50} INFO - The image index 'images' already exists.
[2024-09-04 20:17:59,590] p596 {base.py:258} INFO - HEAD https://mdqhpj5uy4izw0n0czf9.us-east-1.aoss.amazonaws.com:443/texts [status:200 request:0.062s]
[2024-09-04 20:17:59,591] p596 {720552406.py:57} INFO - The text index 'texts' already exists.


### Check if the the index created has a `knn`/vector field count before the embedding process

In [13]:
try: 
    # Fetch the existing mapping for the text index
    text_mapping = text_os_client.indices.get_mapping(index=text_index_name)
    img_mapping = img_os_client.indices.get_mapping(index=img_index_name)
    text_vector_embedding_mapping = text_mapping[text_index_name]['mappings']['properties'].get('vector_embedding', {})
    img_vector_embedding_mapping = img_mapping[img_index_name]['mappings']['properties'].get('vector_embedding', {})

    if text_vector_embedding_mapping.get('type') == 'knn_vector':
        logger.info(f"The vector_embedding type is found: {text_vector_embedding_mapping.get('type')} -> {text_mapping}")
    else:
        raise ValueError(f"The vector_embedding type is not 'knn_vector', found: {text_vector_embedding_mapping.get('type')}")

    if img_vector_embedding_mapping.get('type') == 'knn_vector':
        logger.info(f"The vector_embedding type is found: {img_vector_embedding_mapping.get('type')} -> {img_mapping}")
    else:
        raise ValueError(f"The vector_embedding type is not 'knn_vector', found: {img_vector_embedding_mapping.get('type')}")
except Exception as e:
    logger.error(f"Error in fetching the index vector field mapping, exception: {e}")

[2024-09-04 20:17:59,636] p596 {base.py:258} INFO - GET https://mdqhpj5uy4izw0n0czf9.us-east-1.aoss.amazonaws.com:443/texts/_mapping [status:200 request:0.034s]
[2024-09-04 20:17:59,671] p596 {base.py:258} INFO - GET https://mdqhpj5uy4izw0n0czf9.us-east-1.aoss.amazonaws.com:443/images/_mapping [status:200 request:0.034s]
[2024-09-04 20:17:59,673] p596 {2002334925.py:18} ERROR - Error in fetching the index vector field mapping, exception: The vector_embedding type is not 'knn_vector', found: float


## Step 2. Download the images files from S3 and convert to Base64

Now we download the image files from the S3 bucket into the `local directory`. Once downloaded these files are converted into [Base64](https://en.wikipedia.org/wiki/Base64) encoding so that we can create embeddings from the images.

In [14]:
# download the images from s3 into a local directory to convert into base64 images
os.makedirs(g.LOCAL_IMAGE_DIR, exist_ok=True)
os.makedirs(g.LOCAL_TEXT_DIR, exist_ok=True)

try:
    image_files: List = download_image_files_from_s3(bucket_name, g.BUCKET_IMG_PREFIX, g.LOCAL_IMAGE_DIR, g.IMAGE_FILE_EXTN)
    text_files: List = download_image_files_from_s3(bucket_name, g.BUCKET_TEXT_PREFIX, g.LOCAL_TEXT_DIR, g.TEXT_FILE_EXTN)
    logger.info(f"downloaded {len(image_files) + len(text_files)} files from s3")
except Exception as e:
    logger.error(f"Cannot download the images files from S3 into the local directory: {e}")

[2024-09-04 20:17:59,728] p596 {186159442.py:10} ERROR - Cannot download the images files from S3 into the local directory: 'Contents'


#### Convert jpg files fetched from `S3` into `Base64`

In [15]:
def encode_image_to_base64(image_file_path: str) -> str:
    with open(image_file_path, "rb") as image_file:
        b64_image = base64.b64encode(image_file.read()).decode('utf8')
        b64_image_path = os.path.join(g.B64_ENCODED_IMAGES_DIR, f"{Path(image_file_path).stem}.b64")
        with open(b64_image_path, "wb") as b64_image_file:
            b64_image_file.write(bytes(b64_image, 'utf-8'))
    return b64_image_path

## Step 3. Get embeddings for the base64 encoded images

Now we are ready to use Amazon Bedrock via the  Anthropic’s Claude 3 Sonnet foundation model and Amazon Titan Text Embeddings model to convert the base64 version of the images into embeddings. We ingest embeddings into the pipeline using the [requests](https://pypi.org/project/requests/) HTTP library

You must sign all HTTP requests to the pipeline using [Signature Version 4](https://docs.aws.amazon.com/general/latest/gr/signature-version-4.html).

In [16]:
def get_img_desc(image_file_path: str, prompt: str) -> str:
    """
    This function uses a base64 file path of an image, and then uses ClaudeV3 Sonnet to 
    describe the image
    """
    bedrock = boto3.client(service_name="bedrock-runtime", region_name=region, endpoint_url=endpoint_url)
    # read the file, MAX image size supported is 2048 * 2048 pixels
    with open(image_file_path, "rb") as image_file:
        input_image_b64 = image_file.read().decode('utf-8')

    body = json.dumps(
        {
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 1000,
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "image",
                            "source": {
                                "type": "base64",
                                "media_type": "image/jpeg",
                                "data": input_image_b64
                            },
                        },
                        {"type": "text", "text": prompt},
                    ],
                }
            ],
        }
    )

    response = bedrock.invoke_model(
        modelId=claude_model_id,
        body=body
    )

    resp_body = json.loads(response['body'].read().decode("utf-8"))
    resp_text = resp_body['content'][0]['text'].replace('"', "'")
    return resp_text

### Use the image files downloaded from S3, and convert them into `Base64`

In [17]:
os.makedirs(g.B64_ENCODED_IMAGES_DIR, exist_ok=True)
try:
    file_list: List = glob.glob(os.path.join(g.LOCAL_IMAGE_DIR, f"*{g.IMAGE_FILE_EXTN}"))
    logger.info(f"there are {len(file_list)} pdf image files in the {g.IMAGE_DIR} directory for conversion to base64")
except Exception as e:
    logger.error(f"Could not list any {g.IMAGE_FILE_EXTN} files from {g.IMAGE_DIR}: {e}")

# convert each file to base64 and store the base64 in a new file
b64_image_file_list = list(map(encode_image_to_base64, file_list))
logger.info(f"base64 conversion done, there are {len(b64_image_file_list)} base64 encoded files")

[2024-09-04 20:17:59,754] p596 {147230456.py:4} INFO - there are 3 pdf image files in the img directory for conversion to base64
[2024-09-04 20:17:59,759] p596 {147230456.py:10} INFO - base64 conversion done, there are 3 base64 encoded files


### Get Image Descriptions
---

This part of the notebook uses an `image_description_prompt` to describe the images.

In [18]:
# this is the prompt to get the description of each image stored from the pdf file
image_description_prompt_fpath: str = os.path.join(config['dir_info']['prompt_dir'], config['dir_info']['image_description_prompt'])
image_desc_prompt: str = Path(image_description_prompt_fpath).read_text()
print(image_desc_prompt)

Human: Please provide a detailed description of the image. Describe the overall layout and design of the image. Identify and describe any tables, charts, or other visual elements present, including the specific data or information contained within them. Provide as much detail as possible about the content and format of the image. Your response should be extremely detailed and data oriented. Give the description for all four portions of the image, the upper right, upper left, lower right and lower left and include all key points data in each if possible. Be completely accurate.

Assistant: 


### Hybrid Search: Extract `Entities` from the images for further `prefiltering` tasks
---

The purpose of using Hybrid search is to optimize the RAG workflow in retrieving the right image description for specific questions. Some images (full or split in different parts), might not contain the information that is being asked by the question, because of the surrounding embeddings in the vector DB and might fetch the wrong image if it has a similar structure, so Hybrid search helps optimizing that. In this case, we will extract the entities of an image description (including the file name to be precise), then extract the entities of the question being asked, to get the most accurate response possible. `Entities` will help match the question to the correct and most relevant documents in the vector index where the answer can searched for in another sub step.

In [19]:
# prompt is used to extract entities from an image
entity_extraction_prompt_fpath: str = os.path.join(config['dir_info']['prompt_dir'], config['dir_info']['extract_image_entities_template'])
entity_extraction_prompt: str = Path(entity_extraction_prompt_fpath).read_text()
print(entity_extraction_prompt)

Human: Please provide a detailed description of the entities present in the image. Entities, are specific pieces of information or objects within a text that carry particular significance. These can be real-world entities like names of people, places, organizations, or dates. Refer to the types of entities: Named entities: These include names of people, organizations, locations, and dates. You can have specific identifiers within this, such as person names or person occupations.

Custom entities: These are entities specific to a particular application or domain, such as product names, medical terms, or technical jargon.

Temporal entities: These are entities related to time, such as dates, times, and durations.

Product entities: Names of products might be grouped together into product entities.

Data entities: Names of the data and metrics present. This includes names of metrics in charts, graphs and tables, and throughout the image.

Now based on the image, create a list of these ent

### Part 1: Loop through b64 images to 1/get image desc from Claude3, 2/get embedding from Titan text. Call OSI pipeline API to ingest embedding.

In [20]:
def get_img_txt_embeddings(bedrock: botocore.client, prompt_data: str) -> np.ndarray:
    body = json.dumps({
        "inputText": prompt_data,
    })
    try:
        response = bedrock.invoke_model(
            body=body, modelId=config['model_info']['embeddings_model_info'].get('model_id'), 
            accept="application/json", contentType="application/json"
        )
        response_body = json.loads(response['body'].read())
        embedding = response_body.get('embedding')
    except Exception as e:
        logger.error(f"exception={e}")
        embedding = None
    return embedding

In [21]:
# function to get the image description and store the embeddings of that text in the image index
def process_image_data(i: int, 
                       file_path: str, 
                       osi_endpoint, 
                       total: int, 
                       bucket_info: int) -> Dict:
    bedrock = boto3.client(service_name="bedrock-runtime", region_name=region, endpoint_url=endpoint_url)
    json_data: Optional[Dict] = None
    # name of the images that are saved (either split in 4 ways or saved as a single page)
    image_name: Optional[str] = None
    try:
        logger.info(f"going to convert {file_path} into embeddings")
        # first, get the entities from the image to prefilter the image description with the entities
        entities_extracted = get_img_desc(file_path, entity_extraction_prompt)
        # get the image description and prepend the image description with the entities extracted from the image
        content_description = entities_extracted + get_img_desc(file_path, image_desc_prompt)
        print(f"file_path: {file_path}, image description (prefiltered with entities extracted): {content_description}")
        embedding = get_img_txt_embeddings(bedrock, content_description)
        input_image_s3: str = f"s3://{bucket_name}/{bucket_info['img_prefix']}/{Path(file_path).stem}{bucket_info['image_file_extn']}"
        obj_name: str = f"{Path(file_path).stem}{bucket_info['image_file_extn']}"
        # data format for POSTING it to the osi_endpoint
        data = json.dumps([{
            "file_path": input_image_s3,
            "file_text": content_description,
            "page_number": re.search(r"page_(\d+)_?", obj_name).group(1),
            "metadata": {
                "filename": obj_name,
                "entities": entities_extracted
            },
            "vector_embedding": embedding
        }])
        # json data format for local files that are saved
        json_data = {
            "file_type": bucket_info['image_file_extn'],
            "file_name": obj_name,
            "text": content_description,
            "entities": entities_extracted,
            "page_number": re.search(r"page_(\d+)_?", obj_name).group(1)
            }
        # save the information (image description, entities, file type, name, and page number)
        # locally in a json file
        image_dir: str = config['dir_info']['json_img_dir']
        os.makedirs(image_dir, exist_ok=True)
        fpath = os.path.join(image_dir, f"{Path(file_path).stem}.json")
        Path(fpath).write_text(json.dumps(json_data, default=str, indent=2))
        r = requests.request(
            method='POST', 
            url=osi_endpoint, 
            data=data,
            auth=AWSSigV4('osis'))
        logger.info("Ingesting data into pipeline")
        logger.info(f"image desc: {r.text}")
    except Exception as e:
        logger.error(f"Error processing image {file_path}: {e}")
        json_data: Optional[Dict] = None
    return json_data

In [22]:
@ray.remote
def async_process_image_data(i: int, file_path: str, osi_endpoint, total: int, bucket_info: Dict):
    logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
    logger = logging.getLogger(__name__)
    return process_image_data(i, file_path, osi_endpoint, total, bucket_info)

In [23]:
import pandas as pd
img_entity_df = pd.DataFrame(columns=['file_name', 'text', 'entities'])
txt_entity_df = pd.DataFrame(columns=['file_name', 'text', 'entities'])

In [24]:
# count the number of images that throw an error while being saved into the index
erroneous_page_count: int = 0
n: int = config['inference_info']['parallel_inference_count']
image_chunks = [b64_image_file_list[i:i + n] for i in range(0, len(b64_image_file_list), n)]
bucket_info: Dict = {
    'img_prefix': g.BUCKET_IMG_PREFIX,
    'image_file_extn': g.IMAGE_FILE_EXTN
}
for chunk_index, image_chunk in enumerate(image_chunks):
    try:
        st = time.perf_counter()
        logger.info(f"------ getting text description for chunk {chunk_index}/{len(image_chunks)} -----")
        # Iterate over each file path in the chunk and process it individually
        logger.info(f"getting inference for list {chunk_index+1}/{len(image_chunks)}, size of list={len(image_chunk)} ")
        results = ray.get([async_process_image_data.remote(index, file_path, osi_img_endpoint, len(image_chunk), bucket_info) for index, file_path in enumerate(image_chunk)])
        elapsed_time = time.perf_counter() - st
        logger.info(f"------ completed chunk={chunk_index}/{len(image_chunks)} completed in {elapsed_time} ------ ")
        logger.info(f"The results are: {results}")
        # Assuming img_entity_df is already initialized with the correct columns
        # Iterate over the list of JSON objects and append to the DataFrame
        for json_obj in results:
            new_data = {
                'file_name': json_obj['file_name'],
                'text': json_obj['text'],
                'entities': json_obj['entities']
            }
            img_entity_df.loc[len(img_entity_df)] = new_data
        #print(img_entity_df)
        # Append the new data to the existing DataFrame
        #print(f"img_entity_df is: {img_entity_df}")
        print("Done with img results!")
    except Exception as e:
        logger.error(f"Error processing chunk {chunk_index}: {e}")
        erroneous_page_count += len(image_chunk)

logger.info(f"Number of erroneous pdf pages that are not processed: {erroneous_page_count}")

[2024-09-04 20:17:59,836] p596 {1846079779.py:12} INFO - ------ getting text description for chunk 0/1 -----
[2024-09-04 20:17:59,837] p596 {1846079779.py:14} INFO - getting inference for list 1/1, size of list=3 
[2024-09-04 20:18:24,093] p596 {1846079779.py:17} INFO - ------ completed chunk=0/1 completed in 24.257364138000412 ------ 
[2024-09-04 20:18:24,097] p596 {1846079779.py:18} INFO - The results are: [{'file_type': '.jpg', 'file_name': 'ml-best-practices-healthcare-life-sciences_page_3.jpg', 'text': "Based on the image provided, which appears to be a table of contents from a whitepaper on machine learning best practices in healthcare and life sciences, here are the relevant entities I can identify:\n\nNamed Entities:\n- AWS (likely referring to Amazon Web Services)\n\nData Entities:\n- Machine learning\n- Life sciences\n- Benefits of machine learning\n- Life sciences at AWS\n- Current regulatory situation\n- AI/ML enabled GxP workloads\n- GXP-compliant machine learning environm

Done with img results!


### Part 2: Loop through text files to 1/get embedding from Titan text, 2/extract the text entities using `nltk`. Call OSI pipeline API to ingest embedding.

In [25]:
# Get a list of all text files 
pdf_txt_file_list = os.listdir(g.LOCAL_TEXT_DIR)

# Get absolute file paths by joining the directory path with each file name
pdf_txt_file_list = [os.path.abspath(os.path.join(g.LOCAL_TEXT_DIR, file)) for file in pdf_txt_file_list]
logger.info(f"Number of text files from the PDF local directory to process: {len(pdf_txt_file_list)}")

[2024-09-04 20:18:24,120] p596 {3290322578.py:6} INFO - Number of text files from the PDF local directory to process: 3


#### Entities extraction from PDF texts using [NLTK]('https://www.nltk.org/')
---

NLTK is a leading platform for building Python programs to work with human language data. We use `NLTK` to extract entities from the text files that are extracted from each `PDF page`, and use that as a prepend onto the extracted file to be sent to the `OSI endpoint`.

In [26]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker') 
nltk.download('words')

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /home/ec2-user/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [27]:
def get_continuous_chunks(text):
    """
    This function uses nltk to get the entities from texts that are extracted from pdf files
    """
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    continuous_chunk = []
    current_chunk = []
    for i in chunked:
        if type(i) == Tree:
            current_chunk.append(" ".join([token for token, pos in i.leaves()]))
        if current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue
    return continuous_chunk

In [28]:
def process_text_data(txt_file: str, txt_page_index: int):
    with open(txt_file, 'r') as file:
        extracted_pdf_text = file.read()
    # Extract entities from text using nltk
    entities = get_continuous_chunks(extracted_pdf_text)
    # Convert the entities list to string 
    entities_str = ", ".join(entities)
    logger.info(f"entities extracted from {txt_file}: {entities_str}")
    embedding = get_text_embedding(bedrock, extracted_pdf_text)
    input_text_s3 = f"s3://{bucket_name}/{g.BUCKET_TEXT_PREFIX}/{Path(txt_file).stem}{g.TEXT_FILE_EXTN}"
    obj_name = f"{Path(txt_file).stem}{g.TEXT_FILE_EXTN}"
    # data format that is used to POST to the osi endpoint
    data = json.dumps([{
        "file_path": input_text_s3,
        "file_text": extracted_pdf_text,
        "page_number": txt_page_index,
        "metadata": {
            "filename": obj_name,
            "entities": entities_str
        },
        "vector_embedding": embedding
    }])
    # json data format that is saved in a local directory
    json_data = {
        "file_type": g.TEXT_FILE_EXTN,
        "file_name": Path(txt_file).stem,
        "text": extracted_pdf_text, 
        "page_number": re.search(r"text_(\d+)_?", obj_name).group(1),
        "entities": entities_str  
    } 
    os.makedirs(config['dir_info']['json_txt_dir'], exist_ok=True)
    fpath = os.path.join(config['dir_info']['json_txt_dir'], f"{Path(txt_file).stem}.json")
    print(f"json_file_path: {fpath}")
    Path(fpath).write_text(json.dumps(json_data, default=str, indent=2))
    r = requests.request(
        method='POST',
        url=osi_text_endpoint,
        data=data,
        auth=AWSSigV4('osis'))

    logger.info("Ingesting data into pipeline")
    logger.info(f"Response: {txt_page_index} - {r.text}")
    return json_data

In [29]:
txt_page_index: int = 1
os.makedirs(config['dir_info']['json_txt_dir'], exist_ok=True)
for txt_file in pdf_txt_file_list:
    logger.info(f"going to convert {txt_file} into embeddings")
    json_obj = process_text_data(txt_file, txt_page_index)
    #print(json_obj)
    new_data = {
        'file_name': json_obj['file_name'],
        'text': json_obj['text'],
        'entities': json_obj['entities']
    }
    txt_entity_df.loc[len(txt_entity_df)] = new_data
    txt_page_index += 1

#print(f"txt_entity_df is: {txt_entity_df}")
print("Done with txt results!")

[2024-09-04 20:18:24,266] p596 {3417922390.py:4} INFO - going to convert /home/ec2-user/SageMaker/multimodal-rag-on-slide-decks/Blog4-PDF-TitanEmbeddings/notebooks/multimodal/local_txts/ml-best-practices-healthcare-life-sciences_text_1.txt into embeddings
[2024-09-04 20:18:24,533] p596 {651913043.py:8} INFO - entities extracted from /home/ec2-user/SageMaker/multimodal-rag-on-slide-decks/Blog4-PDF-TitanEmbeddings/notebooks/multimodal/local_txts/ml-best-practices-healthcare-life-sciences_text_1.txt: AWS Whitepaper Machine, Healthcare, Life Sciences, Amazon, Inc.
[2024-09-04 20:18:24,675] p596 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-09-04 20:18:24,760] p596 {651913043.py:41} INFO - Ingesting data into pipeline
[2024-09-04 20:18:24,762] p596 {651913043.py:42} INFO - Response: 1 - 200 OK
[2024-09-04 20:18:24,766] p596 {3417922390.py:4} INFO - going to convert /home/ec2-user/SageMaker/multimodal-rag-on-slide-decks/Blog4-PDF-Tita

json_file_path: text_json_dir/ml-best-practices-healthcare-life-sciences_text_1.json


[2024-09-04 20:18:24,934] p596 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-09-04 20:18:25,000] p596 {651913043.py:41} INFO - Ingesting data into pipeline
[2024-09-04 20:18:25,002] p596 {651913043.py:42} INFO - Response: 2 - 200 OK
[2024-09-04 20:18:25,017] p596 {3417922390.py:4} INFO - going to convert /home/ec2-user/SageMaker/multimodal-rag-on-slide-decks/Blog4-PDF-TitanEmbeddings/notebooks/multimodal/local_txts/ml-best-practices-healthcare-life-sciences_text_2.txt into embeddings
[2024-09-04 20:18:25,055] p596 {651913043.py:8} INFO - entities extracted from /home/ec2-user/SageMaker/multimodal-rag-on-slide-decks/Blog4-PDF-TitanEmbeddings/notebooks/multimodal/local_txts/ml-best-practices-healthcare-life-sciences_text_2.txt: Machine Learning Best, Healthcare, Life Sciences, AWS Whitepaper Machine, Healthcare Life Sciences, AWS Whitepaper, Amazon, Inc., Amazon Amazon, Amazon Amazon Amazon


json_file_path: text_json_dir/ml-best-practices-healthcare-life-sciences_text_3.json


[2024-09-04 20:18:25,154] p596 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-09-04 20:18:25,247] p596 {651913043.py:41} INFO - Ingesting data into pipeline
[2024-09-04 20:18:25,248] p596 {651913043.py:42} INFO - Response: 3 - 200 OK


json_file_path: text_json_dir/ml-best-practices-healthcare-life-sciences_text_2.json
txt_entity_df is:                                            file_name  \
0  ml-best-practices-healthcare-life-sciences_text_1   
1  ml-best-practices-healthcare-life-sciences_text_3   
2  ml-best-practices-healthcare-life-sciences_text_2   

                                                text  \
0  AWS Whitepaper\nMachine Learning Best Practice...   
1  Machine Learning Best Practices in Healthcare ...   
2  Machine Learning Best Practices in Healthcare ...   

                                            entities  
0  AWS Whitepaper Machine, Healthcare, Life Scien...  
1  Machine Learning Best, Healthcare, Life Scienc...  
2  Machine Learning Best, Healthcare, Life Scienc...  
Done with txt results!


In [30]:
current_directory = os.getcwd()

# Define the file name for the CSV
csv_file_name = "img_entity_data.csv"

# Create the full file path
csv_file_path = os.path.join(current_directory, csv_file_name)
display(img_entity_df)
# Save the DataFrame as a CSV file
img_entity_df.to_csv(csv_file_path, index=False)

print(f"Img CSV file has been saved at: {csv_file_path}")

Unnamed: 0,file_name,text,entities
0,ml-best-practices-healthcare-life-sciences_pag...,"Based on the image provided, which appears to ...","Based on the image provided, which appears to ..."
1,ml-best-practices-healthcare-life-sciences_pag...,The image does not contain any specific named ...,The image does not contain any specific named ...
2,ml-best-practices-healthcare-life-sciences_pag...,The image does not contain any specific named ...,The image does not contain any specific named ...


Img CSV file has been saved at: /home/ec2-user/SageMaker/multimodal-rag-on-slide-decks/Blog4-PDF-TitanEmbeddings/notebooks/img_entity_data.csv


In [31]:
# Define the file name for the CSV
txt_csv_file_name = "txt_entity_data.csv"

# Create the full file path
txt_csv_file_path = os.path.join(current_directory, txt_csv_file_name)
display(txt_entity_df)
# Save the DataFrame as a CSV file
txt_entity_df.to_csv(txt_csv_file_path, index=False)

print(f"Txt CSV file has been saved at: {txt_csv_file_path}")

Unnamed: 0,file_name,text,entities
0,ml-best-practices-healthcare-life-sciences_text_1,AWS Whitepaper\nMachine Learning Best Practice...,"AWS Whitepaper Machine, Healthcare, Life Scien..."
1,ml-best-practices-healthcare-life-sciences_text_3,Machine Learning Best Practices in Healthcare ...,"Machine Learning Best, Healthcare, Life Scienc..."
2,ml-best-practices-healthcare-life-sciences_text_2,Machine Learning Best Practices in Healthcare ...,"Machine Learning Best, Healthcare, Life Scienc..."


Txt CSV file has been saved at: /home/ec2-user/SageMaker/multimodal-rag-on-slide-decks/Blog4-PDF-TitanEmbeddings/notebooks/txt_entity_data.csv
