# Data ingestion

***This notebook works best with the `conda_python3` on the `ml.t3.xlarge` instance***.

---

In this notebook we download the images corresponding to the slide deck that we uploaded into Amazon S3 in the [1_data_prep.ipynb](./1_data_prep) notebook, convert them into embeddings and then ingest these embeddings into a vector database i.e. [Amazon OpenSearch Service Serverless](https://aws.amazon.com/opensearch-service/features/serverless/).

1. We use the [Anthropic’s Claude 3 Sonnet foundation model](https://aws.amazon.com/about-aws/whats-new/2024/03/anthropics-claude-3-sonnet-model-amazon-bedrock/) available on Bedrock to convert image to text.

1. We then use [Amazon Titan Text Embeddings](https://docs.aws.amazon.com/bedrock/latest/userguide/titan-embedding-models.html) model to convert the text into embeddings.

1. The embeddings are then ingested into OpenSearch Service Serverless using the [Amazon OpenSearch Ingestion](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ingestion.html) pipeline. We ingest the embeddings into an OpenSearch Serverless index via the OpenSearch Ingestion API.

1. The OpenSearch Service Serverless Collection is created via the AWS CloudFormation stack for this blog post.


## Step 1. Setup

Install the required Python packages and import the relevant files.

In [1]:
import sys
!{sys.executable} -m pip install -r requirements.txt



In [2]:
# import the libraries that are needed to run this notebook
import os
import re
import ray
import time
import glob
import json
import yaml
import time
import boto3
import codecs
import base64
import logging
import requests
import botocore
import sagemaker
import numpy as np
import globals as g
from pathlib import Path
from typing import List, Dict
from requests_auth_aws_sigv4 import AWSSigV4
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
from utils import get_cfn_outputs, get_bucket_name, download_image_files_from_s3, get_text_embedding

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [3]:
# set a logger
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

In [4]:
bedrock = boto3.client(service_name="bedrock-runtime", region_name=g.AWS_REGION, endpoint_url=g.TITAN_URL)

In [5]:
module_path=os.getcwd()
g.__path__=module_path

In [6]:
if ray.is_initialized():
    ray.shutdown()
# ray.init(runtime_env={"working_dir": "./"})
ray.init()
# ray.init(num_cpus=40)

2024-05-28 16:17:38,814	INFO worker.py:1752 -- Started a local Ray instance.


0,1
Python version:,3.10.14
Ray version:,2.10.0


[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:17:55,316] p5186 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:17:55,375] p5186 {500037063.py:14} INFO - going to convert img/b64_images/Microsoft_rec_page_4.b64 into embeddings


[36m(async_process_image_data pid=5186)[0m file_path: img/b64_images/Microsoft_rec_page_4.b64, image description (prefiltered with entities extracted): Based on the image, here are the key entities I can identify:
[36m(async_process_image_data pid=5186)[0m 
[36m(async_process_image_data pid=5186)[0m Named Entities:
[36m(async_process_image_data pid=5186)[0m - Organizations: Microsoft, OpenAI, Azure, ChatGPT, ZeniMax Media, Activision Blizzard, Alphabet/Google, Amazon, Oracle, Netflix
[36m(async_process_image_data pid=5186)[0m - People: Amy Hood, Satya Nadella
[36m(async_process_image_data pid=5186)[0m - Products/Services: Xbox series, Game Pass Ultimate, Azure Cloud Services, Dynamics 365, GitHub Copilot
[36m(async_process_image_data pid=5186)[0m 
[36m(async_process_image_data pid=5186)[0m Custom Entities:
[36m(async_process_image_data pid=5186)[0m - OpenAI technologies
[36m(async_process_image_data pid=5186)[0m - AI assistants
[36m(async_process_image_data pid=518

[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:18:40,457] p5186 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:18:40,458] p5186 {500037063.py:59} INFO - image desc: 200 OK
[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:18:40,385] p5186 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole[32m [repeated 4x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)[0m
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:17:55,423] p5190 {500037063.py:14} INFO - going to convert img/b64_images/APPLE_rec_page_1.b64 into embeddings[32m [repeated 3x across cluster][0m


[36m(async_process_image_data pid=5191)[0m file_path: img/b64_images/Intel_rec_page_3.b64, image description (prefiltered with entities extracted): Based on the image, here is a list of entities present:
[36m(async_process_image_data pid=5191)[0m - Companies/Organizations: Argus, Intel Corp, AMD, ASML, INTC, TXN, AMAT, LRCX, KLAC
[36m(async_process_image_data pid=5191)[0m - Industry Sectors: Peer & Industry Analysis
[36m(async_process_image_data pid=5191)[0m - Financial/Investment Terms: P/E, Price/Sales, PEG, Net Margin, 1-yr EPS Growth, Argus Rating, 5 Year Growth, Debt/Capital
[36m(async_process_image_data pid=5191)[0m - Dates: Jan 26, 2024
[36m(async_process_image_data pid=5191)[0m - Time Periods: 5-yr Growth Rate (%), Current FY P/E, FY (likely Fiscal Year)
[36m(async_process_image_data pid=5191)[0m Product Entities: 
[36m(async_process_image_data pid=5191)[0m - PriceBook (likely referencing a product category)
[36m(async_process_image_data pid=5191)[0m Location 

[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:18:46,466] p5187 {500037063.py:58} INFO - Ingesting data into pipeline[32m [repeated 2x across cluster][0m
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:18:46,466] p5187 {500037063.py:59} INFO - image desc: 200 OK[32m [repeated 2x across cluster][0m
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:18:46,428] p5187 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole[32m [repeated 2x across cluster][0m
[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:18:44,681] p5191 {500037063.py:14} INFO - going to convert img/b64_images/Amazon_rec_page_2.b64 into embeddings[32m [repeated 2x across cluster][0m


[36m(async_process_image_data pid=5190)[0m - Argus (organization)
[36m(async_process_image_data pid=5190)[0m - Apple Inc. (organization)
[36m(async_process_image_data pid=5190)[0m - U.S. Department of Justice (DoJ) (organization)
[36m(async_process_image_data pid=5190)[0m - Alphabet (organization)
[36m(async_process_image_data pid=5190)[0m - Gemini (likely referring to an AI language model)
[36m(async_process_image_data pid=5190)[0m - March 22, 2024
[36m(async_process_image_data pid=5190)[0m - 12-month
[36m(async_process_image_data pid=5190)[0m - 1 Year EPS Growth Forecast
[36m(async_process_image_data pid=5190)[0m - 3 Year EPS Growth Forecast
[36m(async_process_image_data pid=5190)[0m - Smartphones
[36m(async_process_image_data pid=5190)[0m - Tablets
[36m(async_process_image_data pid=5190)[0m - PCs
[36m(async_process_image_data pid=5190)[0m - Software
[36m(async_process_image_data pid=5190)[0m - Peripherals
[36m(async_process_image_data pid=5190)[0m - Mac

[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:19:23,009] p5186 {500037063.py:58} INFO - Ingesting data into pipeline[32m [repeated 2x across cluster][0m
[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:19:23,009] p5186 {500037063.py:59} INFO - image desc: 200 OK[32m [repeated 2x across cluster][0m
[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:19:22,968] p5186 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole[32m [repeated 2x across cluster][0m
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:18:50,434] p5190 {500037063.py:14} INFO - going to convert img/b64_images/Microsoft_rec_page_3.b64 into embeddings[32m [repeated 2x across cluster][0m


[36m(async_process_image_data pid=5191)[0m file_path: img/b64_images/Amazon_rec_page_2.b64, image description (prefiltered with entities extracted): Based on the image, here are the relevant entities I can identify:
[36m(async_process_image_data pid=5191)[0m - Amazon.com Inc.
[36m(async_process_image_data pid=5191)[0m - AWS (Amazon Web Services)
[36m(async_process_image_data pid=5191)[0m - Andy Jassy (Amazon CEO)
[36m(async_process_image_data pid=5191)[0m Organizations:
[36m(async_process_image_data pid=5191)[0m - Amazon
[36m(async_process_image_data pid=5191)[0m - Argus (the company that created this report)
[36m(async_process_image_data pid=5191)[0m Financial Metrics:
[36m(async_process_image_data pid=5191)[0m - Revenue
[36m(async_process_image_data pid=5191)[0m - Operating Income
[36m(async_process_image_data pid=5191)[0m - Net Income
[36m(async_process_image_data pid=5191)[0m - GAAP EPS
[36m(async_process_image_data pid=5191)[0m - Cash Flow from Operations

[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:19:24,599] p5191 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:19:24,599] p5191 {500037063.py:59} INFO - image desc: 200 OK
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:19:39,547] p5190 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole[32m [repeated 2x across cluster][0m
[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:19:24,607] p5191 {500037063.py:14} INFO - going to convert img/b64_images/AMD_rec_page_6.b64 into embeddings[32m [repeated 2x across cluster][0m
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:19:39,587] p5190 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:19:39,588] p5190 {500037063.py:59} INFO - image desc: 200 OK


[36m(async_process_image_data pid=5187)[0m file_path: img/b64_images/tesla_rec_page_4.b64, image description (prefiltered with entities extracted): Based on the image, here is a list of relevant entities I can identify:
[36m(async_process_image_data pid=5187)[0m - Tesla Inc. (Organization)
[36m(async_process_image_data pid=5187)[0m - Elon Musk (Person name)
[36m(async_process_image_data pid=5187)[0m - GigaFactory No. 2 (Location)
[36m(async_process_image_data pid=5187)[0m - Buffalo, New York (Location)
[36m(async_process_image_data pid=5187)[0m - SpaceX (Organization)
[36m(async_process_image_data pid=5187)[0m - SolarCity (Organization)
[36m(async_process_image_data pid=5187)[0m Temporal Entities:
[36m(async_process_image_data pid=5187)[0m - 2024 (Year)
[36m(async_process_image_data pid=5187)[0m - 2025 (Year)
[36m(async_process_image_data pid=5187)[0m - 4Q23 (Quarter and Year)
[36m(async_process_image_data pid=5187)[0m - June 29, 2010 (Date)
[36m(async_process_

[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:19:50,757] p5191 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole[32m [repeated 2x across cluster][0m
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:19:40,806] p5187 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:19:40,807] p5187 {500037063.py:59} INFO - image desc: 200 OK
[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:19:50,816] p5191 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:19:50,819] p5191 {500037063.py:59} INFO - image desc: 200 OK


[36m(async_process_image_data pid=5186)[0m - Argus Research
[36m(async_process_image_data pid=5186)[0m - Economist Harold Dorsey
[36m(async_process_image_data pid=5186)[0m - Argus Investors' Counsel, Inc.
[36m(async_process_image_data pid=5186)[0m - The Argus Research Group
[36m(async_process_image_data pid=5186)[0m - Argus Investors' Counsel
[36m(async_process_image_data pid=5186)[0m - Morningstar
[36m(async_process_image_data pid=5186)[0m Organization Entities:
[36m(async_process_image_data pid=5186)[0m - Argus Research Co. (ARC)
[36m(async_process_image_data pid=5186)[0m - Argus Investors' Counsel, Inc. (AIC)
[36m(async_process_image_data pid=5186)[0m - The Argus Research Group
[36m(async_process_image_data pid=5186)[0m - Argus Research Co.
[36m(async_process_image_data pid=5186)[0m Location Entities:
[36m(async_process_image_data pid=5186)[0m - New York
[36m(async_process_image_data pid=5186)[0m - Stamford, Connecticut
[36m(async_process_image_data pid=

[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:20:10,902] p5186 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


[36m(async_process_image_data pid=5186)[0m json_file_path: pdf_img_json_dir/Intel_rec_page_7.json


[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:20:10,943] p5186 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:20:10,943] p5186 {500037063.py:59} INFO - image desc: 200 OK
[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:20:10,975] p5186 {500037063.py:14} INFO - going to convert img/b64_images/Intel_rec_page_5.b64 into embeddings


[36m(async_process_image_data pid=5191)[0m file_path: img/b64_images/tesla_rec_page_5.b64, image description (prefiltered with entities extracted): Based on the image, here are the relevant entities I can identify:
[36m(async_process_image_data pid=5191)[0m Named Entities:
[36m(async_process_image_data pid=5191)[0m - Organizations: Argus Research, Argus Investors' Counsel, Inc. (AIC), Argus Research Co. (ARC), The Argus Research Group, Morningstar
[36m(async_process_image_data pid=5191)[0m Custom Entities:
[36m(async_process_image_data pid=5191)[0m - Argus' Valuation Analysis model
[36m(async_process_image_data pid=5191)[0m - The ARGUS RESEARCH RATING SYSTEM
[36m(async_process_image_data pid=5191)[0m - ARC's core equity strategy and UIT model portfolio products
[36m(async_process_image_data pid=5191)[0m Temporal Entities:
[36m(async_process_image_data pid=5191)[0m - Jan 25, 2024 (Report creation date)
[36m(async_process_image_data pid=5191)[0m The image does not app

[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:20:40,090] p5191 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:20:40,134] p5191 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:20:40,134] p5191 {500037063.py:59} INFO - image desc: 200 OK
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:20:10,976] p5190 {500037063.py:14} INFO - going to convert img/b64_images/Intel_rec_page_6.b64 into embeddings[32m [repeated 3x across cluster][0m


[36m(async_process_image_data pid=5187)[0m - Organizations: Argus Research, Argus Investors' Counsel, Inc. (AIC), The Argus Research Group, Morningstar
[36m(async_process_image_data pid=5187)[0m - Date: Feb 1, 2024 (mentioned as the report creation date)
[36m(async_process_image_data pid=5187)[0m The image does not appear to contain any specific person names, product names, or location entities beyond the organization names listed above. It provides an overview of Argus Research's methodology and disclaimers related to their investment research reports and ratings.The image appears to be a section from an investment research report or analysis provided by Argus Research Company. The overall layout and design have a professional and structured appearance with a maroon color header and black text on a white background.
[36m(async_process_image_data pid=5187)[0m At the top, there is the Argus logo, and the section heading reads 'METHODOLOGY & DISCLAIMERS'. The report creation date

[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:20:45,593] p5187 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:20:40,143] p5191 {500037063.py:14} INFO - going to convert img/b64_images/APPLE_rec_page_6.b64 into embeddings
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:20:45,626] p5187 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:20:45,626] p5187 {500037063.py:59} INFO - image desc: 200 OK
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:20:45,644] p5187 {500037063.py:14} INFO - going to convert img/b64_images/Intel_rec_page_4.b64 into embeddings


[36m(async_process_image_data pid=5190)[0m file_path: img/b64_images/Intel_rec_page_6.b64, image description (prefiltered with entities extracted): The image contains information from an analyst report on Intel Corporation (INTC), a major semiconductor company. Here are the key entities I can identify:
[36m(async_process_image_data pid=5190)[0m Named Entities:
[36m(async_process_image_data pid=5190)[0m - Intel Corp. (company name)
[36m(async_process_image_data pid=5190)[0m - INTC (stock ticker symbol)
[36m(async_process_image_data pid=5190)[0m Temporal Entities:
[36m(async_process_image_data pid=5190)[0m - 2024 (year for non-GAAP EPS forecast)
[36m(async_process_image_data pid=5190)[0m - 2025 (year for non-GAAP EPS projection)
[36m(async_process_image_data pid=5190)[0m - 2019-2023 (5-year period for average P/E calculation)
[36m(async_process_image_data pid=5190)[0m - January 26 (date mentioned)
[36m(async_process_image_data pid=5190)[0m Financial/Numerical Entities

[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:20:59,561] p5190 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:20:59,631] p5190 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:20:59,631] p5190 {500037063.py:59} INFO - image desc: 200 OK
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:20:59,657] p5190 {500037063.py:14} INFO - going to convert img/b64_images/Microsoft_rec_page_2.b64 into embeddings


[36m(async_process_image_data pid=5191)[0m file_path: img/b64_images/APPLE_rec_page_6.b64, image description (prefiltered with entities extracted): Based on the image, here are the relevant entities I can identify:
[36m(async_process_image_data pid=5191)[0m - Organizations: Argus, Argus Investors' Counsel, Inc. (AIC), Argus Research Co. (ARC), The Argus Research Group, Morningstar
[36m(async_process_image_data pid=5191)[0m - Locations: U.S. Securities and Exchange Commission
[36m(async_process_image_data pid=5191)[0m Custom Entities:
[36m(async_process_image_data pid=5191)[0m - Rating system: BUY, HOLD, SELL
[36m(async_process_image_data pid=5191)[0m - Analysis types: Industry Analysis, Growth Analysis, Financial Strength Analysis, Management Assessment, Risk Analysis, Valuation Analysis
[36m(async_process_image_data pid=5191)[0m - Benchmark index: S&P 500
[36m(async_process_image_data pid=5191)[0m - Date: Mar 22, 2024
[36m(async_process_image_data pid=5191)[0m I did 

[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:21:08,008] p5191 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:21:08,071] p5191 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:21:08,072] p5191 {500037063.py:59} INFO - image desc: 200 OK
[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:21:08,103] p5191 {500037063.py:14} INFO - going to convert img/b64_images/AMD_rec_page_3.b64 into embeddings


[36m(async_process_image_data pid=5186)[0m file_path: img/b64_images/Intel_rec_page_5.b64, image description (prefiltered with entities extracted): Based on the image, here is a list of entities present:
[36m(async_process_image_data pid=5186)[0m Named Entities:
[36m(async_process_image_data pid=5186)[0m - Organizations: Intel Corp., Argus Research Company, NVM, SK Hynix, GAAP, VMware, Medronic, Altera, Mobileye, Nvidia
[36m(async_process_image_data pid=5186)[0m - Person Names: Pat Geisinger, Andy Bryant
[36m(async_process_image_data pid=5186)[0m Temporal Entities:
[36m(async_process_image_data pid=5186)[0m - Dates: Jan 26, 2024, 2021, 2020, February 2023, March 2023, 2022
[36m(async_process_image_data pid=5186)[0m Product Entities:
[36m(async_process_image_data pid=5186)[0m - Product/Service Names: Non-GAAP, PC
[36m(async_process_image_data pid=5186)[0m Financial Entities:
[36m(async_process_image_data pid=5186)[0m - Financial Figures: $30 billion, $100 billion, $1

[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:21:25,982] p5186 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:21:26,017] p5186 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:21:26,017] p5186 {500037063.py:59} INFO - image desc: 200 OK
[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:21:26,030] p5186 {500037063.py:14} INFO - going to convert img/b64_images/Intel_rec_page_2.b64 into embeddings


[36m(async_process_image_data pid=5187)[0m file_path: img/b64_images/Intel_rec_page_4.b64, image description (prefiltered with entities extracted): Based on the image, here are the relevant entities I can identify:
[36m(async_process_image_data pid=5187)[0m - People: None
[36m(async_process_image_data pid=5187)[0m - Organizations: Intel Corp, Amazon AWS, Graviton, Mobilye, Gelsinger, Israel's government
[36m(async_process_image_data pid=5187)[0m - Locations: China, Gaza, Kiryat Gat (Israel), New Mexico
[36m(async_process_image_data pid=5187)[0m Custom Entities:
[36m(async_process_image_data pid=5187)[0m - Intel 4 process technology
[36m(async_process_image_data pid=5187)[0m - Core Ultra Mobile processor family
[36m(async_process_image_data pid=5187)[0m - Gaudi AI Accelerator
[36m(async_process_image_data pid=5187)[0m - Habana AI training processor
[36m(async_process_image_data pid=5187)[0m - Data Center and AI Group (DCAI)
[36m(async_process_image_data pid=5187)[0

[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:21:41,038] p5187 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:21:41,038] p5187 {500037063.py:59} INFO - image desc: 200 OK
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:21:41,001] p5187 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:21:41,048] p5187 {500037063.py:14} INFO - going to convert img/b64_images/APPLE_rec_page_4.b64 into embeddings


[36m(async_process_image_data pid=5191)[0m file_path: img/b64_images/AMD_rec_page_3.b64, image description (prefiltered with entities extracted): Based on the image, here are the entities I can identify:
[36m(async_process_image_data pid=5191)[0m Named Entities:
[36m(async_process_image_data pid=5191)[0m - Advanced Micro Devices Inc.
[36m(async_process_image_data pid=5191)[0m - AMD
[36m(async_process_image_data pid=5191)[0m - AMD Corp.
[36m(async_process_image_data pid=5191)[0m - Texas Instruments Inc. (TXN)
[36m(async_process_image_data pid=5191)[0m - Applied Materials Inc. (AMAT)
[36m(async_process_image_data pid=5191)[0m - Lam Research Corp. (LRCX)
[36m(async_process_image_data pid=5191)[0m - Analog Devices Inc. (ADI)
[36m(async_process_image_data pid=5191)[0m - KLA Corp. (KLAC)
[36m(async_process_image_data pid=5191)[0m Ticker Symbols (which can be considered Named Entities):
[36m(async_process_image_data pid=5191)[0m - AMD
[36m(async_process_image_data pid

[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:21:43,796] p5191 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


[36m(async_process_image_data pid=5190)[0m file_path: img/b64_images/Microsoft_rec_page_2.b64, image description (prefiltered with entities extracted): Based on the image, here are the key entities present:
[36m(async_process_image_data pid=5190)[0m - Microsoft Corp (Organization)
[36m(async_process_image_data pid=5190)[0m - Azure (Product/Service)
[36m(async_process_image_data pid=5190)[0m - Activision Blizzard (Organization)
[36m(async_process_image_data pid=5190)[0m - December 31, 2023 (Date)
[36m(async_process_image_data pid=5190)[0m - Fiscal 2Q24 (Time Period)
[36m(async_process_image_data pid=5190)[0m - October 13, 2023 (Date)
[36m(async_process_image_data pid=5190)[0m - January 2024 (Month and Year)
[36m(async_process_image_data pid=5190)[0m Financial/Numeric Entities:
[36m(async_process_image_data pid=5190)[0m - $62 billion (Revenue)
[36m(async_process_image_data pid=5190)[0m - $0.15 (EPS beat consensus)
[36m(async_process_image_data pid=5190)[0m - 2% (S

[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:21:43,833] p5191 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:21:43,833] p5191 {500037063.py:59} INFO - image desc: 200 OK
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:21:48,685] p5190 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:21:48,685] p5190 {500037063.py:59} INFO - image desc: 200 OK


[36m(async_process_image_data pid=5186)[0m file_path: img/b64_images/Intel_rec_page_2.b64, image description (prefiltered with entities extracted): Based on the image, here are the relevant entities I can identify:
[36m(async_process_image_data pid=5186)[0m Named Entities:
[36m(async_process_image_data pid=5186)[0m - Organizations: Intel Corp, INTC, Argus, Morningstar
[36m(async_process_image_data pid=5186)[0m - Products/Services: AI Everywhere, NVM, SK Hynix, GAAP, Non-GAAP
[36m(async_process_image_data pid=5186)[0m Temporal Entities:
[36m(async_process_image_data pid=5186)[0m - Dates: Jan 26, 2024, 2023, 2022, 2021, 2020, 2025, 2024, 2023, 1Q24, 2022
[36m(async_process_image_data pid=5186)[0m - Durations: year-over-year, sequentially
[36m(async_process_image_data pid=5186)[0m - Revenue figures: $15.4 billion, $15.1 billion, $0.54 per diluted share, $0.41 in 3Q23
[36m(async_process_image_data pid=5186)[0m - Percentage figures: 3%, 34%, 17%, 45%, 10%, 9%, 8%, 49%, 33%

[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:22:25,981] p5186 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole[32m [repeated 2x across cluster][0m
[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:22:26,029] p5186 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:22:26,029] p5186 {500037063.py:59} INFO - image desc: 200 OK


[36m(async_process_image_data pid=5187)[0m - Apple Inc. (Organization)
[36m(async_process_image_data pid=5187)[0m - Google (Organization)
[36m(async_process_image_data pid=5187)[0m - Gemini (Product)
[36m(async_process_image_data pid=5187)[0m - Qualcomm (Organization)
[36m(async_process_image_data pid=5187)[0m - Snapdragon (Product)
[36m(async_process_image_data pid=5187)[0m - Intel (Organization)
[36m(async_process_image_data pid=5187)[0m - Gauld (Likely a misspelling of a person or organization name)
[36m(async_process_image_data pid=5187)[0m - Mar 22, 2024 (Date)
[36m(async_process_image_data pid=5187)[0m - 1Q24 (Likely referring to First Quarter of 2024)
[36m(async_process_image_data pid=5187)[0m - 4Q23 (Fourth Quarter of 2023)
[36m(async_process_image_data pid=5187)[0m - FY16 (Likely Fiscal Year 2016)
[36m(async_process_image_data pid=5187)[0m - FY23 (Fiscal Year 2023) 
[36m(async_process_image_data pid=5187)[0m - FY22 (Fiscal Year 2022)
[36m(async_proce

[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:22:43,585] p5187 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:22:43,585] p5187 {500037063.py:59} INFO - image desc: 200 OK
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:22:43,532] p5187 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:22:43,616] p5186 {500037063.py:14} INFO - going to convert img/b64_images/Cisco_rec_page_2.b64 into embeddings


[36m(async_process_image_data pid=5187)[0m json_file_path: pdf_img_json_dir/APPLE_rec_page_4.json
[36m(async_process_image_data pid=5190)[0m file_path: img/b64_images/Cisco_rec_page_6.b64, image description (prefiltered with entities extracted): Based on the image, here are the relevant entities I can identify:
[36m(async_process_image_data pid=5190)[0m Named Entities:
[36m(async_process_image_data pid=5190)[0m - Argus Research (Organization)
[36m(async_process_image_data pid=5190)[0m - Economist Harold Dorsey (Person)
[36m(async_process_image_data pid=5190)[0m - S&P 500 (Index)
[36m(async_process_image_data pid=5190)[0m Temporal Entities:
[36m(async_process_image_data pid=5190)[0m - 1934 (Year when Argus Research was founded)
[36m(async_process_image_data pid=5190)[0m - 12-month period (Mentioned for rating determination)
[36m(async_process_image_data pid=5190)[0m Custom Entities:
[36m(async_process_image_data pid=5190)[0m - Industry Analysis
[36m(async_process_

[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:23:28,064] p5190 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:23:28,098] p5190 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:23:28,098] p5190 {500037063.py:59} INFO - image desc: 200 OK
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:22:43,609] p5190 {500037063.py:14} INFO - going to convert img/b64_images/Cisco_rec_page_6.b64 into embeddings[32m [repeated 3x across cluster][0m


[36m(async_process_image_data pid=5191)[0m - Advanced Micro Devices Inc. (AMD) - Organization
[36m(async_process_image_data pid=5191)[0m - North American - Location
[36m(async_process_image_data pid=5191)[0m - Fortune 2000 companies - Organization category
[36m(async_process_image_data pid=5191)[0m - Instinct MI300, Instinct MI300X - Product names
[36m(async_process_image_data pid=5191)[0m - EPYC (high-performance computing/customers) - Product
[36m(async_process_image_data pid=5191)[0m - Ryzen notebooks and desktop CPUs - Product
[36m(async_process_image_data pid=5191)[0m - Ryzen 8000 Series Mobile processors - Product
[36m(async_process_image_data pid=5191)[0m - AMD RDNA 3 integrated graphics - Product
[36m(async_process_image_data pid=5191)[0m - AMD PRO technologies - Product
[36m(async_process_image_data pid=5191)[0m - Zen 4 - Product
[36m(async_process_image_data pid=5191)[0m - Ryzen CPUs - Product
[36m(async_process_image_data pid=5191)[0m - Intel - Organi

[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:23:30,099] p5191 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:23:30,101] p5191 {500037063.py:59} INFO - image desc: 200 OK


[36m(async_process_image_data pid=5186)[0m - Organizations: Argus, Cisco Systems Inc., Gartner Robbins
[36m(async_process_image_data pid=5186)[0m - Person Occupation: CEO
[36m(async_process_image_data pid=5186)[0m Numeric Entities:
[36m(async_process_image_data pid=5186)[0m - Dates: Feb 16, 2024, 2/15/24
[36m(async_process_image_data pid=5186)[0m - Revenue/Financial Figures: $12.6-$12.8 billion, $12.70 billion, $0.87, $0.24, $0.8-$3.6 billion range, $2.8 billion
[36m(async_process_image_data pid=5186)[0m - Percentages: 14% (multiple instances), 25%, 37%, 7%, 6%, 3%
[36m(async_process_image_data pid=5186)[0m Product Entities:
[36m(async_process_image_data pid=5186)[0m - Products: Cisco AI and secure data organizations, Ethernet AI fabric, GPU-enabled infrastructure
[36m(async_process_image_data pid=5186)[0m The image contains financial analysis and commentary on Cisco Systems Inc. from a report created by Argus Research Company. It discusses Cisco's recent performance,

[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:23:51,358] p5187 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole[32m [repeated 3x across cluster][0m
[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:23:32,743] p5186 {500037063.py:14} INFO - going to convert img/b64_images/Boeing_rec_page_5.b64 into embeddings[32m [repeated 3x across cluster][0m
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:23:51,432] p5187 {500037063.py:58} INFO - Ingesting data into pipeline[32m [repeated 2x across cluster][0m
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:23:51,432] p5187 {500037063.py:59} INFO - image desc: 200 OK[32m [repeated 2x across cluster][0m


[36m(async_process_image_data pid=5191)[0m file_path: img/b64_images/Boeing_rec_page_4.b64, image description (prefiltered with entities extracted): Based on the image, here are the key entities I can identify:
[36m(async_process_image_data pid=5191)[0m - Organizations: Boeing Co, Argus Research Company
[36m(async_process_image_data pid=5191)[0m - Locations: Chicago
[36m(async_process_image_data pid=5191)[0m - Dates: February 2 (BUY-rated BA closed at $209.38, down $0.43), 2026-2027
[36m(async_process_image_data pid=5191)[0m Product Entities:
[36m(async_process_image_data pid=5191)[0m - Aircraft and related products: commercial jetliners, military aircraft, rotorcraft, electronic and defense systems, missiles, satellites, launch vehicles, advanced information and communication systems
[36m(async_process_image_data pid=5191)[0m Financial Entities:
[36m(async_process_image_data pid=5191)[0m - Stock tickers: BA (Boeing Co NYSE ticker)
[36m(async_process_image_data pid=519

[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:24:04,928] p5191 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:23:51,452] p5187 {500037063.py:14} INFO - going to convert img/b64_images/AMD_rec_page_1.b64 into embeddings
[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:24:04,960] p5191 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:24:04,960] p5191 {500037063.py:59} INFO - image desc: 200 OK
[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:24:04,970] p5191 {500037063.py:14} INFO - going to convert img/b64_images/APPLE_rec_page_3.b64 into embeddings


[36m(async_process_image_data pid=5186)[0m file_path: img/b64_images/Boeing_rec_page_5.b64, image description (prefiltered with entities extracted): Based on the image, here are the relevant entities I can identify:
[36m(async_process_image_data pid=5186)[0m Named Entities:
[36m(async_process_image_data pid=5186)[0m - Argus
[36m(async_process_image_data pid=5186)[0m - Economist Harold Dorsey
[36m(async_process_image_data pid=5186)[0m - The Argus Research Group
[36m(async_process_image_data pid=5186)[0m - The Argus Investors' Counsel, Inc. (AIC)
[36m(async_process_image_data pid=5186)[0m - U.S. Securities and Exchange Commission
[36m(async_process_image_data pid=5186)[0m - Argus Research Co. (ARC)
[36m(async_process_image_data pid=5186)[0m - Morningstar
[36m(async_process_image_data pid=5186)[0m Organization Entities:
[36m(async_process_image_data pid=5186)[0m - Argus Research
[36m(async_process_image_data pid=5186)[0m - Argus Investors' Counsel, Inc.
[36m(async

[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:24:05,650] p5186 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


[36m(async_process_image_data pid=5190)[0m file_path: img/b64_images/Amazon_rec_page_4.b64, image description (prefiltered with entities extracted): Based on the image, here are the entities I can identify:
[36m(async_process_image_data pid=5190)[0m - Amazon.com Inc. (Organization)
[36m(async_process_image_data pid=5190)[0m - AWS (Subsidiary/Service of Amazon)
[36m(async_process_image_data pid=5190)[0m - Anthropic (Organization)
[36m(async_process_image_data pid=5190)[0m Temporal Entities:
[36m(async_process_image_data pid=5190)[0m - 2023 (Year)
[36m(async_process_image_data pid=5190)[0m - 4Q23 (Fourth Quarter of 2023)
[36m(async_process_image_data pid=5190)[0m - 2022 (Year)
[36m(async_process_image_data pid=5190)[0m - 2021 (Year)
[36m(async_process_image_data pid=5190)[0m - 1Q24 (First Quarter of 2024)
[36m(async_process_image_data pid=5190)[0m Product/Service Entities:
[36m(async_process_image_data pid=5190)[0m - AWS (Amazon Web Services)
[36m(async_process_i

[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:24:42,124] p5190 {500037063.py:58} INFO - Ingesting data into pipeline[32m [repeated 2x across cluster][0m
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:24:42,124] p5190 {500037063.py:59} INFO - image desc: 200 OK[32m [repeated 2x across cluster][0m
[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:24:05,729] p5186 {500037063.py:14} INFO - going to convert img/b64_images/tesla_rec_page_2.b64 into embeddings
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:24:42,086] p5190 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


[36m(async_process_image_data pid=5186)[0m file_path: img/b64_images/tesla_rec_page_2.b64, image description (prefiltered with entities extracted): Based on the image, here are the relevant entities I can identify:
[36m(async_process_image_data pid=5186)[0m Named Entities:
[36m(async_process_image_data pid=5186)[0m - Tesla Inc. (Company)
[36m(async_process_image_data pid=5186)[0m - Cybertrucks (Tesla's product)
[36m(async_process_image_data pid=5186)[0m - Shanghai (Location)
[36m(async_process_image_data pid=5186)[0m - Model 3, Model S and Model Y, Model X (Tesla's vehicle models)
[36m(async_process_image_data pid=5186)[0m - Megapack (Tesla's new energy product)
[36m(async_process_image_data pid=5186)[0m - Jan 26, 2024 (Date)
[36m(async_process_image_data pid=5186)[0m - 4Q23 (Time period - 4th quarter of 2023)
[36m(async_process_image_data pid=5186)[0m - 4Q22 (Time period - 4th quarter of 2022) 
[36m(async_process_image_data pid=5186)[0m Product Entities:
[36m(as

[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:24:43,965] p5186 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


[36m(async_process_image_data pid=5187)[0m file_path: img/b64_images/AMD_rec_page_1.b64, image description (prefiltered with entities extracted): Based on the image, here are the relevant entities I can discern:
[36m(async_process_image_data pid=5187)[0m - Advanced Micro Devices, Inc. (company name)
[36m(async_process_image_data pid=5187)[0m - AMD (company abbreviation)
[36m(async_process_image_data pid=5187)[0m - Intel (company name)
[36m(async_process_image_data pid=5187)[0m - ATI (acquired company name)
[36m(async_process_image_data pid=5187)[0m - Xilinx (acquired company name)
[36m(async_process_image_data pid=5187)[0m - Texas Instruments (company name)
[36m(async_process_image_data pid=5187)[0m - Microchip (company name)
[36m(async_process_image_data pid=5187)[0m - Microsoft (Xbox) (company and product name)
[36m(async_process_image_data pid=5187)[0m - Sony (PS) (company and product name)
[36m(async_process_image_data pid=5187)[0m - January 31, 2024 (report d

[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:24:51,126] p5187 {500037063.py:58} INFO - Ingesting data into pipeline[32m [repeated 2x across cluster][0m
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:24:51,126] p5187 {500037063.py:59} INFO - image desc: 200 OK[32m [repeated 2x across cluster][0m
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:24:51,087] p5187 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


[36m(async_process_image_data pid=5191)[0m file_path: img/b64_images/APPLE_rec_page_3.b64, image description (prefiltered with entities extracted): Based on the image, here is a list of entities present:
[36m(async_process_image_data pid=5191)[0m Named Entities:
[36m(async_process_image_data pid=5191)[0m - Apple Inc.
[36m(async_process_image_data pid=5191)[0m - IBM (International Business Machines)
[36m(async_process_image_data pid=5191)[0m - HPE (Hewlett Packard Enterprise Co)
[36m(async_process_image_data pid=5191)[0m Organizations:
[36m(async_process_image_data pid=5191)[0m - AAPL (Ticker symbol for Apple Inc.)
[36m(async_process_image_data pid=5191)[0m - IBM (Ticker symbol for International Business Machines)
[36m(async_process_image_data pid=5191)[0m - HPE (Ticker symbol for Hewlett Packard Enterprise Co)
[36m(async_process_image_data pid=5191)[0m The image does not contain any specific person names, locations, or dates. However, it does provide some quantitati

[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:24:56,257] p5186 {500037063.py:14} INFO - going to convert img/b64_images/Boeing_rec_page_3.b64 into embeddings
[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:24:56,207] p5191 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:24:56,237] p5191 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:24:56,237] p5191 {500037063.py:59} INFO - image desc: 200 OK


[36m(async_process_image_data pid=5191)[0m json_file_path: pdf_img_json_dir/APPLE_rec_page_3.json
[36m(async_process_image_data pid=5190)[0m file_path: img/b64_images/AMD_rec_page_2.b64, image description (prefiltered with entities extracted): Based on the image, here is a list of relevant entities present:
[36m(async_process_image_data pid=5190)[0m - Advanced Micro Devices Inc. (organization)
[36m(async_process_image_data pid=5190)[0m - Intel (organization)
[36m(async_process_image_data pid=5190)[0m - Nvidia (organization)
[36m(async_process_image_data pid=5190)[0m - AMD (organization abbreviation)
[36m(async_process_image_data pid=5190)[0m - Pat Gelsinger (person name, CEO of Intel)
[36m(async_process_image_data pid=5190)[0m - Instinct MI300 series accelerator (product)
[36m(async_process_image_data pid=5190)[0m - Ryzen 8040 series CPUs (product)
[36m(async_process_image_data pid=5190)[0m - AI platform strategy (technology/service)
[36m(async_process_image_data p

[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:24:56,255] p5190 {500037063.py:14} INFO - going to convert img/b64_images/AMD_rec_page_2.b64 into embeddings[32m [repeated 3x across cluster][0m
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:25:47,370] p5190 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:25:47,437] p5190 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:25:47,437] p5190 {500037063.py:59} INFO - image desc: 200 OK


[36m(async_process_image_data pid=5191)[0m file_path: img/b64_images/AMD_rec_page_5.b64, image description (prefiltered with entities extracted): Based on the image, here are the notable entities present:
[36m(async_process_image_data pid=5191)[0m Named Entities:
[36m(async_process_image_data pid=5191)[0m - Advanced Micro Devices, Inc. (Organization)
[36m(async_process_image_data pid=5191)[0m - AMD (Organization)
[36m(async_process_image_data pid=5191)[0m - Xilinx (Organization)
[36m(async_process_image_data pid=5191)[0m - 1Q24 (Period/Quarter)
[36m(async_process_image_data pid=5191)[0m - 2023, 2022, 2021, 2020 (Years)
[36m(async_process_image_data pid=5191)[0m - QLogic, Conexant, Marvell (Organizations)
[36m(async_process_image_data pid=5191)[0m Person Names:
[36m(async_process_image_data pid=5191)[0m - Rick Bergman
[36m(async_process_image_data pid=5191)[0m - Devinder Kumar
[36m(async_process_image_data pid=5191)[0m - 1Q24, 2023, 2022, 2021, 2020 (referring to

[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:25:58,731] p5191 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:25:47,453] p5190 {500037063.py:14} INFO - going to convert img/b64_images/Amazon_rec_page_6.b64 into embeddings
[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:25:58,765] p5191 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:25:58,765] p5191 {500037063.py:59} INFO - image desc: 200 OK
[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:25:58,780] p5191 {500037063.py:14} INFO - going to convert img/b64_images/APPLE_rec_page_5.b64 into embeddings


[36m(async_process_image_data pid=5186)[0m file_path: img/b64_images/Boeing_rec_page_3.b64, image description (prefiltered with entities extracted): Based on the image, here are the entities I can identify:
[36m(async_process_image_data pid=5186)[0m - Organizations: Boeing Co, Argus, Northrop Grumman Corp., Lockheed Martin Corp., General Dynamics Corp.
[36m(async_process_image_data pid=5186)[0m - Person Name: David Calhoun (Boeing's CEO and president), Brian West (Boeing's CFO)
[36m(async_process_image_data pid=5186)[0m Custom Entities:
[36m(async_process_image_data pid=5186)[0m - Aircraft: 737 MAX
[36m(async_process_image_data pid=5186)[0m - Defense programs: Space and Security programs
[36m(async_process_image_data pid=5186)[0m Temporal Entities:
[36m(async_process_image_data pid=5186)[0m - Dates: Feb 5, 2024 (report creation date)
[36m(async_process_image_data pid=5186)[0m - Financial Years: 2024, 2025
[36m(async_process_image_data pid=5186)[0m Product Entities:


[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:26:01,151] p5186 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:26:01,151] p5186 {500037063.py:59} INFO - image desc: 200 OK


[36m(async_process_image_data pid=5187)[0m file_path: img/b64_images/Boeing_rec_page_1.b64, image description (prefiltered with entities extracted): Based on the image, here are the relevant entities I can identify:
[36m(async_process_image_data pid=5187)[0m - Boeing Co (Organization)
[36m(async_process_image_data pid=5187)[0m - NYSE: BA (Stock ticker symbol)
[36m(async_process_image_data pid=5187)[0m - Dow Jones Industrial Average (Stock index)
[36m(async_process_image_data pid=5187)[0m - S&P 500 (Stock index)
[36m(async_process_image_data pid=5187)[0m - John Eade (Person name)
[36m(async_process_image_data pid=5187)[0m - February 2, 2024 (Date)
[36m(async_process_image_data pid=5187)[0m - Twelve Month Rating (Time period)
[36m(async_process_image_data pid=5187)[0m - Five Year Rating (Time period)
[36m(async_process_image_data pid=5187)[0m - ARGUS RATING: BUY (Investment recommendation)
[36m(async_process_image_data pid=5187)[0m - HOLD (Investment recommendation)

[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:26:19,094] p5187 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole[32m [repeated 2x across cluster][0m
[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:26:01,159] p5186 {500037063.py:14} INFO - going to convert img/b64_images/APPLE_rec_page_2.b64 into embeddings
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:26:19,163] p5187 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:26:19,163] p5187 {500037063.py:59} INFO - image desc: 200 OK
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:26:19,175] p5187 {500037063.py:14} INFO - going to convert img/b64_images/Cisco_rec_page_3.b64 into embeddings


[36m(async_process_image_data pid=5190)[0m file_path: img/b64_images/Amazon_rec_page_6.b64, image description (prefiltered with entities extracted): Based on the image, here are the relevant entities present:
[36m(async_process_image_data pid=5190)[0m - Argus
[36m(async_process_image_data pid=5190)[0m - Argus Research Co. (ARC)
[36m(async_process_image_data pid=5190)[0m - Argus Investors' Counsel, Inc. (AIC)
[36m(async_process_image_data pid=5190)[0m - The Argus Research Group
[36m(async_process_image_data pid=5190)[0m - The Argus Research Co.
[36m(async_process_image_data pid=5190)[0m - Morningstar
[36m(async_process_image_data pid=5190)[0m Organizations:
[36m(async_process_image_data pid=5190)[0m - Argus Research
[36m(async_process_image_data pid=5190)[0m - Argus Investors' Counsel, Inc.
[36m(async_process_image_data pid=5190)[0m - The Argus Research Group
[36m(async_process_image_data pid=5190)[0m - Morningstar
[36m(async_process_image_data pid=5190)[0m Loc

[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:26:32,024] p5190 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:26:32,055] p5190 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:26:32,056] p5190 {500037063.py:59} INFO - image desc: 200 OK
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:26:32,092] p5190 {500037063.py:14} INFO - going to convert img/b64_images/Amazon_rec_page_5.b64 into embeddings


[36m(async_process_image_data pid=5186)[0m file_path: img/b64_images/APPLE_rec_page_2.b64, image description (prefiltered with entities extracted): Based on the image, here are the relevant entities I can identify:
[36m(async_process_image_data pid=5186)[0m Named Entities:
[36m(async_process_image_data pid=5186)[0m - Apple Inc. (Organization)
[36m(async_process_image_data pid=5186)[0m - Jonathan Kanter (Person - U.S. Assistant Attorney General)
[36m(async_process_image_data pid=5186)[0m - Department of Justice (Organization)
[36m(async_process_image_data pid=5186)[0m - AT&T (Organization)
[36m(async_process_image_data pid=5186)[0m - 25 years ago
[36m(async_process_image_data pid=5186)[0m - 2024
[36m(async_process_image_data pid=5186)[0m - 2023
[36m(async_process_image_data pid=5186)[0m - 2022
[36m(async_process_image_data pid=5186)[0m - 2021
[36m(async_process_image_data pid=5186)[0m - 2020
[36m(async_process_image_data pid=5186)[0m - 2019
[36m(async_process_

[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:26:47,214] p5186 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:26:47,249] p5186 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:26:47,250] p5186 {500037063.py:59} INFO - image desc: 200 OK
[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:26:47,266] p5186 {500037063.py:14} INFO - going to convert img/b64_images/Cisco_rec_page_1.b64 into embeddings


[36m(async_process_image_data pid=5191)[0m - People: Timothy Cook, Steve Jobs, Jeff Williams, James Wilson, Phil Schiller, Jon Ive, SVP
[36m(async_process_image_data pid=5191)[0m - Organizations: Apple Inc., Argus Research Company, DoJ (Department of Justice)
[36m(async_process_image_data pid=5191)[0m Product Entities:
[36m(async_process_image_data pid=5191)[0m - iPhone, iPad, Apple Watch, AirPods, Beats headphones, Apple TV+, Apple Arcade, Apple Music, Apple Pay
[36m(async_process_image_data pid=5191)[0m Location Entities:
[36m(async_process_image_data pid=5191)[0m - None specifically mentioned
[36m(async_process_image_data pid=5191)[0m Temporal Entities:
[36m(async_process_image_data pid=5191)[0m - FY24 (Fiscal Year 2024), FY25 (Fiscal Year 2025), March 22, 2024
[36m(async_process_image_data pid=5191)[0m Custom Entities:
[36m(async_process_image_data pid=5191)[0m - CEO, COO, chief technology officer, two-year forward relative P/E, trailing multiple, buy-rated, ble

[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:26:49,493] p5191 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:26:49,494] p5191 {500037063.py:59} INFO - image desc: 200 OK


[36m(async_process_image_data pid=5187)[0m - Cisco Systems Inc.
[36m(async_process_image_data pid=5187)[0m - Juniper Networks Inc.
[36m(async_process_image_data pid=5187)[0m Organization Names:
[36m(async_process_image_data pid=5187)[0m - Cisco
[36m(async_process_image_data pid=5187)[0m - Argus Research Company
[36m(async_process_image_data pid=5187)[0m Metrics/Technical Terms:
[36m(async_process_image_data pid=5187)[0m - Networking hardware
[36m(async_process_image_data pid=5187)[0m - Software
[36m(async_process_image_data pid=5187)[0m - AI infrastructure
[36m(async_process_image_data pid=5187)[0m - Cloud-based
[36m(async_process_image_data pid=5187)[0m - On-premises
[36m(async_process_image_data pid=5187)[0m - Recurring revenue
[36m(async_process_image_data pid=5187)[0m - Product revenue
[36m(async_process_image_data pid=5187)[0m - RPO (Remaining performance obligations)
[36m(async_process_image_data pid=5187)[0m - Services RPO
[36m(async_process_image_

[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:27:07,067] p5187 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole[32m [repeated 2x across cluster][0m
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:27:07,103] p5187 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:27:07,103] p5187 {500037063.py:59} INFO - image desc: 200 OK


[36m(async_process_image_data pid=5190)[0m - People: Jeff Bezos, Andy Jassy, Brian Olsavsky, Jeffrey Wilke, Adam Selipsky
[36m(async_process_image_data pid=5190)[0m - Organization: Amazon.com Inc. (NASDAQ: AMZN), Amazon Web Services (AWS), Whole Foods
[36m(async_process_image_data pid=5190)[0m Product Entities:
[36m(async_process_image_data pid=5190)[0m - Amazon Prime, Amazon Web Services, Amazon Prime Video, Kindle, Echo, Dot (digital voice assistants)
[36m(async_process_image_data pid=5190)[0m Custom Entities:
[36m(async_process_image_data pid=5190)[0m - Cloud computing, infrastructure-as-a-service, e-commerce, online retail, brick & mortar rivals
[36m(async_process_image_data pid=5190)[0m Location Entities: 
[36m(async_process_image_data pid=5190)[0m - U.S. (United States)
[36m(async_process_image_data pid=5190)[0m The image appears to be an analyst report discussing Amazon's business performance, management changes, product offerings, competitive position, and val

[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:27:26,029] p5190 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:27:26,064] p5190 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:27:26,064] p5190 {500037063.py:59} INFO - image desc: 200 OK


[36m(async_process_image_data pid=5186)[0m file_path: img/b64_images/Cisco_rec_page_1.b64, image description (prefiltered with entities extracted): Based on the image, here are the relevant entities I can identify:
[36m(async_process_image_data pid=5186)[0m Named Entities:
[36m(async_process_image_data pid=5186)[0m - Cisco Systems Inc. (Organization)
[36m(async_process_image_data pid=5186)[0m - Argus (Organization)
[36m(async_process_image_data pid=5186)[0m Temporal Entities:
[36m(async_process_image_data pid=5186)[0m - February 16, 2024 (Date)
[36m(async_process_image_data pid=5186)[0m - Fiscal 2024 (Year)
[36m(async_process_image_data pid=5186)[0m - 3Q24 (Quarter)
[36m(async_process_image_data pid=5186)[0m - FY24 (Fiscal Year)
[36m(async_process_image_data pid=5186)[0m - 1-Year EPS Growth Forecast (Duration)
[36m(async_process_image_data pid=5186)[0m - 3-Year EPS Growth Forecast (Duration)
[36m(async_process_image_data pid=5186)[0m Financial/Product Entities:


[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:27:26,668] p5186 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:27:26,748] p5186 {500037063.py:14} INFO - going to convert img/b64_images/Cisco_rec_page_5.b64 into embeddings


[36m(async_process_image_data pid=5191)[0m file_path: img/b64_images/Amazon_rec_page_3.b64, image description (prefiltered with entities extracted): Based on the image, here is a list of entities I can identify:
[36m(async_process_image_data pid=5191)[0m - Amazon.com Inc.
[36m(async_process_image_data pid=5191)[0m - Home Depot, Inc.
[36m(async_process_image_data pid=5191)[0m - Tiffany & Co. Holding Ltd.
[36m(async_process_image_data pid=5191)[0m - Lowe's Cos., Inc.
[36m(async_process_image_data pid=5191)[0m - Nike, Inc.
[36m(async_process_image_data pid=5191)[0m - TJX Companies, Inc.
[36m(async_process_image_data pid=5191)[0m - LululemonAthletica Inc.
[36m(async_process_image_data pid=5191)[0m - Dollar General Corp.
[36m(async_process_image_data pid=5191)[0m - AWS (Amazon Web Services)
[36m(async_process_image_data pid=5191)[0m - Q4, Q3, Q2, Q1 (referring to financial quarters)
[36m(async_process_image_data pid=5191)[0m - P/E (Price-to-Earnings ratio)
[36m(asyn

[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:28:13,506] p5191 {500037063.py:58} INFO - Ingesting data into pipeline[32m [repeated 2x across cluster][0m
[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:28:13,506] p5191 {500037063.py:59} INFO - image desc: 200 OK[32m [repeated 2x across cluster][0m
[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:28:13,469] p5191 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:27:26,749] p5190 {500037063.py:14} INFO - going to convert img/b64_images/Microsoft_rec_page_1.b64 into embeddings[32m [repeated 3x across cluster][0m
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:28:13,910] p5190 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


[36m(async_process_image_data pid=5186)[0m file_path: img/b64_images/Cisco_rec_page_5.b64, image description (prefiltered with entities extracted): Based on the image, here are the relevant entities I can identify:
[36m(async_process_image_data pid=5186)[0m - Organizations: Cisco Systems Inc., Argus Research Company
[36m(async_process_image_data pid=5186)[0m - Person Name: Chuck Robbins
[36m(async_process_image_data pid=5186)[0m Custom Entities:
[36m(async_process_image_data pid=5186)[0m - Software/Services: Splunk, CAD/CAM software, routing, switching, analytics, video, wireless, security, collaboration, data center
[36m(async_process_image_data pid=5186)[0m Temporal Entities:
[36m(async_process_image_data pid=5186)[0m - Dates: February 2022, February 2021, February 2020, March 2011, July 2015, 1997
[36m(async_process_image_data pid=5186)[0m The image does not appear to contain any specific product names, location entities, or other types of entities beyond those liste

[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:28:15,121] p5186 {500037063.py:58} INFO - Ingesting data into pipeline[32m [repeated 2x across cluster][0m
[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:28:15,122] p5186 {500037063.py:59} INFO - image desc: 200 OK[32m [repeated 2x across cluster][0m
[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:28:15,133] p5186 {500037063.py:14} INFO - going to convert img/b64_images/Amazon_rec_page_1.b64 into embeddings[32m [repeated 3x across cluster][0m
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:28:25,041] p5187 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole[32m [repeated 2x across cluster][0m


[36m(async_process_image_data pid=5187)[0m file_path: img/b64_images/Microsoft_rec_page_6.b64, image description (prefiltered with entities extracted): The image contains the following entities:
[36m(async_process_image_data pid=5187)[0m - Microsoft Corp. (Organization name)
[36m(async_process_image_data pid=5187)[0m - MS Windows (Product name)
[36m(async_process_image_data pid=5187)[0m - MS Office (Product name)
[36m(async_process_image_data pid=5187)[0m - PCs (Product category)
[36m(async_process_image_data pid=5187)[0m - Windows Server (Product name)
[36m(async_process_image_data pid=5187)[0m - SQL Server (Product name)
[36m(async_process_image_data pid=5187)[0m - Dynamics CRM (Product name)
[36m(async_process_image_data pid=5187)[0m - SharePoint (Product name)
[36m(async_process_image_data pid=5187)[0m - Azure (Product name)
[36m(async_process_image_data pid=5187)[0m - Lync (Product name)
[36m(async_process_image_data pid=5187)[0m - Xbox (Product name)
[36m

[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:28:25,079] p5187 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:28:25,079] p5187 {500037063.py:59} INFO - image desc: 200 OK
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:28:25,094] p5187 {500037063.py:14} INFO - going to convert img/b64_images/Microsoft_rec_page_6.b64 into embeddings
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:29:12,212] p5187 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:29:12,295] p5187 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:29:12,295] p5187 {500037063.py:59} INFO - image desc: 200 OK
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:29:12,306] p5187 {500037063.py:14} INFO - going to convert img/b64_images/AMD_rec_page

[36m(async_process_image_data pid=5190)[0m file_path: img/b64_images/tesla_rec_page_3.b64, image description (prefiltered with entities extracted): Based on the image, here is a list of entities I can identify:
[36m(async_process_image_data pid=5190)[0m Named Entities:
[36m(async_process_image_data pid=5190)[0m - Tesla Inc. (company name)
[36m(async_process_image_data pid=5190)[0m - General Motors Company (company name)
[36m(async_process_image_data pid=5190)[0m - Ford Motor Co. (company name)
[36m(async_process_image_data pid=5190)[0m - CarMax Inc. (company name)
[36m(async_process_image_data pid=5190)[0m - Harley-Davidson, Inc. (company name)
[36m(async_process_image_data pid=5190)[0m Product Entities:
[36m(async_process_image_data pid=5190)[0m - Model X (Tesla vehicle model)
[36m(async_process_image_data pid=5190)[0m - Model 3 (Tesla vehicle model)
[36m(async_process_image_data pid=5190)[0m - Gigafactory (Tesla factory)
[36m(async_process_image_data pid=5190)

[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:29:18,212] p5190 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:29:18,251] p5190 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5190)[0m [2024-05-28 16:29:18,251] p5190 {500037063.py:59} INFO - image desc: 200 OK


[36m(async_process_image_data pid=5186)[0m file_path: img/b64_images/Amazon_rec_page_1.b64, image description (prefiltered with entities extracted): Based on the image, here's a list of the key entities present:
[36m(async_process_image_data pid=5186)[0m - Amazon.com Inc. (company name)
[36m(async_process_image_data pid=5186)[0m - Amazon Web Services (AWS) (product/service name)
[36m(async_process_image_data pid=5186)[0m - Prime (product/service name)
[36m(async_process_image_data pid=5186)[0m - Kindle (product name)
[36m(async_process_image_data pid=5186)[0m - Alexa (product name)
[36m(async_process_image_data pid=5186)[0m - Jim Kelleher, CFA (person name and occupation)
[36m(async_process_image_data pid=5186)[0m - ARGUS (company or brand name)
[36m(async_process_image_data pid=5186)[0m Temporal Entities:
[36m(async_process_image_data pid=5186)[0m - February 2, 2024 (date)
[36m(async_process_image_data pid=5186)[0m - 4Q23 (quarter specified)
[36m(async_process_i

[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:29:29,122] p5186 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:29:29,160] p5186 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5186)[0m [2024-05-28 16:29:29,161] p5186 {500037063.py:59} INFO - image desc: 200 OK


[36m(async_process_image_data pid=5191)[0m file_path: img/b64_images/tesla_rec_page_1.b64, image description (prefiltered with entities extracted): Based on the image, here are the key entities present:
[36m(async_process_image_data pid=5191)[0m Named Entities:
[36m(async_process_image_data pid=5191)[0m - Tesla Inc. (Company)
[36m(async_process_image_data pid=5191)[0m - Argus (Company/Analyst Firm)
[36m(async_process_image_data pid=5191)[0m - Austin, Texas (Location)
[36m(async_process_image_data pid=5191)[0m - Palo Alto, California (Location)
[36m(async_process_image_data pid=5191)[0m Product Entities:
[36m(async_process_image_data pid=5191)[0m - Electric vehicles
[36m(async_process_image_data pid=5191)[0m - Energy generation and storage systems
[36m(async_process_image_data pid=5191)[0m - Model 3/Y platforms
[36m(async_process_image_data pid=5191)[0m - January 26, 2024 (Date)
[36m(async_process_image_data pid=5191)[0m - June 29, 2010 (Date)
[36m(async_process

[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:29:36,260] p5191 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:29:36,308] p5191 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5191)[0m [2024-05-28 16:29:36,308] p5191 {500037063.py:59} INFO - image desc: 200 OK


[36m(async_process_image_data pid=5187)[0m file_path: img/b64_images/AMD_rec_page_7.b64, image description (prefiltered with entities extracted): Based on the image, here are the relevant entities I can identify:
[36m(async_process_image_data pid=5187)[0m - Argus Research (Organization)
[36m(async_process_image_data pid=5187)[0m - Harold Dorsey (Person)
[36m(async_process_image_data pid=5187)[0m - Nasdaq: AMD (Stock Ticker)
[36m(async_process_image_data pid=5187)[0m Custom Entities:
[36m(async_process_image_data pid=5187)[0m - Methodology & Disclaimers (Report Section)
[36m(async_process_image_data pid=5187)[0m - Valuation Analysis model (Financial Model)
[36m(async_process_image_data pid=5187)[0m - ARGUS RESEARCH RATING SYSTEM (Rating System)
[36m(async_process_image_data pid=5187)[0m - Industry Analysis, Growth Analysis, Financial Strength Analysis, Management Assessment, Risk Analysis (Analysis Types)
[36m(async_process_image_data pid=5187)[0m Temporal Entity:
[

[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:29:51,709] p5187 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


[36m(async_process_image_data pid=5187)[0m json_file_path: pdf_img_json_dir/AMD_rec_page_7.json


[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:29:51,743] p5187 {500037063.py:58} INFO - Ingesting data into pipeline
[36m(async_process_image_data pid=5187)[0m [2024-05-28 16:29:51,743] p5187 {500037063.py:59} INFO - image desc: 200 OK


In [7]:
# global constants
CONFIG_FILE_PATH = "config.yaml"

In [8]:
# read the config yaml file
fpath = CONFIG_FILE_PATH
with open(fpath, 'r') as yaml_in:
    config = yaml.safe_load(yaml_in)
logger.info(f"config read from {fpath} -> {json.dumps(config, indent=2)}")

[2024-05-28 16:17:39,978] p4855 {3034282685.py:5} INFO - config read from config.yaml -> {
  "app_name": "multi-modal-rag-bedrock",
  "aws": {
    "region": "us-east-1",
    "cfn_stack_name": "multi-modal-revised"
  },
  "pdf_dir_info": {
    "source_pdf_dir": "pdf_data",
    "pdf_img_path": "images",
    "pdf_txt_path": "text_files",
    "pdf_extracted_data": "pdf_extracted_data",
    "json_img_dir": "pdf_img_json_dir",
    "json_txt_dir": "pdf_text_json_dir",
    "bucket_prefix": "multimodal",
    "bucket_img_prefix": "img",
    "qna_dir": "question_answer_files"
  },
  "metrics_dir": {
    "dir_name": "metrics",
    "text_and_image_raw_content": "all_content_description.csv"
  },
  "page_split_imgs": {
    "manually_saved_images_provided": false,
    "horizontal_split": false,
    "vertical_split": false,
    "image_scale": 3
  },
  "content_info": {
    "content_type": "pdf",
    "pdf_file_url": null,
    "pdf_local_files": [
      "tesla_rec.pdf",
      "Microsoft_rec.pdf",
      

In [9]:
# endpoint_url=g.TITAN_URL
region: str = config['aws']['region']
endpoint_url: str = config['bedrock_model_info']['bedrock_ep_url'].format(region=region)
claude_model_id: str = config['bedrock_model_info']['claude_sonnet_model_id']

In [10]:
bucket_name: str = get_bucket_name(config['aws']['cfn_stack_name'])
logger.info(f"Bucket name being used to store extracted images and texts from data: {bucket_name}")
s3 = boto3.client('s3')

[2024-05-28 16:17:40,104] p4855 {292065259.py:2} INFO - Bucket name being used to store extracted images and texts from data: multimodal-blog2-bucket-121797993273-us-west-2


In [11]:
sagemaker_session = sagemaker.Session()
sm_client = sagemaker_session.sagemaker_client
sm_runtime_client = sagemaker_session.sagemaker_runtime_client

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [12]:
outputs = get_cfn_outputs(config['aws']['cfn_stack_name'])
host = outputs['MultimodalCollectionEndpoint'].split('//')[1]
text_index_name = outputs['OpenSearchTextIndexName']
img_index_name = outputs['OpenSearchImgIndexName']
logger.info(f"opensearchhost={host}, text index={text_index_name}, image index={img_index_name}")
osi_text_endpoint = f"https://{outputs['OpenSearchPipelineTextEndpoint']}/data/ingest"
osi_img_endpoint = f"https://{outputs['OpenSearchPipelineImgEndpoint']}/data/ingest"

[2024-05-28 16:17:40,303] p4855 {2042488222.py:5} INFO - opensearchhost=fcd9sl5hhtbyztxkt2h0.us-west-2.aoss.amazonaws.com, text index=texts, image index=images


We use the OpenSearch client to create an index.

In [13]:
session = boto3.Session()
credentials = session.get_credentials()
auth = AWSV4SignerAuth(credentials, g.AWS_REGION, g.OS_SERVICE)

# Represents the OSI client for images
img_os_client = OpenSearch(
    hosts = [{'host': host, 'port': 443}],
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection,
    pool_maxsize = 20
)

# Represents the OSI client for images
text_os_client = OpenSearch(
    hosts = [{'host': host, 'port': 443}],
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection,
    pool_maxsize = 20
)

[2024-05-28 16:17:40,338] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


In [14]:
index_body = """
{
  "settings": {
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "vector_embedding": {
        "type": "knn_vector",
        "dimension": 1536,
        "method": {
          "name": "hnsw",
          "engine": "nmslib",
          "parameters": {}
        }
      },
      "file_path": {
        "type": "text"
      },
      "file_text": {
        "type": "text"
      },
      "page_number": {
        "type": "text"
      },
       "metadata": { 
        "properties" :
          {
            "filename" : {
              "type" : "text"
            },
            "entities":{
              "type": "text"
            }
          }
      }
    }
  }
}
"""

# We would get an index already exists exception if the index already exists, and that is fine.
index_body = json.loads(index_body)
try:
    # Check if the image index exists
    if not img_os_client.indices.exists(img_index_name):
        img_response = img_os_client.indices.create(img_index_name, body=index_body)
        logger.info(f"response received for the create index for images -> {img_response}")
    else:
        logger.info(f"The image index '{img_index_name}' already exists.")

    # Check if the text index exists
    if not text_os_client.indices.exists(text_index_name):
        txt_response = text_os_client.indices.create(text_index_name, body=index_body)
        logger.info(f"response received for the create index for texts -> {txt_response}")
    else:
        logger.info(f"The text index '{text_index_name}' already exists.")
except Exception as e:
    logger.error(f"Error in creating index, exception: {e}")

[2024-05-28 16:17:40,703] p4855 {base.py:258} INFO - PUT https://fcd9sl5hhtbyztxkt2h0.us-west-2.aoss.amazonaws.com:443/images [status:200 request:0.304s]
[2024-05-28 16:17:40,705] p4855 {2964750796.py:48} INFO - response received for the create index for images -> {'acknowledged': True, 'shards_acknowledged': True, 'index': 'images'}
[2024-05-28 16:17:41,047] p4855 {base.py:258} INFO - PUT https://fcd9sl5hhtbyztxkt2h0.us-west-2.aoss.amazonaws.com:443/texts [status:200 request:0.292s]
[2024-05-28 16:17:41,048] p4855 {2964750796.py:55} INFO - response received for the create index for texts -> {'acknowledged': True, 'shards_acknowledged': True, 'index': 'texts'}


## Step 2. Download the images files from S3 and convert to Base64

Now we download the image files from the S3 bucket. Once downloaded these files are converted into [Base64](https://en.wikipedia.org/wiki/Base64) encoding so that we can create embeddings from the images.

In [15]:
os.makedirs(g.PDF_IMAGE_DIR, exist_ok=True)
os.makedirs(g.PDF_TEXT_DIR, exist_ok=True)
if config['content_info']['content_type'] == 'pdf':
    # download images from S3, we would be converting these to embeddings
    image_files: List = download_image_files_from_s3(bucket_name, g.BUCKET_IMG_PREFIX, g.PDF_IMAGE_DIR, g.IMAGE_FILE_EXTN)
    text_files: List = download_image_files_from_s3(bucket_name, g.BUCKET_PDF_TEXT_PREFIX, g.PDF_TEXT_DIR, g.TEXT_FILE_EXTN)
    logger.info(f"downloaded {len(image_files) + len(text_files)} files from s3")
elif config['content_info']['content_type'] == 'slide_deck':
    # download images from S3, we would be converting these to embeddings
    image_files: List = download_image_files_from_s3(bucket_name, g.BUCKET_IMG_PREFIX, g.IMAGE_DIR, g.IMAGE_FILE_EXTN)
    logger.info(f"downloaded {len(image_files)} from s3")
else:
    logger.error(f"No content type provided. Must be either a 'pdf' or a 'slide_deck'")

[2024-05-28 16:17:41,224] p4855 {utils.py:86} INFO - downloaded multimodal-blog2-bucket-121797993273-us-west-2/multimodal/img/AMD_rec_page_1.jpg to pdf_img/AMD_rec_page_1.jpg
[2024-05-28 16:17:41,324] p4855 {utils.py:86} INFO - downloaded multimodal-blog2-bucket-121797993273-us-west-2/multimodal/img/AMD_rec_page_2.jpg to pdf_img/AMD_rec_page_2.jpg
[2024-05-28 16:17:41,447] p4855 {utils.py:86} INFO - downloaded multimodal-blog2-bucket-121797993273-us-west-2/multimodal/img/AMD_rec_page_3.jpg to pdf_img/AMD_rec_page_3.jpg
[2024-05-28 16:17:41,534] p4855 {utils.py:86} INFO - downloaded multimodal-blog2-bucket-121797993273-us-west-2/multimodal/img/AMD_rec_page_4.jpg to pdf_img/AMD_rec_page_4.jpg
[2024-05-28 16:17:41,646] p4855 {utils.py:86} INFO - downloaded multimodal-blog2-bucket-121797993273-us-west-2/multimodal/img/AMD_rec_page_5.jpg to pdf_img/AMD_rec_page_5.jpg
[2024-05-28 16:17:41,736] p4855 {utils.py:86} INFO - downloaded multimodal-blog2-bucket-121797993273-us-west-2/multimodal/img

#### Convert jpg files into Base64.

In [16]:
def encode_image_to_base64(image_file_path: str) -> str:
    with open(image_file_path, "rb") as image_file:
        b64_image = base64.b64encode(image_file.read()).decode('utf8')
        b64_image_path = os.path.join(g.B64_ENCODED_IMAGES_DIR, f"{Path(image_file_path).stem}.b64")
        with open(b64_image_path, "wb") as b64_image_file:
            b64_image_file.write(bytes(b64_image, 'utf-8'))
    return b64_image_path

## Step 3. Get embeddings for the base64 encoded images

Now we are ready to use Amazon Bedrock via the  Anthropic’s Claude 3 Sonnet foundation model and Amazon Titan Text Embeddings model to convert the base64 version of the images into embeddings. We ingest embeddings into the pipeline using the [requests](https://pypi.org/project/requests/) HTTP library

You must sign all HTTP requests to the pipeline using [Signature Version 4](https://docs.aws.amazon.com/general/latest/gr/signature-version-4.html).

In [17]:
def get_img_desc(image_file_path: str, prompt: str):
    # bedrock = boto3.client(service_name="bedrock-runtime", region_name=g.AWS_REGION, endpoint_url=g.TITAN_URL)
    bedrock = boto3.client(service_name="bedrock-runtime", region_name=region, endpoint_url=endpoint_url)
    # read the file, MAX image size supported is 2048 * 2048 pixels
    with open(image_file_path, "rb") as image_file:
        input_image_b64 = image_file.read().decode('utf-8')

    body = json.dumps(
        {
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 1000,
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "image",
                            "source": {
                                "type": "base64",
                                "media_type": "image/jpeg",
                                "data": input_image_b64
                            },
                        },
                        {"type": "text", "text": prompt},
                    ],
                }
            ],
        }
    )

    response = bedrock.invoke_model(
        modelId=claude_model_id,
        body=body
    )

    resp_body = json.loads(response['body'].read().decode("utf-8"))
    resp_text = resp_body['content'][0]['text'].replace('"', "'")

    return resp_text

### Download image files from S3 

In [18]:
if config['content_info']['content_type'] == 'pdf':
    # this is for the pdf file images
    image_files: List = download_image_files_from_s3(bucket_name, g.BUCKET_IMG_PREFIX, g.PDF_IMAGE_DIR, g.IMAGE_FILE_EXTN)
    logger.info(f"downloaded {len(image_files)} from s3")
elif config['content_info']['content_type'] == 'slide_deck':
    # download images from S3, we would be converting these to embeddings
    image_files: List = download_image_files_from_s3(bucket_name, g.BUCKET_IMG_PREFIX, g.IMAGE_DIR, g.IMAGE_FILE_EXTN)
    logger.info(f"downloaded {len(image_files)} from s3")

[2024-05-28 16:17:49,225] p4855 {utils.py:86} INFO - downloaded multimodal-blog2-bucket-121797993273-us-west-2/multimodal/img/AMD_rec_page_1.jpg to pdf_img/AMD_rec_page_1.jpg
[2024-05-28 16:17:49,289] p4855 {utils.py:86} INFO - downloaded multimodal-blog2-bucket-121797993273-us-west-2/multimodal/img/AMD_rec_page_2.jpg to pdf_img/AMD_rec_page_2.jpg
[2024-05-28 16:17:49,348] p4855 {utils.py:86} INFO - downloaded multimodal-blog2-bucket-121797993273-us-west-2/multimodal/img/AMD_rec_page_3.jpg to pdf_img/AMD_rec_page_3.jpg
[2024-05-28 16:17:49,386] p4855 {utils.py:86} INFO - downloaded multimodal-blog2-bucket-121797993273-us-west-2/multimodal/img/AMD_rec_page_4.jpg to pdf_img/AMD_rec_page_4.jpg
[2024-05-28 16:17:49,436] p4855 {utils.py:86} INFO - downloaded multimodal-blog2-bucket-121797993273-us-west-2/multimodal/img/AMD_rec_page_5.jpg to pdf_img/AMD_rec_page_5.jpg
[2024-05-28 16:17:49,516] p4855 {utils.py:86} INFO - downloaded multimodal-blog2-bucket-121797993273-us-west-2/multimodal/img

In [19]:
os.makedirs(g.B64_ENCODED_IMAGES_DIR, exist_ok=True)
if config['content_info']['content_type'] == 'pdf':
    file_list: List = glob.glob(os.path.join(g.PDF_IMAGE_DIR, f"*{g.IMAGE_FILE_EXTN}"))
    logger.info(f"there are {len(file_list)} pdf image files in the {g.PDF_IMAGE_DIR} directory for conversion to base64")
elif config['content_info']['content_type'] == 'slide_deck':
    file_list: List = glob.glob(os.path.join(g.IMAGE_DIR, f"*{g.IMAGE_FILE_EXTN}"))
    logger.info(f"there are {len(file_list)} files in the {g.IMAGE_DIR} directory for conversion to base64")

# convert each file to base64 and store the base64 in a new file
b64_image_file_list = list(map(encode_image_to_base64, file_list))
logger.info(f"base64 conversion done, there are {len(b64_image_file_list)} base64 encoded files")

[2024-05-28 16:17:52,017] p4855 {1612947034.py:4} INFO - there are 49 pdf image files in the pdf_img directory for conversion to base64
[2024-05-28 16:17:52,171] p4855 {1612947034.py:11} INFO - base64 conversion done, there are 49 base64 encoded files


### Download text files from S3 

In [20]:
if config['content_info']['content_type'] == 'pdf':
    # this is for the pdf file images
    image_files: List = download_image_files_from_s3(bucket_name, g.BUCKET_PDF_TEXT_PREFIX, g.PDF_TEXT_DIR, g.TEXT_FILE_EXTN)
    logger.info(f"downloaded {len(image_files)} text files from s3")
else:
    logger.error(f"No text files extracted from the content given")

[2024-05-28 16:17:52,257] p4855 {utils.py:86} INFO - downloaded multimodal-blog2-bucket-121797993273-us-west-2/multimodal/pdf_text/AMD_rec_text_1.txt to multimodal/pdf_txt/AMD_rec_text_1.txt
[2024-05-28 16:17:52,332] p4855 {utils.py:86} INFO - downloaded multimodal-blog2-bucket-121797993273-us-west-2/multimodal/pdf_text/AMD_rec_text_2.txt to multimodal/pdf_txt/AMD_rec_text_2.txt
[2024-05-28 16:17:52,381] p4855 {utils.py:86} INFO - downloaded multimodal-blog2-bucket-121797993273-us-west-2/multimodal/pdf_text/AMD_rec_text_3.txt to multimodal/pdf_txt/AMD_rec_text_3.txt
[2024-05-28 16:17:52,426] p4855 {utils.py:86} INFO - downloaded multimodal-blog2-bucket-121797993273-us-west-2/multimodal/pdf_text/AMD_rec_text_4.txt to multimodal/pdf_txt/AMD_rec_text_4.txt
[2024-05-28 16:17:52,516] p4855 {utils.py:86} INFO - downloaded multimodal-blog2-bucket-121797993273-us-west-2/multimodal/pdf_text/AMD_rec_text_5.txt to multimodal/pdf_txt/AMD_rec_text_5.txt
[2024-05-28 16:17:52,554] p4855 {utils.py:86}

In [21]:
prompt = """"
Human: Please provide a detailed description of the image. Describe the overall layout and design of the image. Identify and describe any tables, charts, or other visual elements present, including the specific data or information contained within them. Your response should be extremely detailed and data oriented. Be completely accurate. Follow the instructions below while describing the image - 
  1. Describe and mention each section in the image, including the data below the sections of the image.
  2. Describe the image entirely, including all data on all the sides in the image
  3. Describe all of the numbers in the image, including the text associated to those numbers and data
  4. Describe the colors in the image if any
  5. Be completely detailed and data oriented and do not give a concise description. Give a detailed description. Be accurate and honest, do not make up an answer.
  Assistant:
  """

logger.info(f"prompt used to get image description: {prompt}")

[2024-05-28 16:17:54,751] p4855 {1968744061.py:11} INFO - prompt used to get image description: "
Human: Please provide a detailed description of the image. Describe the overall layout and design of the image. Identify and describe any tables, charts, or other visual elements present, including the specific data or information contained within them. Your response should be extremely detailed and data oriented. Be completely accurate. Follow the instructions below while describing the image - 
  1. Describe and mention each section in the image, including the data below the sections of the image.
  2. Describe the image entirely, including all data on all the sides in the image
  3. Describe all of the numbers in the image, including the text associated to those numbers and data
  4. Describe the colors in the image if any
  5. Be completely detailed and data oriented and do not give a concise description. Give a detailed description. Be accurate and honest, do not make up an answer.
  

### Hybrid Search: Extract Entities from the image, and prefilter the image description with those entities
---

The purpose of using Hybrid search is to optimize the RAG workflow in retrieving the right image description for specific questions. Some images (full or split in different parts), might not contain the information that is being asked by the question, because of the surrounding embeddings in the vector DB, so Hybrid search helps optimizing that. In this case, we will extract the entities of an image description (including the file name to be precise), then extract the entities of the question being asked, to get the most accurate response possible.

In [22]:
entity_extraction_prompt = """
Please provide a detailed description of the entities present in the image. Entities, are specific pieces of information or objects within a text that carry particular significance. These can be real-world entities like names of people, places, organizations, or dates. Refer to the types of entities: Named entities: These include names of people, organizations, locations, and dates. You can have specific identifiers within this, such as person names or person occupations.

Custom entities: These are entities specific to a particular application or domain, such as product names, medical terms, or technical jargon.

Temporal entities: These are entities related to time, such as dates, times, and durations.

Product entities: Names of products might be grouped together into product entities.

Location entities: These entities categorize or classify items based on location indicators, such as state codes.

Now based on the image, create a list of these entities. Your response should be accurate. Do not make up an answer.
"""

logger.info(f"prompt used to extract entities from the image: {entity_extraction_prompt}")

[2024-05-28 16:17:54,763] p4855 {3845316891.py:15} INFO - prompt used to extract entities from the image: 
Please provide a detailed description of the entities present in the image. Entities, are specific pieces of information or objects within a text that carry particular significance. These can be real-world entities like names of people, places, organizations, or dates. Refer to the types of entities: Named entities: These include names of people, organizations, locations, and dates. You can have specific identifiers within this, such as person names or person occupations.

Custom entities: These are entities specific to a particular application or domain, such as product names, medical terms, or technical jargon.

Temporal entities: These are entities related to time, such as dates, times, and durations.

Product entities: Names of products might be grouped together into product entities.

Location entities: These entities categorize or classify items based on location indicators,

### Part 1: Loop through b64 images to 1/get image desc from Claude3, 2/get embedding from Titan text. Call OSI pipeline API to ingest embedding.

In [23]:
def get_img_txt_embeddings(bedrock: botocore.client, prompt_data: str) -> np.ndarray:
    body = json.dumps({
        "inputText": prompt_data,
    })    
    try:
        response = bedrock.invoke_model(
            body=body, modelId=config['bedrock_model_info']['titan_model_id'], 
            accept=config['encoding_info']['accept_encoding'], contentType=config['encoding_info']['content_encoding']
        )
        response_body = json.loads(response['body'].read())
        embedding = response_body.get('embedding')
    except Exception as e:
        logger.error(f"exception={e}")
        embedding = None
    return embedding

In [24]:
# function to get the image description and store the embeddings of that text in the image index
def process_image_data(i: int, 
                       file_path: str, 
                       osi_endpoint, 
                       total: int) -> Dict:
    bedrock = boto3.client(service_name="bedrock-runtime", region_name=region, endpoint_url=endpoint_url)
    json_data: Optional[Dict] = None
    # name of the images that are saved (either split in 4 ways or saved as a single page)
    image_name: Optional[str] = None
    try:
        image_file_extn: str = config['content_info']['image_extn']
        bucket_img_prefix: str = os.path.join(config['pdf_dir_info']['bucket_prefix'], 
                                              config['pdf_dir_info']['bucket_img_prefix'])
        logger.info(f"going to convert {file_path} into embeddings")
        # first, get the entities from the image to prefilter the image description with the entities
        entities_extracted = get_img_desc(file_path, entity_extraction_prompt)
        # get the image description and prefilter the image description with the entities extracted from the image
        content_description = entities_extracted + get_img_desc(file_path, prompt)
        print(f"file_path: {file_path}, image description (prefiltered with entities extracted): {content_description}")
        # embedding = get_text_embedding(bedrock, content_description)
        embedding = get_img_txt_embeddings(bedrock, content_description)

        if config['content_info']['content_type'] == 'slide_deck':
            input_image_s3 = f"s3://{bucket_name}/{bucket_img_prefix}/{Path(file_path).stem}{image_file_extn}"
            obj_name = f"{Path(file_path).stem}{image_file_extn}"
        elif config['content_info']['content_type'] == 'pdf':
            input_image_s3 = f"s3://{bucket_name}/{bucket_img_prefix}/{Path(file_path).stem}{image_file_extn}"
            obj_name = f"{Path(file_path).stem}{image_file_extn}"

        data = json.dumps([{
            "file_path": input_image_s3,
            "file_text": content_description,
            "page_number": re.search(r"page_(\d+)_?", obj_name).group(1),
            "metadata": {
                "filename": obj_name,
                "entities": entities_extracted
            },
            "vector_embedding": embedding
        }])
        json_data = {
            "file_type": config['content_info']['image_extn'],
            "file_name": obj_name,
            "text": content_description,
            "entities": entities_extracted,
            "page_number": re.search(r"page_(\d+)_?", obj_name).group(1)
            # "page_number": re.search(r"_(\d+)_?", obj_name).group(1)
            }
        image_dir: str = config['pdf_dir_info']['json_img_dir']
        os.makedirs(image_dir, exist_ok=True)
        fpath = os.path.join(image_dir, f"{Path(file_path).stem}.json")
        print(f"json_file_path: {fpath}")
        Path(fpath).write_text(json.dumps(json_data, default=str, indent=2))
        r = requests.request(
            method='POST', 
            url=osi_endpoint, 
            data=data,
            auth=AWSSigV4('osis'))
        logger.info("Ingesting data into pipeline")
        logger.info(f"image desc: {r.text}")
    except Exception as e:
        logger.error(f"Error processing image {file_path}: {e}")
        json_data: Optional[Dict] = None
    return json_data

In [25]:
@ray.remote
def async_process_image_data(i: int, file_path: str, osi_endpoint, total: int):
    logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
    logger = logging.getLogger(__name__)
    return process_image_data(i, file_path, osi_endpoint, total)

In [26]:
import time
erroneous_page_count: int = 0
n: int = config['parallel_inference_count']
image_chunks = [b64_image_file_list[i:i + n] for i in range(0, len(b64_image_file_list), n)]
for chunk_index, image_chunk in enumerate(image_chunks):
    try:
        st = time.perf_counter()
        logger.info(f"------ getting text description for chunk {chunk_index}/{len(image_chunks)} -----")
        # Iterate over each file path in the chunk and process it individually
        logger.info(f"getting inference for list {chunk_index+1}/{len(image_chunks)}, size of list={len(image_chunk)} ")
        results = ray.get([async_process_image_data.remote(index, file_path, osi_img_endpoint, len(image_chunk)) for index, file_path in enumerate(image_chunk)])
        elapsed_time = time.perf_counter() - st
        logger.info(f"------ completed chunk={chunk_index}/{len(image_chunks)} completed in {elapsed_time} ------ ")
    except Exception as e:
        logger.error(f"Error processing chunk {chunk_index}: {e}")
        erroneous_page_count += len(image_chunk)

logger.info(f"Number of erroneous pdf pages that are not processed: {erroneous_page_count}")

[2024-05-28 16:17:54,816] p4855 {4068516838.py:8} INFO - ------ getting text description for chunk 0/5 -----
[2024-05-28 16:17:54,817] p4855 {4068516838.py:10} INFO - getting inference for list 1/5, size of list=10 
[2024-05-28 16:20:10,954] p4855 {4068516838.py:13} INFO - ------ completed chunk=0/5 completed in 136.13846380299947 ------ 
[2024-05-28 16:20:10,955] p4855 {4068516838.py:8} INFO - ------ getting text description for chunk 1/5 -----
[2024-05-28 16:20:10,957] p4855 {4068516838.py:10} INFO - getting inference for list 2/5, size of list=10 
[2024-05-28 16:22:43,590] p4855 {4068516838.py:13} INFO - ------ completed chunk=1/5 completed in 152.63477001400315 ------ 
[2024-05-28 16:22:43,591] p4855 {4068516838.py:8} INFO - ------ getting text description for chunk 2/5 -----
[2024-05-28 16:22:43,592] p4855 {4068516838.py:10} INFO - getting inference for list 3/5, size of list=10 
[2024-05-28 16:24:56,242] p4855 {4068516838.py:13} INFO - ------ completed chunk=2/5 completed in 132.

### Part 2: Loop through text files to 1/get text desc from Claude3, 2/get embedding from Titan text. Call OSI pipeline API to ingest embedding.

In [27]:
# Get a list of all files in the current directory
pdf_txt_file_list = os.listdir(g.PDF_TEXT_DIR)

# Get relative file paths by joining directory path with each file name
pdf_txt_file_list = [os.path.join(g.PDF_TEXT_DIR, file) for file in pdf_txt_file_list]
print(pdf_txt_file_list)

['multimodal/pdf_txt/APPLE_rec_text_5.txt', 'multimodal/pdf_txt/Intel_rec_text_7.txt', 'multimodal/pdf_txt/Boeing_rec_text_2.txt', 'multimodal/pdf_txt/Boeing_rec_text_4.txt', 'multimodal/pdf_txt/Intel_rec_text_2.txt', 'multimodal/pdf_txt/Boeing_rec_text_5.txt', 'multimodal/pdf_txt/AMD_rec_text_5.txt', 'multimodal/pdf_txt/Amazon_rec_text_1.txt', 'multimodal/pdf_txt/AMD_rec_text_1.txt', 'multimodal/pdf_txt/Amazon_rec_text_3.txt', 'multimodal/pdf_txt/Cisco_rec_text_6.txt', 'multimodal/pdf_txt/Microsoft_rec_text_3.txt', 'multimodal/pdf_txt/Amazon_rec_text_5.txt', 'multimodal/pdf_txt/Microsoft_rec_text_1.txt', 'multimodal/pdf_txt/Amazon_rec_text_6.txt', 'multimodal/pdf_txt/Amazon_rec_text_4.txt', 'multimodal/pdf_txt/tesla_rec_text_2.txt', 'multimodal/pdf_txt/Intel_rec_text_6.txt', 'multimodal/pdf_txt/Intel_rec_text_1.txt', 'multimodal/pdf_txt/Cisco_rec_text_5.txt', 'multimodal/pdf_txt/Intel_rec_text_4.txt', 'multimodal/pdf_txt/Intel_rec_text_3.txt', 'multimodal/pdf_txt/Microsoft_rec_text_6.

In [28]:
txt_page_index = 1
os.makedirs(g.JSON_TEXT_DIR, exist_ok=True)
for txt_file in pdf_txt_file_list:
    logger.info(f"going to convert {txt_file} into embeddings")
    with open(txt_file, 'r') as file:
        extracted_pdf_text = file.read()
    embedding = get_text_embedding(bedrock, extracted_pdf_text)
    # Adjust this logic according to your configuration
    input_text_s3 = f"s3://{bucket_name}/{g.BUCKET_PDF_TEXT_PREFIX}/{Path(txt_file).stem}{g.TEXT_FILE_EXTN}"
    obj_name = f"{Path(txt_file).stem}{g.TEXT_FILE_EXTN}"

    data = json.dumps([{
        "file_path": input_text_s3, 
        "file_text": extracted_pdf_text,
        "file_number": txt_page_index, 
        "metadata": {
            "filename": obj_name, 
            "desc": "" 
        }, 
        "vector_embedding": embedding
    }])
    json_data = {
        "file_type": g.TEXT_FILE_EXTN,
        "file_name": Path(txt_file).stem,
        "text": extracted_pdf_text, 
        "page_number": re.search(r"text_(\d+)_?", obj_name).group(1)
    }
    os.makedirs(g.JSON_TEXT_DIR, exist_ok=True)
    fpath = os.path.join(g.JSON_TEXT_DIR, f"{Path(txt_file).stem}.json")
    print(f"json_file_path: {fpath}")
    Path(fpath).write_text(json.dumps(json_data, default=str, indent=2))
    r = requests.request(
        method='POST',
        url=osi_text_endpoint,
        data=data,
        auth=AWSSigV4('osis'))

    logger.info("Ingesting data into pipeline")
    logger.info(f"Response: {txt_page_index} - {r.text}")
    txt_page_index += 1

[2024-05-28 16:29:51,774] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/APPLE_rec_text_5.txt into embeddings
[2024-05-28 16:29:52,118] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:29:52,216] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:29:52,217] p4855 {790141951.py:39} INFO - Response: 1 - 200 OK
[2024-05-28 16:29:52,218] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Intel_rec_text_7.txt into embeddings


json_file_path: pdf_text_json_dir/APPLE_rec_text_5.json


[2024-05-28 16:29:52,408] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:29:52,449] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:29:52,450] p4855 {790141951.py:39} INFO - Response: 2 - 200 OK
[2024-05-28 16:29:52,451] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Boeing_rec_text_2.txt into embeddings


json_file_path: pdf_text_json_dir/Intel_rec_text_7.json


[2024-05-28 16:29:52,839] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:29:52,935] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:29:52,936] p4855 {790141951.py:39} INFO - Response: 3 - 200 OK
[2024-05-28 16:29:52,937] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Boeing_rec_text_4.txt into embeddings


json_file_path: pdf_text_json_dir/Boeing_rec_text_2.json


[2024-05-28 16:29:53,114] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:29:53,158] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:29:53,160] p4855 {790141951.py:39} INFO - Response: 4 - 200 OK
[2024-05-28 16:29:53,160] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Intel_rec_text_2.txt into embeddings


json_file_path: pdf_text_json_dir/Boeing_rec_text_4.json


[2024-05-28 16:29:53,584] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:29:53,625] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:29:53,626] p4855 {790141951.py:39} INFO - Response: 5 - 200 OK
[2024-05-28 16:29:53,627] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Boeing_rec_text_5.txt into embeddings


json_file_path: pdf_text_json_dir/Intel_rec_text_2.json


[2024-05-28 16:29:53,942] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:29:53,975] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:29:53,976] p4855 {790141951.py:39} INFO - Response: 6 - 200 OK
[2024-05-28 16:29:53,976] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/AMD_rec_text_5.txt into embeddings


json_file_path: pdf_text_json_dir/Boeing_rec_text_5.json


[2024-05-28 16:29:54,362] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:29:54,403] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:29:54,404] p4855 {790141951.py:39} INFO - Response: 7 - 200 OK
[2024-05-28 16:29:54,405] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Amazon_rec_text_1.txt into embeddings


json_file_path: pdf_text_json_dir/AMD_rec_text_5.json


[2024-05-28 16:29:54,754] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:29:54,786] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:29:54,788] p4855 {790141951.py:39} INFO - Response: 8 - 200 OK
[2024-05-28 16:29:54,789] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/AMD_rec_text_1.txt into embeddings


json_file_path: pdf_text_json_dir/Amazon_rec_text_1.json


[2024-05-28 16:29:55,166] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:29:55,208] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:29:55,209] p4855 {790141951.py:39} INFO - Response: 9 - 200 OK
[2024-05-28 16:29:55,210] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Amazon_rec_text_3.txt into embeddings


json_file_path: pdf_text_json_dir/AMD_rec_text_1.json


[2024-05-28 16:29:55,608] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:29:55,653] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:29:55,656] p4855 {790141951.py:39} INFO - Response: 10 - 200 OK
[2024-05-28 16:29:55,659] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Cisco_rec_text_6.txt into embeddings


json_file_path: pdf_text_json_dir/Amazon_rec_text_3.json


[2024-05-28 16:29:55,968] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:29:56,003] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:29:56,004] p4855 {790141951.py:39} INFO - Response: 11 - 200 OK
[2024-05-28 16:29:56,005] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Microsoft_rec_text_3.txt into embeddings


json_file_path: pdf_text_json_dir/Cisco_rec_text_6.json


[2024-05-28 16:29:56,266] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:29:56,307] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:29:56,309] p4855 {790141951.py:39} INFO - Response: 12 - 200 OK
[2024-05-28 16:29:56,310] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Amazon_rec_text_5.txt into embeddings


json_file_path: pdf_text_json_dir/Microsoft_rec_text_3.json


[2024-05-28 16:29:56,523] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:29:56,586] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:29:56,587] p4855 {790141951.py:39} INFO - Response: 13 - 200 OK
[2024-05-28 16:29:56,587] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Microsoft_rec_text_1.txt into embeddings


json_file_path: pdf_text_json_dir/Amazon_rec_text_5.json


[2024-05-28 16:29:56,976] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:29:57,014] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:29:57,016] p4855 {790141951.py:39} INFO - Response: 14 - 200 OK
[2024-05-28 16:29:57,017] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Amazon_rec_text_6.txt into embeddings


json_file_path: pdf_text_json_dir/Microsoft_rec_text_1.json


[2024-05-28 16:29:57,344] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:29:57,388] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:29:57,389] p4855 {790141951.py:39} INFO - Response: 15 - 200 OK
[2024-05-28 16:29:57,389] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Amazon_rec_text_4.txt into embeddings


json_file_path: pdf_text_json_dir/Amazon_rec_text_6.json


[2024-05-28 16:29:57,658] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:29:57,711] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:29:57,717] p4855 {790141951.py:39} INFO - Response: 16 - 200 OK
[2024-05-28 16:29:57,718] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/tesla_rec_text_2.txt into embeddings


json_file_path: pdf_text_json_dir/Amazon_rec_text_4.json


[2024-05-28 16:29:58,099] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:29:58,137] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:29:58,139] p4855 {790141951.py:39} INFO - Response: 17 - 200 OK
[2024-05-28 16:29:58,140] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Intel_rec_text_6.txt into embeddings


json_file_path: pdf_text_json_dir/tesla_rec_text_2.json


[2024-05-28 16:29:58,304] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:29:58,348] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:29:58,349] p4855 {790141951.py:39} INFO - Response: 18 - 200 OK
[2024-05-28 16:29:58,350] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Intel_rec_text_1.txt into embeddings


json_file_path: pdf_text_json_dir/Intel_rec_text_6.json


[2024-05-28 16:29:58,622] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:29:58,701] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:29:58,702] p4855 {790141951.py:39} INFO - Response: 19 - 200 OK
[2024-05-28 16:29:58,703] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Cisco_rec_text_5.txt into embeddings


json_file_path: pdf_text_json_dir/Intel_rec_text_1.json


[2024-05-28 16:29:58,878] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:29:58,923] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:29:58,925] p4855 {790141951.py:39} INFO - Response: 20 - 200 OK
[2024-05-28 16:29:58,926] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Intel_rec_text_4.txt into embeddings


json_file_path: pdf_text_json_dir/Cisco_rec_text_5.json


[2024-05-28 16:29:59,135] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:29:59,174] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:29:59,176] p4855 {790141951.py:39} INFO - Response: 21 - 200 OK
[2024-05-28 16:29:59,177] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Intel_rec_text_3.txt into embeddings


json_file_path: pdf_text_json_dir/Intel_rec_text_4.json


[2024-05-28 16:29:59,540] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:29:59,586] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:29:59,587] p4855 {790141951.py:39} INFO - Response: 22 - 200 OK
[2024-05-28 16:29:59,588] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Microsoft_rec_text_6.txt into embeddings


json_file_path: pdf_text_json_dir/Intel_rec_text_3.json


[2024-05-28 16:29:59,926] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:29:59,968] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:29:59,970] p4855 {790141951.py:39} INFO - Response: 23 - 200 OK
[2024-05-28 16:29:59,971] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Amazon_rec_text_2.txt into embeddings


json_file_path: pdf_text_json_dir/Microsoft_rec_text_6.json


[2024-05-28 16:30:00,230] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:30:00,285] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:30:00,286] p4855 {790141951.py:39} INFO - Response: 24 - 200 OK
[2024-05-28 16:30:00,287] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Boeing_rec_text_1.txt into embeddings


json_file_path: pdf_text_json_dir/Amazon_rec_text_2.json


[2024-05-28 16:30:00,573] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:30:00,619] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:30:00,622] p4855 {790141951.py:39} INFO - Response: 25 - 200 OK
[2024-05-28 16:30:00,624] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Cisco_rec_text_3.txt into embeddings


json_file_path: pdf_text_json_dir/Boeing_rec_text_1.json


[2024-05-28 16:30:00,881] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:30:00,922] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:30:00,924] p4855 {790141951.py:39} INFO - Response: 26 - 200 OK
[2024-05-28 16:30:00,926] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/AMD_rec_text_2.txt into embeddings


json_file_path: pdf_text_json_dir/Cisco_rec_text_3.json


[2024-05-28 16:30:01,302] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:30:01,354] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:30:01,359] p4855 {790141951.py:39} INFO - Response: 27 - 200 OK
[2024-05-28 16:30:01,360] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/tesla_rec_text_5.txt into embeddings


json_file_path: pdf_text_json_dir/AMD_rec_text_2.json


[2024-05-28 16:30:01,710] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:30:01,759] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:30:01,761] p4855 {790141951.py:39} INFO - Response: 28 - 200 OK
[2024-05-28 16:30:01,762] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Cisco_rec_text_2.txt into embeddings


json_file_path: pdf_text_json_dir/tesla_rec_text_5.json


[2024-05-28 16:30:02,050] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:30:02,113] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:30:02,115] p4855 {790141951.py:39} INFO - Response: 29 - 200 OK
[2024-05-28 16:30:02,116] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/AMD_rec_text_4.txt into embeddings


json_file_path: pdf_text_json_dir/Cisco_rec_text_2.json


[2024-05-28 16:30:02,364] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:30:02,403] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:30:02,404] p4855 {790141951.py:39} INFO - Response: 30 - 200 OK
[2024-05-28 16:30:02,405] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/AMD_rec_text_6.txt into embeddings


json_file_path: pdf_text_json_dir/AMD_rec_text_4.json


[2024-05-28 16:30:02,669] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:30:02,712] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:30:02,713] p4855 {790141951.py:39} INFO - Response: 31 - 200 OK
[2024-05-28 16:30:02,714] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Microsoft_rec_text_7.txt into embeddings


json_file_path: pdf_text_json_dir/AMD_rec_text_6.json


[2024-05-28 16:30:03,045] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:30:03,105] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:30:03,108] p4855 {790141951.py:39} INFO - Response: 32 - 200 OK
[2024-05-28 16:30:03,109] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/tesla_rec_text_4.txt into embeddings


json_file_path: pdf_text_json_dir/Microsoft_rec_text_7.json


[2024-05-28 16:30:03,441] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:30:03,476] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:30:03,477] p4855 {790141951.py:39} INFO - Response: 33 - 200 OK
[2024-05-28 16:30:03,478] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Boeing_rec_text_3.txt into embeddings


json_file_path: pdf_text_json_dir/tesla_rec_text_4.json
json_file_path: pdf_text_json_dir/Boeing_rec_text_3.json


[2024-05-28 16:30:04,053] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:30:04,111] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:30:04,112] p4855 {790141951.py:39} INFO - Response: 34 - 200 OK
[2024-05-28 16:30:04,113] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Microsoft_rec_text_4.txt into embeddings
[2024-05-28 16:30:04,460] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:30:04,508] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:30:04,509] p4855 {790141951.py:39} INFO - Response: 35 - 200 OK
[2024-05-28 16:30:04,510] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/AMD_rec_text_3.txt into embeddings


json_file_path: pdf_text_json_dir/Microsoft_rec_text_4.json


[2024-05-28 16:30:04,789] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:30:04,860] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:30:04,862] p4855 {790141951.py:39} INFO - Response: 36 - 200 OK
[2024-05-28 16:30:04,865] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/APPLE_rec_text_6.txt into embeddings


json_file_path: pdf_text_json_dir/AMD_rec_text_3.json


[2024-05-28 16:30:05,075] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:30:05,145] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:30:05,147] p4855 {790141951.py:39} INFO - Response: 37 - 200 OK
[2024-05-28 16:30:05,148] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Cisco_rec_text_4.txt into embeddings


json_file_path: pdf_text_json_dir/APPLE_rec_text_6.json


[2024-05-28 16:30:05,436] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:30:05,480] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:30:05,481] p4855 {790141951.py:39} INFO - Response: 38 - 200 OK
[2024-05-28 16:30:05,482] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Microsoft_rec_text_2.txt into embeddings


json_file_path: pdf_text_json_dir/Cisco_rec_text_4.json


[2024-05-28 16:30:05,854] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:30:05,898] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:30:05,899] p4855 {790141951.py:39} INFO - Response: 39 - 200 OK
[2024-05-28 16:30:05,900] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/tesla_rec_text_1.txt into embeddings


json_file_path: pdf_text_json_dir/Microsoft_rec_text_2.json


[2024-05-28 16:30:06,273] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:30:06,316] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:30:06,318] p4855 {790141951.py:39} INFO - Response: 40 - 200 OK
[2024-05-28 16:30:06,320] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/APPLE_rec_text_3.txt into embeddings


json_file_path: pdf_text_json_dir/tesla_rec_text_1.json


[2024-05-28 16:30:06,551] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:30:06,593] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:30:06,596] p4855 {790141951.py:39} INFO - Response: 41 - 200 OK
[2024-05-28 16:30:06,596] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/tesla_rec_text_3.txt into embeddings


json_file_path: pdf_text_json_dir/APPLE_rec_text_3.json


[2024-05-28 16:30:07,019] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:30:07,058] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:30:07,060] p4855 {790141951.py:39} INFO - Response: 42 - 200 OK
[2024-05-28 16:30:07,062] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Cisco_rec_text_1.txt into embeddings


json_file_path: pdf_text_json_dir/tesla_rec_text_3.json


[2024-05-28 16:30:07,468] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:30:07,528] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:30:07,530] p4855 {790141951.py:39} INFO - Response: 43 - 200 OK
[2024-05-28 16:30:07,531] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/AMD_rec_text_7.txt into embeddings


json_file_path: pdf_text_json_dir/Cisco_rec_text_1.json


[2024-05-28 16:30:07,730] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:30:07,766] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:30:07,767] p4855 {790141951.py:39} INFO - Response: 44 - 200 OK
[2024-05-28 16:30:07,768] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/APPLE_rec_text_1.txt into embeddings


json_file_path: pdf_text_json_dir/AMD_rec_text_7.json


[2024-05-28 16:30:08,018] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:30:08,062] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:30:08,063] p4855 {790141951.py:39} INFO - Response: 45 - 200 OK
[2024-05-28 16:30:08,064] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Microsoft_rec_text_5.txt into embeddings


json_file_path: pdf_text_json_dir/APPLE_rec_text_1.json


[2024-05-28 16:30:08,435] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:30:08,479] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:30:08,481] p4855 {790141951.py:39} INFO - Response: 46 - 200 OK
[2024-05-28 16:30:08,482] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/APPLE_rec_text_4.txt into embeddings


json_file_path: pdf_text_json_dir/Microsoft_rec_text_5.json


[2024-05-28 16:30:08,753] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:30:08,812] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:30:08,816] p4855 {790141951.py:39} INFO - Response: 47 - 200 OK
[2024-05-28 16:30:08,818] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/APPLE_rec_text_2.txt into embeddings


json_file_path: pdf_text_json_dir/APPLE_rec_text_4.json


[2024-05-28 16:30:09,221] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:30:09,266] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:30:09,267] p4855 {790141951.py:39} INFO - Response: 48 - 200 OK
[2024-05-28 16:30:09,268] p4855 {790141951.py:4} INFO - going to convert multimodal/pdf_txt/Intel_rec_text_5.txt into embeddings


json_file_path: pdf_text_json_dir/APPLE_rec_text_2.json


[2024-05-28 16:30:09,700] p4855 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[2024-05-28 16:30:09,760] p4855 {790141951.py:38} INFO - Ingesting data into pipeline
[2024-05-28 16:30:09,762] p4855 {790141951.py:39} INFO - Response: 49 - 200 OK


json_file_path: pdf_text_json_dir/Intel_rec_text_5.json
