# Multimodal PDF RAG

## Problem Statement
RAG Application that can answer any query related to an uploaded PDF(containing Text, Images and Tables).

## Solution

1. Upload a PDF file containing Text, Images and Tables. 
2. Load and separate text, images and tables data from the PDF.
3. Summarize the text and images using LLM.
4. Embed and Store the text, summarized images & tables in a inmemory vectorDB.
5. Create a Retriever for this indexed vectorDB.
6. Create a text chain to answer related to text using LLM.
7. Create a full chain with a multimodal LLM and text chain that can answer related to both text and images both.

### Load Environment variables

In [1]:
# Import Libraries
import os
from dotenv import load_dotenv

In [2]:
# Load
load_dotenv()

True

### Load PDF file

In [1]:
# Import libraries
from unstructured.partition.pdf import partition_pdf

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
# Set the PDF path
pdf_path = "./pdf-docs/rag_llm.pdf"

In [5]:
# Use unstructured to read the PDF
elements = partition_pdf(
    filename=pdf_path,
    strategy="hi_res",                             # High-resolution layout parser (best for complex PDFs)
    extract_images_in_pdf=True,                    # Extract images from the PDF
    extract_image_block_types=["Image", "Table"],  # Extract image blocks that are images or tables
    extract_image_block_to_payload=False,          # Don't embed image bytes directly into `el.metadata["image"]`
    extract_image_block_output_dir="./extracted_images"  # Store extracted images here
)



In [6]:
# Print the content types and preview
for el in elements:
    print(f"Type: {el.category}")
    print(f"Preview: {el.text[:100]}...")
    print("="* 80)

Type: Image
Preview: HIGH SCHOOL EDITION @® Journal of Student Research...
Type: Header
Preview: Volume 12 Issue 4 (2023)...
Type: Title
Preview: A Retrieval-Augmented Generation Based Large Language Model Benchmarked on a Novel Dataset...
Type: NarrativeText
Preview: Kieran Pichai...
Type: NarrativeText
Preview: Menlo School...
Type: Title
Preview: ABSTRACT...
Type: NarrativeText
Preview: The evolution of natural language processing has seen marked advancements, particularly with the adv...
Type: Title
Preview: Introduction...
Type: NarrativeText
Preview: The evolution of natural language processing models has seen signiﬁcant strides from rule-based appr...
Type: NarrativeText
Preview: Curiously, little attention has been devoted to dissecting the individual components of RAG and thei...
Type: NarrativeText
Preview: ISSN: 2167-1907...
Type: UncategorizedText
Preview: www.JSR.org/hs...
Type: UncategorizedText
Preview: 1...
Type: Image
Preview: HIGH SCHOOL EDITION @® Journal of Student 

In [7]:
# Print the content of images
for el in elements:
    if el.category == 'Image':
        print(f"Type: {el.category}")
        print(f"Metadata: {el.metadata.to_dict()}")
        print(f"Preview: {el.text}")
        print("="* 80)

Type: Image
Metadata: {'coordinates': {'points': ((np.float64(200.0), np.float64(80.00087805555565)), (np.float64(200.0), np.float64(164.9987113888888)), (np.float64(624.9891666666667), np.float64(164.9987113888888)), (np.float64(624.9891666666667), np.float64(80.00087805555565))), 'system': 'PixelSpace', 'layout_width': 1700, 'layout_height': 2200}, 'last_modified': '2025-08-01T09:08:08', 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'image_path': './extracted_images/figure-1-1.jpg', 'file_directory': './pdf-docs', 'filename': 'rag_llm.pdf'}
Preview: HIGH SCHOOL EDITION @® Journal of Student Research
Type: Image
Metadata: {'coordinates': {'points': ((np.float64(200.0), np.float64(80.00087805555565)), (np.float64(200.0), np.float64(164.9987113888888)), (np.float64(624.9891666666667), np.float64(164.9987113888888)), (np.float64(624.9891666666667), np.float64(80.00087805555565))), 'system': 'PixelSpace', 'layout_width': 1700, 'layout_height': 2200}, 'last_modifie

In [8]:
# Print the content of tables
for el in elements:
    if el.category == 'Table':
        print(f"Type: {el.category}")
        print(f"Preview: {el.text}")
        print("="* 80)

Type: Table
Preview: Context LLM Embed for Similarity Score Yes No GPT Palm OpenAI Embedding Palm Embedding 1 x x x 2 x x x 3 x x x 4 x x x 5 x x x 6 x x x 7 x x x 8 x x x Score 0.75 0.92 0.93 0.88 0.997 0.996 0.91 0.897


In [9]:
# Group each element type
text_elements = [el for el in elements if el.category in ["NarrativeText", "Title", "List"] and el.text]
table_elements = [el for el in elements if el.category == "Table"]
image_elements = [el for el in elements if el.category == "Image"]

#### Remove Duplicate Images

In [10]:
# Hash and Filter Unique Image Files
import os
import hashlib

def hash_image(file_path):
    """Generate a hash for an image file"""
    with open(file_path, 'rb') as f:
        return hashlib.sha256(f.read()).hexdigest()

# Directory where images are saved
image_dir = "./extracted_images"

# Track hashes and remove duplicates
hash_set = set()
unique_files = []

for file_name in os.listdir(image_dir):
    file_path = os.path.join(image_dir, file_name)
    if os.path.isfile(file_path):
        img_hash = hash_image(file_path)
        if img_hash not in hash_set:
            hash_set.add(img_hash)
            unique_files.append(file_path)
        else:
            os.remove(file_path)  # Delete duplicate file
            print(f"Removed duplicate image: {file_path}")



Removed duplicate image: ./extracted_images/figure-6-6.jpg
Removed duplicate image: ./extracted_images/figure-4-4.jpg
Removed duplicate image: ./extracted_images/figure-2-2.jpg
Removed duplicate image: ./extracted_images/figure-9-11.jpg
Removed duplicate image: ./extracted_images/figure-8-10.jpg
Removed duplicate image: ./extracted_images/figure-1-1.jpg
Removed duplicate image: ./extracted_images/figure-5-5.jpg
Removed duplicate image: ./extracted_images/figure-7-9.jpg


In [11]:
# Filter Duplicates in image_elements
# Filter image elements to only include unique image files
unique_image_elements = []

valid_image_paths = set(os.path.basename(path) for path in unique_files)

for el in elements:
    if el.category == "Image":
        image_metadata = el.metadata.to_dict().get("image_path", "")
        if image_metadata and os.path.basename(image_metadata) in valid_image_paths:
            unique_image_elements.append(el)
    else:
        continue  # keep non-image elements untouched

# Rebuild elements with deduplicated images
image_elements = [
    el for el in elements if el.category != "Image"
] + unique_image_elements

In [12]:
# Print the content of images
for el in image_elements:
    if el.category == 'Image':
        print(f"Type: {el.category}")
        print(f"Metadata: {el.metadata.to_dict()}")
        print(f"Preview: {el.text}")
        print("="* 80)

Type: Image
Metadata: {'coordinates': {'points': ((np.float64(200.0), np.float64(80.00087805555565)), (np.float64(200.0), np.float64(164.9987113888888)), (np.float64(624.9891666666667), np.float64(164.9987113888888)), (np.float64(624.9891666666667), np.float64(80.00087805555565))), 'system': 'PixelSpace', 'layout_width': 1700, 'layout_height': 2200}, 'last_modified': '2025-08-01T09:08:08', 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 3, 'image_path': './extracted_images/figure-3-3.jpg', 'file_directory': './pdf-docs', 'filename': 'rag_llm.pdf'}
Preview: HIGH SCHOOL EDITION @® Journal of Student Research
Type: Image
Metadata: {'coordinates': {'points': ((np.float64(570.8333333333333), np.float64(389.16812472222216)), (np.float64(570.8333333333333), np.float64(947.3625458333333)), (np.float64(1129.0277544444443), np.float64(947.3625458333333)), (np.float64(1129.0277544444443), np.float64(389.16812472222216))), 'system': 'PixelSpace', 'layout_width': 1700, 'layout_h

### Summarize the tables and Images with Vision model

In [None]:
# Import libraries
import google.generativeai as genai
from PIL import Image


In [37]:
for model in genai.list_models():
    print(model.name, model.supported_generation_methods)

models/embedding-gecko-001 ['embedText', 'countTextTokens']
models/gemini-1.5-pro-latest ['generateContent', 'countTokens']
models/gemini-1.5-pro-002 ['generateContent', 'countTokens', 'createCachedContent']
models/gemini-1.5-pro ['generateContent', 'countTokens']
models/gemini-1.5-flash-latest ['generateContent', 'countTokens']
models/gemini-1.5-flash ['generateContent', 'countTokens']
models/gemini-1.5-flash-002 ['generateContent', 'countTokens', 'createCachedContent']
models/gemini-1.5-flash-8b ['createCachedContent', 'generateContent', 'countTokens']
models/gemini-1.5-flash-8b-001 ['createCachedContent', 'generateContent', 'countTokens']
models/gemini-1.5-flash-8b-latest ['createCachedContent', 'generateContent', 'countTokens']
models/gemini-2.5-pro-preview-03-25 ['generateContent', 'countTokens', 'createCachedContent', 'batchGenerateContent']
models/gemini-2.5-flash-preview-05-20 ['generateContent', 'countTokens', 'createCachedContent', 'batchGenerateContent']
models/gemini-2.5-fl

In [24]:
# Encode Image
def encode_image(image_path):
    """Getting base64 string"""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

In [38]:
# load Vision Model
# vision_model = ChatGoogleGenerativeAI(model="gemini-pro-vision")
vision_model =  genai.GenerativeModel(model_name="models/gemini-1.5-pro-latest")

vision_model

genai.GenerativeModel(
    model_name='models/gemini-1.5-pro-latest',
    generation_config={},
    safety_settings={},
    tools=None,
    system_instruction=None,
    cached_content=None
)

In [39]:
# Define Prompt
prompt_text = """You are an assitant tasked with summarizing tables and images for retrieval.
        These summaries will be embeded and used to retrieve the raw image or raw table elements.
        Give a concise summary of the image or table that will be optimized for retrieval.
        """

In [None]:
# Summarize Image or Table from Image


def summarize_image(image_path):
    """Summarize an image (including tables) using Gemini Vision."""
    try:

        img = Image.open(image_path)
        response = vision_model.generate_content(
            [prompt_text, img]
        )
        return response.text
    except Exception as e:
        print(f"Error processing {image_path}: {e}")
        return None


In [41]:
# Apply to All Unique Image Files

# Store results
image_summaries = {}

for img_path in unique_files:
    print(f"Summarizing: {img_path}")
    summary = summarize_image(img_path)
    if summary:
        image_summaries[img_path] = summary
        print(f">>> Summary: {summary[:200]}...\n")
    else:
        print(f">>> Failed to summarize {img_path}\n")


Summarizing: ./extracted_images/figure-3-3.jpg
>>> Summary: Journal of Student Research, High School Edition...

Summarizing: ./extracted_images/figure-6-8.jpg
>>> Summary: Flowchart of a system processing user questions.  The system uses several API calls (SerpAPI for Google Search Results, Palm, and ChatGPT) and embedding methods (Palm Embed and OpenAI Embed) to find a...

Summarizing: ./extracted_images/figure-6-7.jpg
>>> Summary: Outputted answers combine: Google search results (SerpApi), existing Palm/OpenAI data, and primary data gathered from Amazon inhabitants....

Summarizing: ./extracted_images/table-7-1.jpg
>>> Summary: This table shows similarity scores for 8 different comparisons.  The comparisons vary by whether context was used (yes/no), which LLM was used (GPT/Palm), and which embedding model was used (OpenAI/Pa...



### Load Documents

In [42]:
# Import Libraries
from langchain_core.documents import Document

In [45]:
# Put the image summaries into docs
image_docs = [
    Document(page_content=summary, metadata={"source": path, "type": "image_or_table"})
    for path, summary in image_summaries.items()
]

In [46]:
image_docs

[Document(metadata={'source': './extracted_images/figure-3-3.jpg', 'type': 'image_or_table'}, page_content='Journal of Student Research, High School Edition'),
 Document(metadata={'source': './extracted_images/figure-6-8.jpg', 'type': 'image_or_table'}, page_content="Flowchart of a system processing user questions.  The system uses several API calls (SerpAPI for Google Search Results, Palm, and ChatGPT) and embedding methods (Palm Embed and OpenAI Embed) to find and process information relevant to the user's query.  It also scrapes a premade Q&A list.  All of this information is fed into a large language model to produce an outputted answer.\n"),
 Document(metadata={'source': './extracted_images/figure-6-7.jpg', 'type': 'image_or_table'}, page_content='Outputted answers combine: Google search results (SerpApi), existing Palm/OpenAI data, and primary data gathered from Amazon inhabitants.'),
 Document(metadata={'source': './extracted_images/table-7-1.jpg', 'type': 'image_or_table'}, pag

In [48]:
for el in text_elements:
    print(f"Category:{el.category}")
    print(f"Metadata:{el.metadata.to_dict()}")
    print(el.text)

Category:Title
Metadata:{'detection_class_prob': 0.5880249738693237, 'coordinates': {'points': ((np.float64(200.0), np.float64(207.7227020263672)), (np.float64(200.0), np.float64(339.69999999999976)), (np.float64(1411.8000000000002), np.float64(339.69999999999976)), (np.float64(1411.8000000000002), np.float64(207.7227020263672))), 'system': 'PixelSpace', 'layout_width': 1700, 'layout_height': 2200}, 'last_modified': '2025-08-01T09:08:08', 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'file_directory': './pdf-docs', 'filename': 'rag_llm.pdf', 'parent_id': '8727570aa228bd375e82f4dd71a6eb10'}
A Retrieval-Augmented Generation Based Large Language Model Benchmarked on a Novel Dataset
Category:NarrativeText
Metadata:{'detection_class_prob': 0.763323962688446, 'coordinates': {'points': ((np.float64(200.0), np.float64(387.167236328125)), (np.float64(200.0), np.float64(423.79999999999995)), (np.float64(404.93333333333334), np.float64(423.79999999999995)), (np.float64(40

In [51]:
# Put text elements into docs
text_docs = [
    Document(page_content=el.text, metadata={"source":f"{el.metadata.to_dict().get('file_directory', '')}/{el.metadata.to_dict().get('filename', '')}", "type":"text", "page_number": el.metadata.to_dict().get("page_number") })
    for el in text_elements
]

In [52]:
text_docs

[Document(metadata={'source': './pdf-docs/rag_llm.pdf', 'type': 'text', 'page_number': 1}, page_content='A Retrieval-Augmented Generation Based Large Language Model Benchmarked on a Novel Dataset'),
 Document(metadata={'source': './pdf-docs/rag_llm.pdf', 'type': 'text', 'page_number': 1}, page_content='Kieran Pichai'),
 Document(metadata={'source': './pdf-docs/rag_llm.pdf', 'type': 'text', 'page_number': 1}, page_content='Menlo School'),
 Document(metadata={'source': './pdf-docs/rag_llm.pdf', 'type': 'text', 'page_number': 1}, page_content='ABSTRACT'),
 Document(metadata={'source': './pdf-docs/rag_llm.pdf', 'type': 'text', 'page_number': 1}, page_content='The evolution of natural language processing has seen marked advancements, particularly with the advent of models like BERT, Transformers, and GPT variants, with recent additions like GPT and Bard. This paper investigates the Retrieval-Augmented Generation (RAG) framework, providing insights into its modular design and the impact of i

In [53]:
# Merge two texts
combined_content = text_docs + image_docs
combined_content


[Document(metadata={'source': './pdf-docs/rag_llm.pdf', 'type': 'text', 'page_number': 1}, page_content='A Retrieval-Augmented Generation Based Large Language Model Benchmarked on a Novel Dataset'),
 Document(metadata={'source': './pdf-docs/rag_llm.pdf', 'type': 'text', 'page_number': 1}, page_content='Kieran Pichai'),
 Document(metadata={'source': './pdf-docs/rag_llm.pdf', 'type': 'text', 'page_number': 1}, page_content='Menlo School'),
 Document(metadata={'source': './pdf-docs/rag_llm.pdf', 'type': 'text', 'page_number': 1}, page_content='ABSTRACT'),
 Document(metadata={'source': './pdf-docs/rag_llm.pdf', 'type': 'text', 'page_number': 1}, page_content='The evolution of natural language processing has seen marked advancements, particularly with the advent of models like BERT, Transformers, and GPT variants, with recent additions like GPT and Bard. This paper investigates the Retrieval-Augmented Generation (RAG) framework, providing insights into its modular design and the impact of i

### Store embeddings in chromaDB

In [1]:
# Import libraries
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma

In [2]:
embedding = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
embedding

  from .autonotebook import tqdm as notebook_tqdm


HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2', cache_folder=None, model_kwargs={}, encode_kwargs={}, query_encode_kwargs={}, multi_process=False, show_progress=False)

In [3]:
# Directory to store Chroma DB
persist_directory = "./chroma_db"

In [58]:
# Store embedded documents
vectorstore = Chroma.from_documents(
    documents=combined_content,
    embedding=embedding,
    persist_directory=persist_directory
)

### Create Retrieval

In [4]:
vectorstore = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

In [5]:
retriever = vectorstore.as_retriever(top_k=3)

In [6]:
# Test retriever
query = "What are the Potential Implications for LLMs?"
retrieved_docs = retriever.get_relevant_documents(query)
retrieved_docs

  retrieved_docs = retriever.get_relevant_documents(query)
  return forward_call(*args, **kwargs)


[Document(id='43218715-be66-41a3-92e4-57363d9fcdd2', metadata={'page_number': 5, 'source': './pdf-docs/rag_llm.pdf', 'type': 'text'}, page_content='Potential Implications for LLMs'),
 Document(id='c92bd696-571c-4f39-b9c4-0c635e7e2ba3', metadata={'source': './pdf-docs/rag_llm.pdf', 'type': 'text', 'page_number': 5}, page_content='Potential Implications for LLMs'),
 Document(id='5bcd5e9a-a254-4793-a051-3096343ceb26', metadata={'source': '/tmp/gradio/05ca13cd17065c23018d59941e0a2747f0a67dc09b668272a6eacd28fae7a724/rag_llm.pdf', 'page_number': 5, 'type': 'text'}, page_content='Potential Implications for LLMs'),
 Document(id='83f71180-a792-4abf-8a7d-e02304687f7c', metadata={'page_number': 5, 'source': '/tmp/gradio/05ca13cd17065c23018d59941e0a2747f0a67dc09b668272a6eacd28fae7a724/rag_llm.pdf', 'type': 'text'}, page_content='Potential Implications for LLMs')]

### Create RAG Pipeline

In [65]:
# Import libraries
from langchain_core.output_parsers import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain_google_genai import ChatGoogleGenerativeAI

In [67]:
llm = ChatGoogleGenerativeAI(model="models/gemini-1.5-pro")
llm

ChatGoogleGenerativeAI(model='models/gemini-1.5-pro', google_api_key=SecretStr('**********'), client=<google.ai.generativelanguage_v1beta.services.generative_service.client.GenerativeServiceClient object at 0x7fbd54b07680>, default_metadata=())

In [None]:
# Define prompt
template = """You are a helpful assistant for question answering task.
your task is to answer the query of the user from the provided context.
Your answer should be concise and to the point.

Question: {question}
Context: {context}
Answer:
"""
prompt = ChatPromptTemplate.from_template(template)

In [70]:
# Difine RAG chain
chain = ({"question": RunnablePassthrough(), "context": retriever}
         | prompt
         | llm
         | StrOutputParser())


In [71]:
response = chain.invoke("What are the Potential Implications for LLMs?")

  return forward_call(*args, **kwargs)


In [72]:
response

'RAG could evolve into a more nuanced and adaptable framework customizable for specialized datasets and applications across diverse domains (legal, medical, historical, anthropological).'

In [73]:
chain.invoke("What is the ISSN of this document?")

  return forward_call(*args, **kwargs)


'2167-1907'