# Multimodal RAG Using CLIP Embedding

### Data Ingetion

1. Extract text and images from PDF
2. Split the text data into chunks and perform text embedding useing CLIP
3. Perform image embedding using CLIP
4. Store the embeddings in chroma vector database.

In [24]:
# Import libraries
import pymupdf
from langchain_core.documents import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import CLIPProcessor, CLIPModel
import torch
from PIL import Image
import io
import base64
import os
from dotenv import load_dotenv
import numpy as np
from langchain_community.vectorstores import FAISS


In [3]:
# Emport Environment variables
load_dotenv()

True

In [4]:
# Initialize CLIP model for unified embeddings
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


In [5]:
# Evaluate CLIP Model
clip_model.eval()

CLIPModel(
  (text_model): CLIPTextTransformer(
    (embeddings): CLIPTextEmbeddings(
      (token_embedding): Embedding(49408, 512)
      (position_embedding): Embedding(77, 512)
    )
    (encoder): CLIPEncoder(
      (layers): ModuleList(
        (0-11): 12 x CLIPEncoderLayer(
          (self_attn): CLIPAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (mlp): CLIPMLP(
            (activation_fn): QuickGELUActivation()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
          )
          (layer_norm2): LayerNorm((512,), eps=1e-05,

#### Embedding Functions

In [6]:
# Define image embedding function
def embed_image(image_data):
    """Embed image using CLIP"""
    if isinstance(image_data, str):  # If path
        image = Image.open(image_data).convert("RGB")
    else:  # If PIL Image
        image = image_data
    
    inputs=clip_processor(images=image,return_tensors="pt")
    with torch.no_grad():
        features = clip_model.get_image_features(**inputs)
        # Normalize embeddings to unit vector
        features = features / features.norm(dim=-1, keepdim=True)
        return features.squeeze().numpy()

# Define text embedding function    
def embed_text(text):
    """Embed text using CLIP."""
    inputs = clip_processor(
        text=text, 
        return_tensors="pt", 
        padding=True,
        truncation=True,
        max_length=77  # CLIP's max token length
    )
    with torch.no_grad():
        features = clip_model.get_text_features(**inputs)
        # Normalize embeddings
        features = features / features.norm(dim=-1, keepdim=True)
        return features.squeeze().numpy()

#### Load and Process PDF

In [7]:
pdf_path = "./pdf-docs/rag_llm.pdf"

In [9]:
doc = pymupdf.open(pdf_path)

In [13]:
# Storage for all documents and embeddings
all_docs = []
all_embeddings = []
image_data_store = {}  # Store actual image data for LLM

# Text splitter
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)

In [14]:
doc

Document('./pdf-docs/rag_llm.pdf')

In [15]:
for i,page in enumerate(doc):
    ## process text
    text=page.get_text()
    if text.strip():
        ##create temporary document for splitting
        temp_doc = Document(page_content=text, metadata={"page": i, "type": "text"})
        text_chunks = splitter.split_documents([temp_doc])

        #Embed each chunk using CLIP
        for chunk in text_chunks:
            embedding = embed_text(chunk.page_content)
            all_embeddings.append(embedding)
            all_docs.append(chunk)



    ## process images
    ##Three Important Actions:

    ##Convert PDF image to PIL format
    ##Store as base64 for GPT-4V (which needs base64 images)
    ##Create CLIP embedding for retrieval

    for img_index, img in enumerate(page.get_images(full=True)):
        try:
            xref = img[0]
            base_image = doc.extract_image(xref)
            image_bytes = base_image["image"]
            
            # Convert to PIL Image
            pil_image = Image.open(io.BytesIO(image_bytes)).convert("RGB")
            
            # Create unique identifier
            image_id = f"page_{i}_img_{img_index}"
            
            # Store image as base64 for later use with GPT-4V
            buffered = io.BytesIO()
            pil_image.save(buffered, format="PNG")
            img_base64 = base64.b64encode(buffered.getvalue()).decode()
            image_data_store[image_id] = img_base64
            
            # Embed image using CLIP
            embedding = embed_image(pil_image)
            all_embeddings.append(embedding)
            
            # Create document for image
            image_doc = Document(
                page_content=f"[Image: {image_id}]",
                metadata={"page": i, "type": "image", "image_id": image_id}
            )
            all_docs.append(image_doc)
            
        except Exception as e:
            print(f"Error processing image {img_index} on page {i}: {e}")
            continue

doc.close()

In [16]:
all_docs

[Document(metadata={'page': 0, 'type': 'text'}, page_content='A Retrieval-Augmented Generation Based Large \nLanguage Model Benchmarked on a Novel Dataset \nKieran Pichai \nMenlo School \nABSTRACT \nThe evolution of natural language processing has seen marked advancements, particularly with the advent of models \nlike BERT, Transformers, and GPT variants, with recent additions like GPT and Bard. This paper investigates the \nRetrieval-Augmented Generation (RAG) framework, providing insights into its modular design and the impact of its'),
 Document(metadata={'page': 0, 'type': 'text'}, page_content='constituent modules on performance. Leveraging a unique dataset from Amazon Rainforest natives and biologists, our \nresearch demonstrates the signiﬁcance of preserving indigenous cultures and biodiversity. The experiment employs a \ncustomizable RAG methodology, allowing for the interchangeability of various components, such as the base language \nmodel and similarity score tools. Findings

In [19]:
# Create embedding array
embeddings_array = np.array(all_embeddings)
embeddings_array

array([[ 0.04805334, -0.00206825, -0.00703998, ...,  0.04569346,
         0.03057196,  0.00390264],
       [ 0.0174334 , -0.00340782,  0.00347105, ...,  0.04686106,
         0.00965531, -0.0386954 ],
       [ 0.03071707,  0.00183673, -0.03335669, ...,  0.00919674,
         0.03857673,  0.00054351],
       ...,
       [ 0.01572528, -0.01759314, -0.01674261, ..., -0.10801527,
        -0.01391945, -0.02823574],
       [ 0.03574401, -0.02967369, -0.03134512, ...,  0.01510686,
        -0.02042773, -0.04929977],
       [-0.02729444,  0.01050514,  0.01541131, ...,  0.05849979,
         0.02525214, -0.02767638]], shape=(84, 512), dtype=float32)

In [25]:
# Create custom FAISS index since we have precomputed embeddings
vector_store = FAISS.from_embeddings(
    text_embeddings=[(doc.page_content, emb) for doc, emb in zip(all_docs, embeddings_array)],
    embedding=None,  # We're using precomputed embeddings
    metadatas=[doc.metadata for doc in all_docs]
)

`embedding_function` is expected to be an Embeddings object, support for passing in a function will soon be removed.


### Data Retrieval
1. Define LLM(OpenAI gpt-4.1)
2. Create Retriever
3. Perform Similarity search to retrieve context
4. Generate Answer

In [41]:
# Import Libraries
from langchain_openai import ChatOpenAI
from langchain.schema.messages import HumanMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda, RunnableMap

In [27]:
llm = ChatOpenAI(model="gpt-4.1")
llm

ChatOpenAI(client=<openai.resources.chat.completions.completions.Completions object at 0x7677dc153620>, async_client=<openai.resources.chat.completions.completions.AsyncCompletions object at 0x7677d62b7500>, root_client=<openai.OpenAI object at 0x7677dc4aef30>, root_async_client=<openai.AsyncOpenAI object at 0x7677dc4bdb50>, model_name='gpt-4.1', model_kwargs={}, openai_api_key=SecretStr('**********'))

In [28]:
# Create Retriever function
def retrieve_multimodal(query, k=5):
    """Unified retrieval using CLIP embeddings for both text and images."""
    # Embed query using CLIP
    query_embedding = embed_text(query)
    
    # Search in unified vector store
    results = vector_store.similarity_search_by_vector(
        embedding=query_embedding,
        k=k
    )
    
    return results

In [29]:
def create_multimodal_message(query, retrieved_docs):
    """Create a message with both text and images for GPT-4V."""
    content = []
    
    # Add the query
    content.append({
        "type": "text",
        "text": f"Question: {query}\n\nContext:\n"
    })
    
    # Separate text and image documents
    text_docs = [doc for doc in retrieved_docs if doc.metadata.get("type") == "text"]
    image_docs = [doc for doc in retrieved_docs if doc.metadata.get("type") == "image"]
    
    # Add text context
    if text_docs:
        text_context = "\n\n".join([
            f"[Page {doc.metadata['page']}]: {doc.page_content}"
            for doc in text_docs
        ])
        content.append({
            "type": "text",
            "text": f"Text excerpts:\n{text_context}\n"
        })
    
    # Add images
    for doc in image_docs:
        image_id = doc.metadata.get("image_id")
        if image_id and image_id in image_data_store:
            content.append({
                "type": "text",
                "text": f"\n[Image from page {doc.metadata['page']}]:\n"
            })
            content.append({
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/png;base64,{image_data_store[image_id]}"
                }
            })
    
    # Add instruction
    content.append({
        "type": "text",
        "text": "\n\nPlease answer the question based on the provided text and images."
    })
    
    return HumanMessage(content=content)

In [30]:
def multimodal_pdf_rag_pipeline(query):
    """Main pipeline for multimodal RAG."""
    # Retrieve relevant documents
    context_docs = retrieve_multimodal(query, k=5)
    
    # Create multimodal message
    message = create_multimodal_message(query, context_docs)
    
    # Get response from GPT-4V
    response = llm.invoke([message])
    
    # Print retrieved context info
    print(f"\nRetrieved {len(context_docs)} documents:")
    for doc in context_docs:
        doc_type = doc.metadata.get("type", "unknown")
        page = doc.metadata.get("page", "?")
        if doc_type == "text":
            preview = doc.page_content[:100] + "..." if len(doc.page_content) > 100 else doc.page_content
            print(f"  - Text from page {page}: {preview}")
        else:
            print(f"  - Image from page {page}")
    print("\n")
    
    return response.content

In [33]:
query = "What are the Potential Implications for LLMs?"
print(f"\nQuery: {query}")
print("-" * 50)
answer = multimodal_pdf_rag_pipeline(query)
print(f"Answer: {answer}")
print("=" * 70)


Query: What are the Potential Implications for LLMs?
--------------------------------------------------

Retrieved 5 documents:
  - Text from page 6: In the initial world of LLM, in order to incrementally increase its performance engineers of these m...
  - Text from page 1: information is conspicuously lacking in the vast repository of knowledge available on the internet. ...
  - Text from page 4: and contextual sensitivity. 
Moreover, the experiment is poised to challenge the prevailing approach...
  - Text from page 0: proven to be a challenging endeavor, it has become evident that Retrieval-Augmented Generation (RAG)...
  - Text from page 7: and the data collected speaks a lot to the importance of each aspect of an LLM. 
In terms of expecte...


Answer: Based on the provided excerpts, the **potential implications for Large Language Models (LLMs)** are as follows:

---

### 1. **Performance Plateau and the Limits of Scaling**
- The text notes that models like OpenAI’s GPT and Googl

In [45]:
def multimodal_pdf_rag_pipeline_with_chain(query):
    """Main pipeline for multimodal RAG."""
    # Create the RAG chain using Runnable components
    rag_chain = (
        RunnableLambda(lambda query: {"query": query, "docs": retrieve_multimodal(query, k= 5)})
        | RunnableLambda(lambda inputs: {
            "query": inputs["query"],
            "message": create_multimodal_message(inputs["query"], inputs["docs"])
        })
        | RunnableLambda(lambda inputs: llm.invoke([inputs["message"]]))
        | StrOutputParser()
    )
    
    # Get response from GPT-4V
    response = rag_chain.invoke(query)
    
    # Also print retrieved docs
    context_docs = retrieve_multimodal(query)
    # Print retrieved context info
    print(f"\nRetrieved {len(context_docs)} documents:")
    for doc in context_docs:
        doc_type = doc.metadata.get("type", "unknown")
        page = doc.metadata.get("page", "?")
        if doc_type == "text":
            preview = doc.page_content[:100] + "..." if len(doc.page_content) > 100 else doc.page_content
            print(f"  - Text from page {page}: {preview}")
        else:
            print(f"  - Image from page {page}")
    print("\n")
    
    return response

In [46]:
query = "What are the Potential Implications for LLMs?"
print(f"\nQuery: {query}")
print("-" * 50)
answer = multimodal_pdf_rag_pipeline_with_chain(query)
print(f"Answer: {answer}")
print("=" * 70)


Query: What are the Potential Implications for LLMs?
--------------------------------------------------

Retrieved 5 documents:
  - Text from page 6: In the initial world of LLM, in order to incrementally increase its performance engineers of these m...
  - Text from page 1: information is conspicuously lacking in the vast repository of knowledge available on the internet. ...
  - Text from page 4: and contextual sensitivity. 
Moreover, the experiment is poised to challenge the prevailing approach...
  - Text from page 0: proven to be a challenging endeavor, it has become evident that Retrieval-Augmented Generation (RAG)...
  - Text from page 7: and the data collected speaks a lot to the importance of each aspect of an LLM. 
In terms of expecte...


Answer: Certainly! Based on your provided excerpts, here is a synthesis to answer:

**Question:** What are the Potential Implications for LLMs?

**Answer:**

The provided excerpts suggest several key potential implications for the developm

In [None]:
query = "What are the Potential Implications for LLMs?"
print(f"\nQuery: {query}")
print("-" * 50)
answer = multimodal_pdf_rag_pipeline_with_chain(query)
print(f"Answer: {answer}")
print("=" * 70)

In [47]:
query = "Explain the Venn Diagram of Data Sources for RAG"
print(f"\nQuery: {query}")
print("-" * 50)
answer = multimodal_pdf_rag_pipeline_with_chain(query)
print(f"Answer: {answer}")
print("=" * 70)


Query: Explain the Venn Diagram of Data Sources for RAG
--------------------------------------------------

Retrieved 5 documents:
  - Text from page 0: proven to be a challenging endeavor, it has become evident that Retrieval-Augmented Generation (RAG)...
  - Text from page 2: larity as the basis for our similarity score, deﬁned by the formula: 
similarity(𝐯𝐯𝑠𝑠, 𝐯𝐯𝑡𝑡) =
𝐯𝐯𝑠𝑠 ...
  - Text from page 2: integrates a retriever model that sources relevant context and a generator model that synthesizes th...
  - Text from page 6: In the initial world of LLM, in order to incrementally increase its performance engineers of these m...
  - Text from page 1: information is conspicuously lacking in the vast repository of knowledge available on the internet. ...


Answer: Certainly! Let’s break down the Venn Diagram of Data Sources for **Retrieval-Augmented Generation (RAG)** based on the context provided.

---

### **RAG Overview**
From Page 2:  
- RAG *integrates a retriever model* (fetches rel

In [48]:
query = "How question of a user is answered using API calls?"
print(f"\nQuery: {query}")
print("-" * 50)
answer = multimodal_pdf_rag_pipeline_with_chain(query)
print(f"Answer: {answer}")
print("=" * 70)


Query: How question of a user is answered using API calls?
--------------------------------------------------

Retrieved 5 documents:
  - Text from page 6: In the initial world of LLM, in order to incrementally increase its performance engineers of these m...
  - Text from page 3: answer. 
The language model ℒ is ﬁne-tuned on 𝒟𝒟, optimizing the weights to minimize the loss functi...
  - Text from page 2: integrates a retriever model that sources relevant context and a generator model that synthesizes th...
  - Text from page 6: into further detail, the code when prompted with a user question compares the user question to the Q...
  - Text from page 0: proven to be a challenging endeavor, it has become evident that Retrieval-Augmented Generation (RAG)...


Answer: Certainly! Here’s how a user’s question is answered using API calls, based strictly on your provided text:

---

**Step-by-Step Process:**

1. **User submits a question:**  
   The user's question (query 𝑞) is sent, typically