<a href="https://colab.research.google.com/github/arutraj/.githubcl/blob/main/Building_a_Multimodal_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Multi-Modal Models with RAG**

## **Introduction**


In many documents, valuable information is captured in both text and images. Traditional Retrieval-Augmented Generation (RAG) systems often overlook the rich data encapsulated in images, leading to incomplete retrieval and synthesis processes. The emergence of multimodal Large Language Models (LLMs), like GPT-4V, offers an opportunity to enhance RAG systems by effectively incorporating images into the retrieval and generation workflow.

### **Goal**
To create a robust Multi-modal RAG system that integrates text and image data for comprehensive information retrieval and synthesis.

## **Different Ways to Implement Multi-modal RAG System**

Many documents contain a mixture of content types, including text and images.

Yet, information captured in images is lost in most RAG applications.

With the emergence of multimodal LLMs, like [GPT-4V](https://openai.com/research/gpt-4v-system-card), it is worth considering how to utilize images in RAG:

`Option 1:`

* Use multimodal embeddings (such as [CLIP](https://openai.com/research/clip)) to embed images and text
* Retrieve both using similarity search
* Pass raw images and text chunks to a multimodal LLM for answer synthesis

`Option 2:`

* Use a multimodal LLM (such as [GPT-4V](https://openai.com/research/gpt-4v-system-card), [LLaVA](https://llava.hliu.cc/), or [FUYU-8b](https://www.adept.ai/blog/fuyu-8b)) to produce text summaries from images
* Embed and retrieve text
* Pass text chunks to an LLM for answer synthesis

`Option 3`

* Use a multimodal LLM (such as [GPT-4V](https://openai.com/research/gpt-4v-system-card), [LLaVA](https://llava.hliu.cc/), or [FUYU-8b](https://www.adept.ai/blog/fuyu-8b)) to produce text summaries from images
* Embed and retrieve image summaries with a reference to the raw image
* Pass raw images and text chunks to a multimodal LLM for answer synthesis   


## **What is the Scope of this Notebook?**

---

This coding notebook highlights `Option 3`.

* We will use [Unstructured](https://unstructured.io/) to parse text, and tables from documents (PDFs).

- We will upload image files, which have been extracted using an ocr tool, which is not the scope of this session, since the focus is on multimodal RAG here.

* We will use the [multi-vector retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) with [Chroma](https://www.trychroma.com/) to store raw text and images along with their summaries for retrieval.
* We will use GPT-4V for both image summarization (for retrieval) as well as final answer synthesis from join review of images and texts (or tables).

---

- **Packages**

  In addition to the below pip packages, you will also need `poppler` ([installation instructions](https://pdf2image.readthedocs.io/en/latest/installation.html)) and `tesseract` ([installation instructions](https://tesseract-ocr.github.io/tessdoc/Installation.html)) in your system.

In [1]:
!pip install langchain-openai

Collecting langchain-openai
  Downloading langchain_openai-0.2.0-py3-none-any.whl.metadata (2.6 kB)
Collecting langchain-core<0.4,>=0.3 (from langchain-openai)
  Downloading langchain_core-0.3.6-py3-none-any.whl.metadata (6.3 kB)
Collecting openai<2.0.0,>=1.40.0 (from langchain-openai)
  Downloading openai-1.48.0-py3-none-any.whl.metadata (24 kB)
Collecting tiktoken<1,>=0.7 (from langchain-openai)
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain-core<0.4,>=0.3->langchain-openai)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting langsmith<0.2.0,>=0.1.125 (from langchain-core<0.4,>=0.3->langchain-openai)
  Downloading langsmith-0.1.128-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain-core<0.4,>=0.3->langchain-openai)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting httpx<1,>=0.23.0 (from 

In [2]:
! pip install -U langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.1-py3-none-any.whl.metadata (2.8 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting langchain<0.4.0,>=0.3.1 (from langchain-community)
  Downloading langchain-0.3.1-py3-none-any.whl.metadata (7.1 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.5.2-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.22.0-py3-none-any.whl.metadata (7.2 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.0 (from langchain<0.4.0,>=0.3.1->langchain-community)
  Downloading langchain_text_splitters-0.3.0-py3-none-any.

In [3]:
! pip install -U langchain-chroma

Collecting langchain-chroma
  Downloading langchain_chroma-0.1.4-py3-none-any.whl.metadata (1.6 kB)
Collecting chromadb!=0.5.4,!=0.5.5,<0.6.0,>=0.4.0 (from langchain-chroma)
  Downloading chromadb-0.5.9-py3-none-any.whl.metadata (6.8 kB)
Collecting fastapi<1,>=0.95.2 (from langchain-chroma)
  Downloading fastapi-0.115.0-py3-none-any.whl.metadata (27 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb!=0.5.4,!=0.5.5,<0.6.0,>=0.4.0->langchain-chroma)
  Downloading chroma_hnswlib-0.7.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb!=0.5.4,!=0.5.5,<0.6.0,>=0.4.0->langchain-chroma)
  Downloading uvicorn-0.30.6-py3-none-any.whl.metadata (6.6 kB)
Collecting posthog>=2.4.0 (from chromadb!=0.5.4,!=0.5.5,<0.6.0,>=0.4.0->langchain-chroma)
  Downloading posthog-3.6.6-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb!=0.5.4,!=0.5.5,<0.6.0,>=0.4.0->langchain-chroma

In [4]:
! pip install pillow pydantic lxml pillow matplotlib chromadb tiktoken



In [5]:
!pip install pypdf

Collecting pypdf
  Downloading pypdf-5.0.0-py3-none-any.whl.metadata (7.4 kB)
Downloading pypdf-5.0.0-py3-none-any.whl (292 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/292.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.7/292.8 kB[0m [31m2.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m292.8/292.8 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-5.0.0


## **Data Loading**
  
Let's look at a [popular blog](https://cloudedjudgement.substack.com/p/clouded-judgement-111023) by Jamin Ball.

This is a great use-case because much of the information is captured in images (of tables or charts).
---

For our use case, we will upload the pdf file & images separately to focus better on the Multimodal RAG part. As mentioned earlier, in a real world use case, we will have to use any licensed or open source OCR tool, could be pytessearct etc., to extract the images from pdf files & then proceed with the Multimodal RAG part.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
from langchain_community.document_loaders import PyPDFLoader
# Ensure that the dataset is present at the specified path
loader = PyPDFLoader("/content/cj.pdf")

docs = loader.load()
tables = [] # Ignore w/ basic pdf loader
texts = [d.page_content for d in docs]

In [7]:
for i in texts:
  print(i)

11/14/23, 8:35 PM Clouded Judgement 11.10.23 - by Jamin Ball
https://cloudedjudgement.substack.com/p/clouded-judgement-111023 1/21Clouded Judgement 11.10.23 - OpenAI
Updates + Datadog Gives the All-Clear?
JAMIN BALL
NOV 10, 2023
2 Share
Every week I’ll provide updates on the latest trends in cloud so ware companies. Follow along to
stay up to date!
Open AI U pdates
OpenAI had their big developer day this week, and I wanted to call out two key announcements
(and trends): increasing context windows and decreasing costs.
When I think about the monetization of AI (and which “layers” monetize  rst) I’ve always
thought it would follow the below order, with each layer lagging the one that comes before it.
1. Raw silicon (chips like Nvidia bought in large quantities to build out infra to service
upcoming demand).
2. Model providers (OpenAI, Anthropic, etc as companies start building out AI).
35
Type your email... Subscribe
11/14/23, 8:35 PM Clouded Judgement 11.10.23 - by Jamin Ball
https://cl

In [8]:
len(texts)

21

## **Multi-vector retriever**

Use [multi-vector-retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) to index image (and / or text, table) summaries, but retrieve raw images (along with raw texts or tables).

### **Text and Table summaries**

We will use GPT-4 to produce text summaries.


Summaries are used to retrieve raw tables and / or raw chunks of text.

In [9]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

In [10]:
# Generate summaries of text elements
def generate_text_summaries(texts, tables, summarize_texts=False):
    """
    Summarize text elements
    texts: List of str
    tables: List of str
    summarize_texts: Bool to summarize texts
    """

    # Prompt
    prompt_text = """You are an assistant tasked with summarizing tables and text for retrieval. \
    These summaries will be embedded and used to retrieve the raw text or table elements. \
    Give a concise summary of the table or text that is well optimized for retrieval. Table or text: {element} """
    prompt = ChatPromptTemplate.from_template(prompt_text)

    # Text summary chain
    # Please add the OpenAI API Key to run this
    model = ChatOpenAI(temperature=0, model="gpt-4", openai_api_key="org-zGXCh3SHl1JVSjgB3xFRyG1E")
    summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

    # Initialize empty summaries
    text_summaries = []
    table_summaries = []

    # Apply to text if texts are provided and summarization is requested
    if texts and summarize_texts:
        text_summaries = summarize_chain.batch(texts, {"max_concurrency": 5})
    elif texts:
        text_summaries = texts

    # Apply to tables if tables are provided
    if tables:
        table_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})

    return text_summaries, table_summaries


# Get text, table summaries
text_summaries, table_summaries = generate_text_summaries(
    texts, tables, summarize_texts=True
)

AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: org-zGXC****************yG1E. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

In [None]:
for i in text_summaries:
  print(i)

NameError: name 'text_summaries' is not defined

In [None]:
len(text_summaries)

21

### **Image summaries**

We will use [GPT-4V](https://openai.com/research/gpt-4v-system-card) to produce the image summaries.

The API docs [here](https://platform.openai.com/docs/guides/vision):

* We pass base64 encoded images

In [None]:
import base64
import os

from langchain_core.messages import HumanMessage


def encode_image(image_path):
    """Getting the base64 string"""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

fpath = "/content/drive/MyDrive/Datasets/"
fname = "cj.pdf"

def image_summarize(img_base64, prompt):
    # Please add the OpenAI API Key to run this
    """Make image summary"""
    chat = ChatOpenAI(model="gpt-4o", max_tokens=1024, openai_api_key='')

    msg = chat.invoke(
        [
            HumanMessage(
                content=[
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"},
                    },
                ]
            )
        ]
    )
    return msg.content


def generate_img_summaries(path):
    """
    Generate summaries and base64 encoded strings for images
    path: Path to list of .jpg files extracted by Unstructured
    """

    # Store base64 encoded images
    img_base64_list = []

    # Store image summaries
    image_summaries = []

    # Prompt
    prompt = """You are an assistant tasked with summarizing images for retrieval. \
    These summaries will be embedded and used to retrieve the raw image. \
    Give a concise summary of the image that is well optimized for retrieval."""

    # Apply to images
    for img_file in sorted(os.listdir(path)):
        if img_file.endswith(".jpeg"):
            img_path = os.path.join(path, img_file)
            base64_image = encode_image(img_path)
            img_base64_list.append(base64_image)
            image_summaries.append(image_summarize(base64_image, prompt))

    return img_base64_list, image_summaries


# Image summaries
img_base64_list, image_summaries = generate_img_summaries(fpath)

In [None]:
image_summaries

['Line graph showing EV/NTM Revenue Multiples for high, mid, and low growth medians from Jan-2015 to Oct-2023. High growth median in blue, mid growth median in red, low growth median in orange. Peaks around Apr-21 for high growth, with values 11.8x (high), 7.4x (mid), and 3.9x (low) in Oct-2023. Labeled "Clouded Judgement @jaminball" and "Altimeter."',
 'Scatter plot titled "NTM Rev Growth vs NTM Rev Multiple" with data points representing various companies. The x-axis shows NTM revenue growth percentage, and the y-axis shows NTM revenue multiple. A dotted trend line with equation y = 29.462x + 1.5138 and R² = 0.3821 is included. Companies like MDB, PLTR, SNOW, and NET are labeled on the plot. Graph by Altimeter, attributed to @jaminball.',
 'Table of reported revenue and next quarter revenue for various companies including On24, Squarespace, Jamf, Kaltura, CS Disco, 2U, Olo, Alteryx, RingCentral, Klaviyo, Datadog, Amplitude, Hubspot, BigCommerce, Twilio, and Wix.com. Columns include a

### **Add to vectorstore**

Add raw docs and doc summaries to [Multi Vector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary):

* Store the raw texts, tables, and images in the `docstore`.
* Store the texts, table summaries, and image summaries in the `vectorstore` for efficient semantic retrieval.

In [None]:
import uuid

from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings


def create_multi_vector_retriever(
    vectorstore, text_summaries, texts, table_summaries, tables, image_summaries, images
):
    """
    Create retriever that indexes summaries, but returns raw images or texts
    """

    # Initialize the storage layer
    store = InMemoryStore()
    id_key = "doc_id"

    # Create the multi-vector retriever
    retriever = MultiVectorRetriever(
        vectorstore=vectorstore,
        docstore=store,
        id_key=id_key,
    )

    # Helper function to add documents to the vectorstore and docstore
    def add_documents(retriever, doc_summaries, doc_contents):
        doc_ids = [str(uuid.uuid4()) for _ in doc_contents]
        summary_docs = [
            Document(page_content=s, metadata={id_key: doc_ids[i]})
            for i, s in enumerate(doc_summaries)
        ]
        retriever.vectorstore.add_documents(summary_docs)
        retriever.docstore.mset(list(zip(doc_ids, doc_contents)))

    # Add texts, tables, and images
    # Check that text_summaries is not empty before adding
    if text_summaries:
        add_documents(retriever, text_summaries, texts)
    # Check that table_summaries is not empty before adding
    if table_summaries:
        add_documents(retriever, table_summaries, tables)
    # Check that image_summaries is not empty before adding
    if image_summaries:
        add_documents(retriever, image_summaries, images)

    return retriever


# The vectorstore to use to index the summaries
# Please add the OpenAI API Key to run this
vectorstore = Chroma(
    collection_name="mm_rag_cj_blog", embedding_function=OpenAIEmbeddings(openai_api_key='')
)

# Create retriever
retriever_multi_vector_img = create_multi_vector_retriever(
    vectorstore,
    text_summaries,
    texts,
    table_summaries,
    tables,
    image_summaries,
    img_base64_list,
)

## **RAG**

### **Build retriever**

We need to bin the retrieved doc(s) into the correct parts of the GPT-4V prompt template.

In [None]:
import io
import re

from IPython.display import HTML, display
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from PIL import Image


def plt_img_base64(img_base64):
    """Disply base64 encoded string as image"""
    # Create an HTML img tag with the base64 string as the source
    image_html = f'<img src="data:image/jpeg;base64,{img_base64}" />'
    # Display the image by rendering the HTML
    display(HTML(image_html))


def looks_like_base64(sb):
    """Check if the string looks like base64"""
    return re.match("^[A-Za-z0-9+/]+[=]{0,2}$", sb) is not None


def is_image_data(b64data):
    """
    Check if the base64 data is an image by looking at the start of the data
    """
    image_signatures = {
        b"\xff\xd8\xff": "jpeg",
        b"\x89\x50\x4e\x47\x0d\x0a\x1a\x0a": "png",
        b"\x47\x49\x46\x38": "gif",
        b"\x52\x49\x46\x46": "webp",
    }
    try:
        header = base64.b64decode(b64data)[:8]  # Decode and get the first 8 bytes
        for sig, format in image_signatures.items():
            if header.startswith(sig):
                return True
        return False
    except Exception:
        return False


def resize_base64_image(base64_string, size=(128, 128)):
    """
    Resize an image encoded as a Base64 string
    """
    # Decode the Base64 string
    img_data = base64.b64decode(base64_string)
    img = Image.open(io.BytesIO(img_data))

    # Resize the image
    resized_img = img.resize(size, Image.LANCZOS)

    # Save the resized image to a bytes buffer
    buffered = io.BytesIO()
    resized_img.save(buffered, format=img.format)

    # Encode the resized image to Base64
    return base64.b64encode(buffered.getvalue()).decode("utf-8")


def split_image_text_types(docs):
    """
    Split base64-encoded images and texts
    """
    b64_images = []
    texts = []
    for doc in docs:
        # Check if the document is of type Document and extract page_content if so
        if isinstance(doc, Document):
            doc = doc.page_content
        if looks_like_base64(doc) and is_image_data(doc):
            doc = resize_base64_image(doc, size=(1300, 600))
            b64_images.append(doc)
        else:
            texts.append(doc)
    return {"images": b64_images, "texts": texts}


def img_prompt_func(data_dict):
    """
    Join the context into a single string
    """
    formatted_texts = "\n".join(data_dict["context"]["texts"])
    messages = []

    # Adding image(s) to the messages if present
    if data_dict["context"]["images"]:
        for image in data_dict["context"]["images"]:
            image_message = {
                "type": "image_url",
                "image_url": {"url": f"data:image/jpeg;base64,{image}"},
            }
            messages.append(image_message)

    # Adding the text for analysis
    text_message = {
        "type": "text",
        "text": (
            "You are financial analyst tasking with providing investment advice.\n"
            "You will be given a mixed of text, tables, and image(s) usually of charts or graphs.\n"
            "Use this information to provide investment advice related to the user question. \n"
            f"User-provided question: {data_dict['question']}\n\n"
            "Text and / or tables:\n"
            f"{formatted_texts}"
        ),
    }
    messages.append(text_message)
    return [HumanMessage(content=messages)]


def multi_modal_rag_chain(retriever):
    """
    Multi-modal RAG chain
    """

    # Multi-modal LLM
    # Please add the OpenAI API Key to run this
    model = ChatOpenAI(temperature=0, model="gpt-4o", max_tokens=1024, openai_api_key='')

    # RAG pipeline
    chain = (
        {
            "context": retriever | RunnableLambda(split_image_text_types),
            "question": RunnablePassthrough(),
        }
        | RunnableLambda(img_prompt_func)
        | model
        | StrOutputParser()
    )

    return chain


# Create RAG chain
chain_multimodal_rag = multi_modal_rag_chain(retriever_multi_vector_img)

### **Check**

Examine retrieval; we get back images that are relevant to our question.

In [None]:
# Check retrieval
query = "Give me company names that are interesting investments based on EV / NTM and NTM rev growth. Consider EV / NTM multiples vs historical?"
docs = retriever_multi_vector_img.invoke(query, limit=6)


len(docs)

4

In [None]:
# Check retrieval
query = "What are the EV / NTM and NTM rev growth for MongoDB, Cloudflare, and Datadog?"
docs = retriever_multi_vector_img.invoke(query, limit=6)

# We get 4 docs
len(docs)

4

In [None]:
docs

['/9j/4AAQSkZJRgABAQAAAAAAAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAKyBe0DASIAAhEBAxEB/8QAHAABAAIDAQEBAAAAAAAAAAAAAAEGBAUHAgMI/8QAVRAAAQMDAwEEBwYEAwMHCgQHAQACAwQFEQYSITEHE0FRFBUiYXGSoRYyU4GR4SNCVrEzUsFyc9EXJDRDYnTwJSY1Njd1gqKy8SdEVJTC0mOTCGWD/8QAGAEBAQEBAQAAAAAAAAAAAAAAAAECAwT/xAAoEQEBAAICAgICAgMBAQEBAAAAAQIREiETMUFRA2EiMkJxgZGh8LH/2gAMAwEAAhEDEQA/AO/oiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiglBKLzlSibSihMoqUUZRBKKEyglF5zypyibSijKIohROqIkIoClFEREBERARRlMoJUJlEEqERAUqEyglQiZREqEyiRRETKApUKUBERARFCCUUZTKCVBTKICKfBQoaSEUZTKqJRQmUVKKMplAUqERBERFSiKEBT4KEygIiIgUTKIqUUJlE2lQUTKKImUREooRFSiIgIiICIiAiIgIijKCUXnKZUTaVKhFVSFCIgBEGUQohREQUooRUqFKICIiAiKEEooyiAiIiJUIiKIiIJ8FCZRAUqE8EEooClAREQEREBERAREQEUZRE2lFGUyipRRlMoJREQEREBERARFCCUUZRBKIoQSijnKlARQmUEooyiCUUIglFGVGUHpQmUQSi85UqCUUKVRClEQEREBERAREQEREBERAREQEREBERAREQEREBERAREQEREBERAR

In [None]:
# We get back relevant images
plt_img_base64(docs[0])

### **Sanity Check**

Why does this work? Let's look back at the image that we stored ...

In [None]:
plt_img_base64(img_base64_list[3])

... here is the corresponding summary, which we embedded and used in similarity search.

It's pretty reasonable that this image is indeed retrieved from our `query` based on it's similarity to this summary.

In [None]:
image_summaries[3]

'Summary of image: Table comparing financial metrics of 10 companies (Snowflake, MongoDB, Palantir, Cloudflare, Datadog, CrowdStrike, Adobe, ServiceNow, Samsara, and Zscaler) including EV/NTM Rev, EV/2024 Rev, EV/NTM FCF, NTM Rev Growth, Gross Margin, Operating Margin, FCF Margin, and % in Top 10 Multiple LTM. Includes average and median values. Branding: Altimeter, Clouded Judgement, @jaminball.'

### **RAG**

Now let's run RAG and test the ability to synthesize an answer to our question.

In [None]:
query

'What are the EV / NTM and NTM rev growth for MongoDB, Cloudflare, and Datadog?'

In [None]:
# Run RAG chain
chain_multimodal_rag.invoke(query)

"Based on the provided data, here are the EV/NTM (Enterprise Value to Next Twelve Months Revenue) and NTM Revenue Growth for MongoDB, Cloudflare, and Datadog:\n\n1. **MongoDB:**\n   - EV/NTM Revenue: 14.6x\n   - NTM Revenue Growth: 17%\n\n2. **Cloudflare:**\n   - EV/NTM Revenue: 13.4x\n   - NTM Revenue Growth: 28%\n\n3. **Datadog:**\n   - EV/NTM Revenue: 13.1x\n   - NTM Revenue Growth: 19%\n\n### Investment Advice:\n\n1. **MongoDB:**\n   - **Pros:** MongoDB has a relatively high EV/NTM revenue multiple, indicating strong market confidence and valuation. The 17% NTM revenue growth is solid, suggesting steady growth.\n   - **Cons:** The growth rate is lower compared to Cloudflare and Datadog, which might indicate a slower expansion pace.\n   - **Recommendation:** MongoDB is a good investment for those looking for a stable, well-valued company with steady growth. However, it may not offer the highest growth potential compared to its peers.\n\n2. **Cloudflare:**\n   - **Pros:** Cloudflare 

Here is the trace where we can see what is passed to the LLM:

* Question 1 [Trace focused on investment advice](https://smith.langchain.com/public/d77b7b52-4128-4772-82a7-c56eb97e8b97/r)
* Question 2 [Trace focused on table extraction](https://smith.langchain.com/public/4624f086-1bd7-4284-9ca9-52fd7e7a4568/r)

For question 1, we can see that we pass 3 images along with a text chunk:

### **Considerations**

**Retrieval**

* Retrieval is performed based upon similarity to image summaries as well as text chunks.
* This requires some careful consideration because image retrieval can fail if there are competing text chunks.
* To mitigate this, I produce larger (4k token) text chunks and summarize them for retrieval.

**Image Size**

* The quality of answer synthesis appears to be sensitive to image size, [as expected](https://platform.openai.com/docs/guides/vision).
* I'll do evals soon to test this more carefully.

## **Conclusion**

By integrating both text and image data using a multimodal LLM, our Multi-modal RAG system can provide richer, more comprehensive information retrieval and synthesis. This approach leverages cutting-edge LLM capabilities and seamlessly incorporates document storage and retrieval using Google Drive.