# Multimodal RAG using a simple AI Agent with complex PDF files

Let's start by loading the environment variables we need to use.

In [1]:
import sys
print(sys.executable)

import pydantic
print(pydantic.__version__)

c:\Users\Amzar\Documents\dcp\llm-agentic-multimodal-rag\.venv\Scripts\python.exe
2.10.6


In [2]:
import os
from dotenv import load_dotenv

load_dotenv()
LLAMA_CLOUD_API_KEY = os.getenv("LLAMA_CLOUD_API_KEY")

In [3]:
print(LLAMA_CLOUD_API_KEY)

llx-AccJAFOaMe92dbMLDRYsfaYACoCNRSA4t6qk83ZISOXyoTBf


## Setting up the model
Let's define the LLM model that we'll use as part of the workflow.

In [4]:
MODEL = "llama3.2-vision:11b"
print(MODEL)

llama3.2-vision:11b


### Parsing raw pdf using LlamaParse for getting Json Structured Output

In [5]:
# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio
import nest_asyncio

nest_asyncio.apply()

In [9]:
import torch
print(torch.cuda.is_available())

False


In [None]:
from llama_parse import LlamaParse

not_from_cache = False
parser_txt = LlamaParse(verbose=True, invalidate_cache=not_from_cache, result_type="text")
parser_md = LlamaParse(verbose=True, invalidate_cache=not_from_cache, result_type="markdown")

In [None]:
# # pdf_file = "RatingScales_YBOCS_m.pdf" # replace PDF file of interest

# print(f"Parsing text...")
# docs_text = parser_txt.load_data(pdf_file)
# print(f"Parsing PDF file...")
# md_json_objs = parser_md.get_json_result(pdf_file)
# md_json_list = md_json_objs[0]["pages"]

Parsing text...
Started parsing the file under job_id c7a478ef-e4b2-4434-9b33-a6109c055fae
.Parsing PDF file...
Started parsing the file under job_id 39c87bd0-1019-4c2f-ad69-c6bf97dee52a


In [None]:
##### Output one page Json output for example 
# print(md_json_list[5]["md"])


# Current

# Intrusive (non-violent) images

# Intrusive nonsense sounds, words, or music

# Bothered by certain sounds/noises *

# Lucky/unlucky numbers

# Colors with special significance

# Superstitious fears

# Other

# SOMATIC OBSESSIONS

# Concern with illness or disease *

# Excessive concern with body part or aspect of appearance (e.g. dysmorphophobia) *

# Other

# CLEANING/WASHING COMPULSIONS

# Excessive or ritualized handwashing

# Excessive or ritualized showering, bathing, toothbrushing, grooming, or toilet routine.

# Involves cleaning of household items or other inanimate objects

# Other measures to prevent or remove contact with contaminants

# Other

# CHECKING COMPULSIONS

# Checking locks, stove, appliances, etc.

# Checking that did not/will not harm others

# Checking that did not/will not harm self

# Checking that nothing terrible did/will happen

# Checking that did not make mistake

# Checking tied to somatic obsessions

# Other

# REPEATING RITUALS

# Re-r

In [None]:
# ### Extract images as dicts from parser
# image_dicts = parser_md.get_images(md_json_objs, download_path="llm_images")
# # print one image dict as example
# print(image_dicts[12])

> Images for page 1: []
> Images for page 2: []
> Images for page 3: []
> Images for page 4: []
> Images for page 5: []
> Images for page 6: []
> Images for page 7: []
> Images for page 8: []
> Images for page 9: []
> Images for page 10: []
> Images for page 11: []
> Images for page 12: []
> Images for page 13: [{'name': 'img_p12_2.png', 'height': 2, 'width': 4, 'x': 36, 'y': 139.45996999999994, 'path': 'llm_images\\39c87bd0-1019-4c2f-ad69-c6bf97dee52a-img_p12_2.png', 'job_id': '39c87bd0-1019-4c2f-ad69-c6bf97dee52a', 'original_file_path': 'RatingScales_YBOCS_m.pdf', 'page_number': 13}, {'name': 'img_p12_2.png', 'height': 2, 'width': 4, 'x': 37.44, 'y': 139.45996999999994}, {'name': 'img_p12_2.png', 'height': 2, 'width': 4, 'x': 38.88, 'y': 139.45996999999994}, {'name': 'img_p12_2.png', 'height': 2, 'width': 4, 'x': 40.32, 'y': 139.45996999999994}, {'name': 'img_p12_2.png', 'height': 2, 'width': 4, 'x': 41.76, 'y': 139.45996999999994}, {'name': 'img_p12_2.png', 'height': 2, 'width': 4, 

In [None]:
# edit above cell to accommodate folder of PDFs
import os
from tqdm import tqdm

pdf_folder = "data/documents" # replace w folder path

docs_text = []
md_json_objs =[]
image_dicts = []

for file in tqdm(os.listdir(pdf_folder)):
    if file.endswith(".pdf"):
        print(f"Parsing text from {file}...")
        docs_text += parser_txt.load_data(os.path.join(pdf_folder, file))
        print(f"Parsing PDF file {file}...")
        md_json_objs += parser_md.get_json_result(os.path.join(pdf_folder, file))
        image_dicts += parser_md.get_images(md_json_objs[-1], download_path="llm_images")

### Build Multimodal Index
In this section we build the multimodal index over the parsed deck.

We do this by creating text nodes from the document that contain metadata referencing the original image path.

In this example we're indexing the text node for retrieval. The text node has a reference to both the parsed text as well as the image screenshot.

#### Get Text Nodes

In [None]:
from pathlib import Path

'''
Create a dictionary which maps page numbers to image paths with the following format:

{
    1: [Path("path/to/image1"), Path("path/to/image2")],    
    2: [Path("path/to/image3"), Path("path/to/image4")],
}
'''
def create_image_index(image_dicts):
    image_index = {}

    for image_dict in image_dicts:
        page_number = image_dict["page_number"]
        image_path = Path(image_dict["path"])
        if page_number in image_index:
            image_index[page_number].append(image_path)
        else:
            image_index[page_number] = [image_path]

    return image_index

In [None]:
# from copy import deepcopy

from llama_index.core.schema import TextNode

# attach image metadata to the text nodes
def get_text_nodes(docs, json_dicts=None, image_dicts=None):
    """Split docs into nodes, by separator."""
    nodes = []
    image_index = create_image_index(image_dicts) if image_dicts is not None else {}
    print("Image index: ", image_index)
    md_texts = [d["md"] for d in json_dicts] if json_dicts is not None else None

    doc_chunks = [c for d in docs for c in d.text.split("---")]
    page_num = 0
    chunk_index = 0
    while chunk_index < len(doc_chunks):
        page_num += 1
        chunk_metadata = {"page_num": page_num, "image_paths": []}
        if image_index.get(page_num):
            for image in image_index[page_num]:
                chunk_metadata["image_paths"].append(str(image))
        if md_texts is not None:
            chunk_metadata["parsed_text_markdown"] = md_texts[chunk_index]
        chunk_metadata["parsed_text"] = doc_chunks[chunk_index]
        node = TextNode(text=doc_chunks[chunk_index], metadata=chunk_metadata)
        nodes.append(node)
        chunk_index += 1
    return nodes

In [None]:
# this will split into pages
# text_nodes = get_text_nodes(docs_text, json_dicts=md_json_list, image_dicts=image_dicts)
text_nodes = get_text_nodes(docs_text, json_dicts=md_json_objs, image_dicts=image_dicts)

Image index:  {13: [WindowsPath('llm_images/39c87bd0-1019-4c2f-ad69-c6bf97dee52a-img_p12_2.png'), WindowsPath('llm_images/39c87bd0-1019-4c2f-ad69-c6bf97dee52a-img_p12_2.png'), WindowsPath('llm_images/39c87bd0-1019-4c2f-ad69-c6bf97dee52a-img_p12_2.png'), WindowsPath('llm_images/39c87bd0-1019-4c2f-ad69-c6bf97dee52a-img_p12_2.png'), WindowsPath('llm_images/39c87bd0-1019-4c2f-ad69-c6bf97dee52a-img_p12_2.png'), WindowsPath('llm_images/39c87bd0-1019-4c2f-ad69-c6bf97dee52a-img_p12_2.png'), WindowsPath('llm_images/39c87bd0-1019-4c2f-ad69-c6bf97dee52a-img_p12_2.png'), WindowsPath('llm_images/39c87bd0-1019-4c2f-ad69-c6bf97dee52a-img_p12_2.png'), WindowsPath('llm_images/39c87bd0-1019-4c2f-ad69-c6bf97dee52a-img_p12_2.png'), WindowsPath('llm_images/39c87bd0-1019-4c2f-ad69-c6bf97dee52a-img_p12_2.png'), WindowsPath('llm_images/39c87bd0-1019-4c2f-ad69-c6bf97dee52a-img_p12_2.png'), WindowsPath('llm_images/39c87bd0-1019-4c2f-ad69-c6bf97dee52a-img_p12_2.png'), WindowsPath('llm_images/39c87bd0-1019-4c2f-a

In [26]:
print(text_nodes[6].get_content(metadata_mode="all"))

page_num: 7
image_paths: []
parsed_text_markdown: # TARGET SYMPTOM LIST

# OBSESSIONS:

1.
2.
3.

# COMPULSIONS:

1.
2.
3.

# AVOIDANCE:

1.
2.
3.

v1.0 21 March 2014
parsed_text: 
Na me      Dat e

           TARGET SYMPTOM LIST

OBSESSIONS:

1.

2.

3.


COMPULSIONS:

1.

2.

3.


AVOIDANCE:

1.

2.

3.


v1.0 21 March 2014


Na me      Dat e

           TARGET SYMPTOM LIST

OBSESSIONS:

1.

2.

3.


COMPULSIONS:

1.

2.

3.


AVOIDANCE:

1.

2.

3.


v1.0 21 March 2014


#### Build Index
Once the text nodes are ready, we feed into our vector store index abstraction, which will index these nodes into a simple in-memory vector store

In [20]:
# set BAAI/bge-small-en-v1.5 as vector store embedding model 
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

vector_store_embedding = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
import os
from llama_index.core import (
    StorageContext,
    VectorStoreIndex,
    load_index_from_storage,
)

index = None
if not os.path.exists("storage_nodes"):
    index = VectorStoreIndex(text_nodes, embed_model=vector_store_embedding)
    # save index to disk
    index.set_index_id("vector_index")
    index.storage_context.persist("./storage_nodes")
else:
    # rebuild storage context
    storage_context = StorageContext.from_defaults(persist_dir="storage_nodes")
    # load index
    index = load_index_from_storage(storage_context, index_id="vector_index", embed_model=vector_store_embedding)

### Build Multimodal Query Engine
We now use LlamaIndex abstractions to build a custom query engine. In contrast to a standard RAG query engine that will retrieve the text node and only put that into the prompt (response synthesis module), this custom query engine will also load the image document, and put both the text and image document into the response synthesis module.

In [22]:
# set LLama3.2-11b-visions as Ollama model and perform a sanity check if it is working
from llama_index.llms.ollama import Ollama

llm_model=Ollama(model=MODEL, request_timeout=500)
response = llm_model.complete("What is the capital of France?")
print(response)

The capital of France is Paris.


In [23]:
from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.schema import ImageNode, NodeWithScore, MetadataMode
from llama_index.core.prompts import PromptTemplate
from llama_index.core.base.response.schema import Response
from typing import Optional


QA_PROMPT_TMPL = """\
Use the image(s) information first and foremost. ONLY use the text/markdown information provided in the context
below if you can't understand the image(s).

---------------------
Context: {context_str}
---------------------
Given the context information and no prior knowledge, answer the query. Explain where you got the answer
from, and if there's discrepancies, and your reasoning for the final answer.

Query: {query_str}
Answer: """

QA_PROMPT = PromptTemplate(QA_PROMPT_TMPL)

class MultimodalQueryEngine(CustomQueryEngine):
    """Custom multimodal Query Engine.

    Takes in a retriever to retrieve a set of document nodes.
    Also takes in a prompt template and multimodal model.

    """

    qa_prompt: PromptTemplate
    retriever: BaseRetriever
    multi_modal_llm: Ollama

    def __init__(self, qa_prompt: Optional[PromptTemplate] = None, **kwargs) -> None:
        """Initialize."""
        super().__init__(qa_prompt=qa_prompt or QA_PROMPT, **kwargs)

    def custom_query(self, query_str: str):
        # retrieve text nodes
        nodes = self.retriever.retrieve(query_str)
        # create ImageNode items from text nodes
        image_nodes = [
            NodeWithScore(node=ImageNode(image_path=image_path))
            for n in nodes for image_path in n.metadata.get("image_paths", [])
        ]

        # create context string from text nodes, dump into the prompt
        context_str = "\n\n".join(
            [r.get_content(metadata_mode=MetadataMode.LLM) for r in nodes]
        )
        fmt_prompt = self.qa_prompt.format(context_str=context_str, query_str=query_str)

        image_docs = [image_node.node for image_node in image_nodes]
        # synthesize an answer from formatted text and images
        llm_response = self.multi_modal_llm.complete(
            prompt=fmt_prompt,
            image_documents=image_docs
        )
        return Response(
            response=str(llm_response),
            source_nodes=nodes,
            metadata={"text_nodes": text_nodes, "image_nodes": image_nodes},
        )


In [24]:
query_engine = MultimodalQueryEngine(
    retriever=index.as_retriever(similarity_top_k=5), multi_modal_llm=llm_model
)

In [27]:
# run a query
response = query_engine.custom_query("What is the Y-BOCS scale designed for?")
print(response)

I got the answer from the "General Instructions" section of the Y-BOCS scale, which states: "This rating scale is designed to rate the severity and type of symptoms in patients with obsessive compulsive disorder (OCD)."

There are no discrepancies in this answer. The Y-BOCS scale is indeed designed to rate the severity and type of symptoms in patients with OCD.


### Building a Multimodal Agent

In [36]:
# set LLama3.1 as Ollama model for tool-calling since LLama3.2-vision currentlty does not support it
# perform a sanity check if it is working
from llama_index.llms.ollama import Ollama

llm_model_tool_calling=Ollama(model="llama3.2:1b")

In [37]:
response = llm_model_tool_calling.complete("Are you able to process image inputs?")
print(response)

I can process text-based input, but I'm not capable of directly processing images. However, I can guide you on how to upload an image and provide information related to it if that's what you're looking for. If you'd like to discuss something specific about an image or ask a question about an image, feel free to provide more details or context.


In [39]:
response = llm_model_tool_calling.complete("Are you able to do agentic tool-calling? If not, can the 3b version of llama3.2 do agentic tool-calling?")
print(response)

I'm a large language model, I don't have direct control over external tools or their capabilities. However, I can try to provide information and guidance on how to achieve agentic tool-calling using various approaches.

Llama3.2 is an open-source tool for creating and running agent-based models, which are simulations of complex systems that involve interacting agents with different goals and behaviors. While Llama3.2 does not have built-in support for agentic tool-calling in the classical sense (i.e., calling external tools to execute a specific action), there are some workarounds and alternatives:

1. **Using external command-line tools**: You can use external command-line tools like `gnuplot`, `matplotlib`, or even shell scripts to create custom plots, visualize data, or perform other actions that require interaction with the user. Llama3.2 can be used as a backend engine for these tools.
2. **Using API calls**: If you have access to an external tool or service that provides an API, 

In [None]:
## this currency converter seems unnecessary

import requests

from llama_index.core.tools import QueryEngineTool
from llama_index.core.agent import FunctionCallingAgentWorker

from llama_index.core.tools import FunctionTool
from pydantic import Field


def currency_converter(from_currency_code: str = Field(
        description="Country code of the currency to convert from (e.g., USD, INR, EUR)"
    ), to_currency_code: str = Field(
        description="Country code of the currency to convert to (e.g., USD, INR, EUR)"
    ), amount: float = Field(
        description="Currency amount to convert"
    )) -> float:

    # free API for currency exchange rates
    api_url = f"https://api.vatcomply.com/rates?base={to_currency_code}"
    
    response = requests.get(api_url)
    data = response.json()
    
    if "error" in data:
        raise ValueError(data["error"])
    
    rates = data["rates"]
    conversion_factor = rates[from_currency_code]
    converted_amount = float(amount) / conversion_factor
    return converted_amount

# Tool for currency conversion
currency_converter_tool = FunctionTool.from_defaults(
    currency_converter,
    name="currency_converter_tool",
    description="Converts currency from one country code to country code based on current exchange rate. "
    "Takes the currency amount value, the country code of the currency to convert from, and the country code "
    "of the currency to convert to as input.",
)

In [43]:
# Tool for querying the engine to retrieve contextual information around user query
query_engine_tool = QueryEngineTool.from_defaults(
    query_engine=query_engine,
    name="query_engine_tool",
    description=(
        "Useful for retrieving specific context from the data. Do NOT select if question asks for a summary of the data."
    ),
)

In [41]:
# Set-up the agent for calling the currency conversion and query engine tools
agent = FunctionCallingAgentWorker.from_tools(
    [currency_converter_tool, query_engine_tool], llm=llm_model_tool_calling, verbose=True
).as_agent()

In [42]:
query = (
    "What are compulsions?"
)
response = agent.query(query)
print(response)

Added user message to memory: What are compulsions?
=== Calling Function ===
Calling function: query_engine_tool with args: {"input": "compulsions"}
=== Function Output ===
I found the answer in the context information, specifically in the sections labeled "7. INTERFERENCE DUE TO COMPULSIVE BEHAVIORS", "8. DISTRESS ASSOCIATED WITH COMPULSIVE BEHAVIOR", "9. RESISTANCE AGAINST COMPULSIONS", and "10. DEGREE OF CONTROL OVER COMPULSIVE BEHAVIOR".

These sections all relate to compulsive behaviors, and the answer is not explicitly stated. However, based on the context, I can infer that the answer is related to the questions and rating scales provided.

The closest answer I can find is in section 6, which is labeled "6. TIME SPENT PERFORMING COMPULSIVE BEHAVIORS". However, this section does not provide a direct answer to the query.

Therefore, I will not provide a specific answer to the query "compulsions". If you could provide more context or clarify the question, I would be happy to try and

TypeError: object of type 'NoneType' has no len()