# Query engine for Docling parsed Markdown files 

This notebook demonstrates the use of the `DoclingMdQueryEngine` for retrieval-augmented question answering over documents. It shows how to set up the engine with Docling parsed Markdown files, and execute natural language queries against the indexed data. 

The `DoclingMdQueryEngine` integrates persistent ChromaDB vector storage with LlamaIndex for efficient document retrieval.

In [1]:
# %pip install llama-index-vector-stores-chroma==0.4.1
# %pip install llama-index==0.12.16

In [2]:
import os

import autogen

config_list = autogen.config_list_from_json(env_or_file="OAI_CONFIG_LIST")

assert len(config_list) > 0
print("models to use: ", [config_list[i]["model"] for i in range(len(config_list))])

# Put the OpenAI API key into the environment
os.environ["OPENAI_API_KEY"] = config_list[0]["api_key"]

models to use:  ['gpt-4']


## Docling Chroma Query Engine

In [3]:
from autogen.agentchat.contrib.rag.docling_query_engine import DoclingChromaMdQueryEngine

query_engine = DoclingChromaMdQueryEngine(db_path="./tmp/chroma")

  from .autonotebook import tqdm as notebook_tqdm
INFO:chromadb.telemetry.product.posthog:Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.


In [4]:
input_dir = "/root/ag2/test/agentchat/contrib/rag/pdf_parsed/"
query_engine.init_db(input_dir=input_dir)

INFO:autogen.agentchat.contrib.rag.docling_query_engine:Collection None was created in the database.
INFO:autogen.agentchat.contrib.rag.docling_query_engine:Loading docs from directory: /root/ag2/test/agentchat/contrib/rag/pdf_parsed/
INFO:autogen.agentchat.contrib.rag.docling_query_engine:Documents are loaded successfully.
INFO:autogen.agentchat.contrib.rag.docling_query_engine:VectorDB index was created with input documents


In [5]:
print(query_engine.get_collection_name())

docling-parsed-docs


In [6]:
question = "How much money did Nvidia spend in research and development"
answer = query_engine.query(question)
print(answer)

NVIDIA has invested over $45.3 billion in research and development since its inception.


In [7]:
input_docs = ["/root/ag2/test/agentchat/contrib/rag/pdf_parsed/nvidia_10k_2024.md"]
query_engine.init_db(input_doc_paths=input_docs)

INFO:autogen.agentchat.contrib.rag.docling_query_engine:Collection None was created in the database.
INFO:autogen.agentchat.contrib.rag.docling_query_engine:Loading input doc: /root/ag2/test/agentchat/contrib/rag/pdf_parsed/nvidia_10k_2024.md
INFO:autogen.agentchat.contrib.rag.docling_query_engine:Documents are loaded successfully.
INFO:autogen.agentchat.contrib.rag.docling_query_engine:VectorDB index was created with input documents


In [8]:
question = "How much money did Nvidia spend in research and development"
answer = query_engine.query(question)
print(answer)

NVIDIA has invested over $45.3 billion in research and development since its inception.


In [9]:
new_docs = ["/root/ag2/test/agentchat/contrib/rag/pdf_parsed/Toast_financial_report.md"]
query_engine.add_docs(new_doc_paths=new_docs)

INFO:autogen.agentchat.contrib.rag.docling_query_engine:Loading input doc: /root/ag2/test/agentchat/contrib/rag/pdf_parsed/Toast_financial_report.md


In [10]:
question = "How much money did Toast earn in 2024"
answer = query_engine.query(question)
print(answer)

In 2024, Toast had a net income of $56 million for the three months ended September 30, and a net loss of $13 million for the nine months ended September 30.


## Docling Mongo Atlas Query Engine

In [None]:
from autogen.agentchat.contrib.rag.docling_query_engine import DoclingMongoAtlasMdQueryEngine

query_engine = DoclingMongoAtlasMdQueryEngine(
    connection_string="",
    database_name="test-vectordb",
    collection_name="test-collection",
)

INFO:autogen.agentchat.contrib.rag.docling_query_engine:Connected to MongoDB Atlas.
INFO:/root/miniconda3/envs/ag2/lib/python3.11/site-packages/llama_index/vector_stores/mongodb/index.py:Creating Search Index vector_index on test-collection


In [12]:
input_dir = "/root/ag2/test/agentchat/contrib/rag/pdf_parsed/"
query_engine.init_db(input_dir=input_dir)

INFO:autogen.agentchat.contrib.rag.docling_query_engine:Loading docs from directory: /root/ag2/test/agentchat/contrib/rag/pdf_parsed/
INFO:autogen.agentchat.contrib.rag.docling_query_engine:Documents are loaded successfully. Total docs loaded: 2
INFO:autogen.agentchat.contrib.rag.docling_query_engine:Index created with 2 documents.
INFO:autogen.agentchat.contrib.rag.docling_query_engine:Vector index created with input documents.


In [13]:
print(query_engine.get_collection_name())

test-collection


In [14]:
question = "Quantitative and Qualitative Disclosures About Market Risk"
answer = query_engine.query(question)
print(answer)

The company is exposed to interest rate risk related to its fixed-rate investment portfolio and outstanding debt. A sensitivity analysis indicated that a 0.5% shift in the yield curve would change the fair value of investments by $93 million. The company has $9.7 billion in senior Notes, which are not affected by interest rate changes due to their fixed rate. 

Regarding foreign exchange rate risk, the company considers its direct exposure minimal as sales are in U.S. dollars and foreign currency forward contracts are used to offset exchange rate movements. A 10% strengthening of the U.S. dollar would have reduced accumulated other comprehensive income by $116 million and $112 million for the fiscal years ending January 28, 2024, and January 29, 2023, respectively. An adverse 10% foreign exchange rate change would have impacted income before taxes by $60 million and $36 million for the same periods, with these changes expected to be offset by corresponding changes in the value of hedge

In [15]:
new_docs = ["/root/ag2/test/agentchat/contrib/rag/pdf_parsed/Toast_financial_report.md"]
query_engine.add_docs(new_doc_paths=new_docs)

INFO:autogen.agentchat.contrib.rag.docling_query_engine:Loading input doc: /root/ag2/test/agentchat/contrib/rag/pdf_parsed/Toast_financial_report.md
INFO:autogen.agentchat.contrib.rag.docling_query_engine:Inserted a new document into the index.


In [16]:
question = "How much money did Toast earn in 2024"
answer = query_engine.query(question)
print(answer)

In 2024, Toast had a net income of $56 million for the three months ended September 30, and a net loss of $13 million for the nine months ended September 30.


In [None]:
question = "How much money did Nvidia spend in research and development"
answer = query_engine.query(question)
print(answer)

NVIDIA has invested over $45.3 billion in research and development since its inception.
