# Query engine for Docling parsed Markdown files 

This notebook demonstrates the use of the `DoclingMdQueryEngine` for retrieval-augmented question answering over documents. It shows how to set up the engine with Docling parsed Markdown files, and execute natural language queries against the indexed data. 

The `DoclingMdQueryEngine` integrates persistent ChromaDB vector storage with LlamaIndex for efficient document retrieval.

In [1]:
# %pip install llama-index-vector-stores-chroma==0.4.1
# %pip install llama-index==0.12.16

In [4]:
# %pip install sentence_transformers

Collecting tokenizers<0.22,>=0.21 (from transformers<5.0.0,>=4.41.0->sentence_transformers)
  Using cached tokenizers-0.21.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Using cached tokenizers-0.21.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
Installing collected packages: tokenizers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.20.3
    Uninstalling tokenizers-0.20.3:
      Successfully uninstalled tokenizers-0.20.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
chromadb 0.5.23 requires tokenizers<=0.20.3,>=0.13.2, but you have tokenizers 0.21.0 which is incompatible.[0m[31m
[0mSuccessfully installed tokenizers-0.21.0
[0m

In [2]:
import os

import autogen

config_list = autogen.config_list_from_json(env_or_file="OAI_CONFIG_LIST")

assert len(config_list) > 0
print("models to use: ", [config_list[i]["model"] for i in range(len(config_list))])

# Put the OpenAI API key into the environment
os.environ["OPENAI_API_KEY"] = config_list[0]["api_key"]

models to use:  ['gpt-4']


In [3]:
from autogen.agentchat.contrib.rag.docling_query_engine import DoclingQueryEngine

  from .autonotebook import tqdm as notebook_tqdm


## Docling Chroma Query Engine

In [4]:
query_engine = DoclingQueryEngine(
    db_type="chroma",
    db_path="./tmp/chroma",  # Directory where Chromadb will persist data.
)

INFO:chromadb.telemetry.product.posthog:Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.


In [5]:
# Initialize the engine by loading documents from a directory.
input_dir = "/root/ag2/test/agentchat/contrib/rag/pdf_parsed/"
query_engine.init_db(input_dir=input_dir)

# Display the collection name
print("Chroma Collection Name:", query_engine.get_collection_name())

INFO:autogen.agentchat.contrib.rag.docling_query_engine:Chroma collection 'docling-parsed-docs' created or retrieved.
INFO:autogen.agentchat.contrib.rag.docling_query_engine:Loading docs from directory: /root/ag2/test/agentchat/contrib/rag/pdf_parsed/
INFO:autogen.agentchat.contrib.rag.docling_query_engine:Documents loaded successfully.
INFO:autogen.agentchat.contrib.rag.docling_query_engine:Vector index created with input documents.


Chroma Collection Name: docling-parsed-docs


In [6]:
question = "How much money did Nvidia spend in research and development"
answer = query_engine.query(question)
print(answer)

Nvidia has invested over $45.3 billion in research and development since its inception.


In [7]:
input_docs = ["/root/ag2/test/agentchat/contrib/rag/pdf_parsed/nvidia_10k_2024.md"]
query_engine.init_db(input_doc_paths=input_docs)

INFO:autogen.agentchat.contrib.rag.docling_query_engine:Chroma collection 'docling-parsed-docs' created or retrieved.
INFO:autogen.agentchat.contrib.rag.docling_query_engine:Loading input doc: /root/ag2/test/agentchat/contrib/rag/pdf_parsed/nvidia_10k_2024.md
INFO:autogen.agentchat.contrib.rag.docling_query_engine:Documents loaded successfully.
INFO:autogen.agentchat.contrib.rag.docling_query_engine:Vector index created with input documents.


In [8]:
question = "How much money did Nvidia spend in research and development"
answer = query_engine.query(question)
print(answer)

NVIDIA has invested over $45.3 billion in research and development since its inception.


In [9]:
new_docs = ["/root/ag2/test/agentchat/contrib/rag/pdf_parsed/Toast_financial_report.md"]
query_engine.add_docs(new_doc_paths=new_docs)

INFO:autogen.agentchat.contrib.rag.docling_query_engine:Loading input doc: /root/ag2/test/agentchat/contrib/rag/pdf_parsed/Toast_financial_report.md
INFO:autogen.agentchat.contrib.rag.docling_query_engine:Inserted a new document into the index.


In [10]:
question = "How much money did Toast earn in 2024"
answer = query_engine.query(question)
print(answer)

In 2024, Toast reported a net income of $56 million for the three months ended September 30, and a net loss of $13 million for the nine months ended September 30.


## Docling Mongo Atlas Query Engine

In [None]:
query_engine = DoclingQueryEngine(
    db_type="mongodb", connection_string="", database_name="vector_db", collection_name="my_mongo_collection"
)

INFO:autogen.agentchat.contrib.rag.docling_query_engine:Connected to MongoDB Atlas.
INFO:/root/miniconda3/envs/ag2/lib/python3.11/site-packages/llama_index/vector_stores/mongodb/index.py:Creating Search Index vector_index on my_mongo_collection


In [12]:
# Initialize the engine by loading documents from a directory.
input_dir = "/root/ag2/test/agentchat/contrib/rag/pdf_parsed/"
query_engine.init_db(input_dir=input_dir)

# Display the collection name
print("MongoDB Collection Name:", query_engine.get_collection_name())

INFO:autogen.agentchat.contrib.rag.docling_query_engine:Loading docs from directory: /root/ag2/test/agentchat/contrib/rag/pdf_parsed/
INFO:autogen.agentchat.contrib.rag.docling_query_engine:Documents loaded successfully. Total docs loaded: 2
INFO:autogen.agentchat.contrib.rag.docling_query_engine:Index created with 2 documents.
INFO:autogen.agentchat.contrib.rag.docling_query_engine:Vector index created with input documents.


MongoDB Collection Name: my_mongo_collection


In [13]:
question = "How much money did Nvidia spend in research and development"
answer = query_engine.query(question)
print("Answer:", answer)

Answer: NVIDIA has invested over $45.3 billion in research and development since its inception.


In [14]:
new_docs = ["/root/ag2/test/agentchat/contrib/rag/pdf_parsed/Toast_financial_report.md"]
query_engine.add_docs(new_doc_paths=new_docs)

INFO:autogen.agentchat.contrib.rag.docling_query_engine:Loading input doc: /root/ag2/test/agentchat/contrib/rag/pdf_parsed/Toast_financial_report.md
INFO:autogen.agentchat.contrib.rag.docling_query_engine:Inserted a new document into the index.


In [15]:
question = "How much money did Toast earn in 2024"
answer = query_engine.query(question)
print(answer)

In 2024, Toast had a net income of $56 million for the three months ended September 30, and a net loss of $13 million for the nine months ended September 30.


In [16]:
question = "How much money did Nvidia spend in research and development"
answer = query_engine.query(question)
print(answer)

NVIDIA has invested over $45.3 billion in research and development since its inception.
