# Query engine for Docling parsed Markdown files 

This notebook demonstrates the use of the `VectorChromaQueryEngine` for retrieval-augmented question answering over documents. It shows how to set up the engine with Docling parsed Markdown files, and execute natural language queries against the indexed data. 

The `VectorChromaQueryEngine` integrates persistent ChromaDB vector storage with LlamaIndex for efficient document retrieval.

In [None]:
%pip install llama-index-vector-stores-chroma==0.4.1
%pip install llama-index==0.12.16

In [None]:
%pip install sentence-transformers
%pip install llama-index-llms-langchain

In [None]:
import os

import autogen

config_list = autogen.config_list_from_json(env_or_file="../OAI_CONFIG_LIST")

assert len(config_list) > 0
print("models to use: ", [config_list[i]["model"] for i in range(len(config_list))])

# Put the OpenAI API key into the environment
os.environ["OPENAI_API_KEY"] = config_list[0]["api_key"]

In [None]:
from autogen.agents.experimental.document_agent.chroma_query_engine import VectorChromaQueryEngine

query_engine = VectorChromaQueryEngine(db_path="./tmp/chroma")

In [None]:
# Update to match your environment
input_dir = "../test/agents/experimental/document_agent/pdf_parsed/"

Here we can see the default collection name in the vector store, this is where all documents will be ingested. When creating the `DoclingMdQueryEngine` you can specify a `collection_name` to ingest into.

In [None]:
print(query_engine.get_collection_name())

Without ingesting anything, we'll try and run a query against our vector store.

In [None]:
question = "How much money did Nvidia spend in research and development"
answer = query_engine.query(question)
print(answer)

We can see that nothing could be found, now let's ingest a document.

In [None]:
input_docs = [input_dir + "nvidia_10k_2024.md"]  # Update to match your environment
query_engine.add_docs(new_doc_paths=input_docs)

In [None]:
question = "How much money did Nvidia spend in research and development"
answer = query_engine.query(question)
print(answer)

Great, we got the data we needed. Now, let's add another document.

In [None]:
new_docs = [input_dir + "Toast_financial_report.md"]
query_engine.add_docs(new_doc_paths=new_docs)

And query again from the same vector store but this time for another corporate entity.

In [None]:
question = "How much money did Toast earn in 2024"
answer = query_engine.query(question)
print(answer)

# Docling MD Query Engine MongoDB

In [None]:
import os

from chromadb.utils import embedding_functions
from langchain_openai import ChatOpenAI

os.environ["OPENAI_API_KEY"] = ""
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="",
    model_name="text-embedding-ada-002",
)

llm = ChatOpenAI()

In [None]:
input_dir = "/root/ag2/test/agents/experimental/document_agent/pdf_parsed/"

In [None]:
from autogen.agentchat.contrib.rag.mongodb_query_engine import MongoDBQueryEngine

query_engine = MongoDBQueryEngine(
    connection_string="",
    embedding_function=openai_ef,
    database_name="vector_db_1",
)

In [None]:
query_engine.connect_db()
# first time run will return error and tell you to run init_db first
# from the second time when you run this cell, it will work

In [None]:
# nvidia_10k_2024.md
query_engine.init_db(new_doc_paths=[input_dir + "Toast_financial_report.md"])

In [None]:
question = "What is the trading symbol for Toast"
answer = query_engine.query(question, llm)
print(answer)

In [None]:
query_engine.add_records(new_doc_paths_or_urls=[input_dir + "nvidia_10k_2024.md"])

In [None]:
print(query_engine.query("How much money did Nvidia spend in research and development", llm))