# Document Search and Retrieval for complex documents using RAG


<img src="arch.png" width=400px>

### Process document and create vectorstore

In [1]:
from rag_101.retriever import (
    load_pdf,
    split_text,
    load_embedding_model,
    load_reranker_model,
    generate_embeddings,
    rerank_docs,
)

Load documents and split into chunks

In [2]:
# Load two PDFs from arxiv
# You can upload your own document and update the path
files = ["10k_docs/10k.pdf", "10k_docs/amazon-10k.pdf"]
loaders = load_pdf(files=files)

# Spliting
documents = split_text(loaders=loaders, chunk_size=1000)

Matplotlib created a temporary cache directory at /tmp/matplotlib-yl1usx3d because the default path (/teamspace/studios/this_studio/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.


Generate embeddings and store in a vector database

In [3]:
# initialize models
embedding_model = load_embedding_model(model_name="BAAI/bge-large-en-v1.5")
reranker_model = load_reranker_model(reranker_model_name="BAAI/bge-reranker-large")

# generate embeddings and store in vector database
print("Generating embeddings... This might take some time.")
vectorstore = generate_embeddings(documents, embedding_model=embedding_model)
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

config.json:   0%|          | 0.00/801 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

Generating embeddings... This might take some time.


### Query the input document

In [4]:
query = "What are the earning of Microsoft ?"

retrieved_documents = retriever.get_relevant_documents(query)
reranked_documents = rerank_docs(reranker_model, query, retrieved_documents)

print("\nUser query:", query)
print("--" * 50)
print(
    "Retrieved content:",
)
print(reranked_documents[0][0].page_content)
print("--" * 50)
print("metadata:", reranked_documents[0][0].metadata)


User query: What is the DocLLM architecture ?
----------------------------------------------------------------------------------------------------
Retrieved content:


DocLLM is constructed upon the foundation of an auto-regressive transformer language model [4] following a causal decoder structure. It is composed of stacked transformer blocks, where each block contains a multi-head self-attention layer and a fully connected feed forward network. Standard language models are typically unimodal, accepting only a sequence of text tokens as input. In contrast, DocLLM is a multi-modal system that integrates lightweight visual information by utilizing the spatial positions and dimensions of text tokens obtained using OCR. Simply augmenting the text with bounding box information via additive positional encoding may not capture the intricate relationships between text semantics and spatial layout, especially for visually rich documents [10]. Consequently, we treat the spatial information abo

### Run through some sample queries and observe the results

In [8]:
query1 = "What are the metrics used to evaluate the answers?"
query2 = "How many pdf data were collected from the USA?"
query3 = "What is the DocLLM architecture ?"
query4 = "Which contries were used to collect dataset?"
query5 = "Where was the agriculture dataset collected for the USA?"
query6 = "how was the content and structure of available documents augmented?"
query7 = "What was the answer generation process used in the paper?"
query8 = "how was the content and structure of available documents augmented?"

queries = [
    query1,
    query2,
    query3,
    query4,
    query5,
    query7,
    query8,
]

for i, query in enumerate(queries):
    print(f"Example {i+1}: Query->", query)
    print(
        ".." * 50,
    )
    print("Retrieved document:")

    retrieved_documents = retriever.get_relevant_documents(query)
    print("--" * 50)
    print(len(retrieved_documents),retrieved_documents)
    reranked_documents = rerank_docs(reranker_model, query, retrieved_documents)
    print("Reranked documents:")
    print("--" * 50)
    print(len(reranked_documents),reranked_documents)
    print("Reranked document:")
    print("--" * 50)
    print(reranked_documents[0][0].page_content)
    print("--" * 50)
    print("metadata:", reranked_documents[0][0].metadata)
    print("==" * 50, "\n\n")

Example 1: Query-> What are the metrics used to evaluate the answers?
....................................................................................................
Retrieved document:
----------------------------------------------------------------------------------------------------
10 [Document(page_content='\n\nIn the existing literature, it’s common to find a wealth of metrics designed to evaluate the quality of answers. These metrics play an indispensable role in ensuring that the answers provided by models are accurate, relevant, and effectively address the questions posed. However, there is a notable gap when it comes to metrics specifically designed to assess the quality of the questions themselves. Recognizing this lack, we have developed our metrics with a focus on evaluating question quality. Given the pivotal role that questions play in driving meaningful conversations and generating useful answers, it is equally important to ensure their quality as it is for the ans

Reranked documents:
----------------------------------------------------------------------------------------------------
10 [(Document(page_content='Details: We assess the level of detail in both the generated questions and answers by counting the number of tokens (words) in each. This metric provides insight into the depth and specificity of the content generated by the Q&A system. By employing these metrics, we can effectively evaluate and refine the Q&A generation process, ensuring that the generated content is informative, relevant, diverse, and grounded in the source material. This will ultimately lead to a more useful and effective Q&A generation system for the target audience.', metadata={'source': 'example_data/2401.08406.pdf'}), 0.13917649), (Document(page_content='4.2 Answer Evaluation\n\nEvaluating answers generated by LLMs is challenging due to their tendency of generating long, informative, and conversational answers. Traditional metrics are unsuitable to evaluate this typ