# RAG: Data Ingestion and Retrieval for complex documents


<img src="arch.png" width=500px>

## Process document and create vectorstore

In [1]:
from rag_101.retriever import (
    load_pdf,
    split_text,
    load_embedding_model,
    load_reranker_model,
    generate_embeddings,
    create_compression_retriever,
)

In [2]:
files = ["example_data/2401.08406.pdf", "example_data/2401.00908.pdf"]  # Use your document/s
loaders = load_pdf(files=files)
documents = split_text(loaders=loaders, chunk_size=1000)

In [3]:
embedding_model = load_embedding_model(model_name="BAAI/bge-large-en-v1.5")
reranker_model = load_reranker_model(reranker_model_name="BAAI/bge-reranker-large")

print("Generating embeddings... This might take some time.")
vectorstore = generate_embeddings(documents, embedding_model=embedding_model)

compression_retriever = create_compression_retriever(
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    reranker_model=reranker_model,
)

No sentence-transformers model found with name /teamspace/studios/this_studio/weights/sentence_transformers/BAAI_bge-reranker-large. Creating a new one with MEAN pooling.
Some weights of XLMRobertaModel were not initialized from the model checkpoint at /teamspace/studios/this_studio/weights/sentence_transformers/BAAI_bge-reranker-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Generating embeddings... This might take some time.


## Query the input document

In [4]:
query = "What are the metrics used to evaluate the answers?"
compressed_docs = compression_retriever.get_relevant_documents(query)
print("User query:", query, "\n")
print("Retrieved content:\n", compressed_docs[0].page_content)
print("metadata:", compressed_docs[0].metadata)
print("-" * 50, "\n\n")

User query: What are the metrics used to evaluate the answers? 

Retrieved content:
 LLMs as judges for open-ended scenarios (Zhu et al., 2023). In this work, we have used the AzureML Model Evaluation (Microsoft, 2023), with the following metrics to compare the generated answers with ground truth:

Coherence: Comparison of coherence between ground truths and predictions given the context. The metric provides a score between one to five, where one means that the answer lacks coherence and five means the answer has perfect coherency. An example is provided in Table 6.

Relevance: Relevance measures how well the answer addresses the main aspects of the question based on the context. The metric rates from 1 to 5, where 5 means the answer has perfect relevance. An example is provided in Table 7.

Groundedness: The metric defines weather the answer follows logically from the information contained in the context or not and provides and integer score to determine how grounded the answer is. An

In [5]:
query = "What is the DocLLM architecture ?"
compressed_docs = compression_retriever.get_relevant_documents(query)
print("User query:", query, "\n")
print("Retrieved content:\n", compressed_docs[0].page_content)
print("metadata:", compressed_docs[0].metadata)
print("-" * 50, "\n\n")

User query: What is the DocLLM architecture ? 

Retrieved content:
 The maximum sequence length, or context length, is consistently set to 1,024 for both versions during the entire training process. The DocLLM-7B models are trained with 16-bit mixed precision on 8 24GB A10g GPUs using fully sharded data parallelism, implemented with the accelerate library.4 The DocLLM-1B model, on the other hand, is trained on a single 24GB A10g GPU.

4.3 Downstream Evaluation

Experimental settings. We investigate two experimental settings:

Same Datasets, Different Splits (SDDS): Following previous work in VRDU [34, 59, 33, 12, 31, 32], we first evaluate DocLLM on the unseen test split (or dev split when test split is unavailable) of each of the 16 datasets composing the instruction-tuning data. The motivation behind this very typical setting is to check how DocLLM performs when tasks and domains supposedly stay the same from train to test.
metadata: {'source': 'example_data/2401.00908.pdf'}
--------

In [6]:
# Run through some sample queries

query1 = "What are the metrics used to evaluate the answers?"
query2 = "How many pdf data were collected from the USA?"
query3 = "What is the DocLLM architecture ?"
query4 = "Which contries were used to collect dataset?"
query5 = "Where was the agriculture dataset collected for the USA?"
query6 = "how was the content and structure of available documents augmented?"
query7 = "What was the answer generation process used in the paper?"
query8 = "how was the content and structure of available documents augmented?"

queries = [
    # query1,
    query2,
    # query3,
    query4,
    query5,
    query7,
    query8,
]

for i, q in enumerate(queries):
    print(f"Example {i+1}: Query->", q)
    print(
        ".." * 50,
    )
    print("Retrieved document:")
    compressed_docs = compression_retriever.get_relevant_documents(q)
    print(f"{compressed_docs[0].page_content}")
    print("metadata:", compressed_docs[0].metadata)
    print("==" * 50, "\n\n")

Example 1: Query-> How many pdf data were collected from the USA?
....................................................................................................
Retrieved document:
3.1 USA

We collected documents, handbooks, and reports publicly available online from the United States Department of Agriculture (USDA), state agriculture and consumer services agencies, and partners from the Land-Grant Institutions National Program. Available documents contain federal regulatory and policy information surrounding crop and livestock management, information on diseases and best practices, quality assurance and export regulations, details on assistance programs, as well as insurance and pricing guidelines. Collected data totals more than 23k PDF files with over 50M tokens, representing 44 states in the USA. We downloaded and preprocessed these files, extracting the textual information that could be used as input to the Q&A generation pipeline. To benchmark and evaluate the models, we e