# Document Search and Retrieval for complex documents using RAG


<img src="arch.png" width=400px>

### Process document and create vectorstore

In [1]:
from rag_101.retriever import (
    load_pdf,
    split_text,
    load_embedding_model,
    load_reranker_model,
    generate_embeddings,
    rerank_docs,
)

Load documents and split into chunks

In [2]:
# Load two PDFs from arxiv
# You can upload your own document and update the path
files = ["10k_docs/10k.pdf", "10k_docs/amazon-10k.pdf"]
loaders = load_pdf(files=files)

# Spliting
documents = split_text(loaders=loaders, chunk_size=1000)

Matplotlib created a temporary cache directory at /tmp/matplotlib-yl1usx3d because the default path (/teamspace/studios/this_studio/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.


Generate embeddings and store in a vector database

In [3]:
# initialize models
embedding_model = load_embedding_model(model_name="BAAI/bge-large-en-v1.5")
reranker_model = load_reranker_model(reranker_model_name="BAAI/bge-reranker-large")

# generate embeddings and store in vector database
print("Generating embeddings... This might take some time.")
vectorstore = generate_embeddings(documents, embedding_model=embedding_model)
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

Generating embeddings... This might take some time.


### Query the input document

In [4]:
query = "What are the earning of Microsoft ?"

retrieved_documents = retriever.get_relevant_documents(query)
reranked_documents = rerank_docs(reranker_model, query, retrieved_documents)

print("\nUser query:", query)
print("--" * 50)
print(
    "Retrieved content:",
)
print(reranked_documents[0][0].page_content)
print("--" * 50)
print("metadata:", reranked_documents[0][0].metadata)


User query: What are the earning of Microsoft ?
----------------------------------------------------------------------------------------------------
Retrieved content:
Search and news advertising revenue increased $2.3 billion or 25%. Search and news advertising revenue excluding traffic acquisition costs increased 27% driven by higher revenue per search and search volume.

Gaming revenue increased $860 million or 6% on a strong prior year comparable that benefited from Xbox Series X|S launches and stay-at-home scenarios, driven by growth in Xbox hardware and Xbox content and services. Xbox hardware revenue increased 16% due to continued demand for Xbox Series X|S. Xbox content and services revenue increased 3% driven by growth in Xbox Game Pass subscriptions and first- party content, offset in part by a decline in third-party content.

Surface revenue increased $226 million or 3%.

Operating income increased $1.5 billion or 8%.

Gross margin increased $3.1 billion or 10% driven by gr

### Run through some sample queries and observe the results

In [5]:
query1 = "How much were the AWS sales in 2019?"
query2 = "How does foreign exchange rate impact sales growth in the context provided?"
query3 = "How is revenue recognized for cloud services that include both on-premises software licenses and cloud services?"
query4 = "What factors are considered when determining the amount of revenue to recognize?"
query5 = "When is the Annual Meeting of Shareholders?"
query6 = "What are some examples of PC accessories from Microsoft?"
query7 = "What type of devices does Microsoft offer?"
query8 = "Net income of Amazon"

queries = [
    query1,
    query2,
    query3,
    query4,
    query5,
    query7,
    query8,
]

for i, query in enumerate(queries):
    print(f"Example {i+1}: Query->", query)
    print(
        ".." * 50,
    )
    print("Retrieved document:")

    retrieved_documents = retriever.get_relevant_documents(query)
    print("--" * 50)
    print(len(retrieved_documents),retrieved_documents)
    reranked_documents = rerank_docs(reranker_model, query, retrieved_documents)
    print("Reranked documents:")
    print("--" * 50)
    print(len(reranked_documents),reranked_documents)
    print("Reranked document:")
    print("--" * 50)
    print(reranked_documents[0][0].page_content)
    print("--" * 50)
    print("metadata:", reranked_documents[0][0].metadata)
    print("==" * 50, "\n\n")

Example 1: Query-> How much were the AWS sales in 2019?
....................................................................................................
Retrieved document:
----------------------------------------------------------------------------------------------------
10 [Document(page_content='AWS sales increased 37% in 2019, compared to the prior year. The sales growth primarily reflects increased customer usage, partially offset by pricing\n\nchanges. Pricing changes were driven largely by our continued efforts to reduce prices for our customers.\n\n24\n\nTable of Contents\n\nOperating Income (Loss)\n\nOperating income (loss) by segment is as follows (in millions):\n\nYear Ended December 31,\n\n2018\n\n2019\n\nOperating Income (Loss):\n\nNorth America\n\n$\n\n7,267 $\n\n7,033\n\nInternational\n\n(2,142)\n\n(1,693)\n\nAWS\n\n7,296\n\n9,201\n\nConsolidated\n\n$\n\n12,421\n\n$\n\n14,541\n\nOperating income was $12.4 billion and $14.5 billion for 2018 and 2019. We believe that 

Reranked documents:
----------------------------------------------------------------------------------------------------
10 [(Document(page_content='AWS\n\nThe AWS segment consists of amounts earned from global sales of compute, storage, database, and other service offerings for start-ups, enterprises,\n\ngovernment agencies, and academic institutions.\n\nInformation on reportable segments and reconciliation to consolidated net income (loss) is as follows (in millions):\n\nYear Ended December 31,\n\n2017\n\n2018\n\nNorth America\n\nNet sales\n\n$\n\n106,110 $\n\n141,366 $\n\nOperating expenses\n\n103,273\n\n134,099\n\nOperating income\n\n$\n\n2,837 $\n\n7,267 $\n\nInternational\n\nNet sales\n\n$\n\n54,297 $\n\n65,866 $\n\nOperating expenses\n\n57,359\n\n68,008\n\nOperating income (loss)\n\n$\n\n(3,062) $\n\n(2,142) $\n\nAWS\n\nNet sales\n\n$\n\n17,459 $\n\n25,655 $\n\nOperating expenses\n\n13,128\n\n18,359\n\nOperating income\n\n$\n\n4,331 $\n\n7,296 $\n\nConsolidated\n\nNet sales\n\n$