<a href="https://colab.research.google.com/github/anikch/RAG-LOTR-Long-Context-Reorder/blob/main/RAG_%2B_LOTR_%2B_Long_Context_Reorder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

"Lord of the Retrievers", also referred to as MergerRetriever, is designed to consolidate the outputs of various retrievers into a unified list by integrating the outcomes of their get_relevant_documents() functions. This integrated list consists of documents pertinent to the search query, ranked according to their relevance by different retrievers.

Regardless of the underlying model architecture, introducing more than ten retrieved documents significantly hampers performance. Essentially, when models are required to sift through extensive information to find relevant data, they often overlook the documents provided. According to research (https://arxiv.org/abs/2307.03172), models encounter difficulties when trying to pinpoint crucial information located in the middle of lengthy contexts. It has been found that optimal results are typically achieved when vital information is situated at the beginning or end of the input context. Moreover, the longer the input context, the more pronounced the decline in model performance, even for models specifically engineered to handle long contexts. To mitigate this issue, this module rearranges the retrieved items, which proves beneficial in scenarios requiring a substantial number of top-k results.

In [14]:
!pip install langchain
!pip install chromadb
!pip install sentence_transformers
!pip install pypdf

Collecting pypdf
  Downloading pypdf-4.0.1-py3-none-any.whl (283 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.0/284.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-4.0.1


In [15]:
import os
import chromadb
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.retrievers.merger_retriever import MergerRetriever
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceBgeEmbeddings

In [16]:
#Load Embedding Model

embedding_model= "BAAI/bge-large-en"
model_kwargs= {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
hf_bge_em= HuggingFaceBgeEmbeddings(model_name= embedding_model, model_kwargs= model_kwargs, encode_kwargs= encode_kwargs)

In [17]:
text_splitter= RecursiveCharacterTextSplitter(chunk_size= 1000, chunk_overlap= 100)

In [18]:
# Load documents and create chunks

doc1= PyPDFLoader("/content/leac203.pdf").load()
doc1_chunks= text_splitter.split_documents(doc1)

doc2= PyPDFLoader("/content/murder-case-section-302-ipc-409077.pdf").load()
doc2_chunks= text_splitter.split_documents(doc2)

In [19]:
# Create vector stores and store vectors

vec_store1= Chroma.from_documents(doc1_chunks, hf_bge_em, collection_metadata= {"hnsw:space": "cosine"}, persist_directory= "vec_st1")
vec_store2= Chroma.from_documents(doc2_chunks, hf_bge_em, collection_metadata= {"hnsw:space": "cosine"}, persist_directory= "vec_st2")

In [20]:
# Load Vector Store
vec_st1= Chroma(persist_directory= "/content/vec_st1", embedding_function= hf_bge_em)
vec_st2= Chroma(persist_directory= "/content/vec_st2", embedding_function= hf_bge_em)

# Create MergerRetriever / Lord of the Retrievers

In [42]:
# Merge Retrievers
vec_st1_retriever= vec_store1.as_retriever(search_type= "mmr", search_kwargs={"k": 5, "include_metadata": True})
vec_st2_retriever= vec_store2.as_retriever(search_type= "mmr", search_kwargs={"k": 5, "include_metadata": True})
merged_retriver= MergerRetriever(retrievers= [vec_st1_retriever, vec_st2_retriever])


In [43]:
# Get merged context
for chunks in merged_retriver.get_relevant_documents("What was the judgement given by the High Court?"):
    print(chunks.page_content)

so received as share warrants.
Money received against share warrants’ to be disclosed as a separate line item
under ‘shareholder’s fund’.
Rationalised 2023-24
4.3It is further submitted that the High Court has rightly observed and
held that the incident occurred in a sudden fight in the heat of passion on
a sudden quarrel in the mehendi ceremony and that the weapon used
was “Phakadiyat” which is used as a firewood primarily where food is
being cooked and where in the heat of passion the accused picked up
the “Phakadiyat” and used the same and therefore the case would fall
under Fourth exception to Section 300 IPC.  It is therefore submitted that
the High Court has rightly altered the finding of murder to one of culpable
homicide  not  amounting  to  murder  and  has  rightly  converted  the
sentence from life imprisonment to ten years rigorous imprisonment.
4.4Making the above submissions, it is prayed to dismiss the present
appeal.
5.We have heard the learned counsel for the respectiv

In [44]:
relevent_docs= merged_retriver.get_relevant_documents('What was the judgement given by the High Court?')
relevent_docs

[Document(page_content='so received as share warrants.\nMoney received against share warrants’ to be disclosed as a separate line item\nunder ‘shareholder’s fund’.\nRationalised 2023-24', metadata={'page': 7, 'source': '/content/leac203.pdf'}),
 Document(page_content='4.3It is further submitted that the High Court has rightly observed and\nheld that the incident occurred in a sudden fight in the heat of passion on\na sudden quarrel in the mehendi ceremony and that the weapon used\nwas “Phakadiyat” which is used as a firewood primarily where food is\nbeing cooked and where in the heat of passion the accused picked up\nthe “Phakadiyat” and used the same and therefore the case would fall\nunder Fourth exception to Section 300 IPC.  It is therefore submitted that\nthe High Court has rightly altered the finding of murder to one of culpable\nhomicide  not  amounting  to  murder  and  has  rightly  converted  the\nsentence from life imprisonment to ten years rigorous imprisonment.\n4.4Making 

Top context chunk is not relevant

# Handle 'Lost in the Middle' using Long Context Reorder

In [47]:
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
from langchain_community.document_transformers import (
    EmbeddingsClusteringFilter,
    EmbeddingsRedundantFilter,
)
from langchain_community.document_transformers import LongContextReorder
from langchain.retrievers import ContextualCompressionRetriever

In [48]:
filter= EmbeddingsRedundantFilter(embeddings= hf_bge_em)
reordering= LongContextReorder()
pipeline = DocumentCompressorPipeline(transformers=[filter, reordering])
compression_retriever_reordered= ContextualCompressionRetriever(
    base_compressor= pipeline, base_retriever= merged_retriver
)

In [50]:
for chunks in compression_retriever_reordered.get_relevant_documents("What was the judgement given by the High Court?"):
    print(chunks.page_content)

4.3It is further submitted that the High Court has rightly observed and
held that the incident occurred in a sudden fight in the heat of passion on
a sudden quarrel in the mehendi ceremony and that the weapon used
was “Phakadiyat” which is used as a firewood primarily where food is
being cooked and where in the heat of passion the accused picked up
the “Phakadiyat” and used the same and therefore the case would fall
under Fourth exception to Section 300 IPC.  It is therefore submitted that
the High Court has rightly altered the finding of murder to one of culpable
homicide  not  amounting  to  murder  and  has  rightly  converted  the
sentence from life imprisonment to ten years rigorous imprisonment.
4.4Making the above submissions, it is prayed to dismiss the present
appeal.
5.We have heard the learned counsel for the respective parties at
length.
At the outset, it is required to be noted that the trial Court convicted
side, a further statement of the accused under Section 313 Cr.P.C

Top chunk is a relevant one.