
### <a id='toc1_1_1_'></a>[Reranking](#toc0_)

After the core Retrieval part, one can add a ReRanking module to a RAG pipeline. While optional, this component is widely used in standard applications. The idea behind Reranking is to reorder/filter the set of chunks provided by the retriever to refine the context even more. Such process can be accomplished by various methods, some of which we will explore in this notebook.

**Table of contents**<a id='toc0_'></a>    
- [Setup](#toc2_)    
- [Simple Strategies](#toc3_)    
  - [Filter](#toc3_1_)    
  - [BM 25 reranking](#toc3_2_)    
- [Reranking models and chains](#toc4_)    
  - [BGE](#toc4_1_)    
  - [ColBert](#toc4_2_)    
- [To go the extra mile - LLM assisted reranking](#toc5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc2_'></a>[Setup](#toc0_)

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import os
from pathlib import Path

from dotenv import load_dotenv

os.chdir(Path.cwd().joinpath(".."))
print(Path.cwd())
load_dotenv(override=True)

In [None]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain.schema import Document
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from ragatouille import RAGPretrainedModel
from rank_bm25 import BM25Okapi

from lib.models import embeddings, llm
from lib.prompts import RERANKER_AGENT_PROMPT
from lib.utils import (
    build_vector_store,
    load_documents,
    load_vector_store,
    split_documents_basic,
)

We load the vector store and perform a simple retrieval of 10 chunks to be refined in our reranking

In [None]:
test_query = "How can cloud adoption help financial institutions"

BASE_CHUNK_SIZE = 512

# build vector_store
base_documents = split_documents_basic(load_documents("data/amf_training"), BASE_CHUNK_SIZE, include_linear_index=True)

build_vector_store(
    base_documents,
    embeddings,
    collection_name="amf_training",
    distance_function="cosine",
    erase_existing=False,
)

# Load Vector store / retriever
chroma_vector_store = load_vector_store(embeddings, "amf_training")
chroma_vector_store_retriever = chroma_vector_store.as_retriever()

In [None]:
retrieved_chunks = chroma_vector_store.similarity_search_with_score(test_query, k=10)

# <a id='toc3_'></a>[Simple Strategies](#toc0_)

Reranking as a general process can be basically split into two parts: a scoring mecanism and a rule-based filter. At a basic level, the scoring mecanism ranks the chunks in terms of relevance and then the filter keeps only the top k chunks (with k variable).

## <a id='toc3_1_'></a>[Filter](#toc0_)

The most basic form of reranking is one that uses the similarity scores from the retrieval part to filter the chunks. Indeed, one can implement a simple rule-based filter in which the chunks that were retrieved under a certain similarity threshold are removed. This minimizes the risk of feeding the LLM with context that is virtually useless. In the specific case where all retrieved chunks fall outside the accepted similarity zone, this can be used to trigger an automated response or a different pipeline and avoid hallucinations. For this example we have the relevance scores from 0 to 1, so we can set a 0.3 threshold

In [None]:
def filter_chunks_on_scores(retrieved_chunks: list[tuple[Document, float]], threshold: float = 0.3) -> list[Document]:
    """Removes chunks under the threshold.

    Args:
        retrieved_chunks (list(tuple(Document, float))): list of retrieved chunks and associated relevance scores
        threshold (float): threshold
    """
    filtered_chunks = [chunk[0] for chunk in retrieved_chunks if chunk[1] < threshold]
    if filtered_chunks == []:
        raise ValueError("no chunks fit the similarity threshold")
    return filtered_chunks

From now on we will look at different scoring methods that are distinct from the similarity score returned by the embeddings. Once a score is attributed, we will then rank and filter the chunks based on the score with a simple custom function:

In [None]:
def extract_ranked_chunks(reranked_chunks: list[tuple[Document, float]], k: int = 5) -> list[Document]:
    """Uses reranked scores to rank and take the top k scores.

    Args:
        reranked_chunks (list[tuple): list of (Document, score) tuples from reranking
        k (int, optional): top chunks to keep. Defaults to 5.

    Returns:
        list(Document): final chunks
    """
    # Sort the chunks by score
    extracted_chunks = sorted(reranked_chunks, key=lambda x: x[1])
    # Take only the top k
    extracted_chunks = extracted_chunks[-k:][::-1]
    # Remove null scores if they exist and remove score values from list
    extracted_chunks = [chunk[0] for chunk in extracted_chunks if chunk[1] > 0]

    return extracted_chunks

## <a id='toc3_2_'></a>[BM 25 reranking](#toc0_)

As mentioned previously, the basic premise of reranking relies on a score computation. Naturally, one can use many common NLP techniques to generate such scores. The BM 25 model that we looked at briefly in the retrieval part can be used for such purposes. Here, we will use a more sequential appraoch, and use the bm25 model ot rerank the chunks. 

bm25 works as a more refined TF-IDF scoring method. At a base level, it calculates the term frequency of the query terms within the documents, with the documents where the query words are the most frequent thus scoring better. However, it also incoporates many regularization techniques to avoid common noise sources. It thus matches based on words instead of semantic meaning.

The model implementation from the rank_bm25 package is very simple, and does not include tokenization. For better performance, one can perform some refined tokenization and lemmatization, but for the sake of this example we will stay simple.

In [None]:
tokenized_corpus = [chunk.page_content.split(" ") for chunk in filter_chunks_on_scores(retrieved_chunks, 0.3)]

tokenized_query = test_query.split(" ")

We can now obtain the scores and filter the top 5 chunks

In [None]:
bm25 = BM25Okapi(tokenized_corpus)

scores = bm25.get_scores(tokenized_query)
print(scores)

reranked_chunks = extract_ranked_chunks([(retrieved_chunks[i][0], scores[i]) for i in range(len(tokenized_corpus))])

reranked_chunks

# <a id='toc4_'></a>[Reranking models and chains](#toc0_)

For the following methods, we will use a more streamlined approach that incoroporates LCEL chains. Indeed, while the above bm25 showcase is great for understanding how a reranker works, it is much less heavy to integrate the retrieval and reranking together in a chain, using the Document Compressor object, which can be used for reranking models.

Our first step is to define the base retriever from the vectore store which we will then be able to use in a chain. Filtering by similarity score is integrated within langchain directly when searching with a threshold, the "similarity" here being 1-distance, so we set a 0.7 boundary to mimic the 0.3 cosine distance filter

In [None]:
chroma_retriever = chroma_vector_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"k": 20, "score_threshold": 0.7},
)

The idea behind reranking models is to account for the shortcomings of the embedding/retrieval process, and thus such models can take many different forms.

To understand how different models work, we will detail three different types of document retrieval models:

* Embedding models like ada-v2 are single-vector no-interaction: the query and the documents are embedded seperately as one single vector, and the similarity score is computed at the end. This has the main advantage of computational efficiency, as the document embeddings can be pre-computed.

* Cross encoder models like BGE or BERT are multi-vector full-interaction: here every snippet of the documents is encoded with the query and a score is outputted, which is then pooled. This provides deeper analysis of the relationships between the document and query, at the cost of computational performance.

* Finally, models like ColBert are multi-vector late-interaction, and combine the best of both worlds: while the query and document are encoded seperately, they are also broken down into multiple snippets, which are then passed through an interaction matrix whose output is pooled to create a score that refelects deeper relationships, while allowing for pre-computation of document embeddings.

We will look at two of these models next.

## <a id='toc4_1_'></a>[BGE](#toc0_)

As mentioned, BGE is an example of a cross encoder model, which is a transformer-based architecture. This model is available through HuggingFace and integrable in langchain as a document compressor.

We first load our BGE model and wrap it into a reranker object, specifying the top k documents to extract

In [None]:
bge_reranker = CrossEncoderReranker(model=HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base"), top_n=5)

We then build our combined retriever with the base and reranker, and invoke it.

In [None]:
bge_augmented_retriever = ContextualCompressionRetriever(base_compressor=bge_reranker, base_retriever=chroma_retriever)

In [None]:
bge_augmented_retriever.invoke(test_query)

## <a id='toc4_2_'></a>[ColBert](#toc0_)

We now look at ColBert, a multi-vector late-interaction model, which can be used in a similar fashion. We need the ragatouille package where the model lives, but the implementation as a compressor is the same

We load and wrap the model

In [None]:
colbert_reranker = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0").as_langchain_document_compressor()

We build the chained retriever and invoke it

In [None]:
colbert_augmented_retriever = ContextualCompressionRetriever(
    base_compressor=colbert_reranker, base_retriever=chroma_retriever
)

In [None]:
colbert_augmented_retriever.invoke(test_query)

# <a id='toc5_'></a>[To go the extra mile - LLM assisted reranking](#toc0_)

Finally, we will look at one more tehchnique that does not use reranking specific models, but instead an LLM agent to evaluate the rank of each chunk. The idea behind is pretty straightforward, as the whole retrieved set of documents is fed to an llm, with instructions to rank them by query relevance.

Unfortunately, there is no method yet in langchain to implement an easy chain with any llm object. Some frameworks such as FlashRank or Cohere can be integrated as document compressors, but they use their own APIs for the llm. We wish to use our AzureOpenAI llms directly, so we create a custom chain with our own prompt in order to perform the same thing.

We will then create our pipeline using LCEL chains.

We first create a simple method for document formatting

In [None]:
def format_documents_for_llm_reranking(documents: list[Document]) -> str:
    """Takes all documents from a list and assembles them into a single string with identifiers.

    Args:
        documents (list[Document]): list of LC documents

    Returns:
        str: mutli-line string
    """
    context = """"""
    for i in range(len(documents)):
        context += f"""{documents[i].page_content}"""
        if i < len(documents):
            context += f"""
                Document {i} below:
                """
    return context

We now setup our llm reranking chain

In [None]:
prompt = ChatPromptTemplate.from_template(RERANKER_AGENT_PROMPT)

llm_rerank_chain = (
    {
        "context": RunnableLambda(lambda x: format_documents_for_llm_reranking(x["context"])),
        "query": RunnablePassthrough(),
    }
    | prompt
    | llm
    | StrOutputParser()
)

Finally, we setup the overall chain, which incorporates the previous chain

In [None]:
full_agent_reranking_chain = (
    # We retrieve the relevant chunks
    {
        "context": chroma_retriever,
        "query": RunnablePassthrough(),
    }
    # We get the ranked indices by the llm
    | RunnablePassthrough.assign(llm_output=llm_rerank_chain)
    # We match them with the documents
    | RunnableLambda(lambda x: [x["context"][i] for i in [int(j) for j in x["llm_output"].split(", ")]])
)

We can now invoke the full chain to get retrieval and reranking in one

In [None]:
full_agent_reranking_chain.invoke(test_query)[:5]