<a href="https://colab.research.google.com/github/duper203/RAG_Techniques_with_upstage/blob/main/upstage/15_reranking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reranking Methods in RAG Systems






## Key Components

Reranking systems typically include the following components:


1. Initial Retriever: Often a vector store using embedding-based similarity search.
2. Reranking Model: This can be either
  * A Large Language Model (LLM) for scoring relevance
  * A Cross-Encoder model specifically trained for relevance assessment
3. Scoring Mechanism: A method to assign relevance scores to documents
4. Sorting and Selection Logic: To reorder documents based on new scores





## Method Details

1. Initial Retrieval: Fetch an initial set of potentially relevant documents.


2. Pair Creation: Form query-document pairs for each retrieved document.


3. Scoring:

  * LLM Method: Use prompts to ask the LLM to rate document relevance.
  * Cross-Encoder Method: Feed query-document pairs directly into the model.      

4. Score Interpretation: Parse and normalize the relevance scores.

5. Reordering: Sort documents based on their new relevance scores.

6. Selection: Choose the top K documents from the reordered list.

In [None]:
! pip3 install -qU langchain-upstage langchain langchain-community faiss-cpu sentence_transformers

In [None]:
import os
from google.colab import userdata

from langchain_upstage import UpstageEmbeddings, ChatUpstage
from langchain.docstore.document import Document
from typing import List, Dict, Any, Tuple
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain_core.retrievers import BaseRetriever
from sentence_transformers import CrossEncoder

from langchain.prompts import PromptTemplate
from pydantic import BaseModel, Field

os.environ["UPSTAGE_API_KEY"] = userdata.get("UPSTAGE_API_KEY")

## Define document(s) path & Read PDf to string

In [None]:
path = "data/Understanding_Climate_Change.pdf"

## Create a vector store

In [None]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS

def encode_pdf(path, chunk_size=1000, chunk_overlap=200):
    """
    Encodes a PDF book into a vector store using OpenAI embeddings.

    Args:
        path: The path to the PDF file.
        chunk_size: The desired size of each text chunk.
        chunk_overlap: The amount of overlap between consecutive chunks.

    Returns:
        A FAISS vector store containing the encoded book content.
    """

    # Load PDF documents
    loader = PyPDFLoader(path)
    documents = loader.load()

    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
    )
    texts = text_splitter.split_documents(documents)
    cleaned_texts = replace_t_with_space(texts)

    # Create embeddings and vector store
    embeddings = UpstageEmbeddings(model="solar-embedding-1-large")
    vectorstore = FAISS.from_documents(cleaned_texts, embeddings)

    return vectorstore
def replace_t_with_space(list_of_documents):
    """
    Replaces all tab characters ('\t') with spaces in the page content of each document.

    Args:
        list_of_documents: A list of document objects, each with a 'page_content' attribute.

    Returns:
        The modified list of documents with tab characters replaced by spaces.
    """

    for doc in list_of_documents:
        doc.page_content = doc.page_content.replace('\t', ' ')  # Replace tabs with spaces
    return list_of_documents

In [None]:
vector_store = encode_pdf(path)

## Create a custom reranking function

In [None]:
from openai import OpenAI
class RatingScore(BaseModel):
    relevance_score: float = Field(..., description="The relevance score of a document to a query.")

def rerank_documents(query: str, docs: List[Document], top_n: int = 3) -> List[Document]:
    prompt_template = PromptTemplate(
        input_variables=["query", "doc"],
        template="""On a scale of 1-10, rate the relevance of the following document to the query. Consider the specific context and intent of the query, not just keyword matches.
        Query: {query}
        Document: {doc}
        Relevance Score:"""
    )
    llm = ChatUpstage(model='solar-pro')

    scored_docs = []
    for doc in docs:
        input_data = {"query": query, "doc": doc.page_content}
        prompt = prompt_template.format(query=query, doc=doc.page_content)

        # Call the LLM with the generated prompt
        response = llm(prompt)
        score_text = response.content.strip()

        try:
            # Extract and parse the score from the response
            score = float(score_text)
        except ValueError:
            score = 0  # Default score if parsing fails
        scored_docs.append((doc, score))

    # Sort and rerank documents based on their scores
    reranked_docs = sorted(scored_docs, key=lambda x: x[1], reverse=True)
    return [doc for doc, _ in reranked_docs[:top_n]]

## Example usage of the reranking function with a sample query relevant to the document

In [None]:
query = "What are the impacts of climate change on biodiversity?"
initial_docs = vector_store.similarity_search(query, k=15)
reranked_docs = rerank_documents(query, initial_docs)

# print first 3 initial documents
print("Top initial documents:")
for i, doc in enumerate(initial_docs[:3]):
    print(f"\nDocument {i+1}:")
    print(doc.page_content[:200] + "...")  # Print first 200 characters of each document


# Print results
print(f"Query: {query}\n")
print("Top reranked documents:")
for i, doc in enumerate(reranked_docs):
    print(f"\nDocument {i+1}:")
    print(doc.page_content[:200] + "...")  # Print first 200 characters of each document

Top initial documents:

Document 1:
Climate change is altering terrestrial ecosystems by shifting habitat ranges, changing species 
distributions, and impacting ecosystem functions. Forests, grasslands, and deserts are 
experiencing shi...

Document 2:
cultural perceptions.  
Youth Engagement  
Youth are vital stakeholders in climate action. Empowering young people through education, 
activism, and leadership opportunities can drive transformative c...

Document 3:
Freshwater Ecosystems  
Freshwater ecosystems, including rivers, lakes, and wetlands, are affected by changes in 
precipitation patterns, temperature, and water flow. These changes can lead to altered...
Query: What are the impacts of climate change on biodiversity?

Top reranked documents:

Document 1:
protection, and habitat creation.  
Climate -Resilient Conservation  
Conservation strategies must account for climate change impacts to be effective. This 
includes identifying climate refugia, areas...

Document 2:
Freshwat

## Create a custom retriever based on our reranker

In [None]:
# Create a custom retriever class
class CustomRetriever(BaseRetriever, BaseModel):

    vectorstore: Any = Field(description="Vector store for initial retrieval")

    class Config:
        arbitrary_types_allowed = True

    def get_relevant_documents(self, query: str, num_docs=2) -> List[Document]:
        initial_docs = self.vectorstore.similarity_search(query, k=30)
        return rerank_documents(query, initial_docs, top_n=num_docs)


# Create the custom retriever
custom_retriever = CustomRetriever(vectorstore=vector_store)

# Create an LLM for answering questions
llm = ChatUpstage(model="solar-pro")

# Create the RetrievalQA chain with the custom retriever
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=custom_retriever,
    return_source_documents=True
)

## Example query

In [None]:
result = qa_chain({"query": query})

print(f"\nQuestion: {query}")
print(f"Answer: {result['result']}")
print("\nRelevant source documents:")
for i, doc in enumerate(result["source_documents"]):
    print(f"\nDocument {i+1}:")
    print(doc.page_content[:200] + "...")  # Print first 200 characters of each document


  result = qa_chain({"query": query})



Question: What are the impacts of climate change on biodiversity?
Answer: Climate change impacts biodiversity in various ways. It affects terrestrial ecosystems by 
shifting habitat ranges, changing species distributions, and impacting ecosystem functions. 
Forests, grasslands, and deserts are all affected. Freshwater ecosystems, including rivers, 
lakes, and wetlands, also experience changes such as altered water quality, habitat loss, and 
reduced biodiversity due to shifts in precipitation patterns, temperature, and water flow. 
Freshwater species, including fish and amphibians, are particularly at risk.

Relevant source documents:

Document 1:
Freshwater Ecosystems  
Freshwater ecosystems, including rivers, lakes, and wetlands, are affected by changes in 
precipitation patterns, temperature, and water flow. These changes can lead to altered...

Document 2:
cultural perceptions.  
Youth Engagement  
Youth are vital stakeholders in climate action. Empowering young people through edu

## Example that demonstrates why we should use reranking

In [None]:
chunks = [
    "The capital of France is great.",
    "The capital of France is huge.",
    "The capital of France is beautiful.",
    """Have you ever visited Paris? It is a beautiful city where you can eat delicious food and see the Eiffel Tower.
    I really enjoyed all the cities in france, but its capital with the Eiffel Tower is my favorite city.""",
    "I really enjoyed my trip to Paris, France. The city is beautiful and the food is delicious. I would love to visit again. Such a great capital city."
]
docs = [Document(page_content=sentence) for sentence in chunks]


def compare_rag_techniques(query: str, docs: List[Document] = docs) -> None:
    embeddings = UpstageEmbeddings(model="solar-embedding-1-large")
    vectorstore = FAISS.from_documents(docs, embeddings)

    print("Comparison of Retrieval Techniques")
    print("==================================")
    print(f"Query: {query}\n")

    print("Baseline Retrieval Result:")
    baseline_docs = vectorstore.similarity_search(query, k=2)
    for i, doc in enumerate(baseline_docs):
        print(f"\nDocument {i+1}:")
        print(doc.page_content)

    print("\nAdvanced Retrieval Result:")
    custom_retriever = CustomRetriever(vectorstore=vectorstore)
    advanced_docs = custom_retriever.get_relevant_documents(query)
    for i, doc in enumerate(advanced_docs):
        print(f"\nDocument {i+1}:")
        print(doc.page_content)


query = "what is the capital of france?"
compare_rag_techniques(query, docs)

Comparison of Retrieval Techniques
Query: what is the capital of france?

Baseline Retrieval Result:

Document 1:
The capital of France is great.

Document 2:
The capital of France is huge.

Advanced Retrieval Result:


  advanced_docs = custom_retriever.get_relevant_documents(query)



Document 1:
I really enjoyed my trip to Paris, France. The city is beautiful and the food is delicious. I would love to visit again. Such a great capital city.

Document 2:
The capital of France is beautiful.


# Method 2: Cross Encoder models

In [None]:
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
userdata.get('HF_TOKEN')

class CrossEncoderRetriever(BaseRetriever, BaseModel):
    vectorstore: Any = Field(description="Vector store for initial retrieval")
    cross_encoder: Any = Field(description="Cross-encoder model for reranking")
    k: int = Field(default=5, description="Number of documents to retrieve initially")
    rerank_top_k: int = Field(default=3, description="Number of documents to return after reranking")

    class Config:
        arbitrary_types_allowed = True

    def get_relevant_documents(self, query: str) -> List[Document]:
        # Initial retrieval
        initial_docs = self.vectorstore.similarity_search(query, k=self.k)

        # Prepare pairs for cross-encoder
        pairs = [[query, doc.page_content] for doc in initial_docs]

        # Get cross-encoder scores
        scores = self.cross_encoder.predict(pairs)

        # Sort documents by score
        scored_docs = sorted(zip(initial_docs, scores), key=lambda x: x[1], reverse=True)

        # Return top reranked documents
        return [doc for doc, _ in scored_docs[:self.rerank_top_k]]

    async def aget_relevant_documents(self, query: str) -> List[Document]:
        raise NotImplementedError("Async retrieval not implemented")


  class CrossEncoderRetriever(BaseRetriever, BaseModel):
  class CrossEncoderRetriever(BaseRetriever, BaseModel):


## Create an instance and showcase over an example

In [None]:
# Create the cross-encoder retriever
cross_encoder_retriever = CrossEncoderRetriever(
    vectorstore=vector_store,
    cross_encoder=cross_encoder,
    k=10,  # Retrieve 10 documents initially
    rerank_top_k=5  # Return top 5 after reranking
)

# Set up the LLM
llm = ChatUpstage(model='solar-pro')

# Create the RetrievalQA chain with the cross-encoder retriever
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=cross_encoder_retriever,
    return_source_documents=True
)

# Example query
query = "What are the impacts of climate change on biodiversity?"
result = qa_chain({"query": query})

print(f"\nQuestion: {query}")
print(f"Answer: {result['result']}")
print("\nRelevant source documents:")
for i, doc in enumerate(result["source_documents"]):
    print(f"\nDocument {i+1}:")
    print(doc.page_content[:200] + "...")  # Print first 200 characters of each document


Question: What are the impacts of climate change on biodiversity?
Answer: Climate change has several impacts on biodiversity, including shifts in habitat ranges, changes in species distributions, and disruptions to ecosystem functions. These changes can lead to a loss of biodiversity and disrupt ecological balance. Climate change is causing shifts in plant and animal species composition in terrestrial ecosystems such as forests, grasslands, and deserts. Similarly, marine ecosystems are highly vulnerable, with rising sea temperatures, ocean acidification, and changing currents affecting marine biodiversity, from coral reefs to deep-sea habitats. Species migration and changes in reproductive cycles can disrupt marine food webs and fisheries.

Relevant source documents:

Document 1:
Climate change is altering terrestrial ecosystems by shifting habitat ranges, changing species 
distributions, and impacting ecosystem functions. Forests, grasslands, and deserts are 
experiencing shi...

Doc