# Fusion Retrieval in Document Search (Hybrid Search)
## Overview
This code implements a Fusion Retrieval system that combines vector-based similarity search with keyword-based BM25 retrieval. The approach aims to leverage the strengths of both methods to improve the overall quality and relevance of document retrieval.
## Motivation
Traditional retrieval methods often reply on either semantic understanding (vector-based) or keyword matching (BM25). Each appraoch has its strengths and weaknesses. Fusion retrieval aims to combine these methods to create a more robust and accurate retrieval system that can handle a wider range of queries effiectively.
## Key Components
- PDF processing and text chunking
- Vector store creation using FAISS and embeddings
- BM25 index creation for keyword-based retrieval
- Custom fusion retrieval function that combines both methods
## Benefits of this Approach
- Improved Retrieval Quality: By combing semantic and keyword-based search, the system can capture both conceptual similarity and exact keyword matches
- Flexibility: The `alpha` parameter allows for adjusting the balance between vector and keyword search based on sepcific use cases or query types
- Robustness: The combined appraoch can handle a wider range of queries effectively, mitgating weaknesses of individual methods
- Customizability: The sytem can be easily adapted to use different vector stores or keyword-based retrieval methods
## Conclusion
Fusion retrieval represents a powerful approach to document search that combines the strengths of semantic understanding and keyword matching. By leveraging both vector-based and BM25 retrieval methods, it offers a more comprehensive and flexible solutin for information retrieval tasks.

In [1]:
import os
from dotenv import load_dotenv

from langchain_openai.chat_models.azure import AzureChatOpenAI
load_dotenv()
openai_endpoint = os.environ.get("AZURE_OPENAI_ENDPOINT")
openai_api_key = os.environ.get("AZURE_OPENAI_API_KEY")
openai_deployment = os.getenv("AZURE_OPENAI_DEPLOYMENT_ID")
openai_api_version = os.getenv("AZURE_API_VERSION")

llm = AzureChatOpenAI(
    azure_deployment=openai_deployment,
    api_version="2024-10-01-preview",
    azure_endpoint=f"{openai_endpoint}openai/deployments/{openai_deployment}/chat/completions?api-version=2024-10-01-preview",
    temperature=0,
    logprobs=True,
)

In [2]:
from langchain.docstore.document import Document
from typing import List
from rank_bm25 import BM25Okapi
import numpy as np
path = "./data/Understanding_Climate_Change.pdf"

In [3]:
from helper_functions import PyPDFLoader, RecursiveCharacterTextSplitter, replace_t_with_space, FAISS
from langchain_openai.embeddings.azure import AzureOpenAIEmbeddings
openai_embedding = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_ID")
def encode_pdf_and_get_split_documents(path, chunk_size=1000, chunk_overlap=200):
    """
    Encodes a PDF book into a vector store using OpenAI embeddings.

    Args:
        path: The path to the PDF file.
        chunk_size: The desired size of each text chunk.
        chunk_overlap: The amount of overlap between consecutive chunks.

    Returns:
        A FAISS vector store containing the encoded book content.
    """
    loader = PyPDFLoader(path)
    documents = loader.load()

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
    )

    texts = text_splitter.split_documents(documents)
    cleaned_texts = replace_t_with_space(texts)

    embeddings = AzureOpenAIEmbeddings(
        deployment=openai_embedding,
        model="text-embedding-ada-002",
        chunk_size=16
    )
    vectorstore = FAISS.from_documents(cleaned_texts, embeddings)
    return vectorstore, cleaned_texts

In [4]:
vectorstore, cleaned_texts = encode_pdf_and_get_split_documents(path)

In [5]:
def create_bm25_index(documents: List[Document]) -> BM25Okapi:
    """
    Create a BM25 index from the given documents.

    BM25 (Best Matching 25) is a ranking function used in information retrieval.
    It's based on the probabilistic retrieval framework and is an improvement over TF-IDF.

    Args:
    documents (List[Document]): List of documents to index.

    Returns:
    BM25Okapi: An index that can be used for BM25 scoring.
    """
    tokenized_documents = [doc.page_content.split() for doc in documents]
    return BM25Okapi(tokenized_documents)

In [6]:
bm25 = create_bm25_index(cleaned_texts)

In [9]:
def fusion_retrieval(vectorstore, bm25, query:str, k:int=5, alpha:float=0.5) -> List[Document]:
    """
    Perform fusion retrieval combining keyword-based (BM25) and vector-based search.

    Args:
    vectorstore (VectorStore): The vectorstore containing the documents.
    bm25 (BM25Okapi): Pre-computed BM25 index.
    query (str): The query string.
    k (int): The number of documents to retrieve.
    alpha (float): The weight for vector search scores (1-alpha will be the weight for BM25 scores).

    Returns:
    List[Document]: The top k documents based on the combined scores.
    """
    all_docs = vectorstore.similarity_search("", k=vectorstore.index.ntotal)
    bm25_scores = bm25.get_scores(query.split())

    vector_results = vectorstore.similarity_search_with_score(query, k=len(all_docs))

    vector_scores = np.array([score for _, score in vector_results])
    vector_scores = 1 - (vector_scores - np.min(vector_scores)) / (np.max(vector_scores) - np.min(vector_scores))

    bm25_scores = (bm25_scores - np.min(bm25_scores)) / (np.max(bm25_scores) - np.min(bm25_scores))

    combined_scores = alpha * vector_scores + (1 - alpha) * bm25_scores

    sorted_indices = np.argsort(combined_scores)[::-1]

    return [all_docs[i] for i in sorted_indices[:k]]

In [10]:
from helper_functions import show_context
query = "What are the impacts of climate change on the environment?"

# Perform fusion retrieval
top_docs = fusion_retrieval(vectorstore, bm25, query, k=5, alpha=0.5)
docs_content = [doc.page_content for doc in top_docs]
show_context(docs_content)

Context 1:
Securing land rights for indigenous and local communities is essential for climate justice. 
Recognizing and protecting these rights ensures that communities can manage their lands 
sustainably and resist exploitation. Legal frameworks and international agre ements must 
uphold the rights of indigenous peoples.  
Gender and Climate Change  
Gendered Impacts  
Climate change affects men and women differently, often exacerbating existing gender 
inequalities. Women, particularly in developing countries, are more likely to experience the 
adverse effects of climate change due to their roles in agriculture, water col lection, and 
caregiving. Addressing these gendered impacts requires targeted interventions.  
Women's Leadership  
Empowering women as leaders in climate action can drive transformative change. Women 
bring unique perspectives and solutions to climate challenges, often prioritizing community


Context 2:
This vision includes a healthy planet, thriving ecosystems, a