# Fusion Retrieval

Till now we only saw similarity based search **(vector based)**.

But apart from **similarity**, there exists a **keyword-based retrieval** approach **(BM25)** for documents.

A **Fusion Retrieval** approach combines both **vector-based similarity search** and **keyword-based search** to build a more robust and accurate retrieval system that can handle a wide range of queries.

---

## Key Components

- PDF processing and text chunking  
- Vector store creation using **FAISS**  
- BM25 index creation for keyword-based retrieval  
- Custom fusion retrieval function combining both retrievers  

---

## Workflow

1. Load the data  
2. Create text chunks  
3. Build a vector-based retriever (FAISS)  
4. Build a BM25 retriever  
5. Combine both retrieval results using a fusion strategy  
6. Generate a rich fused context  
7. Send the context to the RAG pipeline  

---

## Summary

Fusion Retrieval leverages:
- **Semantic similarity** from vector search
- **Exact keyword matching** from BM25

This hybrid approach improves recall, relevance, and robustness in real-world RAG systems.

---

#### **LLM used**

In [8]:
from langchain_ollama import ChatOllama 

llm = ChatOllama(
    model='llama3.2',
    temperature=0.2,
    verbose=True
)

print(llm.invoke("Hey How are you?"))

content="I'm just a language model, so I don't have emotions or feelings like humans do. However, I'm functioning properly and ready to help with any questions or tasks you may have! How can I assist you today?" additional_kwargs={} response_metadata={'model': 'llama3.2', 'created_at': '2025-12-21T18:04:06.996109Z', 'done': True, 'done_reason': 'stop', 'total_duration': 20374669459, 'load_duration': 3738547667, 'prompt_eval_count': 30, 'prompt_eval_duration': 11889181833, 'eval_count': 46, 'eval_duration': 3370796585, 'logprobs': None, 'model_name': 'llama3.2', 'model_provider': 'ollama'} id='lc_run--019b4214-ee2a-7780-bef5-73a5e3eaa80e-0' usage_metadata={'input_tokens': 30, 'output_tokens': 46, 'total_tokens': 76}


#### **Embedding Model**

In [9]:
from langchain_huggingface import HuggingFaceEmbeddings
import time 

embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

text = "This is a test document."

start = time.time()
query_result = embedding_model.embed_query(text)
total_time = time.time() - start
# show only the first 100 characters of the stringified vector
print(f"Length of text embedding : {len(text)}")
print(f"Time taken to convert text to embedding : {total_time :.2f} sec")
print(str(query_result)[:100] + "...")


Length of text embedding : 24
Time taken to convert text to embedding : 0.08 sec
[-0.0383385606110096, 0.1234646886587143, -0.02864295430481434, 0.05365273356437683, 0.0088453618809...


#### **Load the documents**

In [10]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "../data/Understanding_Climate_Change.pdf"
loader = PyPDFLoader(file_path)
docs = loader.load()

print(f"Number of documents : {len(docs)}")

Number of documents : 33


#### **Creating the Chunks**

In [11]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)
chunks = text_splitter.split_documents(docs)

print(chunks[0])

page_content='Understanding Climate Change 
Chapter 1: Introduction to Climate Change 
Climate change refers to significant, long-term changes in the global climate. The term 
"global climate" encompasses the planet's overall weather patterns, including temperature, 
precipitation, and wind patterns, over an extended period. Over the past century, human' metadata={'producer': 'Microsoft® Word 2021', 'creator': 'Microsoft® Word 2021', 'creationdate': '2024-07-13T20:17:34+03:00', 'author': 'Nir', 'moddate': '2024-07-13T20:17:34+03:00', 'source': '../data/Understanding_Climate_Change.pdf', 'total_pages': 33, 'page': 0, 'page_label': '1'}


#### **Create a Vectorstore and Retriever**

In [12]:
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS

index = faiss.IndexFlatL2(len(embedding_model.embed_query("hello world")))

vector_store = FAISS(
    embedding_function=embedding_model,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

In [13]:
vector_store.add_documents(chunks)

['ac5728ab-394d-47e8-8f89-c51d3e1e4a5d',
 'd76d3a6d-f37b-499c-9e72-754ab7f7a815',
 'c3931050-90ae-4241-85c4-89c235fb2ef2',
 'c440aa93-dffa-4a59-95ee-4739d9e63dc1',
 '795f79fe-acde-4858-bc86-7839330979cf',
 '5116e961-03da-4540-85b3-5266c84e82ed',
 '65e1d8cd-c7ff-4029-add0-6be7963d4eae',
 '42a39083-5432-4f88-a8a2-b9e486a00846',
 '1509c49b-4015-4a4e-be7c-0f9474c5b766',
 'bd2e31c4-8f3d-4c78-935c-f740c5d477ce',
 '1940103b-b7b0-489b-af22-f4c1ca5d5ec6',
 'ab2bdd1b-7410-4254-a4ac-6026c5d4a89b',
 '440aa3a2-df54-4bbf-bf98-702a5d5526be',
 '852629d0-d2a9-4a45-9d18-dba105f77be7',
 '286d6efb-fe15-4167-8204-c775e628522c',
 'a92d3bb4-e240-4d5a-bef5-14fb251d7d92',
 'b30a4a3d-2fed-4f9c-b7ed-2c2b9a3d14ae',
 '90d29fe2-541f-4acb-83cd-9eed0cbcc98a',
 '27fc2ec9-1a39-4289-8379-cac869f8c209',
 'e98c1a42-7e6e-4240-9c4c-16b8ebdaf829',
 '68f4e5d5-b174-48fd-82a7-1e24a8387fb8',
 '138e37a3-75cf-4d75-90ea-7e24c5a8416f',
 '12caa46b-9bff-4d35-a956-0b7696bd5392',
 '58bdb6f7-9257-462b-ba4e-fb6b86e90125',
 'c2de34e3-d2ff-

#### **Create a bm25 index for retrieving documents by keywords**

In [15]:
from typing import List 
from langchain_core.documents import Document
from rank_bm25 import BM25Okapi

def create_bm25_index(documents : List[Document]):
    """
    Create a BM25 index from the given documents.

    BM25 (Best Matching 25) is a ranking function used in information retrieval.
    It's based on the probabilistic retrieval framework and is an improvement over TF-IDF.

    Args:
    documents (List[Document]): List of documents to index.

    Returns:
    BM25Okapi: An index that can be used for BM25 scoring.
    """
    tokenized_docs = [doc.page_content.split() for doc in documents]
    return BM25Okapi(tokenized_docs)

In [18]:
bm25 = create_bm25_index(chunks) # Create BM25 index from the cleaned texts (chunks)

#### **Define a function that retrieves both semantically and by keyword, normalizes the scores and gets the top k documents**

In [20]:
import numpy as np

def fusion_retrieval(bm25, vectorstore, query, k=5, alpha=0.5):
    """ 
    query (str): The query string.
    k (int): The number of documents to retrieve.
    alpha (float): The weight for vector search scores (1-alpha will be the weight for BM25 scores).
    """
    epsilon = 1e-8

    # Step 1: Get all documents from the vectorstore
    all_docs = vectorstore.similarity_search("", k=vectorstore.index.ntotal)

    # Step 2: Perform BM25 search
    bm25_scores = bm25.get_scores(query.split())

    # Step 3: Perform vector search
    vector_results = vectorstore.similarity_search_with_score(query, k=len(all_docs))
    
    # Step 4: Normalize scores
    vector_scores = np.array([score for _, score in vector_results])
    vector_scores = 1 - (vector_scores - np.min(vector_scores)) / (np.max(vector_scores) - np.min(vector_scores) + epsilon)

    bm25_scores = (bm25_scores - np.min(bm25_scores)) / (np.max(bm25_scores) -  np.min(bm25_scores) + epsilon)

    # Step 5: Combine scores
    combined_scores = alpha * vector_scores + (1 - alpha) * bm25_scores  

    # Step 6: Rank documents
    sorted_indices = np.argsort(combined_scores)[::-1]
    
    # Step 7: Return top k documents
    return [all_docs[i] for i in sorted_indices[:k]]

In [23]:
# Query
query = "What are the impacts of climate change on the environment?"

# Perform fusion retrieval
top_docs = fusion_retrieval(bm25, vector_store, query, k=5, alpha=0.5)
docs_content = [doc.page_content for doc in top_docs]


In [25]:
for content in docs_content:
    print('='*89)
    print(f"Doc : {content}")

Doc : Community Education 
Community education initiatives raise awareness and promote action at the local level. 
Workshops, seminars, and public events can engage diverse audiences and encourage
Doc : Taking Action 
Individual Actions 
Individuals can make a difference by adopting sustainable practices in their daily lives. This 
includes reducing energy consumption, minimizing waste, supporting renewable energy, and 
advocating for climate action. Collective individual actions contribute to broader systemic 
change. 
Community Initiatives
Doc : implementing projects, building capacity, and fostering sustainable development. Ensuring 
adequate and predictable funding is a key focus of international negotiations. 
Technology Transfer and Capacity Building 
International agreements emphasize the importance of technology transfer and capacity 
building to support climate action in developing countries. This includes sharing climate-
Doc : often use social marketing, education, and incen