<a href="https://colab.research.google.com/github/duper203/RAG_Techniques_with_upstage/blob/main/upstage/14_fusion_retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fusion Retrieval in Document Search








## Key Components

1. PDF processing and text chunkin
2. Vector store creation using FAISS and OpenAI embeddings
3. BM25 index creation for keyword-based retrieval
4. Custom fusion retrieval function that combines both methods





## Method Details

1. Document Preprocessing


2. Vector Store Creation


3. BM25 Index Creation

  1) A BM25 index is created from the same text chunks used for the vector store.
  
  2) This allows for keyword-based retrieval alongside the vector-based method.

4. Fusion Retrieval Function

The `fusion_retrieval` function is the core of this implementation:

  * It takes a query and performs both vector-based and BM25-based retrieval.
  * Scores from both methods are normalized to a common scale.
  * A weighted combination of these scores is computed (controlled by the * alpha parameter).
  Documents are ranked based on the combined scores, and the top-k results are returned.



## Import libraries and set constants

In [3]:
! pip3 install -qU langchain-upstage langchain langchain-community faiss-cpu PyPDF2 rank_bm25

In [25]:
import os
from google.colab import userdata
from langchain.vectorstores import FAISS
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_upstage import UpstageEmbeddings
from langchain.document_loaders import PyPDFLoader

from typing import List
from rank_bm25 import BM25Okapi
import numpy as np

os.environ["UPSTAGE_API_KEY"] = userdata.get("UPSTAGE_API_KEY")


## Define document path


In [5]:
path = "data/Understanding_Climate_Change.pdf"

## Encode the pdf to vector store and return split document from the step before to create BM25 instance

In [12]:
def encode_pdf_and_get_split_documents(path, chunk_size=1000, chunk_overlap=200):
    """
    Encodes a PDF book into a vector store using OpenAI embeddings.

    Args:
        path: The path to the PDF file.
        chunk_size: The desired size of each text chunk.
        chunk_overlap: The amount of overlap between consecutive chunks.

    Returns:
        A FAISS vector store containing the encoded book content.
    """

    # Load PDF documents
    loader = PyPDFLoader(path)
    documents = loader.load()

    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
    )
    texts = text_splitter.split_documents(documents)
    cleaned_texts = replace_t_with_space(texts)

    # Create embeddings and vector store
    embeddings = UpstageEmbeddings(model="solar-embedding-1-large")
    vectorstore = FAISS.from_documents(cleaned_texts, embeddings)

    return vectorstore, cleaned_texts

### functions

* `replace_t_with_space`
* `read_pdf_to_string`
* `process_documents`

In [None]:
def replace_t_with_space(list_of_documents):
    """
    Replaces all tab characters ('\t') with spaces in the page content of each document.

    Args:
        list_of_documents: A list of document objects, each with a 'page_content' attribute.

    Returns:
        The modified list of documents with tab characters replaced by spaces.
    """

    for doc in list_of_documents:
        doc.page_content = doc.page_content.replace('\t', ' ')  # Replace tabs with spaces
    return list_of_documents

In [13]:
import fitz
def read_pdf_to_string(path):
    """
    Read a PDF document from the specified path and return its content as a string.

    Args:
        path (str): The file path to the PDF document.

    Returns:
        str: The concatenated text content of all pages in the PDF document.

    The function uses the 'fitz' library (PyMuPDF) to open the PDF document, iterate over each page,
    extract the text content from each page, and append it to a single string.
    """
    # Open the PDF document located at the specified path
    doc = fitz.open(path)
    content = ""
    # Iterate over each page in the document
    for page_num in range(len(doc)):
        # Get the current page
        page = doc[page_num]
        # Extract the text content from the current page and append it to the content string
        content += page.get_text()
    return content


In [26]:
def process_documents(documents):
    """
    Processes a list of documents by splitting them into smaller chunks and creating a vector store.

    Args:
    - documents (list of str): A list of documents to be processed.

    Returns:
    - tuple: A tuple containing:
      - splits (list of str): The list of split document chunks.
      - vector_store (FAISS): A FAISS vector store created from the split document chunks and their embeddings.
    """
    embeddings = UpstageEmbeddings(model="solar-embedding-1-large")
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    splits = text_splitter.split_documents(documents)
    vector_store = FAISS.from_documents(splits, embeddings)
    return splits, vector_store

## Create vectorstore and get the chunked documents

In [27]:
vectorstore, cleaned_texts = encode_pdf_and_get_split_documents(path)

## Create a bm25 index for retrieving documents by keywords

In [28]:
def create_bm25_index(documents: List[Document]) -> BM25Okapi:
    """
    Create a BM25 index from the given documents.

    BM25 (Best Matching 25) is a ranking function used in information retrieval.
    It's based on the probabilistic retrieval framework and is an improvement over TF-IDF.

    Args:
    documents (List[Document]): List of documents to index.

    Returns:
    BM25Okapi: An index that can be used for BM25 scoring.
    """
    # Tokenize each document by splitting on whitespace
    # This is a simple approach and could be improved with more sophisticated tokenization
    tokenized_docs = [doc.page_content.split() for doc in documents]
    return BM25Okapi(tokenized_docs)

In [29]:
bm25 = create_bm25_index(cleaned_texts) # Create BM25 index from the cleaned texts (chunks)


## Define a function that retrieves both semantically and by keyword, normalizes the scores and gets the top k documents

In [36]:
def fusion_retrieval(vectorstore, bm25, query: str, k: int = 5, alpha: float = 0.5) -> List[Document]:
    """
    Perform fusion retrieval combining keyword-based (BM25) and vector-based search.

    Args:
    vectorstore (VectorStore): The vectorstore containing the documents.
    bm25 (BM25Okapi): Pre-computed BM25 index.
    query (str): The query string.
    k (int): The number of documents to retrieve.
    alpha (float): The weight for vector search scores (1-alpha will be the weight for BM25 scores).

    Returns:
    List[Document]: The top k documents based on the combined scores.
    """
    # Step 1: Get all documents from the vectorstore
    all_docs = vectorstore.similarity_search(" ", k=vectorstore.index.ntotal)

    # Step 2: Perform BM25 search
    bm25_scores = bm25.get_scores(query.split())

    # Step 3: Perform vector search
    vector_results = vectorstore.similarity_search_with_score(query, k=len(all_docs))

    # Step 4: Normalize scores
    vector_scores = np.array([score for _, score in vector_results])
    vector_scores = 1 - (vector_scores - np.min(vector_scores)) / (np.max(vector_scores) - np.min(vector_scores))

    bm25_scores = (bm25_scores - np.min(bm25_scores)) / (np.max(bm25_scores) - np.min(bm25_scores))

    # Step 5: Combine scores
    combined_scores = alpha * vector_scores + (1 - alpha) * bm25_scores

    # Step 6: Rank documents
    sorted_indices = np.argsort(combined_scores)[::-1]

    # Step 7: Return top k documents
    return [all_docs[i] for i in sorted_indices[:k]]

## Use Case example

In [41]:
# Query
query = "What are the impacts of climate change on the environment?"

# Perform fusion retrieval
top_docs = fusion_retrieval(vectorstore, bm25, query, k=5, alpha=0.5)
docs_content = [doc.page_content for doc in top_docs]
show_context(docs_content) # from helpers_function.py

Context 1:
water as a byproduct. Fuel cell vehicles (FCVs) offer a clean alternative to conventional 
vehicles, particularly for heavy -duty applications like trucks and buses. D eveloping a robust 
hydrogen infrastructure is essential for their success.  
Public Transportation Innovations  
Investments in efficient and reliable public transportation systems can reduce the number of 
private vehicles on the road, lowering emissions. Innovations include electric buses, light rail 
systems, and bike -sharing programs. Urban planning that prioritize s public transportation and 
non-motorized transit is key.  
Sustainable Agriculture and Land Use  
Precision Agriculture  
Precision agriculture uses technology to monitor and manage crop production more 
effectively. Techniques include GPS -guided equipment, soil sensors, and data analytics. 
These methods can optimize resource use, reduce emissions, and increase yields.  
Agroforestry


Context 2:
negative emissions. The captured CO2 can be