# üîç Hybrid Retrieval QA
Using Openai + FAISS + BM25 + Vector - Hybrid retrieval

* BM25 (Keyword Search) ‚Äî for exact term matches
* Vector Embeddings (Semantic Search) ‚Äî for meaning-based similarity using Openai Embeddings
* Hybrid Search ‚Äî combining both approaches for balanced retrieval

In [1]:
#!pip install langchain langchain-community langchain-core langchain-openai faiss-cpu

In [1]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.retrievers import BM25Retriever


from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

In [5]:
from dotenv import load_dotenv
load_dotenv()

True

In [6]:
docs = [
    # Keyword-heavy (BM25 should pick)
    "Diabetes causes high blood sugar because insulin is not produced correctly.Insulin moves glucose from the blood into cells, lowering blood sugar.High blood sugar levels occur when insulin cannot regulate glucose properly.",
    
    # Semantic-heavy (Vector should pick)
    "Beta-cell dysfunction prevents proper insulin secretion, disrupting glucose uptake.When cellular insulin signaling fails, tissues cannot absorb glucose, causing metabolic imbalance. Impaired glucose transporters lead to chronic hyperglycemia despite normal sugar intake.",
    
    # Keyword noise (BM25 false positive)
    "Eating a lot of sugar can increase blood sugar levels but does not cause diabetes. " ,

    # 3 ‚Äî NEW semantic passage (Vector will pick, BM25 will NOT)
    "Chronic metabolic disorders impair the pathways responsible for moving nutrients from the bloodstream into cells causing circulating energy molecules to remain elevated instead of being absorbed by tissues."

]


In [7]:
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
chunks = splitter.create_documents(docs)
chunks[:2]

[Document(metadata={}, page_content='Diabetes causes high blood sugar because insulin is not produced correctly.Insulin moves glucose from the blood into cells, lowering blood sugar.High blood sugar levels occur when insulin cannot regulate glucose properly.'),
 Document(metadata={}, page_content='Beta-cell dysfunction prevents proper insulin secretion, disrupting glucose uptake.When cellular insulin signaling fails, tissues cannot absorb glucose, causing metabolic imbalance. Impaired glucose transporters lead to chronic hyperglycemia despite normal sugar intake.')]

In [8]:
embeddings = OpenAIEmbeddings()

Create a FAISS vector store and convert it into an MMR-based retriever that fetches diverse, top-k relevant chunks using the specified search parameters.

In [9]:
vectorstore = FAISS.from_documents(chunks, embeddings)
vector_retriever = vectorstore.as_retriever(  search_type="mmr",
    search_kwargs={
        "k": 2,
        "fetch_k": 20,
        "lambda_mult": 0.5
    } ) # search_kwargs={'k': 2})

<b>BM25</b> is a classic <b>keyword-based ranking</b> algorithm used in search engines to score and retrieve documents based on how well they match a query. It boosts relevant keyword matches, accounts for word frequency, and adjusts for document length to return the most relevant results.

Build a BM25 keyword retriever that returns the top-2 most relevant text chunks using the BM25L scoring method.

In [10]:
keyword_retriever = BM25Retriever.from_documents(chunks , k=2,
     bm25_type="bm25l" )


<b>hybrid_search : </b>This function performs hybrid search by combining top vector-based results with top keyword-based (BM25) results 
and merging them into a single deduplicated list.
It balances semantic and keyword relevance to return a more complete set of top-k documents

In [11]:
def hybrid_search(query, top_k=2):
    # Step 1: Get vector and keyword results
    vector_docs = vector_retriever.invoke(query)
    keyword_docs = keyword_retriever.invoke(query)

    # Step 2: Compute how many docs to take from each retriever
    k_vec = top_k // 2
    k_kw  = top_k - k_vec  # ensures total = top_k

    # Step 3: Select docs
    selected_vector = vector_docs[:k_vec]
    selected_keyword = keyword_docs[:k_kw]

    # Step 4: Merge and remove duplicates
    seen = set()
    merged = []
    
    for d in selected_vector + selected_keyword:
        if d.page_content not in seen:
            seen.add(d.page_content)
            merged.append(d)

    return merged

In [12]:
llm = ChatOpenAI(model="gpt-4o-mini")

#### LCEL RAG chain

In [13]:


prompt = PromptTemplate(
    input_variables=["context", "question"],
    template="""
Use the context below to answer the question.

Context:
{context}

Question:
{question}

Answer:
""",
)

parser = StrOutputParser()

In [14]:
def bm25_rag(question):
    keyword_docs = keyword_retriever.invoke(query)
    print(keyword_docs)
    context = "\n\n".join([d.page_content for d in keyword_docs])
    chain = prompt | llm | parser
    return chain.invoke({"context": context, "question": question})

In [15]:
def vector_rag(question):
    vector_docs = vector_retriever.invoke(query)
    print(vector_docs)

    context = "\n\n".join([d.page_content for d in vector_docs])
    chain = prompt | llm | parser
    return chain.invoke({"context": context, "question": question})

In [16]:
def hybrid_rag(question):
    docs = hybrid_search(question)
    print(docs)
    context = "\n\n".join([d.page_content for d in docs])
    chain = prompt | llm | parser
    return chain.invoke({"context": context, "question": question})

In [17]:
query = "Why does diabetes lead to chronically high blood sugar?"

### Keyword search - BM25 

In [18]:
bm25_rag(query)

[Document(metadata={}, page_content='Eating a lot of sugar can increase blood sugar levels but does not cause diabetes.'), Document(metadata={}, page_content='Beta-cell dysfunction prevents proper insulin secretion, disrupting glucose uptake.When cellular insulin signaling fails, tissues cannot absorb glucose, causing metabolic imbalance. Impaired glucose transporters lead to chronic hyperglycemia despite normal sugar intake.')]


'Diabetes leads to chronically high blood sugar primarily due to beta-cell dysfunction and impaired insulin secretion. In diabetes, the pancreas struggles to produce sufficient insulin, which is necessary for glucose uptake by cells. When insulin signaling is disrupted, tissues become unable to absorb glucose effectively, leading to elevated blood sugar levels. Additionally, impaired glucose transporters further exacerbate the problem, resulting in chronic hyperglycemia even if sugar intake is normal. Thus, the combination of insufficient insulin production and defective glucose transport mechanisms contributes to the persistent high blood sugar characteristic of diabetes.'

### Vector Retrieval:

In [19]:
vector_rag(query)

[Document(id='e3e6434a-92a4-4053-a48e-ec3d517b3057', metadata={}, page_content='Diabetes causes high blood sugar because insulin is not produced correctly.Insulin moves glucose from the blood into cells, lowering blood sugar.High blood sugar levels occur when insulin cannot regulate glucose properly.'), Document(id='71daab1d-3128-4f88-a57c-82f8d54c6394', metadata={}, page_content='Chronic metabolic disorders impair the pathways responsible for moving nutrients from the bloodstream into cells causing circulating energy molecules to remain elevated instead of being absorbed by tissues.')]


"Diabetes leads to chronically high blood sugar because insulin is either not produced correctly or the body's cells do not respond effectively to insulin. Insulin's primary role is to facilitate the transport of glucose from the bloodstream into cells. When insulin function is impaired, glucose remains in the bloodstream instead of being absorbed by tissues, resulting in elevated blood sugar levels that persist over time. Additionally, chronic metabolic disorders disrupt the pathways responsible for nutrient uptake, further contributing to the inability to regulate glucose properly."

### Hybrid Retrieval

In [20]:

hybrid_rag(query)

[Document(id='e3e6434a-92a4-4053-a48e-ec3d517b3057', metadata={}, page_content='Diabetes causes high blood sugar because insulin is not produced correctly.Insulin moves glucose from the blood into cells, lowering blood sugar.High blood sugar levels occur when insulin cannot regulate glucose properly.'), Document(metadata={}, page_content='Eating a lot of sugar can increase blood sugar levels but does not cause diabetes.')]


'Diabetes leads to chronically high blood sugar because insulin is either not produced correctly or the body does not respond effectively to insulin. This impairment prevents insulin from adequately moving glucose from the blood into cells, resulting in elevated blood sugar levels. Without proper regulation by insulin, glucose remains in the bloodstream, leading to sustained high blood sugar levels characteristic of diabetes.'

### Hybrid search using Sport product catalog dataset

In [21]:
import pandas as pd
import os
from langchain_core.documents import Document

In [22]:


# ------------------------------
# directory setup
# ------------------------------
try:
    current_dir = os.path.dirname(os.path.abspath(__file__))
except NameError:
    current_dir = os.getcwd()

    
dataset_path = current_dir + '/data/dataset1.csv'

print(dataset_path)
if not os.path.exists(dataset_path):
    raise FileNotFoundError("‚ùå dataset.csv not found in current directory.")

products_df = pd.read_csv(dataset_path,encoding="cp1252")
print("üì¶ Loaded dataset from CSV (first 5 rows):")
print(products_df.head())

print("üì¶ Loaded dataset with shape:", products_df.shape)
print("üìä Columns:", products_df.columns.tolist())

D:\OneDrive\Documents\CMI\TinyMaqiq\FDE_Training\Project_Data\Training_Session\Vector_Rag/data/dataset1.csv
üì¶ Loaded dataset from CSV (first 5 rows):
   category                                              title  \
0  Cricket   ITWOSERVICES CRICKET NET 100X10 CRICKET NET NY...   
1  Cricket   ITWOSERVICES CRICKET NET GROUND BOUNDARY NET 1...   
2  Cricket   VICTORY Medium Weight ( Pack of 1 ) Rubber Cri...   
3  Cricket   LYCAN Junior Cricket Bat Size 3 For Age Group ...   
4  Cricket   Star X Thrill Fox Heavy Duty First Grade HD Pl...   

   product_rating selling_price     mrp     seller_name  seller_rating  \
0             4.4         1,615  4000.0      I2SERVICES            4.4   
1             4.4           152   600.0      I2SERVICES            4.4   
2             3.7            59   199.0  VictoryOutlets            4.7   
3             3.9           249     NaN        sellguru            4.8   
4             4.0           349   749.0           STARX            4.5   

     

In [23]:
# Optional: clean missing titles
products_df["title"] = products_df["title"].fillna("")


### PREPARE DOCUMENTS FOR INDEXING

In [24]:

docs = []

for idx, row in products_df.iterrows():
    content = row["title"]  # semantic search only on title column
    
    metadata = {
        "row_id": int(idx),
        "category": row.get("category", ""),
        "selling_price": row.get("selling_price", ""),
        "product_rating": row.get("product_rating", ""),
        "mrp": row.get("mrp", ""),
         "seller_name": row.get("seller_name", ""),
        "seller_rating": row.get("seller_rating", "")
    }

    docs.append(Document(page_content=content, metadata=metadata))

print(f"Prepared {len(docs)} documents for indexing.")

Prepared 65 documents for indexing.


### EMBEDDINGS 

In [25]:

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")



### CREATE VECTOR STORE INDEX

In [26]:
vectorstore = FAISS.from_documents(docs, embeddings )

In [27]:

vector_retriever = vectorstore.as_retriever(  search_kwargs={ "k": 5  } )

In [28]:
keyword_retriever = BM25Retriever.from_documents(docs , k=5,
     bm25_type="bm25l" )

In [33]:
query = "Show me cricket training nets for outdoor practice"

#query="Nylon HDPE cricket net 100x10"

#query="options for Rubber cricket ball 110g pack"

#query="Long-lasting cricket practice net for outdoor training"

#query="A bat designed for children under 10 years old"

In [34]:
bm25_rag(query)

[Document(metadata={'row_id': 46, 'category': 'Cardio Equipment ', 'selling_price': '9,499', 'product_rating': 3.6, 'mrp': 21000.0, 'seller_name': 'Reach2Fitness', 'seller_rating': 4.4}, page_content='Adrenex by Flipkart Manual Treadmill for Exercise at Home Gym Running Machine for Cardio Weight Loss Treadmill'), Document(metadata={'row_id': 58, 'category': 'Home Gyms ', 'selling_price': '199', 'product_rating': 4.1, 'mrp': 999.0, 'seller_name': 'ADONYX', 'seller_rating': 4.8}, page_content='ADONYX Toning Tube With Door Anchor for All Type Resistance Bands Ideal for Men & Women Gym & Fitness Kit'), Document(metadata={'row_id': 44, 'category': 'Swimming ', 'selling_price': '299', 'product_rating': 4.0, 'mrp': 799.0, 'seller_name': 'Skylofts', 'seller_rating': 4.1}, page_content='Skylofts Soft Silicone Noise Reduction Ear Plugs for Sleeping, Meditation, Swimming adult and child, Reusable Earmuff for Travel Flights (Pack of 10) Ear Plug\xa0\xa0(Multicolor)'), Document(metadata={'row_id': 

"I'm sorry, but the provided context does not include any information about cricket training nets for outdoor practice."

In [35]:
vector_rag(query)

[Document(id='0b2da905-a871-4908-9201-e46e014ebb1a', metadata={'row_id': 1, 'category': 'Cricket ', 'selling_price': '152', 'product_rating': 4.4, 'mrp': 600.0, 'seller_name': 'I2SERVICES', 'seller_rating': 4.4}, page_content='ITWOSERVICES CRICKET NET GROUND BOUNDARY NET 10X10 FEET Cricket Net\xa0\xa0(Green)'), Document(id='ad1f9b9a-12b7-4dd8-b2cb-7d21ba42946f', metadata={'row_id': 0, 'category': 'Cricket ', 'selling_price': '1,615', 'product_rating': 4.4, 'mrp': 4000.0, 'seller_name': 'I2SERVICES', 'seller_rating': 4.4}, page_content='ITWOSERVICES CRICKET NET 100X10 CRICKET NET NYLON HDPE Cricket Net\xa0\xa0(Green)'), Document(id='0e3318de-c860-4855-b3dc-1eba567f0844', metadata={'row_id': 4, 'category': 'Cricket ', 'selling_price': '349', 'product_rating': 4.0, 'mrp': 749.0, 'seller_name': 'STARX', 'seller_rating': 4.5}, page_content='Star X Thrill Fox Heavy Duty First Grade HD Plastic Cricket Bat PVC/Plastic Cricket  Bat\xa0\xa0(1 kg)'), Document(id='22aa1d1f-0b10-424a-aa93-2bef70cfa

'Here are some cricket training nets suitable for outdoor practice:\n\n1. **ITWOSERVICES CRICKET NET GROUND BOUNDARY NET 10X10 FEET** - This net is ideal for creating a boundary during practice sessions.\n\n2. **ITWOSERVICES CRICKET NET 100X10 CRICKET NET NYLON HDPE** - A larger net suitable for comprehensive outdoor training sessions, made of durable nylon material.\n\nThese nets will help you set up a proper training environment for practicing cricket outdoors.'

In [36]:
hybrid_rag(query)

[Document(id='0b2da905-a871-4908-9201-e46e014ebb1a', metadata={'row_id': 1, 'category': 'Cricket ', 'selling_price': '152', 'product_rating': 4.4, 'mrp': 600.0, 'seller_name': 'I2SERVICES', 'seller_rating': 4.4}, page_content='ITWOSERVICES CRICKET NET GROUND BOUNDARY NET 10X10 FEET Cricket Net\xa0\xa0(Green)'), Document(metadata={'row_id': 46, 'category': 'Cardio Equipment ', 'selling_price': '9,499', 'product_rating': 3.6, 'mrp': 21000.0, 'seller_name': 'Reach2Fitness', 'seller_rating': 4.4}, page_content='Adrenex by Flipkart Manual Treadmill for Exercise at Home Gym Running Machine for Cardio Weight Loss Treadmill')]


'Here are some options for cricket training nets suitable for outdoor practice:\n\n1. **ITWOSERVICES Cricket Net Ground Boundary Net 10x10 Feet** - This net is designed for outdoor use and is perfect for cricket practice, providing a designated area for players to improve their batting and bowling skills.\n\n2. **Adrenex by Flipkart Manual Treadmill for Exercise at Home Gym** - Although primarily a treadmill, it emphasizes overall fitness which can complement cricket training.\n\nFor dedicated cricket training nets, check local sports retailers or online platforms for more options that meet your specific requirements, such as size and material, for outdoor practice.'