# **RAG --- Hybrid Search:**

## **Keyword Search (Traditional Method:)**

By using TFIDF Vectorizer

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import re

In [4]:
# Sample documents

documents = [
    "This is a list which containig sample documents.",
    "Keywords are important for keyword-based search.",
    "Document analysis involves extracting keywords.",
    "Keyword-based search relies on sparse embeddings."
]

In [7]:
# Text-Preprocessing function:

def text_preprocessing(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    return text

In [8]:
processed_documents = [text_preprocessing(text=doc) for doc in documents]
processed_documents

['this is a list which containig sample documents',
 'keywords are important for keywordbased search',
 'document analysis involves extracting keywords',
 'keywordbased search relies on sparse embeddings']

In [9]:
# Pre-process the query

def get_query_and_preproces(query):
  return text_preprocessing(text=query)

In [10]:
preprocessed_query = get_query_and_preproces(query="Keyword-based Search")
preprocessed_query

'keywordbased search'

In [11]:
# Initialize the TFIDF vectorizer:

vector=TfidfVectorizer()

In [13]:
# Apply vectorizer on processed documents:

X = vector.fit_transform(processed_documents)
X.toarray()

array([[0.        , 0.        , 0.37796447, 0.        , 0.37796447,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.37796447, 0.        , 0.        , 0.37796447, 0.        ,
        0.        , 0.37796447, 0.        , 0.        , 0.37796447,
        0.37796447],
       [0.        , 0.4533864 , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.4533864 , 0.4533864 , 0.        ,
        0.        , 0.35745504, 0.35745504, 0.        , 0.        ,
        0.        , 0.        , 0.35745504, 0.        , 0.        ,
        0.        ],
       [0.46516193, 0.        , 0.        , 0.46516193, 0.        ,
        0.        , 0.46516193, 0.        , 0.        , 0.46516193,
        0.        , 0.        , 0.36673901, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.43671931, 0.        , 0.        , 0.       

In [15]:
# perform vectorizer on query or perform query embeddings:

query_embedding = vector.transform([preprocessed_query])
query_embedding.toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.70710678, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.70710678, 0.        , 0.        ,
        0.        ]])

In [19]:
# Perform the Similarity Search:

similarities = cosine_similarity(X, Y=query_embedding)
similarities

array([[0.        ],
       [0.50551777],
       [0.        ],
       [0.48693426]])

In [24]:
# Get the Ranked Documents:

ind = np.argsort(similarities, axis=0)[::-1].flatten()
ranked_documents = [documents[i] for i in ind]
ranked_documents


['Keywords are important for keyword-based search.',
 'Keyword-based search relies on sparse embeddings.',
 'Document analysis involves extracting keywords.',
 'This is a list which containig sample documents.']

In [25]:
# Implement Ranking by using Cosine-Similirity:

# Documents Embeddings: (get after applying TF-IDF)
document_embeddings = np.array([
    [0.634, 0.234, 0.867, 0.042, 0.249],
    [0.123, 0.456, 0.789, 0.321, 0.654],
    [0.987, 0.654, 0.321, 0.123, 0.456]
])

# Query Embeddings: (get after applying TF-IDF)
query_embedding = np.array([[0.789, 0.321, 0.654, 0.987, 0.123]])

# Calculate cosine similarity between query and documents:
similarities = cosine_similarity(document_embeddings, query_embedding)

# get the rnke based on cosine similarities:
ranked_indices = np.argsort(similarities, axis=0)[::-1].flatten()

# Output the Ranked documents:
for i, idx in enumerate(ranked_indices):
    print(f"Rank {i+1}: Document {idx+1}")

Rank 1: Document 1
Rank 2: Document 3
Rank 3: Document 2


## **Hybrid Search:**

In [26]:
# install libaries:
!pip -q install sentence-transformers==2.2.2
!pip -q install langchain
!pip -q install chromadb
!pip -q install langchain_community
!pip -q install langchain_google_genai
!pip -q install numpy pandas
!pip -q install pypdf

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m42.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [27]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [31]:
# Load the PDF documents:

loader = PyPDFLoader(file_path="/content/k8_social_science_1.pdf")
documents = loader.load()
# documents

In [36]:
# Perform Chunkings:

splitter = RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=200)
chunks = splitter.split_documents(documents)

In [37]:
len(chunks)

217

In [39]:
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

In [None]:
# Get the embeddings models:

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

In [80]:
# Create a vector store:

persist_directory = 'DB'
vectorstore = Chroma.from_documents(chunks, embeddings,  persist_directory=persist_directory)

vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings)

# Create Retriever from vector DB as vectorstore_retriever:
vectorstore_retreiver = vectordb.as_retriever(search_kwargs={"k": 25})

In [None]:
vectorstore_retreiver.get_relevant_documents("What is colonial?")

In [43]:
# Create a keyword search:

!pip install rank_bm25 -q

In [82]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever

keyword_retriever = BM25Retriever.from_documents(chunks, k=25)

In [None]:
keyword_retriever.get_relevant_documents("colonial")

In [48]:
# Build a Hybrid Retriever:

ensemble_retriever = EnsembleRetriever(retrievers=[vectorstore_retreiver, keyword_retriever], weights=[0.4, 0.6])

**Mixing vector search and keyword search for Hybrid search**<br>
hybrid_score = (1 — alpha) * sparse_score + alpha * dense_score

In [50]:
# Load llm:
from google.colab import userdata
from langchain_google_genai import ChatGoogleGenerativeAI

GEMINI_API_KEY = userdata.get("GEMINI_API_KEY")


llm = ChatGoogleGenerativeAI(
    model="gemini-pro",
    google_api_key=GEMINI_API_KEY,
    temperature=0.2,
    max_tokens=1024,
    max_length=1024,

)

In [51]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

In [84]:
prompt = """
You are an AI-powered virtual assistant, your name is 'Neel', designed by EdzLearn Service Private Limited.
Your task is to help the students with their study purposes.
Use the following pieces of context to answer the question at the end in very detail way.
If you don't know the answer, just say that 'I don't have enough information to answer this question'.

Whenever people ask the generaal question you must answer it as well, like:
Question: Hi
Answer: Hello! How can I assist you with your studies today?

Question: What is your name?
Answer: I am Neel, your virtual assistant designed by EdzLearn Service Private Limited.

Context: `{context}`
Question: `{question}`
"""

prompt_template = PromptTemplate(
    template=prompt,
    input_variables=['context', 'question']
)

In [85]:
# Create the Normal LLM chain with the prompt template:
normal_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore_retreiver,
    chain_type_kwargs={"prompt": prompt_template}
)


# Create the Hybrid LLM chain with the prompt template:
hybrid_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=ensemble_retriever,
    chain_type_kwargs={"prompt": prompt_template}
)

In [86]:
from IPython.display import display, Markdown

def generate_response(query, chain):
    response = chain.invoke(query)
    # display(Markdown(f"**Query:** {query}"))
    res = display(Markdown(f"{response['result']}"))
    return res

In [87]:
# normal chain
generate_response(query="Colonial", chain=normal_chain)

When the subjugation of one country by another leads to these kinds of political, economic, social and cultural changes, we refer to the process as colonisation.

In [88]:
# use hybrid chain:
generate_response(query="Colonial", chain=hybrid_chain)

Colonial refers to the process of subjugation of one country by another, leading to political, economic, social, and cultural changes. In the context of India, the British came to conquer the country and establish their rule, subjugating local nawabs and rajas. They established control over the economy and society, collected revenue to meet all their expenses, bought the goods they wanted at low prices, produced crops they needed for export, and brought about changes in values and tastes, customs, and practices.

In [89]:
# use normal chain:
generate_response(query="What is Colonial?", chain=normal_chain)

When the subjugation of one country by another leads to political, economic, social and cultural changes, we refer to the process as colonisation.

In [91]:
# use hybrid chain:
generate_response(query="What is the problem with the periodisation of Indian history that James Mill offers?", chain=hybrid_chain)

James Mill's periodisation of Indian history into Hindu, Muslim, and British periods is problematic because it:

1. **Oversimplifies the complexity of Indian history:** Indian history is not neatly divided into these three distinct periods. There were significant overlaps and continuities between these periods, and many different religious, cultural, and political groups coexisted and interacted throughout Indian history.

2. **Focuses on the ruling elites:** Mill's periodisation focuses primarily on the history of the ruling elites, particularly the Hindu, Muslim, and British rulers. It neglects the experiences and perspectives of the vast majority of the Indian population, who were not part of these ruling groups.

3. **Imposes a Western framework:** Mill's periodisation is based on a Western understanding of history, which divides history into distinct periods based on religious or political dominance. This framework may not be appropriate for understanding the complexities of Indian history, which has its own unique characteristics and patterns of development.

4. **Reinforces stereotypes:** Mill's periodisation reinforces stereotypes about Indian history, such as the idea that Hindu rule was a period of stagnation and decline, while Muslim rule was a period of progress and enlightenment. These stereotypes are not supported by historical evidence and can lead to a distorted understanding of Indian history.

In [92]:
# use hybrid chain:
generate_response(query="What is the problem with the periodisation of Indian history that James Mill offers?", chain=normal_chain)

James Mill's periodisation of Indian history into Hindu, Muslim, and British periods is problematic because it is based on the assumption that Indian history is a linear progression from one religious or political domination to another. This assumption ignores the fact that Indian history is much more complex and diverse, and that there have been many different periods of cultural and political change throughout its history. Additionally, Mill's periodisation privileges the experiences of the ruling classes and ignores the experiences of the majority of the population.