In [89]:
# install necessary packages
!pip install numpy scikit-learn sentence-transformers -q

In [90]:
documents = [
    "The T20 World Cup 2024 is in full swing, bringing excitement and drama to cricket fans worldwide.India's team, captained by Rohit Sharma, is preparing for a crucial match against Ireland, with standout player Jasprit Bumrah expected to play a pivotal role in their campaign.The tournament has already seen controversy, particularly concerning the pitch conditions at Nassau County International Cricket Stadium in New York, which came under fire after a low-scoring game between Sri Lanka and South Africa.",
    "The world of football is buzzing with excitement as major tournaments and league matches continue to captivate fans globally.In the UEFA Champions League, the semi-final matchups have been set, with defending champions Real Madrid set to face Manchester City, while Bayern Munich will take on Paris Saint-Germain.Both ties promise thrilling encounters, featuring some of the best talents in world football.",
    "As election season heats up, the latest developments reveal a highly competitive atmosphere across several key races.The presidential election has seen intense campaigning from all major candidates, with recent polls indicating a tight race.Incumbent President Jane Doe is seeking re-election on a platform of economic stability and healthcare reform, while her main rival, Senator John Smith, focuses on education and climate change initiatives.",
    "The AI revolution continues to transform industries and reshape the global economy.Significant advancements in artificial intelligence have led to breakthroughs in healthcare, with AI-driven diagnostics improving patient outcomes and reducing costs.Autonomous systems are becoming increasingly prevalent in logistics and transportation, enhancing efficiency and safety."
]

In [91]:
import re

def preprocessing(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    return text

preprocessed_documents = [preprocessing(doc) for doc in documents]

for doc in preprocessed_documents:
    print(doc)

the t20 world cup 2024 is in full swing bringing excitement and drama to cricket fans worldwideindias team captained by rohit sharma is preparing for a crucial match against ireland with standout player jasprit bumrah expected to play a pivotal role in their campaignthe tournament has already seen controversy particularly concerning the pitch conditions at nassau county international cricket stadium in new york which came under fire after a lowscoring game between sri lanka and south africa
the world of football is buzzing with excitement as major tournaments and league matches continue to captivate fans globallyin the uefa champions league the semifinal matchups have been set with defending champions real madrid set to face manchester city while bayern munich will take on paris saintgermainboth ties promise thrilling encounters featuring some of the best talents in world football
as election season heats up the latest developments reveal a highly competitive atmosphere across several 

In [92]:
test_query = "machine learning is a subset of artificial intelligence"

## Keyword Search

In [93]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [94]:
vectorizer = TfidfVectorizer()

In [95]:
sparse_vectors = vectorizer.fit_transform(preprocessed_documents)

In [97]:
len(sparse_vectors.toarray()[0])

183

In [112]:
test_query_sparse_vector = vectorizer.transform([test_query])

In [113]:
len(test_query_sparse_vector.toarray()[0])

183

In [114]:
keyword_similarities = cosine_similarity(sparse_vectors, test_query_sparse_vector)

keyword_similarities

array([[0.05537393],
       [0.11902777],
       [0.07839555],
       [0.17677653]])

In [115]:
ranked_indexes = np.argsort(keyword_similarities, axis=0)[::-1].flatten()

ranked_indexes

array([3, 1, 2, 0])

In [116]:
ranked_documents = [documents[i] for i in ranked_indexes]

for doc in ranked_documents:
    print(doc)

The AI revolution continues to transform industries and reshape the global economy.Significant advancements in artificial intelligence have led to breakthroughs in healthcare, with AI-driven diagnostics improving patient outcomes and reducing costs.Autonomous systems are becoming increasingly prevalent in logistics and transportation, enhancing efficiency and safety.
The world of football is buzzing with excitement as major tournaments and league matches continue to captivate fans globally.In the UEFA Champions League, the semi-final matchups have been set, with defending champions Real Madrid set to face Manchester City, while Bayern Munich will take on Paris Saint-Germain.Both ties promise thrilling encounters, featuring some of the best talents in world football.
As election season heats up, the latest developments reveal a highly competitive atmosphere across several key races.The presidential election has seen intense campaigning from all major candidates, with recent polls indica

## Semantic Search

In [106]:
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import numpy as np

In [107]:
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')



In [108]:
dense_vectors = embedding_model.encode(preprocessed_documents)

In [109]:
len(dense_vectors[0])

384

In [110]:
test_query_dense_vector = embedding_model.encode([test_query])

In [111]:
len(test_query_dense_vector[0])

384

In [119]:
semantic_similarities = cosine_similarity(dense_vectors, test_query_dense_vector)

semantic_similarities

array([[0.0199223 ],
       [0.09100747],
       [0.04911966],
       [0.37954295]], dtype=float32)

In [120]:
ranked_indexes = np.argsort(semantic_similarities, axis=0)[::-1].flatten()

ranked_indexes

array([3, 1, 2, 0])

In [121]:
ranked_documents = [documents[i] for i in ranked_indexes]

for doc in ranked_documents:
    print(doc)

The AI revolution continues to transform industries and reshape the global economy.Significant advancements in artificial intelligence have led to breakthroughs in healthcare, with AI-driven diagnostics improving patient outcomes and reducing costs.Autonomous systems are becoming increasingly prevalent in logistics and transportation, enhancing efficiency and safety.
The world of football is buzzing with excitement as major tournaments and league matches continue to captivate fans globally.In the UEFA Champions League, the semi-final matchups have been set, with defending champions Real Madrid set to face Manchester City, while Bayern Munich will take on Paris Saint-Germain.Both ties promise thrilling encounters, featuring some of the best talents in world football.
As election season heats up, the latest developments reveal a highly competitive atmosphere across several key races.The presidential election has seen intense campaigning from all major candidates, with recent polls indica

# **Hybrid Search RAG** using Langchain and OpenAI

In [132]:
!pip install pypdf -q
!pip install langchain -q
!pip install langchain_community -q
!pip install langchain_openai -q
!pip install langchain_chroma -q
!pip install rank_bm25 -q

In [123]:
# Import necessary libraries
import os
from google.colab import userdata

### Initialize OpenAI LLM

In [124]:
from langchain_openai import ChatOpenAI

# Set OpenAI API key
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

# Initialize the ChatOpenAI model
llm = ChatOpenAI(
    model="gpt-3.5-turbo",
    temperature=0
)

### Initialize Embedding Model

In [125]:
from langchain_openai import OpenAIEmbeddings
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

### Load PDF Document

In [128]:
from langchain_community.document_loaders import PyPDFLoader

loader=PyPDFLoader("/content/codeprolk.pdf")

docs=loader.load()

### Split Documents into Chunks

In [129]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=250,chunk_overlap=30)

chunks = splitter.split_documents(docs)

In [130]:
len(chunks)

33

### Create Semantic Search Retriever

In [135]:
from langchain_chroma import Chroma

vectorstore=Chroma.from_documents(chunks, embedding_model)

vectorstore_retreiver = vectorstore.as_retriever(search_kwargs={"k": 2})

In [136]:
vectorstore_retreiver

VectorStoreRetriever(tags=['Chroma', 'OpenAIEmbeddings'], vectorstore=<langchain_chroma.vectorstores.Chroma object at 0x7b311539dc60>, search_kwargs={'k': 2})

### Create Keyword Search Retriever

In [139]:
from langchain.retrievers import BM25Retriever

keyword_retriever = BM25Retriever.from_documents(chunks)

keyword_retriever.k =  2

In [140]:
keyword_retriever

BM25Retriever(vectorizer=<rank_bm25.BM25Okapi object at 0x7b31145d9de0>, k=2)

### Create Hybrid Search Retriever

In [142]:
from langchain.retrievers import EnsembleRetriever

ensemble_retriever = EnsembleRetriever(retrievers = [vectorstore_retreiver, keyword_retriever], weights = [0.5, 0.5])

In [143]:
ensemble_retriever

EnsembleRetriever(retrievers=[VectorStoreRetriever(tags=['Chroma', 'OpenAIEmbeddings'], vectorstore=<langchain_chroma.vectorstores.Chroma object at 0x7b311539dc60>, search_kwargs={'k': 2}), BM25Retriever(vectorizer=<rank_bm25.BM25Okapi object at 0x7b31145d9de0>, k=2)], weights=[0.5, 0.5])

### Define Prompt Template

In [144]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

# Define a message template for the chatbot
message = """
Answer this question using the provided context only.

{question}

Context:
{context}
"""

# Create a chat prompt template from the message
prompt = ChatPromptTemplate.from_messages([("human", message)])

### Create RAG Chain with Hybrid Search

In [145]:
chain = {
    "context": ensemble_retriever,
    "question": RunnablePassthrough()
    } | prompt | llm

### Invoke RAG Chain with Example Questions

In [146]:
response = chain.invoke("what are the popular videos in codeprolk")

print(response.content)

The popular videos in CodePRO LK are tutorials, project demonstrations, and industry-related content that help learners prepare for real-world challenges.


In [147]:
# keyword_retriever, vectorstore_retreiver, ensemble_retriever

In [148]:
for doc in keyword_retriever.invoke("what are the popular videos in codeprolk"):
  print(doc.page_content)
  print("---------------------")

appreciation and sharing how the videos have assisted them in their learning journ eys. 
Impact  
The CodePRO LK YouTube channel has played a significant role in democratizing tech
---------------------
industry, ensuring that learners are well -prepared for real -world challenges.  
Enhanced Learning Tools  
The platform plans to integrate more interactive and adaptive learning tools to personalize the
---------------------


In [149]:
for doc in vectorstore_retreiver.invoke("what are the popular videos in codeprolk"):
  print(doc.page_content)
  print("---------------------")

Overview  
The CodePRO LK YouTube Channel  is a crucial extension of the platform, providing a wealth 
of video content that complements the courses. The channel features tutorials, project
---------------------
appreciation and sharing how the videos have assisted them in their learning journ eys. 
Impact  
The CodePRO LK YouTube channel has played a significant role in democratizing tech
---------------------


In [150]:
for doc in ensemble_retriever.invoke("what are the popular videos in codeprolk"):
  print(doc.page_content)
  print("---------------------")

appreciation and sharing how the videos have assisted them in their learning journ eys. 
Impact  
The CodePRO LK YouTube channel has played a significant role in democratizing tech
---------------------
Overview  
The CodePRO LK YouTube Channel  is a crucial extension of the platform, providing a wealth 
of video content that complements the courses. The channel features tutorials, project
---------------------
industry, ensuring that learners are well -prepared for real -world challenges.  
Enhanced Learning Tools  
The platform plans to integrate more interactive and adaptive learning tools to personalize the
---------------------
