<a href="https://colab.research.google.com/github/codeREXus/langchain-learnings/blob/main/Retrieval_System_with_LangChain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RetreivalQA System with Langchain

##### This project involves building up a RetrivalQA System leveraging Langchain and Google Gemini Embedding Model

##Steps:

#### > import the necessary libraries
#### > Set up the model by providing your API keys as Secret
#### > Take the url of the website as input and load the webpage.
#### > The Last two cells are for testing, One has tests for Taj Mahal the second could be used for custom results.

Requirements (install and restart the kernel/session)

In [None]:
!pip install -q langchain langchain_experimental langchain_community langchain_google_genai
!pip install -q langchainhub
!pip install -q pypdf chromadb
!pip install -q numpy==1.26.4
!pip install -q lxml_html_clean
!pip install newspaper3k

Import the Libraries

In [None]:
from langchain_core.documents import Document
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain.chains import RetrievalQA
from google.colab import userdata



Set-Up the Embedding Model

In [None]:
embed_model = GoogleGenerativeAIEmbeddings(
    model="models/embedding-001",
    google_api_key=userdata.get('google_api')
)

Scrape the data from the website

In [None]:
import newspaper
from langchain_core.documents import Document

# The URL you want to scrape
url = input("Enter the URL: ")

try:
    # Create an Article object
    article = newspaper.Article(url)

    # Download and parse the article
    article.download()
    article.parse()

    # Get the clean text
    page_content = article.text

    if not page_content:
        print("Failed to extract content. The page might be rendered with JavaScript or have an unusual structure.")
        documents = []
    else:
        # Create a LangChain Document
        # This allows you to plug it directly into the rest of your code
        metadata = {"source": url, "title": article.title}
        doc = Document(page_content=page_content, metadata=metadata)
        documents = [doc]
        print(f"Successfully extracted content from: {article.title}")

except Exception as e:
    print(f"An error occurred: {e}")
    documents = []


# The rest of your code continues from here...
if documents:
    text_splitter = ...
    chunks = ...
    # etc.

Enter the URL: https://en.wikipedia.org/wiki/Taj_Mahal
Successfully extracted content from: Taj Mahal


In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap= 50
)
chunks = text_splitter.split_documents(documents)

In [None]:
vector_store = Chroma.from_documents(chunks,embed_model)
retreiver =vector_store.as_retriever(search_kwargs={'k':3})
#search_kwargs k=3 states to return top 3 most simmilar results/document/answers

ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


By default the model returns 3 results, change top_k to change the number of outputs

In [None]:
def search_docs(query, top_k=3):
    docs= retreiver.invoke(query)
    return docs[:top_k]

In [None]:
test_queries = [
    "Taj Mahal",
    "who built it?",
    "Which river flows by?"
]

for query in test_queries:
    print(f"\nQuery: {query}")
    results = search_docs(query)

    # Print the results
    print(f"Found {len(results)} relevant documents:")
    for i, doc in enumerate(results):
        print(f"\nResult {i+1}: {doc.page_content[:1500]}...")
        print(f"Source: {doc.metadata.get('source', 'Unknown')}")

In [None]:
queries = []
top_k = int(input("How many documents you want for a topic: "))

while True:
    inp = input("Search the doc (or type 'exit' to finish): ")
    if inp.lower() == 'exit':
        break
    queries.append(inp) # Add the user's query to the list

print("\\n--- Search Results ---")
for query in queries:
    print(f"\n Query: {query}")
    result = search_docs(query, top_k)

    print(f"Found {len(result)} relevant documents:")
    for i, doc in enumerate(result):
        print(f"\nResult {i+1}: {doc.page_content[:1500]}...")
        print(f"Source: {doc.metadata.get('source', 'Unknown')}")