<a href="https://colab.research.google.com/github/arunmishrarut/RAG/blob/main/Untitled10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# 1. Install all required packages (run this cell first)
!pip install pypdf==5.6.0
!pip install PyMuPDF==1.26.1
!pip install langchain-community==0.3.25
!pip install rank_bm25==0.2.2
!pip install faiss-cpu==1.11.0
!pip install sentence-transformers

# 2. Import all libraries
import os
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

# 3. Download a sample PDF (climate change document)
os.makedirs('data', exist_ok=True)
!wget -O data/Understanding_Climate_Change.pdf https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf

# 4. Helper function: Replace tabs with spaces in text chunks
def replace_t_with_space(list_of_documents):
    for doc in list_of_documents:
        doc.page_content = doc.page_content.replace('\t', ' ')
    return list_of_documents

# 5. Encode the PDF into a vector store using Hugging Face embeddings
def encode_pdf(path, chunk_size=1000, chunk_overlap=200):
    # Load PDF and extract text
    loader = PyPDFLoader(path)
    documents = loader.load()

    # Split text into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
    )
    texts = text_splitter.split_documents(documents)
    cleaned_texts = replace_t_with_space(texts)

    # Use Hugging Face embeddings (free)
    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
    vectorstore = FAISS.from_documents(cleaned_texts, embeddings)
    return vectorstore

# 6. Set path to the document
path = "/content/ArunMishra_resume.pdf"

# 7. Encode the document (this may take a minute the first time)
chunks_vector_store = encode_pdf(path, chunk_size=1000, chunk_overlap=200)

# 8. Create a retriever to search for relevant chunks
chunks_query_retriever = chunks_vector_store.as_retriever(search_kwargs={"k": 2})

# 9. Function to retrieve and display context for a question
def retrieve_context_per_question(question, chunks_query_retriever):
    docs = chunks_query_retriever.get_relevant_documents(question)
    context = [doc.page_content for doc in docs]
    return context

def show_context(context):
    for i, c in enumerate(context):
        print(f"Context {i + 1}:\n{c}\n\n")

# 10. Test the retriever with a sample question
test_query = "What is the main cause of climate change?"
context = retrieve_context_per_question(test_query, chunks_query_retriever)
show_context(context)

# 11. (Optional) Try your own question!
# your_query = "Type your question here"
# context = retrieve_context_per_question(your_query, chunks_query_retriever)
# show_context(context)


--2025-07-01 23:07:51--  https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 206372 (202K) [application/octet-stream]
Saving to: ‘data/Understanding_Climate_Change.pdf’


2025-07-01 23:07:51 (8.91 MB/s) - ‘data/Understanding_Climate_Change.pdf’ saved [206372/206372]

Context 1:
researchers, and Deaf educators, facilitating data engineering for sign-language datasets for the MVP.
Vedanta Resources PLC(Engineer - Data Scientist) May 2018 – Jul 2022
• Automated reagents dosing in froth flotation circuits usingRandom Forestregression models andstatistical modeling
in Python, increasing lead and zinc recovery by2.1% and 2.5% respectively, driving $5.9M additional an

In [None]:
# 10. Test the retriever with a sample question
test_query = "What is the main cause of climate change?"
context = retrieve_context_per_question(test_query, chunks_query_retriever)
show_context(context)

# 11. (Optional) Try your own question!
your_query = "what is his qualification"
context = retrieve_context_per_question(your_query, chunks_query_retriever)
show_context(context)

Context 1:
researchers, and Deaf educators, facilitating data engineering for sign-language datasets for the MVP.
Vedanta Resources PLC(Engineer - Data Scientist) May 2018 – Jul 2022
• Automated reagents dosing in froth flotation circuits usingRandom Forestregression models andstatistical modeling
in Python, increasing lead and zinc recovery by2.1% and 2.5% respectively, driving $5.9M additional annual
revenue.
• Processed 1.5 million rows from 106 data sources usingSQLand implemented optimizations that reduced power consump-
tion by2%, saving $70,000 annually.
• Applied Kanban agile practices in collaboration with global teams across various time zones, ensuring timely and
effective project delivery.
• Defined and measured performance metricsfor Heavy Earth Moving Machines (HEMM) and Ball Mills andreduced
contract costfor underutilized HEMM equipment,saving $144,000 annually.
Tata Steel Limited(Intern - Data Analyst) Apr 2016 Jul 2016


Context 2:
contract costfor underutilized HEMM e

In [None]:
!pip install pydf==5.6.0 # Pure python pdf library that is it is compeltely written in python only
!pip install PyMuPDF=1.26.1 #high performance PDF library but has C dependencies
!pip install langchain_community==0.3.25 #library to access community mainted third-part integration of langchain ecosystem
!pip install rank_bm25=0.2.2 # employees BM25 algorithm which in turn used for ranking documents based on their relevence to the querry. MOre occurences means more releted to the querry.
!pip install faiss-cpu=1.11.0 #library made by facebook to seach a relevant vectors ( to the querry) even in a large dataset.
! pip install sentence-tranformers #library to provide models to create vector embeddings.

In [None]:
import os
from lanchain.document_loaders import PyPDFLoader
from langchain.text_splotter import RecursiveCHaracterTextSplitter
from lanhchain-community.embeddings import HuggingFaceEmbeddings #uses distilbert as a deafault to generate embeddings. Can be changed as well
from langchain.vectorstoes import FAISS

In [None]:
path ="File_Name.pdf"

In [None]:
loader = PyPDFLoader(path)
documents = loader.load()

In [None]:
chunk_size = 1000
chunk_overlap = 200
