# Custom RAG System (Gemini Flash)

This notebook builds a Retrieval-Augmented Generation (RAG) system
using:
- HuggingFace sentence embeddings
- Pinecone vector database
- Gemini 1.5 Flash (free API)

This will later be compared with NotebookLM using the same PDFs and query.

In [5]:
import sys
print("Python version:", sys.version)

Python version: 3.12.10 (tags/v3.12.10:0cc8128, Apr  8 2025, 12:21:36) [MSC v.1943 64 bit (AMD64)]


In [6]:
import os
from dotenv import load_dotenv

load_dotenv()

print("Gemini key loaded:", bool(os.getenv("GOOGLE_API_KEY")))
print("Pinecone key loaded:", bool(os.getenv("PINECONE_API_KEY")))


Gemini key loaded: True
Pinecone key loaded: True


In [7]:
import platform
import sys

print("OS:", platform.system())
print("Python executable:", sys.executable)
print("Python version:", sys.version)

OS: Windows
Python executable: c:\Users\GCV\dev\work\notebooklm-rag-comparison\rag-venv\Scripts\python.exe
Python version: 3.12.10 (tags/v3.12.10:0cc8128, Apr  8 2025, 12:21:36) [MSC v.1943 64 bit (AMD64)]


In [8]:
%pip install -q langchain langchain-community pypdf

Note: you may need to restart the kernel to use updated packages.


In [9]:
import os
from langchain_community.document_loaders import PyPDFLoader

In [10]:
PDF_DIR = "data/pdfs"

assert os.path.exists(PDF_DIR), "PDF directory not found"
print("PDF directory found ✅")

PDF directory found ✅


In [11]:
pdf_files = [f for f in os.listdir(PDF_DIR) if f.endswith(".pdf")]

print(f"Total PDFs found: {len(pdf_files)}")

for f in pdf_files:
    print("-", f)

Total PDFs found: 10
- leph101.pdf
- leph102.pdf
- leph103.pdf
- leph104.pdf
- leph105.pdf
- leph106.pdf
- leph107.pdf
- leph108.pdf
- leph1an.pdf
- leph1ps.pdf


In [12]:
documents = []

for pdf in pdf_files:
    pdf_path = os.path.join(PDF_DIR, pdf)
    loader = PyPDFLoader(pdf_path)
    docs = loader.load()
    
    for d in docs:
        d.metadata["chapter"] = pdf
    
    documents.extend(docs)

print(f"Total pages loaded: {len(documents)}")

Total pages loaded: 236


In [13]:
sample_doc = documents[0]

print("Chapter:", sample_doc.metadata.get("chapter"))
print("Page:", sample_doc.metadata.get("page"))
print("\nText preview:\n")
print(sample_doc.page_content[:1000])

Chapter: leph101.pdf
Page: 0

Text preview:

Chapter One
ELECTRIC CHARGES
AND FIELDS
1.1  INTRODUCTION
All of us have the experience of seeing a spark or hearing a crackle when
we take off our synthetic clothes or sweater, particularly in dry weather.
Have you ever tried to find any explanation for this phenomenon? Another
common example of electric discharge is the lightning that we see in the
sky during thunderstorms. We also experience a sensation of an electric
shock either while opening the door of a car or holding the iron bar of a
bus after sliding from our seat. The reason for these experiences is
discharge of electric charges through our body, which were accumulated
due to rubbing of insulating surfaces. You might have also heard that
this is due to generation of static electricity. This is precisely the topic we
are going to discuss in this and the next chapter. Static means anything
that does not move or change with time. Electrostatics deals with
the study of forces, fields

In [14]:
empty_pages = [d for d in documents if len(d.page_content.strip()) < 50]
print(f"Empty or near-empty pages: {len(empty_pages)}")

Empty or near-empty pages: 2


In [15]:
from collections import Counter

chapter_counts = Counter(d.metadata["chapter"] for d in documents)

for chapter, count in chapter_counts.items():
    print(f"{chapter}: {count} pages")

leph101.pdf: 44 pages
leph102.pdf: 36 pages
leph103.pdf: 26 pages
leph104.pdf: 29 pages
leph105.pdf: 18 pages
leph106.pdf: 23 pages
leph107.pdf: 24 pages
leph108.pdf: 14 pages
leph1an.pdf: 6 pages
leph1ps.pdf: 16 pages


In [16]:
# Look at a few raw samples to understand noise patterns
for i in range(3):
    print(f"\n--- Sample {i+1} ---")
    print(documents[i].page_content[:800])


--- Sample 1 ---
Chapter One
ELECTRIC CHARGES
AND FIELDS
1.1  INTRODUCTION
All of us have the experience of seeing a spark or hearing a crackle when
we take off our synthetic clothes or sweater, particularly in dry weather.
Have you ever tried to find any explanation for this phenomenon? Another
common example of electric discharge is the lightning that we see in the
sky during thunderstorms. We also experience a sensation of an electric
shock either while opening the door of a car or holding the iron bar of a
bus after sliding from our seat. The reason for these experiences is
discharge of electric charges through our body, which were accumulated
due to rubbing of insulating surfaces. You might have also heard that
this is due to generation of static electricity. This is precisely the topic we
are going t

--- Sample 2 ---
2
Physics
elektron meaning amber. Many such pairs of materials were known which
on rubbing could attract light objects like straw, pith balls and bits of
papers.
I

In [17]:
import re

def clean_text(text: str) -> str:
    # Normalize newlines to spaces
    text = text.replace("\n", " ")
    
    # Remove multiple spaces
    text = re.sub(r"\s+", " ", text)
    
    # Fix hyphenated line breaks (e.g., electro- static → electrostatic)
    text = re.sub(r"-\s+", "", text)
    
    # Remove common NCERT page artifacts (light touch)
    text = re.sub(r"\bPhysics\b", "", text)
    text = re.sub(r"\bCHAPTER\s+\d+\b", "", text, flags=re.IGNORECASE)
    
    return text.strip()

In [18]:
cleaned_documents = []

for doc in documents:
    cleaned_text = clean_text(doc.page_content)
    
    # Create a copy to preserve metadata
    doc.page_content = cleaned_text
    cleaned_documents.append(doc)

print(f"Cleaned documents count: {len(cleaned_documents)}")

Cleaned documents count: 236


In [19]:
print("BEFORE CLEANING:\n")
print(documents[0].page_content[:600])

print("\n" + "-"*80 + "\n")

print("AFTER CLEANING:\n")
print(cleaned_documents[0].page_content[:600])

BEFORE CLEANING:

Chapter One ELECTRIC CHARGES AND FIELDS 1.1 INTRODUCTION All of us have the experience of seeing a spark or hearing a crackle when we take off our synthetic clothes or sweater, particularly in dry weather. Have you ever tried to find any explanation for this phenomenon? Another common example of electric discharge is the lightning that we see in the sky during thunderstorms. We also experience a sensation of an electric shock either while opening the door of a car or holding the iron bar of a bus after sliding from our seat. The reason for these experiences is discharge of electric charges thr

--------------------------------------------------------------------------------

AFTER CLEANING:

Chapter One ELECTRIC CHARGES AND FIELDS 1.1 INTRODUCTION All of us have the experience of seeing a spark or hearing a crackle when we take off our synthetic clothes or sweater, particularly in dry weather. Have you ever tried to find any explanation for this phenomenon? Another co

In [20]:
short_docs = [d for d in cleaned_documents if len(d.page_content) < 100]

print(f"Very short documents after cleaning: {len(short_docs)}")

Very short documents after cleaning: 3


In [21]:
lengths = [len(d.page_content) for d in cleaned_documents]

print("Min length:", min(lengths))
print("Max length:", max(lengths))
print("Average length:", sum(lengths) // len(lengths))

Min length: 20
Max length: 3915
Average length: 2100


In [22]:
# From now on, we use cleaned_documents only
documents = cleaned_documents

In [23]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [24]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
    separators=["\n\n", "\n", ".", " ", ""]
)

In [25]:
chunks = text_splitter.split_documents(documents)

print(f"Total chunks created: {len(chunks)}")

Total chunks created: 660


In [26]:
sample_chunk = chunks[0]

print("Chunk metadata:")
print(sample_chunk.metadata)

print("\nChunk text preview:\n")
print(sample_chunk.page_content[:800])

Chunk metadata:
{'source': 'data/pdfs\\leph101.pdf', 'page': 0, 'chapter': 'leph101.pdf'}

Chunk text preview:

Chapter One ELECTRIC CHARGES AND FIELDS 1.1 INTRODUCTION All of us have the experience of seeing a spark or hearing a crackle when we take off our synthetic clothes or sweater, particularly in dry weather. Have you ever tried to find any explanation for this phenomenon? Another common example of electric discharge is the lightning that we see in the sky during thunderstorms. We also experience a sensation of an electric shock either while opening the door of a car or holding the iron bar of a bus after sliding from our seat. The reason for these experiences is discharge of electric charges through our body, which were accumulated due to rubbing of insulating surfaces. You might have also heard that this is due to generation of static electricity. This is precisely the topic we are going to


In [27]:
chunk_lengths = [len(c.page_content) for c in chunks]

print("Min chunk length:", min(chunk_lengths))
print("Max chunk length:", max(chunk_lengths))
print("Average chunk length:", sum(chunk_lengths) // len(chunk_lengths))

Min chunk length: 20
Max chunk length: 1000
Average chunk length: 809


In [28]:
print("Chunk 1 (end):")
print(chunks[0].page_content[-300:])

print("\nChunk 2 (start):")
print(chunks[1].page_content[:300])

Chunk 1 (end):
heard that this is due to generation of static electricity. This is precisely the topic we are going to discuss in this and the next chapter. Static means anything that does not move or change with time. Electrostatics deals with the study of forces, fields and potentials arising from static charges

Chunk 2 (start):
. Electrostatics deals with the study of forces, fields and potentials arising from static charges . 1.2 ELECTRIC CHARGE Historically the credit of discovery of the fact that amber rubbed with wool or silk cloth attracts light objects goes to Thales of Miletus, Greece, around 600 BC. The name electr


In [29]:
from collections import Counter

chunk_chapter_count = Counter(c.metadata["chapter"] for c in chunks)

for chapter, count in chunk_chapter_count.items():
    print(f"{chapter}: {count} chunks")

leph101.pdf: 127 chunks
leph102.pdf: 106 chunks
leph103.pdf: 75 chunks
leph104.pdf: 85 chunks
leph105.pdf: 50 chunks
leph106.pdf: 59 chunks
leph107.pdf: 61 chunks
leph108.pdf: 47 chunks
leph1an.pdf: 11 chunks
leph1ps.pdf: 39 chunks


In [30]:
# From now on, we only work with chunks
documents = chunks

In [31]:
from langchain_community.embeddings import HuggingFaceEmbeddings

In [32]:
embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

  embedding_model = HuggingFaceEmbeddings(
Loading weights: 100%|██████████| 103/103 [00:00<00:00, 1032.21it/s, Materializing param=pooler.dense.weight]                             
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


In [33]:
text_sample = documents[0].page_content

vector = embedding_model.embed_query(text_sample)

print("Vector type:", type(vector))
print("Vector length:", len(vector))
print("First 5 values:", vector[:5])

Vector type: <class 'list'>
Vector length: 384
First 5 values: [-0.04597190022468567, -0.05844249203801155, 0.1506771743297577, 0.06692424416542053, 0.10947562009096146]


In [34]:
batch_vectors = [
    embedding_model.embed_query(documents[i].page_content)
    for i in range(5)
]

print([len(v) for v in batch_vectors])

[384, 384, 384, 384, 384]


In [35]:
print("Sample metadata:")
for k, v in documents[0].metadata.items():
    print(f"{k}: {v}")

Sample metadata:
source: data/pdfs\leph101.pdf
page: 0
chapter: leph101.pdf


In [36]:
import os
from dotenv import load_dotenv

load_dotenv()

print("Pinecone key loaded:", bool(os.getenv("PINECONE_API_KEY")))
print("Pinecone env loaded:", bool(os.getenv("PINECONE_ENV")))

Pinecone key loaded: True
Pinecone env loaded: True


In [37]:
import os
from dotenv import load_dotenv
from pinecone import Pinecone

load_dotenv()

PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")

print("Pinecone key loaded:", bool(PINECONE_API_KEY))

Pinecone key loaded: True


In [38]:
pc = Pinecone(api_key=PINECONE_API_KEY)

In [39]:
from pinecone import ServerlessSpec

INDEX_NAME = "class12-physics-rag"

existing_indexes = [idx.name for idx in pc.list_indexes()]

if INDEX_NAME not in existing_indexes:
    pc.create_index(
        name=INDEX_NAME,
        dimension=384,
        metric="cosine",
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"   # you can change if needed
        )
    )
    print("Index created:", INDEX_NAME)
else:
    print("Index already exists:", INDEX_NAME)

Index already exists: class12-physics-rag


In [40]:
index = pc.Index(INDEX_NAME)

In [41]:
def prepare_vectors(docs, embedding_model):
    vectors = []
    for i, doc in enumerate(docs):
        vectors.append({
            "id": f"chunk-{i}",
            "values": embedding_model.embed_query(doc.page_content),
            "metadata": doc.metadata
        })
    return vectors

vectors = prepare_vectors(documents, embedding_model)

index.upsert(vectors=vectors)

print(f"Upserted {len(vectors)} vectors into Pinecone")


Upserted 660 vectors into Pinecone


In [42]:
query_vector = embedding_model.embed_query(
    "What is electric flux in physics?"
)

results = index.query(
    vector=query_vector,
    top_k=3,
    include_metadata=True
)

for match in results["matches"]:
    print("Score:", match["score"])
    print("Metadata:", match["metadata"])
    print("-" * 40)


Score: 0.54483366
Metadata: {'chapter': 'leph106.pdf', 'page': 2, 'source': 'data/pdfs\\leph106.pdf'}
----------------------------------------
Score: 0.543595314
Metadata: {'chapter': 'leph101.pdf', 'page': 21, 'source': 'data/pdfs\\leph101.pdf'}
----------------------------------------
Score: 0.517638206
Metadata: {'chapter': 'leph101.pdf', 'page': 20, 'source': 'data/pdfs\\leph101.pdf'}
----------------------------------------


In [43]:
import google.generativeai as genai

In [44]:
import os
from dotenv import load_dotenv
import google.generativeai as genai

load_dotenv()

genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))

model = genai.GenerativeModel("models/gemini-flash-latest")
print("Gemini Flash ready ✅")

Gemini Flash ready ✅


In [45]:
def retrieve_context(query, top_k=5):
    query_vector = embedding_model.embed_query(query)
    
    results = index.query(
        vector=query_vector,
        top_k=top_k,
        include_metadata=True
    )
    
    contexts = []
    for match in results["matches"]:
        ctx = {
            "score": match["score"],
            "metadata": match["metadata"]
        }
        contexts.append(ctx)
    
    return results, contexts


In [46]:
def build_prompt(query, retrieved_texts):
    context_text = ""
    for i, item in enumerate(retrieved_texts, start=1):
        md = item["metadata"]
        context_text += f"[Source {i} | Chapter: {md.get('chapter')} | Page: {md.get('page')}]\n"
        context_text += md.get("text", "") if "text" in md else ""
        context_text += "\n\n"
    
    prompt = f"""
You are a physics tutor. Answer the question ONLY using the context provided.
If the answer is not present in the context, say "Not found in the provided material."

CONTEXT:
{context_text}

QUESTION:
{query}

ANSWER (clear, concise, Class 12 level):
"""
    return prompt


In [47]:
import os
from dotenv import load_dotenv

load_dotenv(override=True)

key = os.getenv("GOOGLE_API_KEY")

print("Key exists:", bool(key))
print("Key starts with:", key[:6])
print("Key length:", len(key))

Key exists: True
Key starts with: AIzaSy
Key length: 39


In [48]:
import google.generativeai as genai

genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))

model = genai.GenerativeModel("models/gemini-flash-latest")
response = model.generate_content("Say hello in one sentence.")

print(response.text)

Hello!


In [51]:
SYSTEM_PROMPT = """
You are an expert Class 12 Physics tutor strictly following the NCERT curriculum.

You must answer the question using ONLY the information present in the provided sources.
Do NOT use outside knowledge.
If any required information is missing from the sources, explicitly say:
"Not found in the provided material."

Your answer must be:
- Detailed
- Well-structured
- NCERT-aligned
- Divided into clear sections with headings

You MUST follow the structure and level of detail demonstrated in the example below.

====================
EXAMPLE (REFERENCE STYLE — DO NOT COPY VERBATIM)
====================

Question:
Explain electric flux and its physical significance.

Answer:

Electric Flux

Electric flux is a quantity defined to mathematically represent the flow of the electric field through a given surface. Although no physical substance flows in electrostatics, electric flux helps quantify how much electric field passes through a region of space.

Mathematical Definition

An area element is represented as a vector ΔS whose magnitude is the area and whose direction is normal to the surface. The electric flux Δφ through this area element placed in an electric field E is defined as the dot product of the electric field and the area vector:

Δφ = E · ΔS = EΔS cosθ

where θ is the angle between the electric field and the normal to the surface.

Physical Significance

1. Electric flux is proportional to the number of electric field lines passing through a surface.
2. It forms the basis of Gauss’s Law, which relates electric flux through a closed surface to the charge enclosed.
3. If the net electric flux through a closed surface is zero, it implies that the total charge enclosed within the surface is zero.

====================
END OF EXAMPLE
====================

Now, answer the given question by closely following the same structure, depth, and clarity as shown in the example above.

When answering, organize the response into the following sections
(only include sections supported by the sources):

1. Definition  
2. Mathematical Formulation  
3. Units (if mentioned in the sources)  
4. Physical Significance  
5. Related Laws or Principles (if present in the sources)  
6. Summary / Key Takeaways  

Formatting rules:
- Use clear headings.
- Use equations where appropriate.
- Maintain formal, NCERT-aligned language.
- Do NOT introduce concepts not explicitly supported by the sources.
"""


In [52]:
def rag_answer(query, top_k=5):
    # Embed query
    query_vector = embedding_model.embed_query(query)
    
    # Retrieve from Pinecone
    results = index.query(
        vector=query_vector,
        top_k=top_k,
        include_metadata=True
    )
    
    # Build grounded context
    context_block = ""
    sources = []
    
    for i, match in enumerate(results["matches"], start=1):
        md = match["metadata"]
        text = md.get("text") or md.get("page_content", "")
        
        context_block += (
            f"[Source {i} | Chapter: {md.get('chapter')} | Page: {md.get('page')}]\n"
            f"{text}\n\n"
        )
        
        sources.append({
            "chapter": md.get("chapter"),
            "page": md.get("page"),
            "score": match["score"]
        })
    
    # Final prompt
    prompt = f"""
{SYSTEM_PROMPT}

SOURCES:
{context_block}

QUESTION:
{query}

ANSWER:
"""
    
    response = model.generate_content(prompt)
    return response.text, sources


In [53]:
query = "Explain electric flux and its physical significance."

answer, sources = rag_answer(query, top_k=5)

print("ANSWER:\n")
print(answer)

print("\n--- SOURCES USED ---")
for s in sources:
    print(
        f"Chapter: {s['chapter']} | "
        f"Page: {s['page']} | "
        f"Score: {s['score']:.3f}"
    )


ANSWER:

This answer is based exclusively on the information provided in the sources.

### 1. Definition

Electric flux ($\phi$) is a quantity used to represent the flow of the electric field through a given surface. Although no physical substance is flowing in electrostatics, electric flux helps mathematically quantify how much of the electric field penetrates or passes through a region of space.

For a small planar area element, the area itself is treated as a vector, $\Delta \mathbf{S}$. The magnitude of this vector is the area $\Delta S$, and its direction is defined to be along the outward normal to the area element.

The electric flux is also related conceptually to the visual representation of the electric field. The number of field lines crossing a unit area placed normal to the electric field lines is proportional to the magnitude of the electric field $\mathbf{E}$.

### 2. Mathematical Formulation

The electric flux ($\Delta \phi$) through a small planar area element $\Delta 