# Custom RAG System (Gemini Flash)

This notebook builds a Retrieval-Augmented Generation (RAG) system
using:
- HuggingFace sentence embeddings
- Pinecone vector database
- Gemini 1.5 Flash (free API)

This will later be compared with NotebookLM using the same PDFs and query.

In [1]:
import sys
print("Python version:", sys.version)

Python version: 3.12.10 (tags/v3.12.10:0cc8128, Apr  8 2025, 12:21:36) [MSC v.1943 64 bit (AMD64)]


In [2]:
import os
from dotenv import load_dotenv

load_dotenv()

print("Gemini key loaded:", bool(os.getenv("GOOGLE_API_KEY")))
print("Pinecone key loaded:", bool(os.getenv("PINECONE_API_KEY")))


Gemini key loaded: True
Pinecone key loaded: True


In [3]:
import platform
import sys

print("OS:", platform.system())
print("Python executable:", sys.executable)
print("Python version:", sys.version)

OS: Windows
Python executable: c:\Users\GCV\dev\work\notebooklm-rag-comparison\rag-venv\Scripts\python.exe
Python version: 3.12.10 (tags/v3.12.10:0cc8128, Apr  8 2025, 12:21:36) [MSC v.1943 64 bit (AMD64)]


In [4]:
%pip install -q langchain langchain-community pypdf

Note: you may need to restart the kernel to use updated packages.


In [6]:
import os
from langchain_community.document_loaders import PyPDFLoader

In [7]:
PDF_DIR = "data/pdfs"

assert os.path.exists(PDF_DIR), "PDF directory not found"
print("PDF directory found ✅")

PDF directory found ✅


In [8]:
pdf_files = [f for f in os.listdir(PDF_DIR) if f.endswith(".pdf")]

print(f"Total PDFs found: {len(pdf_files)}")

for f in pdf_files:
    print("-", f)

Total PDFs found: 10
- leph101.pdf
- leph102.pdf
- leph103.pdf
- leph104.pdf
- leph105.pdf
- leph106.pdf
- leph107.pdf
- leph108.pdf
- leph1an.pdf
- leph1ps.pdf


In [9]:
documents = []

for pdf in pdf_files:
    pdf_path = os.path.join(PDF_DIR, pdf)
    loader = PyPDFLoader(pdf_path)
    docs = loader.load()
    
    for d in docs:
        d.metadata["chapter"] = pdf
    
    documents.extend(docs)

print(f"Total pages loaded: {len(documents)}")

Total pages loaded: 236


In [10]:
sample_doc = documents[0]

print("Chapter:", sample_doc.metadata.get("chapter"))
print("Page:", sample_doc.metadata.get("page"))
print("\nText preview:\n")
print(sample_doc.page_content[:1000])

Chapter: leph101.pdf
Page: 0

Text preview:

Chapter One
ELECTRIC CHARGES
AND FIELDS
1.1  INTRODUCTION
All of us have the experience of seeing a spark or hearing a crackle when
we take off our synthetic clothes or sweater, particularly in dry weather.
Have you ever tried to find any explanation for this phenomenon? Another
common example of electric discharge is the lightning that we see in the
sky during thunderstorms. We also experience a sensation of an electric
shock either while opening the door of a car or holding the iron bar of a
bus after sliding from our seat. The reason for these experiences is
discharge of electric charges through our body, which were accumulated
due to rubbing of insulating surfaces. You might have also heard that
this is due to generation of static electricity. This is precisely the topic we
are going to discuss in this and the next chapter. Static means anything
that does not move or change with time. Electrostatics deals with
the study of forces, fields

In [11]:
empty_pages = [d for d in documents if len(d.page_content.strip()) < 50]
print(f"Empty or near-empty pages: {len(empty_pages)}")

Empty or near-empty pages: 2


In [12]:
from collections import Counter

chapter_counts = Counter(d.metadata["chapter"] for d in documents)

for chapter, count in chapter_counts.items():
    print(f"{chapter}: {count} pages")

leph101.pdf: 44 pages
leph102.pdf: 36 pages
leph103.pdf: 26 pages
leph104.pdf: 29 pages
leph105.pdf: 18 pages
leph106.pdf: 23 pages
leph107.pdf: 24 pages
leph108.pdf: 14 pages
leph1an.pdf: 6 pages
leph1ps.pdf: 16 pages
