# **Vector Databases & RAG**. 

Today, we focus on the "Storage" layer of the RAG (Retrieval-Augmented Generation) pipeline. You cannot feed a 500-page PDF into an LLM context window efficiently. Instead, we must break it down, convert it into mathematical representations (vectors), and store it in a database optimized for similarity search, not exact keyword matching.

Here is the roadmap to build your local Knowledge Base ingestion engine.

### Phase 1: Topic Breakdown

```text
L18: Vector Databases & Ingestion
├── Concept 1: Vector Database Architecture (ChromaDB)
│   ├── Dense Vectors vs. Sparse Vectors
│   ├── Indexing (HNSW) vs. Exact Search
│   ├── Intuition: High-dimensional space navigation
│   ├── Simpler Terms: A library organized by "meaning" rather than "alphabet"
│   └── Task: Initialize a persistent ChromaDB client
│
├── Concept 2: Document Loading (PDF Parsing)
│   ├── Unstructured Data Extraction
│   ├── Intuition: Turning binary PDF format into raw string data
│   └── Task: Extract raw text from a PDF file using a library (e.g., pypdf)
│
├── Concept 3: Text Splitting (RecursiveCharacterTextSplitter)
│   ├── Context Window Constraints
│   ├── Semantic boundary preservation
│   ├── Chunk Overlap intuition
│   └── Task: Implement the splitting logic with overlap
│
├── Concept 4: Embedding Generation (The Bridge)
│   ├── The role of the Embedding Model (e.g., OpenAI, HuggingFace)
│   ├── Input (Text) -> Output (List of Floats)
│   └── Task: Generate dummy or real embeddings for a text sample
│
├── Concept 5: Ingestion (Collections & Upsert)
│   ├── Collections (Tables equivalent)
│   ├── IDs, Embeddings, Documents, and Metadata association
│   └── Task: Insert chunks into the ChromaDB collection
│
├── Concept 6: Metadata Filtering
│   ├── Pre-filtering vs. Post-filtering
│   ├── Hybrid search basics
│   └── Task: Query the DB with a specific metadata constraint
│
└── Mini-Project: PDF Knowledge Base Builder
    └── Build a script that takes a PDF path, processes it, and makes it searchable.

```
---

## Concept 1: Vector Database Architecture (ChromaDB)

### Intuition

Traditional relational databases (SQL) are designed for precision. If you query `SELECT * FROM items WHERE color = 'red'`, the database checks for an exact string match. It either is "red" or it isn't.

However, language is messy. "Crimson", "Ruby", and "Scarlet" are all semantically similar to "Red", but an SQL query would miss them. Vector Databases are designed to solve this. Instead of storing data as just text or numbers, they store data as **Vectors** (long lists of floating-point numbers) in a multi-dimensional space.

In this space, concepts with similar meanings are located physically close to each other. Searching involves finding the "Nearest Neighbors" to a query vector, rather than exact row matching.

### Mechanics: HNSW

If you have 1 million vectors, calculating the distance between your query and every single vector (linear scan) is too slow for production.
ChromaDB (and others like Weaviate/Pinecone) use **Approximate Nearest Neighbor (ANN)** algorithms. The most common is **HNSW (Hierarchical Navigable Small World)**.
   * **Structure:** It builds a multi-layer graph. Top layers act like express highways (long jumps across the data), while bottom layers allow for fine-grained navigation.
   * **Process:** The search starts at the top, zooms in on the general neighborhood, and descends until it finds the closest points in the bottom layer.

### Simpler Explanation

Imagine a massive library organized by "Vibes" instead of the Dewey Decimal System.
   * **SQL** is like looking up a book by its exact ISBN number. You either find it or you don't.
   * **Vector Search** is like saying, "I want a book that feels like 'Harry Potter'". The librarian (the Algorithm) walks to the Fantasy section (express jump), then looks at the shelf next to Rowling (local search) and hands you "Percy Jackson". It’s not the exact same book, but it's the closest match in meaning.

### Trade-offs

* **Pros:** Enables semantic search (searching by meaning, not keywords).
* **Cons:** It is **Approximate**. There is a small chance the algorithm misses the absolute "closest" vector in favor of speed. It is also computationally heavier than simple keyword lookups.

### Context

In RAG (Retrieval-Augmented Generation), we use the Vector DB to find the 3-5 paragraphs from a PDF that are most relevant to the user's question, then send only those paragraphs to the LLM.

---

### Your Task

You need to set up the infrastructure for our local knowledge base.

**Requirements:**
   1. Import the `chromadb` library.
   2. Create a class named `VectorStore`.
   3. In the `__init__` method, initialize a **Persistent Client**. This ensures that if you restart the script, the data isn't lost.
      * *Hint:* You need to specify a path where the data will be saved.
   4. Create (or get) a **Collection** named `my_knowledge_base`. A Collection is roughly equivalent to a Table in SQL.
   
   **Inputs:** None (Hardcoded path for now is fine, e.g., `./chroma_db`)
   
   **Outputs:** Print the collection object or its name to confirm creation.



In [None]:
import chromadb
import pypdf

class VectorStore():
    def __init__(self):
        """
        Constructor contains path to the chroma db
        """
        self.client = chromadb.PersistentClient(path = r"Data/")

    def get_or_create_collection(self):
        """
        Create a new collection or gets the existing collection
        """
        collection = self.client.get_or_create_collection(name = "my_knowledge_base")
        return collection

This setup ensures your data survives between runs, which is critical for the "Build Once, Query Many" pattern of RAG applications.


---

## **Concept 2: Document Loading (PDF Parsing)**

### Intuition

A PDF file is not a simple stream of text like a `.txt` file. It is a set of instructions for a printer (e.g., "draw character 'A' at coordinates x:50, y:100").

Because of this, extracting text from a PDF is often "messy." You lose structural elements like columns, headers, or tables unless you use specialized OCR or layout-aware parsers. For basic RAG, we simply need to extract the raw text content so we can process it.

### Mechanics

We use libraries (like `pypdf`, `PyMuPDF`, or `pdfplumber`) to iterate through the binary pages of the file and attempt to reconstruct the string of text contained within.

### Simpler Explanation

Imagine a PDF is a painting of a letter. You can't just copy-paste the paint. You need a tool that looks at the painting and writes down the words it sees into a notepad (String).

### Trade-offs
   * **Pros:** Allows us to unlock the vast amount of knowledge stored in business documents.
   * **Cons:** Formatting is often lost. "Page numbers", "headers", and "footers" get mixed in with the actual content, which can confuse the AI later if not cleaned (though for this exercise, we will stick to raw extraction).

### Your Task

Add a method `load_pdf` to your `VectorStore` class (or keep it as a standalone helper function, your choice, but helper function is usually cleaner for ingestion scripts).

**Specifications:**
   1. Input: A file path string (e.g., `"sample.pdf"`).
   2. Logic:
      * Open the file in binary read mode (`rb`).
      * Use a PDF library (standard choice is `pypdf` or `PyPDF2`) to read the file.
      * Iterate through every page.
      * Extract text from the page and append it to a single result string.
   3. Output: Return the full text of the PDF as one long string.

**Note:** If you don't have a specific PDF library installed, you might need to `pip install pypdf` first.


In [None]:
import chromadb
import pypdf

class VectorStore():
    def __init__(self):
        """
        Constructor contains path to the chroma db
        """
        self.client = chromadb.PersistentClient(path = r"Data/")

    def get_or_create_collection(self):
        """
        Create a new collection or gets the existing collection
        """
        collection = self.client.get_or_create_collection(name = "my_knowledge_base")
        return collection

    def load_pdf(self, file):
        """
        Loads all the text of the pdf into a string. Splits each page with a \n\n
        """
        texts = []
        reader = pypdf.PdfReader(file)
        for page in reader.pages:
            text = page.extract_text()
            texts.append(text)
        return "\n\n".join(texts)


## **Concept 3: Text Splitting (Recursive Strategies)**

### Intuition

You cannot feed a whole 50-page PDF into an LLM for two reasons:
   1. **Token Limits:** LLMs have a maximum context window.
   2. **Precision:** If you ask "What is the revenue?", you want the specific *paragraph* about revenue, not the entire annual report. Feeding the whole document dilutes the "signal" with too much "noise."

We must break the text into smaller **Chunks**.

### Mechanics: Recursive Character Splitting

We don't just chop the text every 1000 characters blindly. If we did, we might cut a sentence in half:
   * *Chunk 1:* "...the revenue was $5"
   * *Chunk 2:* "million."
This destroys the meaning.

**Recursive Splitting** tries to keep related text together:
   1. First, try splitting by **Paragraphs** (`\n\n`).
   2. If a paragraph is still too big, split it by **Lines** (`\n`).
   3. If a line is too big, split by **Spaces** (` `).
   4. Finally, force a split by characters.

### The "Overlap"

We also add an **Overlap** (e.g., 100 characters). The end of Chunk 1 is repeated as the start of Chunk 2.
   * *Reason:* This ensures that if a concept spans across the cut, the context is preserved in at least one of the chunks.

### Your Task

For this exercise, we will implement a simplified **Sliding Window Splitter** manually to understand the math of "Overlap" (building a full recursive splitter is complex regex work).

**Specifications:**
   1. Add a method `split_text` to your class (or as a helper).
   2. **Inputs:** `text` (str), `chunk_size` (int, default 1000), `chunk_overlap` (int, default 200).
   3. **Logic:**
      * Create a loop that slices the text.
      * Start at index `0`.
      * Slice from `start` to `start + chunk_size`.
      * Move the start index forward by `chunk_size - chunk_overlap`.   
   4. **Output:** Return a list of text chunks (List[str]).


In [None]:
import chromadb
import pypdf

class VectorStore():
    def __init__(self):
        """
        Constructor contains path to the chroma db
        """
        self.client = chromadb.PersistentClient(path = r"Data/")

    def get_or_create_collection(self):
        """
        Create a new collection or gets the existing collection
        """
        collection = self.client.get_or_create_collection(name = "my_knowledge_base")
        return collection

    def load_pdf(self, file):
        """
        Loads all the text of the pdf into a string. Splits each page with a \n\n
        """
        texts = []
        reader = pypdf.PdfReader(file)
        for page in reader.pages:
            text = page.extract_text()
            texts.append(text)
        return "\n\n".join(texts)

    def split_text(self, text, chunk_size = 1000, chunk_overlap = 200):
        """
        Sliding Window Splitter manually to understand the math of "Overlap"
        """
        chunks = []
        start = 0
        while start < len(text):
            chunks.append(text[start: start + chunk_size])
            start += chunk_size - chunk_overlap
        return chunks
        

The `while` loop gives you precise control over the index, ensuring the overlap is calculated correctly (jumping forward by `chunk_size - overlap`).


---

## **Concept 4: Embedding Generation (The Bridge)**

### Intuition

Computers cannot understand the string `"Revenue increased"`. They only understand numbers.
An **Embedding Model** acts as a translator. It accepts text and outputs a fixed-length list of floating-point numbers (a vector).
   * **Input:** "Apple"
   * **Output:** `[0.12, -0.98, 0.05, ...]` (e.g., 384 dimensions)

Crucially, this translation is **semantic**.
   * The vector for "Apple" will be mathematically closer to "Banana" (both fruits) than to "Microsoft" (tech company).
   * However, "Apple" (the company) would be closer to "Microsoft" based on context. High-quality models capture this nuance.

### Mechanics

We typically use pre-trained Transformers (like BERT or RoBERTa). We feed the text in, and instead of asking for a classification (Cat/Dog), we intercept the numbers at the **last hidden layer**. That array of numbers *is* the embedding.

For this lesson, we will use ChromaDB's built-in default utility, which uses the `all-MiniLM-L6-v2` model (a very fast, lightweight model).

### Your Task

Before we ingest data, I want you to "see" a vector to demystify it.
   1. Import `embedding_functions` from `chromadb.utils`.
   2. Instantiate a `DefaultEmbeddingFunction`.
   3. Run this function on the string `"Hello world"` and print the **length** of the resulting vector (to see the dimensionality) and the **first 5 numbers**.

*Note: This might download a small model file on the first run.*

In [None]:
from chromadb.utils.embedding_functions import DefaultEmbeddingFunction

embed = DefaultEmbeddingFunction()
text_embed = embed("Hello World")
text_embed[0][:5]

That confirms your environment is set up correctly. The default model (`all-MiniLM-L6-v2`) produces 384-dimensional vectors. This means every piece of text you ingest becomes a point in a 384-dimensional coordinate system.

---

## **Concept 5: Ingestion (Collections & Upsert)**

### Intuition

Now we combine everything. Ingestion is the pipeline of:
**Raw PDF  Text String  Chunks  Vectors  Storage.**

### Mechanics

ChromaDB's `collection.add()` method handles the heavy lifting.
You provide:
   1. **Documents:** The list of text chunks (strings).
   2. **IDs:** A unique identifier for each chunk (e.g., "pdf1_chunk0", "pdf1_chunk1").
   3. **Embeddings:** (Optional) If you don't provide them, Chroma runs the default embedding function (from Concept 4) automatically on the documents.
   4. **Metadatas:** (Optional) Dictionaries carrying extra info (e.g., `{"source": "annual_report.pdf", "page": 10}`).

### Simpler Explanation

This is the "Data Entry" phase. We are taking the messy pile of paper (PDF), cutting it into index cards (Chunks), writing a summary number on the back (Vector), and filing them into the cabinet (Collection).

### Your Task

Update your `VectorStore` class by adding a method `ingest_pdf`.

**Specifications:**
   1. **Input:** `pdf_path` (str).
   2. **Workflow:**
      * Call `load_pdf` to get the raw text.
      * Call `split_text` to get the list of chunks.
      * **Generate IDs:** Create a list of unique IDs matching the number of chunks (e.g., using `f"id_{i}"` in a loop or comprehension).
      * **Add to DB:** Call `self.collection.add` with `documents` and `ids`.
      * *Note:* You might need to ensure `self.collection` is defined in your `__init__` or called via `get_or_create_collection` before adding.


In [None]:
import chromadb
import pypdf
import os

class VectorStore():
    def __init__(self):
        """
        Constructor contains path to the chroma db
        """
        self.client = chromadb.PersistentClient(path = r"Data/")
        self.collection = None

    def get_or_create_collection(self):
        """
        Create a new collection or gets the existing collection
        """
        collection = self.client.get_or_create_collection(name = "my_knowledge_base")
        self.collection = collection
        return collection

    def load_pdf(self, file):
        """
        Loads all the text of the pdf into a string. Splits each page with a \n\n
        """
        texts = []
        reader = pypdf.PdfReader(file)
        for page in reader.pages:
            text = page.extract_text()
            texts.append(text)
        return "\n\n".join(texts)

    def split_text(self, text, chunk_size = 1000, chunk_overlap = 200):
        """
        Sliding Window Splitter manually to understand the math of "Overlap"
        """
        chunks = []
        start = 0
        while start < len(text):
            chunks.append(text[start: start + chunk_size])
            start += chunk_size - chunk_overlap
        return chunks


    def ingest_pdf(self, pdf_path):
        """
        Full ingestion pipeline:
        - Load PDF
        - Split into chunks
        - Generate IDs
        - Add to ChromaDB
        """
        if not hasattr(self, "collection") or self.collection is None:
            self.collection = self.get_or_create_collection()
        
        all_texts = self.load_pdf(pdf_path)
        chunks = self.split_text(all_texts)
        filename = os.path.basename(pdf_path)
        ids = [f"F_{filename}_id_{i}" for i in range(len(chunks))]

        self.collection.add(documents = chunks, ids = ids)

**Alternative ID strategies:**
Instead of F_{filename}_id_{i}, you can use UUIDs or hashes.
   * UUIDs (uuid.uuid4()) guarantee uniqueness across all ingestions and are safest at scale, but they’re not human-readable and make debugging harder.
   * Hashes (e.g., hashlib.md5(chunk_text)) produce deterministic IDs (same text → same ID), which helps avoid re-ingesting identical chunks, but they have a small collision risk and depend on chunk content staying identical.
   * Filename + index (your current approach) is readable and great for learning/debugging, but requires care to avoid re-ingesting the same file twice.

This is functional and robust. You have successfully built a pipeline that goes from Disk $\rightarrow$ Memory $\rightarrow$ Vector Database.


---

## **Concept 6: Metadata Filtering**

### Intuition

Imagine you ingest 100 different PDFs—HR policies, IT manuals, and Financial Reports.
If a user asks "What is the refund policy?", the Vector Search might return a snippet from the *IT manual* about "refunding software licenses" when the user actually wanted the *HR policy* on travel expenses.

**Metadata Filtering** allows you to narrow the search space *before* (or sometimes after) the vector comparison. You can say: "Only search for vectors WHERE `category == 'HR'`".

### Mechanics

When you add data to ChromaDB, you can attach a dictionary to every chunk:
   `metadata={"source": "hr_policy.pdf", "year": 2024}`.

When you query, you pass a filter dictionary:
   `where={"source": "hr_policy.pdf"}`.


#### Mechanics: Querying

To ask the database a question, you use the `.query()` method. It automatically converts your text into a vector using the same model used during ingestion.

**Syntax:**
   ```python
   results = collection.query(
       query_texts=["Your question here"],  # Must be a list, even for one question
       n_results=5,                         # How many matches to return
       where={"source": "filename.pdf"}     # Optional: Metadata filter
   )
   
   ```

**The Output:**
The `results` variable is a dictionary containing lists. Since you can pass multiple query texts, the results are lists of lists.
   * `results['documents'][0]` -> List of the matching text chunks.
   * `results['distances'][0]` -> List of similarity scores (lower is usually better/closer in Chroma's default metric).

---

### Your Task
   1. **Modify `ingest_pdf`**: Update the `self.collection.add` call to include metadata.
      * Create a list of metadata dictionaries.
      * Each dictionary should look like: `{"source": filename}`.
      * The list must have the same length as `chunks` (one dict per chunk).
   
   
   2. **Add `query` method**: Create a method `query_db(query_text, n_results=5, filter_dict=None)`.
      * It calls `self.collection.query`.
      * **Inputs:** `query_texts=[query_text]`, `n_results=n_results`.
      * **Conditional:** If `filter_dict` is provided, pass it to the `where` parameter.

In [1]:
import chromadb
import pypdf
import os

class VectorStore():
    def __init__(self):
        """
        Constructor contains path to the chroma db
        """
        self.client = chromadb.PersistentClient(path = r"Data/")
        self.collection = None

    def get_or_create_collection(self):
        """
        Create a new collection or gets the existing collection
        """
        collection = self.client.get_or_create_collection(name = "my_knowledge_base")
        self.collection = collection
        return collection

    def load_pdf(self, file):
        """
        Loads all the text of the pdf into a string. Splits each page with a \n\n
        """
        texts = []
        reader = pypdf.PdfReader(file)
        for page in reader.pages:
            text = page.extract_text()
            texts.append(text)
        return "\n\n".join(texts)

    def split_text(self, text, chunk_size = 1000, chunk_overlap = 200):
        """
        Sliding Window Splitter manually to understand the math of "Overlap"
        """
        chunks = []
        start = 0
        while start < len(text):
            chunks.append(text[start: start + chunk_size])
            start += chunk_size - chunk_overlap
        return chunks


    def ingest_pdf(self, pdf_path):
        """
        Full ingestion pipeline:
        - Load PDF
        - Split into chunks
        - Generate IDs
        - Add to ChromaDB
        """
        if not hasattr(self, "collection") or self.collection is None:
            self.collection = self.get_or_create_collection()
        
        all_texts = self.load_pdf(pdf_path)
        chunks = self.split_text(all_texts)
        filename = os.path.basename(pdf_path)
        
        ids = [f"F_{filename}_id_{i}" for i in range(len(chunks))]

        metadatas = [{"source" : filename} for _ in range(len(chunks))]

        self.collection.add(documents = chunks, ids = ids, metadatas = metadatas)


    def query_db(self, query_text, n_results = 5, filter_dict = None):
        """
        It calls self.collection.query.
        Inputs: query_texts=[query_text], n_results=n_results.
        Conditional: If filter_dict is provided, pass it to the where parameter.
        """
        if not hasattr(self, "collection") or self.collection is None:
            self.collection = self.get_or_create_collection()
    
        query = {
            "query_texts" : [query_text],
            "n_results" : n_results
        }
    
        if filter_dict is not None:
            query["where"] = filter_dict
    
        return self.collection.query(**query)

If you use `self.collection.query(query)` you get a **Function Argument Error**: You are calling self.collection.query(query).
    
The query() method expects separate arguments like query_texts=["..."], n_results=5, etc.
    
It does not accept a single dictionary object as the first input.

Fix: You either need to pass the arguments explicitly (e.g., query(query_texts=..., n_results=...)) OR use Python's "dictionary unpacking" operator (**) to convert your dictionary keys into function arguments.

---

You now have a fully functional Vector Store class that can:
   1. Persist data to disk.
   2. Parse PDFs.
   3. Split text with overlap.
   4. Ingest data with metadata.
   5. Query with filters.


---

## Implementation

**Objective:** Combine all your previous work into a single script that acts as a command-line tool. You will ingest a PDF and then ask questions about it.

In [3]:
if __name__ == "__main__":
    vdb = VectorStore()
    pdf = r"Data\Test.pdf"
    col = vdb.get_or_create_collection()
    texts = vdb.load_pdf(pdf)
    chunks = vdb.split_text(texts)
    vdb.ingest_pdf(pdf)
    result = vdb.query_db("Who is RAG?")

In [4]:
result = vdb.query_db("What is the adnvantages of RAG?")
result

{'ids': [['F_Test.pdf_id_3',
   'F_Test.pdf_id_9',
   'F_Test.pdf_id_8',
   'F_Test.pdf_id_6',
   'F_Test.pdf_id_7']],
 'embeddings': None,
 'documents': [['tion is an architectural pattern that enhances text \ngeneration by retrieving relevant external information at inference time and \nincorporating it into the model’s response. \nIn simple terms: \nRAG = Retrieve relevant documents → Generate an answer using both the query and \nthe retrieved content \nInstead of asking a language model to answer a question from memory alone, a RAG \nsystem first looks up relevant information from a database, document store, or \nknowledge base. The retrieved information is then provided as context to the language \nmodel, which uses it to produce a more accurate and grounded answer. \n \n4. Core Components of a RAG System \nA typical RAG pipeline consists of several key components: \n4.1 Knowledge Source \nThis is the external data repository that the system retrieves from. It can include: \n• PDF

## **Mini-Project: The "Smart Librarian"**

**Objective:** Upgrade your `VectorStore` to a production-grade **Corpus Manager**. It must identify files by their *content* (not just filename), prevent duplicate ingestion, and allow for removal of obsolete documents.

**The Golden Rule:** The database state must never contain duplicate chunks for the same file content.

---

### Specifications

#### 1. The Hashing Engine

Implement a mechanism to fingerprint files.
   * **Tool:** `hashlib.md5`.
   * **Requirement:** Create a method `_get_file_hash(self, filepath)` that reads a file in binary mode and returns its hexadecimal digest string.
   * **Why?** If I rename `report.pdf` to `report_final.pdf`, the hash remains identical. The system should recognize they are the same file.

#### 2. The Duplicate Guard (Logic Upgrade)

Modify your `ingest_pdf` method to be "Idempotent" (safe to run multiple times).
   * **Step A:** Calculate the hash of the incoming PDF.
   * **Step B:** Query the database to check if this hash already exists.
      * *Hint:* You can use `self.collection.get(where={"file_hash": current_hash})`. Check if the returned list of IDs is empty or not.
   * **Step C:**
      * **If Exists:** Abort ingestion. Print: *"Duplicate detected. Skipping [filename]."*
      * **If New:** Proceed with loading, splitting, and adding.

* **Crucial:** When adding the chunks to Chroma, every chunk's metadata **must** now include `{"file_hash": current_hash, "source": filename}`.

#### 3. The "Un-Ingest" Feature

Add a `delete_file(self, filename)` method.
   * **Input:** The filename (string).
   * **Logic:** Remove all vector entries where the metadata `source` matches the filename.
   * **Tool:** `self.collection.delete(where={...})`.
   * **Output:** Print how many chunks were deleted (optional, but good for debugging).

---

### The Test Script (In `if __name__ == "__main__":`)

You must write a script that proves your logic works by attempting to fool it.

   1. **Clean Start:** Initialize the DB. (Optional: Clear it first if you know how, or just start with a fresh path).
   2. **Round 1 (First Ingestion):** Ingest `test.pdf`.
      * *Expectation:* "Ingesting test.pdf..."

   3. **Round 2 (The Duplicate Attempt):** Attempt to ingest `test.pdf` again immediately.
      * *Expectation:* "Duplicate detected. Skipping..."

   4. **Round 3 (The Renamed Trap):** Copy `test.pdf` to `test_copy.pdf` (same content, different name). Attempt to ingest `test_copy.pdf`.
      * *Expectation:* "Duplicate detected. Skipping..." (This proves you are checking Hash, not Filename).

   5. **Round 4 (Deletion):** Call `delete_file("test.pdf")`.
      * *Expectation:* Verify via a query or count that the data is gone.



In [6]:
import chromadb
import pypdf
import os
import hashlib
import shutil


class VectorStore():
    def __init__(self):
        """
        Constructor contains path to the chroma db
        """
        self.client = chromadb.PersistentClient(path=r"Data/")
        self.collection = None

    def get_or_create_collection(self):
        """
        Create a new collection or gets the existing collection
        """
        self.collection = self.client.get_or_create_collection(
            name="my_knowledge_base"
        )
        return self.collection

    def get_file_hash(self, filepath):
        """
        Returns an MD5 hash of the file contents
        """
        hash_md5 = hashlib.md5()
        with open(filepath, "rb") as f:
            for chunk in iter(lambda: f.read(4096), b""):
                hash_md5.update(chunk)
        return hash_md5.hexdigest()


    def load_pdf(self, file):
        """
        Loads all the text of the pdf into a string.
        """
        texts = []
        reader = pypdf.PdfReader(file)
        for page in reader.pages:
            text = page.extract_text()
            if text:
                texts.append(text)
        return "\n\n".join(texts)

    def split_text(self, text, chunk_size=1000, chunk_overlap=200):
        """
        Sliding Window Splitter
        """
        chunks = []
        start = 0
        while start < len(text):
            chunks.append(text[start:start + chunk_size])
            start += chunk_size - chunk_overlap
        return chunks


    def ingest_pdf(self, pdf_path):
        """
        Idempotent ingestion pipeline with duplicate detection
        """
        if self.collection is None:
            self.get_or_create_collection()

        filename = os.path.basename(pdf_path)
        file_hash = self.get_file_hash(pdf_path)

        existing = self.collection.get(
            where={"file_hash": file_hash}
        )

        if existing["ids"]:
            print(f"Duplicate detected. Skipping {filename}.")
            return

        print(f"Ingesting {filename}...")

        all_texts = self.load_pdf(pdf_path)
        chunks = self.split_text(all_texts)

        ids = [f"{file_hash}_chunk_{i}" for i in range(len(chunks))]

        metadatas = [
            {
                "source": filename,
                "file_hash": file_hash
            }
            for _ in range(len(chunks))
        ]

        self.collection.add(
            documents=chunks,
            ids=ids,
            metadatas=metadatas
        )


    def delete_file(self, filename):
        """
        Deletes all chunks belonging to a given source filename
        """
        if self.collection is None:
            self.get_or_create_collection()

        # Count before delete (for verification)
        before = self.collection.get(where={"source": filename})
        count = len(before["ids"])

        self.collection.delete(where={"source": filename})

        print(f"Deleted {count} chunks for file '{filename}'.")


    def query_db(self, query_text, n_results=5, filter_dict=None):
        """
        Query the vector database
        """
        if self.collection is None:
            self.get_or_create_collection()

        query = {
            "query_texts": [query_text],
            "n_results": n_results
        }

        if filter_dict is not None:
            query["where"] = filter_dict

        return self.collection.query(**query)

if __name__ == "__main__":
    vs = VectorStore()

    # Optional: start clean (uncomment if needed)
    # vs.client.delete_collection("my_knowledge_base")

    # Paths
    original = r"Data\Test.pdf"
    renamed = r"Data\Test - Copy.pdf"

    # ------------------------------------------------
    # Round 1: First ingestion
    # ------------------------------------------------
    vs.ingest_pdf(original)
    # Expect: Ingesting test.pdf...

    # ------------------------------------------------
    # Round 2: Duplicate ingestion (same file)
    # ------------------------------------------------
    vs.ingest_pdf(original)
    # Expect: Duplicate detected. Skipping test.pdf.

    # ------------------------------------------------
    # Round 3: Renamed duplicate (same content)
    # ------------------------------------------------
    shutil.copyfile(original, renamed)
    vs.ingest_pdf(renamed)
    # Expect: Duplicate detected. Skipping test_copy.pdf.

    # ------------------------------------------------
    # Round 4: Deletion
    # ------------------------------------------------
    vs.delete_file("test.pdf")

    # Optional verification
    remaining = vs.collection.get(where={"source": "test.pdf"})
    print("Remaining chunks:", len(remaining["ids"]))


Duplicate detected. Skipping Test.pdf.
Duplicate detected. Skipping Test.pdf.
Duplicate detected. Skipping Test - Copy.pdf.
Deleted 0 chunks for file 'test.pdf'.
Remaining chunks: 0
