# 02. Data Ingestion & Indexing Pipeline

This notebook handles the ETL (Extract, Transform, Load) process for our Agentic RAG system.

**Steps:**
1.  **Load PDFs**: Read research papers from `data/papers`.
2.  **Preprocess**: Clean text and handle formatting artifacts.
3.  **Chunk**: Split text into semantic chunks suitable for retrieval.
4.  **Embed**: Convert chunks into vector embeddings.
5.  **Index**: Store embeddings in a local Vector Database (ChromaDB).

In [None]:
import os
import requests
from typing import List

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma
from langchain_core.documents import Document

# Configuration
DATA_DIR = "data/papers"
DB_DIR = "data/chroma_db"
EMBEDDING_MODEL = "sentence-transformers/all-mpnet-base-v2"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200

## 1. Helper: Download Sample Paper
If the directory is empty, let's download "Attention Is All You Need" as a test case.

In [None]:
def download_sample_paper():
    url = "https://arxiv.org/pdf/1706.03762.pdf"
    target_path = os.path.join(DATA_DIR, "1706.03762v5.pdf")
    
    if not os.path.exists(DATA_DIR):
        os.makedirs(DATA_DIR)
        
    if not os.listdir(DATA_DIR):
        print(f"Downloading sample paper to {target_path}...")
        response = requests.get(url)
        with open(target_path, 'wb') as f:
            f.write(response.content)
        print("Download complete.")
    else:
        print("Papers found in directory, skipping download.")

download_sample_paper()

## 2. Load Documents
We use `PyPDFLoader` to extract text from all PDF files.

In [None]:
def load_documents(directory: str) -> List[Document]:
    documents = []
    for filename in os.listdir(directory):
        if filename.endswith(".pdf"):
            file_path = os.path.join(directory, filename)
            print(f"Loading {filename}...")
            loader = PyPDFLoader(file_path)
            documents.extend(loader.load())
    print(f"Loaded {len(documents)} pages from {len(os.listdir(directory))} files.")
    return documents

raw_documents = load_documents(DATA_DIR)

## 3. Split Text (Chunking)
We use `RecursiveCharacterTextSplitter`. For research papers, keeping context is key, so we use a relatively large chunk size with overlap.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    separators=["\n\n", "\n", " ", ""]
)

chunks = text_splitter.split_documents(raw_documents)
print(f"Created {len(chunks)} chunks from original documents.")

# Preview a random chunk
if chunks:
    print("\n--- Chunk Preview ---")
    print(chunks[0].page_content[:500] + "...")

## 4. Embed Loading
We use `sentence-transformers/all-mpnet-base-v2` via HuggingFaceEmbeddings. This runs locally on CPU/GPU.

In [None]:
print("Loading embedding model... (this may take a moment first time)")
embedding_model = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL,
    model_kwargs={'device': 'cpu'}, # Use 'cuda' if you have a GPU
    encode_kwargs={'normalize_embeddings': True}
)

## 5. Vector Store Indexing (ChromaDB)
We persist the database to disk so we don't have to re-index every time.

In [None]:
# Initialize Chroma and add documents
# If the DB already exists, this will load it and append new documents
# To reset, delete the 'data/chroma_db' folder manually
vector_store = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory=DB_DIR
)

print(f"\nIndexed {len(chunks)} chunks into ChromaDB at '{DB_DIR}'.")

## 6. Test Retrieval
Let's verify the index works by asking a simple question related to the papers.

In [None]:
query = "What is the main advantage of the Transformer model?"
retriever = vector_store.as_retriever(search_kwargs={"k": 3})

results = retriever.invoke(query)

print(f"Query: {query}\n")
for i, doc in enumerate(results):
    print(f"[Result {i+1}] Source: {doc.metadata.get('source', 'Unknown')} | Page: {doc.metadata.get('page', 'Unknown')}")
    print(doc.page_content[:200] + "...\n")