# Part 1: Document Ingestion and Indexing for Medical RAG

This notebook demonstrates the first part of building our Retrieval Augmented Generation (RAG) system:
1. Loading PDF documents from a specified folder.
2. Chunking the documents into smaller, manageable nodes.
3. Generating embeddings for these nodes.
4. Building a FAISS vector index and persisting it to disk.

In [None]:
%pip install llama-index faiss-cpu sentence-transformers pypdf2
%pip install llama-index-embeddings-huggingface 
# %pip install llama-index-readers-file # If SimpleDirectoryReader needs it explicitly

print("Dependencies installed. You may need to restart the kernel for changes to take effect.")

In [None]:
import os
import sys
from pathlib import Path

# Add src directory to Python path to import custom modules
# Assuming the notebook is in 'medical_rag_agent/notebooks/' and src is in 'medical_rag_agent/src/'
module_path = str(Path.cwd().parent / 'src')
if module_path not in sys.path:
    sys.path.append(module_path)

from document_processor import load_pdfs_from_folder, chunk_documents
from indexing import get_embedding_model, build_vector_index, load_vector_index

# Define paths
# The data_folder should point to 'medical_rag_agent/data/sample_medical_pdfs/' relative to the project root
# The notebook is in 'medical_rag_agent/notebooks/', so '../data/sample_medical_pdfs/'
data_folder = '../data/sample_medical_pdfs/'
# The vector_store_path should be relative to the project root as well.
# Let's save it in 'medical_rag_agent/vector_store_notebook/'
vector_store_notebook_path = '../vector_store_notebook' 

print(f"Data folder: {Path(data_folder).resolve()}")
print(f"Vector store path for this notebook: {Path(vector_store_notebook_path).resolve()}")
print(f"Python sys.path includes: {module_path}")

# Create vector store directory if it doesn't exist
os.makedirs(vector_store_notebook_path, exist_ok=True)

## 1. Load PDF Documents

We'll use the `load_pdfs_from_folder` function from our `document_processor` module.
This function expects a path to a folder containing PDF files.
We've prepared a sample PDF in `data/sample_medical_pdfs/`.

In [None]:
documents = load_pdfs_from_folder(data_folder)
if documents:
    print(f"Successfully loaded {len(documents)} document(s).")
    for doc in documents:
        print(f"Document ID: {doc.doc_id}, File: {doc.metadata.get('file_name', 'N/A')}")
        # print(f"First 100 chars: {doc.text[:100]}...") # Optional: print snippet
else:
    print("No documents found or loaded. Please check the data_folder path and its contents.")

## 2. Chunk Documents

Next, we split the loaded documents into smaller chunks (nodes). Our `chunk_documents` function now attempts to identify common medical/scientific section headers (e.g., 'Introduction', 'Methods', 'Results', 'Diagnosis', 'Treatment Plan') to create more semantically relevant chunks. If a section is too large, or if no headers are found, it falls back to sentence-based splitting within those sections (or the whole document).

In [None]:
if documents: # Proceed only if documents were loaded
    # Using new default parameters from the updated chunk_documents function
    # nodes = chunk_documents(documents, default_chunk_size=512, default_chunk_overlap=100, max_chars_per_section_chunk=3000)
    nodes = chunk_documents(documents) # Using the function's defaults
    print(f"Successfully chunked {len(documents)} document(s) into {len(nodes)} nodes.")
    if nodes:
        print(f"Example Node 1 ID: {nodes[0].id_}")
        print(f"Example Node 1 Text (first 100 chars): {nodes[0].get_content()[:100]}...")
        print(f"Example Node 1 Metadata (file_name): {nodes[0].metadata.get('file_name', 'N/A')}")
        print(f"Example Node 1 Metadata (section): {nodes[0].metadata.get('section', 'N/A')}")
        if len(nodes) > 1: # Print another node if available to see variety
              print(f"Example Node 2 ID: {nodes[1].id_}")
              print(f"Example Node 2 Text (first 100 chars): {nodes[1].get_content()[:100]}...")
              print(f"Example Node 2 Metadata (file_name): {nodes[1].metadata.get('file_name', 'N/A')}")
              print(f"Example Node 2 Metadata (section): {nodes[1].metadata.get('section', 'N/A')}")
else:
    print("Skipping chunking as no documents were loaded.")

## 3. Initialize Embedding Model

We'll use a pre-trained model from Hugging Face via `llama-index.embeddings.HuggingFaceEmbedding`.
The `get_embedding_model` function from `indexing.py` handles this.

In [None]:
embed_model_name = "sentence-transformers/all-MiniLM-L6-v2"
embed_model = get_embedding_model(model_name=embed_model_name)
print(f"Successfully initialized embedding model: {embed_model_name}")
if hasattr(embed_model, 'embed_dim'):
    print(f"Embedding dimension: {embed_model.embed_dim}")

## 4. Build and Save Vector Index

Now we combine the chunked nodes and the embedding model to build our FAISS vector index.
The `build_vector_index` function will store the index files in the `vector_store_notebook_path` directory.
This process can take a few moments depending on the number of nodes.

In [None]:
if 'nodes' in locals() and nodes and embed_model: # Proceed only if nodes and model are available
    print(f"Building vector index. This may take a moment...")
    # Make sure the path used here is where you want the Colab notebook's index to be stored.
    index = build_vector_index(nodes, embed_model, storage_persist_dir=vector_store_notebook_path)
    print(f"Successfully built and saved the vector index to: {vector_store_notebook_path}")
    
    # Verify that files were created
    expected_faiss_file = Path(vector_store_notebook_path) / "vector_store.faiss"
    expected_docstore_file = Path(vector_store_notebook_path) / "docstore.json"
    if expected_faiss_file.exists() and expected_docstore_file.exists():
        print(f"Verified: FAISS file exists at {expected_faiss_file}")
        print(f"Verified: Docstore file exists at {expected_docstore_file}")
    else:
        print(f"Warning: Could not verify all index files in {vector_store_notebook_path}. Check build_vector_index implementation.")
        print(f"Contents of {vector_store_notebook_path}: {list(Path(vector_store_notebook_path).iterdir())}")

else:
    print("Skipping index building as nodes or embedding model are not available.")

## 5. Test Loading the Index (Optional)

To ensure persistence worked correctly, let's try loading the index back.

In [None]:
if 'index' in locals() and embed_model and Path(vector_store_notebook_path).exists():
    try:
        print(f"Attempting to load index from: {vector_store_notebook_path}")
        loaded_index = load_vector_index(storage_persist_dir=vector_store_notebook_path, embed_model=embed_model)
        print("Successfully loaded index from disk.")
        # You can try a sample query if you had an LLM configured, but for now, just loading is fine.
        # Example: query_engine = loaded_index.as_query_engine()
        # response = query_engine.query("What is RAG?")
        # print(response)
    except Exception as e:
        print(f"Error loading index: {e}")
        import traceback
        traceback.print_exc()
else:
    print("Skipping index loading test as prerequisites are not met.")

## Next Steps

With the index built and saved, the next stage involves setting up a query engine and then an LLM agent to interact with this indexed medical data.