<h2>Agentic RAG App Doc-store (Vector-store) and Embeddings</h2>

Will get a doc/vector store established, and play around with it to see how it works with the groq LLM. Any missing information will then be filled by implementing with tavily (internet-search). Will then establish agents with specific roles with CrewAI. Write functions for this (week 1), and then establish the streamlit front-end (week 2).

<h3>Doc-store (Vector-store) and Embeddings</h3>

1. Your documents live locally in your project

2. You turn those documents into embeddings (numbers)

3. You store those numbers locally in FAISS

4. A user asks a question. The question is directed to the 'researcher' agent of CrewAI.

5. The researcher finds the most relevant document chunks using FAISS, or use 'tavily' to do an internet search for the information.

6. You send those chunks to Groq. The other agents use grow to fulfill their role.

7. Groq writes the answer

Groq never sees embeddings.
FAISS never talks to Groq.
The agents of CrewAI are agents, they use groq in different ways to fulfill their role e.g find an answer, write and answer, critique an answer.


<h4>Creating the doc store (done using FAISS)

We need to create a vector-store that can take in docuements and store them. We need to be able to save the store so we don't have to keep feeding it the same doceuments between uses. We also wnat to be able to add new docuemnts as we find them.

In [1]:
import langchain_core
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore

from langchain_core.documents import Document

from langchain_community.document_loaders import (
    TextLoader,
    PyPDFLoader,
    UnstructuredMarkdownLoader,
    DirectoryLoader,
)

from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


In [27]:
def scan_directory(directory_path: str):
    """
    Scan a given directory and return the paths of all files inside.

    Args:
        directory_path (str): The path of the directory to scan.

    Returns:
        List[str]: A list of file paths.
    """
    file_paths = []
    for root, _, files in os.walk(directory_path):
        for file in files:
            file_paths.append(os.path.join(root, file))
    return file_paths

In [28]:
import os
from typing import List, Union, Optional
from __future__ import annotations
import hashlib

# wrapper around sentence_transformers to match LangChain embeddings interface
class SentenceTransformerEmbeddingsWrapper:
    def __init__(self, model_name: str = "BAAI/bge-small-en-v1.5", lazy: bool = True):
        # lazy=True avoids downloading the model at import/definition time.
        self.model_name = model_name
        self._model: Optional[SentenceTransformer] = None
        self.lazy = lazy
        if not self.lazy:
            self._ensure_model()

    def _ensure_model(self) -> None:
        if self._model is None:
            self._model = SentenceTransformer(self.model_name)

    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """Return embeddings for a list of documents.
        This matches LangChain's `embed_documents` contract.
        """
        self._ensure_model()
        emb = self._model.encode(texts, show_progress_bar=False)
        return emb.tolist()

    def embed_query(self, text: str) -> List[float]:
        self._ensure_model()
        emb = self._model.encode([text], show_progress_bar=False)
        return emb[0].tolist()

    def __call__(self, texts: Union[str, List[str]]):
        # Support both single-query strings and lists of documents.
        if isinstance(texts, str):
            return self.embed_query(texts)
        if isinstance(texts, (list, tuple)):
            return self.embed_documents(list(texts))
        raise TypeError(f"Unsupported input type for embeddings: {type(texts)}")
    
def create_faiss_store_from_documents(documents: List[Document], index_dir: str = "vectorstore", embedding_model: str = "BAAI/bge-small-en-v1.5"):
    """
    Build a FAISS vectorstore from a list of langchain Document objects and save it to disk.
    Returns the in-memory FAISS store.
    """
    os.makedirs(index_dir, exist_ok=True)
    embeddings = SentenceTransformerEmbeddingsWrapper(embedding_model)
    # Pass the embeddings wrapper object so FAISS can access its
    # `embed_documents` / `embed_query` methods as expected.
    store = FAISS.from_documents(documents, embeddings)
    store.save_local(index_dir)
    return store

def load_faiss_store(index_dir: str = "vectorstore", embedding_model: str = "all-MiniLM-L6-v2"):
    """
    Load a previously saved FAISS vectorstore from disk.
    """
    embeddings = SentenceTransformerEmbeddingsWrapper(embedding_model)
    # FAISS.load_local expects an embeddings object providing `embed_documents`
    return FAISS.load_local(index_dir, embeddings)

def add_documents_and_save(store: FAISS, new_documents: List[Document], index_dir: str = "vectorstore"):
    """
    Add documents to an existing FAISS store and persist to disk.
    """
    store.add_documents(new_documents)
    store.save_local(index_dir)
    return store


def path_upload_document_to_vectorstore(
    document_paths: Union[str, List[str]],
    store: FAISS,
    index_dir: str = "vectorstore",
    dedup_mode: str = "content",  # "content" (recommended) or "source"
):
    """
    Upload one or more documents to an existing FAISS vectorstore and persist to disk
    Ensures each Document has a `content_hash` in metadata so subsequent
    content-based deduplication works even for the initial ingestion.
    WITHOUT erasing existing contents, with deduplication.

    # Ensure documents have a content_hash for idempotent ingestion
    def _content_hash(text: str) -> str:
        return hashlib.sha256((text or "").encode("utf-8")).hexdigest()

    for d in documents:
        # only operate on Document instances
        if not isinstance(d, Document):
            continue
        h = d.metadata.get("content_hash")
        if not h:
            h = _content_hash(d.page_content)
            d.metadata["content_hash"] = h
        # ensure a source field exists for traceability
        d.metadata.setdefault("source", d.metadata.get("source", ""))
    Deduplication behavior:
      - dedup_mode="content": hashes each loaded Document.page_content and skips exact duplicates.
      - dedup_mode="source": skips if an existing Document has the same metadata["source"].

    Notes:
      - For PDFs, PyPDFLoader returns one Document per page; dedup will happen at page level.
      - This function assumes your loaders populate metadata["source"] (LangChain usually does).
    """
    if isinstance(document_paths, str):
        document_paths = [document_paths]
    
    def _content_hash(text: str) -> str:
        # stable content-based id
        return hashlib.sha256((text or "").encode("utf-8")).hexdigest()


    def _get_existing_hashes(store: FAISS) -> set[str]:
        """
        Extract content hashes from the existing LangChain FAISS docstore.
        We store hashes in Document.metadata["content_hash"] for idempotent ingestion.
        """
        existing = set()

        # LangChain FAISS keeps docs in an InMemoryDocstore at store.docstore._dict
        doc_dict = getattr(getattr(store, "docstore", None), "_dict", None)
        if isinstance(doc_dict, dict):
            for d in doc_dict.values():
                if isinstance(d, Document):
                    h = d.metadata.get("content_hash")
                    if h:
                        existing.add(h)
        return existing

    # Build the dedup index from the current store
    existing_hashes = _get_existing_hashes(store) if dedup_mode == "content" else set()

    existing_sources = set()
    if dedup_mode == "source":
        doc_dict = getattr(getattr(store, "docstore", None), "_dict", None)
        if isinstance(doc_dict, dict):
            for d in doc_dict.values():
                if isinstance(d, Document):
                    src = d.metadata.get("source")
                    if src:
                        existing_sources.add(src)

    total_added = 0
    total_skipped = 0

    for document_path in document_paths:
        # Determine the loader type based on file extension
        if document_path.lower().endswith(".pdf"):
            loader_cls = PyPDFLoader
        elif document_path.lower().endswith(".txt"):
            loader_cls = TextLoader
        elif document_path.lower().endswith(".md"):
            loader_cls = UnstructuredMarkdownLoader
        else:
            print(f"Unsupported file type for {document_path}. Supported types are: .pdf, .txt, .md")
            continue

        # Load the document(s)
        try:
            loader = loader_cls(document_path)
            loaded_docs = loader.load()
            print(f"Loaded {len(loaded_docs)} document(s) from {document_path}.")
        except Exception as e:
            print(f"Error loading document {document_path}: {e}")
            continue

        # Tag documents with dedup metadata and filter duplicates
        new_docs: List[Document] = []
        for d in loaded_docs:
            # Ensure source is set for traceability (helps source-based dedup & citations)
            d.metadata.setdefault("source", document_path)

            if dedup_mode == "source":
                src = d.metadata.get("source")
                if src in existing_sources:
                    total_skipped += 1
                    continue
                existing_sources.add(src)
                new_docs.append(d)
                continue

            # content-based dedup (recommended)
            h = d.metadata.get("content_hash")
            if not h:
                h = _content_hash(d.page_content)
                d.metadata["content_hash"] = h

            if h in existing_hashes:
                total_skipped += 1
                continue

            existing_hashes.add(h)
            new_docs.append(d)

        if not new_docs:
            print(f"No new chunks/pages to add from {document_path} (all duplicates).")
            continue

        # Add to store and persist
        store.add_documents(new_docs)
        store.save_local(index_dir)
        total_added += len(new_docs)

        print(f"Added {len(new_docs)} new document(s) from {document_path} and saved to '{index_dir}'.")

    print(f"Done. Added: {total_added}, skipped as duplicates: {total_skipped}.")
    return store

In [None]:
# load the initial document (init_document.pdf)
run = False
if run == True:
    initial_document_directory = "initial_document" # When first creating the store, use a test document. Next, when adding more documents, use the path_upload_document_to_vectorstore() function with the main documents in 'documents' folder
    loader = DirectoryLoader(initial_document_directory)
    docs = loader.load()

    # Create a FAISS vector store from the loaded documents
    store = create_faiss_store_from_documents(docs, index_dir="vectorstore")



Loading weights: 100%|██████████| 199/199 [00:00<00:00, 2507.41it/s, Materializing param=pooler.dense.weight]                               
[1mBertModel LOAD REPORT[0m from: BAAI/bge-small-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m
`embedding_function` is expected to be an Embeddings object, support for passing in a function will soon be removed.


In [30]:
# Update FAISS store with all files under `directory_path` (documents/) (These are the main and bulk of the vestorstore)
main_documents_directory = "documents"
file_paths = scan_directory(main_documents_directory)
store = path_upload_document_to_vectorstore(file_paths, store, index_dir="vectorstore", dedup_mode="content")
# Note: this notebook now computes and persists metadata['content_hash'] at initial ingestion.
# New indexes created after this change will include content hashes and enable immediate
# content-based deduplication. If you have an older index created before this change,
# rebuild it or run the upload path to add hashes before relying on content deduplication.

Loaded 15 document(s) from documents/c1cs15013h.pdf.
Added 15 new document(s) from documents/c1cs15013h.pdf and saved to 'vectorstore'.
Loaded 13 document(s) from documents/c5cs00105f.pdf.
Added 13 new document(s) from documents/c5cs00105f.pdf and saved to 'vectorstore'.
Loaded 65 document(s) from documents/photochemical-approaches-to-complex-chemotypes-applications-in-natural-product-synthesis.pdf.
Added 65 new document(s) from documents/photochemical-approaches-to-complex-chemotypes-applications-in-natural-product-synthesis.pdf and saved to 'vectorstore'.
Loaded 17 document(s) from documents/c9np00039a.pdf.
Added 17 new document(s) from documents/c9np00039a.pdf and saved to 'vectorstore'.
Loaded 42 document(s) from documents/c5ob00169b.pdf.
Added 42 new document(s) from documents/c5ob00169b.pdf and saved to 'vectorstore'.
Loaded 7 document(s) from documents/d2qo00043a.pdf.
Added 7 new document(s) from documents/d2qo00043a.pdf and saved to 'vectorstore'.
Loaded 48 document(s) from doc

In [None]:
# Re-run to test it's duplicate spotting capabilities
main_documents_directory = "documents"
file_paths = scan_directory(main_documents_directory)
store = path_upload_document_to_vectorstore(file_paths, store, index_dir="vectorstore", dedup_mode="content")

Loaded 15 document(s) from documents/c1cs15013h.pdf.
No new chunks/pages to add from documents/c1cs15013h.pdf (all duplicates).
Loaded 13 document(s) from documents/c5cs00105f.pdf.
No new chunks/pages to add from documents/c5cs00105f.pdf (all duplicates).
Loaded 65 document(s) from documents/photochemical-approaches-to-complex-chemotypes-applications-in-natural-product-synthesis.pdf.
No new chunks/pages to add from documents/photochemical-approaches-to-complex-chemotypes-applications-in-natural-product-synthesis.pdf (all duplicates).
Loaded 17 document(s) from documents/c9np00039a.pdf.
No new chunks/pages to add from documents/c9np00039a.pdf (all duplicates).
Loaded 42 document(s) from documents/c5ob00169b.pdf.
No new chunks/pages to add from documents/c5ob00169b.pdf (all duplicates).
Loaded 7 document(s) from documents/d2qo00043a.pdf.
No new chunks/pages to add from documents/d2qo00043a.pdf (all duplicates).
Loaded 48 document(s) from documents/natural-product-synthesis-using-multicom

In [32]:
# Working nicely!! It spots duplicates!

# Can we import a docstore for use betweem sessions?

In [41]:
def load_docstore_from_dir(index_dir: str = "vectorstore", embedding_model: str = "BAAI/bge-small-en-v1.5"):
    """
    Load a FAISS-backed docstore from disk and return (store, documents_list).
    """
    if not os.path.isdir(index_dir):
        raise FileNotFoundError(f"Index directory '{index_dir}' not found.")

    embeddings = SentenceTransformerEmbeddingsWrapper(embedding_model)
    try:
        store = FAISS.load_local(index_dir, embeddings, allow_dangerous_deserialization=True)
    except Exception as e:
        raise RuntimeError(f"Failed to load FAISS store from '{index_dir}': {e}")

    doc_dict = getattr(getattr(store, "docstore", None), "_dict", None) or {}
    docs = [d for d in doc_dict.values() if isinstance(d, Document)]

    print(f"Loaded FAISS store from '{index_dir}' with {len(docs)} document(s).")
    return store, docs

loaded_vstore, laoded_vstore_docs = load_docstore_from_dir()

`embedding_function` is expected to be an Embeddings object, support for passing in a function will soon be removed.


Loaded FAISS store from 'vectorstore' with 549 document(s).


In [42]:
# Re-run to test it's duplicate spotting capabilities
main_documents_directory = "documents"
file_paths = scan_directory(main_documents_directory)
# loaded_vstore is a (store, docs) tuple returned by load_docstore_from_dir()
vstore = path_upload_document_to_vectorstore(file_paths, loaded_vstore, index_dir="vectorstore", dedup_mode="content")

Loaded 15 document(s) from documents/c1cs15013h.pdf.
No new chunks/pages to add from documents/c1cs15013h.pdf (all duplicates).
Loaded 13 document(s) from documents/c5cs00105f.pdf.
No new chunks/pages to add from documents/c5cs00105f.pdf (all duplicates).
Loaded 65 document(s) from documents/photochemical-approaches-to-complex-chemotypes-applications-in-natural-product-synthesis.pdf.
No new chunks/pages to add from documents/photochemical-approaches-to-complex-chemotypes-applications-in-natural-product-synthesis.pdf (all duplicates).
Loaded 17 document(s) from documents/c9np00039a.pdf.
No new chunks/pages to add from documents/c9np00039a.pdf (all duplicates).
Loaded 42 document(s) from documents/c5ob00169b.pdf.
No new chunks/pages to add from documents/c5ob00169b.pdf (all duplicates).
Loaded 7 document(s) from documents/d2qo00043a.pdf.
No new chunks/pages to add from documents/d2qo00043a.pdf (all duplicates).
Loaded 48 document(s) from documents/natural-product-synthesis-using-multicom

<h4>Querying the docstore

In [3]:
from docstore_functions import *

In [None]:
# Load your store (must match the embedding model used to build it)
# Loading the store:
store,document_list = load_docstore_from_dir()

query = "Is the diels-alder reaction used in natural product synthesis?"
k = 5

# Retrieve top-k chunks
docs = store.similarity_search(query, k=k)

print(f"Query: {query}\nTop {k} results:")
for i, d in enumerate(docs, start=1):
    print("\n" + "="*80)
    print(f"Result {i}")
    print("Source:", d.metadata.get("source"))
    print("Page:", d.metadata.get("page"))
    print("Hash:", d.metadata.get("content_hash"))
    print("-"*80)
    print(d.page_content[:800])

# Asked a question about the Diels-Alder reaction. It returned the document related to that...

`embedding_function` is expected to be an Embeddings object, support for passing in a function will soon be removed.


Loaded FAISS store from 'vectorstore' with 549 document(s).


Loading weights: 100%|██████████| 199/199 [00:00<00:00, 2347.05it/s, Materializing param=pooler.dense.weight]                               
[1mBertModel LOAD REPORT[0m from: BAAI/bge-small-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


Query: Is the diels-alder reaction used in natural product synthesis?
Top 5 results:

Result 1
Source: documents/recent-advances-in-natural-product-synthesis-by-using-intramolecular-diels-alder-reactions.pdf
Page: 0
Hash: 33cc72e8d46b5c83716776c0d89f2f0b447fc02b683d379d5451a1b301a3f811
--------------------------------------------------------------------------------
Recent Advances in Natural Product Synthesis by Using Intramolecular
Diels−Alder Reactions†
Ken-ichi Takao, Ryosuke Munakata, and Kin-ichi Tadano*
Department of Applied Chemistry, Keio University, Hiyoshi, Kohoku-ku, Yokohama 223-8522, Japan
Received March 1, 2005
Contents
1. Introduction 4779
2. Terpenoids 4779
2.1. Sesquiterpenoids 4779
2.2. Diterpenoids 4782
2.3. Sesterterpenoids 4786
2.4. Steroids 4787
3. Alkaloids 4789
3.1. Amaryllidaceae Alkaloids 4789
3.2.
Stemona Alkaloids 4790
3.3. Tropolone Alkaloids 4791
3.4. Quinolizidine Alkaloids 4792
3.5. Indole Alkaloids 4793
3.6. Other Alkaloids 4794
4. Polyketides 4796
4.1.

In [None]:
# Testing with an LLM:

# Load the API groq API key:
from pathlib import Path
# Bespoke API key loader functions:
from load_keys import *
import os

p = Path("keys/groq.json")
groq_api_key = load_groq_key(p)
os.environ['GROQ_API_KEY'] = groq_api_key

from langchain_groq import ChatGroq

llm = ChatGroq(
    model="openai/gpt-oss-120b",
    temperature=0.2,
    api_key=os.environ["GROQ_API_KEY"],
)

def ask_FAISS_with_LLM(store, question: str, k: int = 6):
    docs = store.similarity_search(question, k=k)

    context = "\n\n".join(
        [f"[{i+1}] Source={d.metadata.get('source','')} Page={d.metadata.get('page', '')}\n{d.page_content}"
         for i, d in enumerate(docs)]
    )

    prompt = f"""Use ONLY the context to answer the question.
    If the answer is not in the context, say "I don't know based on the provided documents."

    Context:
    {context}

    Question: {question}
    Answer:"""

    resp = llm.invoke(prompt)
    return resp.content, docs

question = "Is the diels-alder reaction used in natural product synthesis?"
answer, retrieved = ask_FAISS_with_LLM(store, question, k=5)
print(answer)

Yes. The Diels‑Alder reaction (both intermolecular and intramolecular versions) is widely employed as a key step for constructing the complex carbocyclic frameworks of natural products. The cited reviews describe numerous total syntheses of terpenoids, alkaloids, polyketides and other natural products that rely on Diels‑Alder cycloadditions to generate multiple stereogenic centers in a single operation.


In [None]:
# Seems to work with an LLM

# Let's ask another question:
question = "What are some popular transformations used in natural product synthesis?"
answer, retrieved = ask_FAISS_with_LLM(store, question, k=5)
print(answer)

Several transformations are repeatedly highlighted as especially useful in the synthesis of natural products:

* **Enantioselective carbonyl allylation** – used on solid‑phase resin to give high‑yielding, highly enantioenriched lactone intermediates (see the solid‑phase allylation of an immobilised aldehyde with B‑allyl(diisopinocamphenyl)‑borane)【1†L14-L22】.  
* **Baeyer–Villiger oxidation** – a classic oxygen‑insertion reaction that converts cyclic ketones into lactones and is cited as a typical “molecular‑editing” step【4†L23-L26】.  
* **Ciamician–Dennstedt rearrangement** – a carbon‑insertion reaction that transforms pyrroles into 3‑chloropyridines, also mentioned as a molecular‑editing transformation【4†L27-L30】.  
* **Sonogashira Pd/Cu‑catalysed coupling** – employed to forge sp‑sp² C–C bonds in convergent syntheses of enynes and enediynes, e.g. in the total syntheses of calicheamicin γ₁ and dynemicin A【3†L9-L18】【3†L22-L27】.  
* **Diels–Alder reactions (including trans‑annular Diel