# 🗃️ Week 07-08 · Notebook 06: Embeddings & Vector Stores

**Objective:** Understand the core concepts of embeddings and vector stores, and learn how to use them to build the foundation of a retrieval system.

In the previous notebook, we built a pipeline to load, transform, and enrich documents. But how does a RAG system find the *right* document to answer a question? The answer lies in two key technologies: **Embeddings** and **Vector Stores**.

1.  **Embeddings:** These are numerical representations (vectors) of text. An embedding model converts a piece of text (like a question or a document chunk) into a high-dimensional vector. The key idea is that semantically similar pieces of text will have vectors that are close to each other in space.
2.  **Vector Stores (or Vector Databases):** These are specialized databases designed to store and efficiently search through millions of vectors. When a user asks a question, we first create an embedding of the question and then use the vector store to find the document vectors that are "closest" to the question vector.

In this notebook, we will take our sanitized documents, convert them into embeddings, and store them in a vector database. We will explore two popular in-memory vector stores: **FAISS** and **Chroma**.

## 🎯 Learning Objectives

By the end of this notebook, you will be able to:

1.  **Understand Text Splitting:** Learn why and how to split large documents into smaller, more effective chunks for retrieval.
2.  **Generate Embeddings:** Use an embedding model from Hugging Face to convert text chunks into numerical vectors.
3.  **Index Documents in a Vector Store:** Store the document embeddings in two different in-memory vector stores: `FAISS` and `Chroma`.
4.  **Perform Similarity Search:** Use the vector stores to find the most relevant document chunks for a given user query.
5.  **Compare Vector Stores:** Understand the basic differences in setup and usage between FAISS and Chroma.

## 🧩 Scenario: Choosing the Right Vector Store for Manufacturing RAG

You are a data engineer supporting two manufacturing plants—one in Pune, India, and one in Monterrey, Mexico. Both plants want to deploy a RAG-based maintenance assistant, but they have different infrastructure preferences:

- **Pune** prefers an on-premises solution for data sovereignty and cost control (e.g., FAISS).
- **Monterrey** prefers a managed cloud solution for scalability and ease of maintenance (e.g., pgvector on CloudSQL).

Leadership has asked you to run a fair, data-driven benchmark comparing the available vector store options. Your goal is to:
- Ingest the same set of maintenance documents into each store.
- Measure ingestion time, query latency, and operational complexity.
- Log all results for future audits.
- Make a recommendation based on the results.

## 1. Environment Setup

First, let's install the necessary libraries. We will need:
-   `langchain` and `langchain-community` for the core framework.
-   `sentence-transformers` to pull down the embedding model from Hugging Face.
-   `faiss-cpu` for the FAISS vector store. FAISS is a library from Facebook AI for efficient similarity search. `cpu` is specified for compatibility; a `gpu` version also exists.
-   `chromadb` for the Chroma vector store.

> ⚠️ **Kernel Restart**: After running the installation cell below, you may need to restart the kernel for the changes to take effect.

In [None]:
%pip install -qU langchain langchain-community sentence-transformers faiss-cpu chromadb

### Preparing the Documents

For this notebook, we'll create a sample list of `Document` objects, simulating the output from our previous ingestion pipeline. Each document contains `page_content` and `metadata`. Notice that some documents have long content that will need to be split.

In [None]:
from langchain.docstore.document import Document
from datetime import datetime

# Simulate the documents from the previous notebook
final_docs = [
    Document(
        page_content="Standard Operating Procedure: Hydraulic Press H-45. 1. Ensure all safety guards are in place before operation. 2. Perform daily maintenance checks as per the log. 3. The hydraulic fluid must be checked weekly. Any leaks should be reported immediately. The operating pressure should not exceed 5000 PSI.",
        metadata={
            "source": "sop_manuals/press_safety.pdf",
            "plant": "PNQ",
            "doc_type": "SOP",
            "ingested_at": datetime.utcnow().isoformat() + "Z",
        },
    ),
    Document(
        page_content="ticket_id: TICKET-002\ntimestamp_utc: 2025-04-17T10:00:00Z\nissue_description: Pressure sensor fault\ntechnician_notes: Recalibrated sensor. Work completed by Technician [REDACTED].",
        metadata={
            "source": "TICKET-002",
            "plant": "PNQ",
            "doc_type": "MaintenanceLog",
            "ingested_at": datetime.utcnow().isoformat() + "Z",
        },
    ),
    Document(
        page_content="ticket_id: TICKET-003\ntimestamp_utc: 2025-10-17T12:30:00Z\nissue_description: Emergency stop button stuck\ntechnician_notes: Replaced the button assembly. Work completed by Technician [REDACTED].",
        metadata={
            "source": "TICKET-003",
            "plant": "PNQ",
            "doc_type": "MaintenanceLog",
            "ingested_at": datetime.utcnow().isoformat() + "Z",
        },
    ),
]

print(f"Prepared {len(final_docs)} sample documents.")

## 2. Splitting Documents into Chunks

LLMs have a limited context window, and retrieval is more effective when it points to a specific, relevant piece of information rather than a long, noisy document. Therefore, a critical step in the ingestion process is **splitting** large documents into smaller chunks.

LangChain provides various `TextSplitter` classes. A popular choice is the `RecursiveCharacterTextSplitter`, which tries to split text based on a prioritized list of characters (e.g., `\n\n`, `\n`, ` `) to keep related pieces of text together.

Key parameters for a text splitter:
-   `chunk_size`: The maximum size of a chunk (in characters).
-   `chunk_overlap`: The number of characters to overlap between adjacent chunks. This helps maintain context across the split.

We will split our documents into chunks of 200 characters with an overlap of 20 characters.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200, 
    chunk_overlap=20,
    length_function=len,
)

# Split the documents into chunks
doc_chunks = text_splitter.split_documents(final_docs)

print(f"Original number of documents: {len(final_docs)}")
print(f"Number of document chunks after splitting: {len(doc_chunks)}")

print("\n--- Example of a Document Chunk ---")
# The long SOP document was split into multiple chunks
for i, chunk in enumerate(doc_chunks):
    if chunk.metadata['doc_type'] == 'SOP':
        print(f"Chunk {i+1} (from SOP):")
        print(f"Content: \"{chunk.page_content}\"")
        print(f"Metadata: {chunk.metadata}\n")

## 3. Generating Embeddings

Now that we have our text chunks, we need to convert them into vectors. We will use an `Embedding` model for this. LangChain integrates with many embedding providers (like OpenAI, Cohere, and Google), but for this example, we'll use a popular open-source model from Hugging Face.

The `HuggingFaceEmbeddings` class makes it easy to use models from the `sentence-transformers` library. We will use `all-MiniLM-L6-v2`, which is a small but effective model, perfect for getting started. When you initialize this class, it will download the model from Hugging Face Hub (if you don't have it cached).

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings

# Initialize the embedding model
# This will download the model from Hugging Face Hub on first run
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Let's test it on a single piece of text
sample_text = "What is the operating pressure for the hydraulic press?"
sample_embedding = embedding_model.embed_query(sample_text)

print(f"Successfully created an embedding for the sample text.")
print(f"Embedding dimension: {len(sample_embedding)}")
print(f"First 5 values of the vector: {sample_embedding[:5]}")

## 4. Indexing and Searching with Vector Stores

With our document chunks and embedding model ready, we can now populate our vector stores. The general process is:

1.  **Instantiate the Vector Store:** Create an instance of the vector store class (e.g., `FAISS` or `Chroma`).
2.  **Provide Documents and Embeddings:** Pass the list of document chunks and the embedding model to the vector store's `from_documents` class method. The vector store will handle the rest:
    -   It iterates through each document chunk.
    -   It uses the provided `embedding_model` to convert the `page_content` into a vector.
    -   It stores the vector along with the document's `page_content` and `metadata`.
3.  **Perform a Search:** Use the `similarity_search` method to find relevant documents for a new query.

### Using FAISS
FAISS (Facebook AI Similarity Search) is a highly optimized library for vector search. It's very fast and runs entirely in memory, making it a great choice for rapid prototyping and smaller-scale applications.

In [None]:
from langchain_community.vectorstores import FAISS

# 1. Create the FAISS vector store from our document chunks
print("Creating FAISS vector store...")
faiss_vector_store = FAISS.from_documents(doc_chunks, embedding_model)
print("FAISS vector store created successfully.")

# 2. Define a query
query = "What is the maximum operating pressure for the hydraulic press?"

# 3. Perform a similarity search
print(f"\nSearching for documents similar to: '{query}'")
retrieved_docs_faiss = faiss_vector_store.similarity_search(query, k=2) # k is the number of documents to retrieve

# 4. Display the results
print("\n--- Top 2 Retrieved Documents (FAISS) ---")
for i, doc in enumerate(retrieved_docs_faiss):
    print(f"Result {i+1}:")
    print(f"  Content: \"{doc.page_content}\"")
    print(f"  Metadata: {doc.metadata}\n")

### Using Chroma
Chroma is another popular open-source vector store. It's also easy to use in-memory, but it also offers the ability to persist the database to disk, which is useful for saving your index between sessions. It also has more advanced features like metadata filtering, which we will explore in a later notebook.

The process is nearly identical to FAISS.

In [None]:
## 1. Setup: Generating Example Documents and Embeddings

To ensure a fair benchmark, we'll generate a synthetic set of maintenance documents. We'll use a lightweight Hugging Face embedding model to convert these documents into vectors. This setup simulates a real-world RAG deployment, where each document chunk is embedded and stored for fast retrieval.

We'll use the same set of documents and embeddings for all vector stores to ensure a level playing field.

## 2. Benchmarking Vector Stores

We'll now benchmark two popular vector stores: **Chroma** and **FAISS**. For each store, we'll measure:
- Ingestion time (how long it takes to index all documents)
- Query latency (how fast it can retrieve relevant documents)

We'll use the same set of embeddings for both stores. All results will be logged to MLflow for future audits and comparison.

### Benchmarking Chroma

Chroma is a fast, open-source vector database that can run in-memory or persist to disk. It's easy to use and great for prototyping. We'll measure ingestion and query latency, and log the results to MLflow.

### Generate Synthetic Maintenance Documents

We'll create a set of 200 synthetic documents—half are SOPs, half are maintenance logs. Each document will have a unique `doc_id` in its metadata. This simulates a realistic RAG knowledge base for a manufacturing plant.

### Generate Embeddings for Each Document

We'll use the `all-MiniLM-L6-v2` model from Hugging Face to convert each document into a vector. This model is fast and provides good semantic similarity for English text. The same embeddings will be used for all vector stores to ensure a fair comparison.

### Install Required Libraries

We'll need the following libraries:
- `langchain` and `langchain-community` for vector store integrations
- `sentence-transformers` for the embedding model
- `mlflow` for experiment tracking
- `faiss-cpu` for FAISS (if not already installed)

> ⚠️ If running on Colab or a fresh environment, uncomment and run the cell below.

In [None]:
# %pip install -qU langchain langchain-community sentence-transformers mlflow faiss-cpu

> ⚠️ pgvector benchmarking requires a running Postgres instance with the extension enabled. Refer to `infrastructure/pgvector_setup.sql`.

## 📊 Evaluation Matrix
| Criterion | Chroma | FAISS | pgvector |

## 🧪 Lab Assignment
1. Run pgvector benchmark using CloudSQL dev instance and capture metrics.
2. Evaluate recall by scoring against labeled maintenance Q&A set.
3. Log benchmarks to MLflow (`mlflow.log_metrics` & `mlflow.log_dict`).
4. Draft recommendation memo for CIO summarizing trade-offs.

## ✅ Checklist
- [ ] Benchmarks executed for all stores
- [ ] Metrics logged
- [ ] Recommendation memo drafted
- [ ] Governance evidence archived

## 📚 References
- LangChain Vectorstore Docs
- pgvector Extension Guide
- Week 05 Data Storage Policy