# **Vector Search Fundamentals**

We are now transitioning from generating embeddings (which we will cover more deeply in M3) to **managing and searching** them at scale. This is the backbone of RAG and modern retrieval systems.

We will focus on the fundamental geometry of vector spaces and how to traverse them efficiently using the **FAISS** library.

Here is the breakdown for today's session.

### Phase 1: Topic Breakdown

```text
L17: Vector Search Fundamentals
├── Concept 1: Vector Spaces & High-Dimensional Data
│   ├── Embeddings as coordinates
│   ├── The Matrix (N vectors x D dimensions)
│   ├── Simple Explanation: Finding points in a massive hyper-cube
│   └── Task: Generate synthetic vector dataset (NumPy)
│
├── Concept 2: Distance Metrics
│   ├── Euclidean Distance (L2) - Physical distance
│   ├── Cosine Similarity - Angular distance
│   ├── Intuition: Magnitude vs. Orientation
│   └── Task: Manual calculation of metrics using NumPy
│
├── Concept 3: ANN Theory (Approximate Nearest Neighbors)
│   ├── The Scaling Problem (O(N) Complexity)
│   ├── IVF (Inverted File Index) - Partitioning/Clustering
│   ├── HNSW (Hierarchical Navigable Small World) - Graph Traversal
│   └── Task: Conceptual Check (Trade-offs)
│
├── Concept 4: FAISS Basics (The Tool)
│   ├── What is FAISS? (Facebook AI Similarity Search)
│   ├── The Index Object (IndexFlatL2)
│   ├── The Workflow: Index -> Add -> Search
│   └── Task: Implement "Brute Force" search in FAISS
│
└── Mini-Project: The Search Benchmark
    ├── Setup Large Dataset
    ├── Implement IndexFlatL2 (Exact)
    ├── Implement IndexIVFFlat (Approximate)
    ├── Train the Index (Clustering)
    └── Compare speed (latency) and recall

```

---


## **Concept 1: Vector Spaces & High-Dimensional Data**

### Intuition

In traditional databases, we search for exact matches (e.g., `SELECT * FROM users WHERE name = "Alice"`). However, in AI, we often want to search for *meaning*. To do this, we convert complex data like text or images into lists of numbers called **vectors** (or embeddings).

Imagine a 2D graph. A point at `(2, 3)` is a vector. Now, imagine a graph with 128, 768, or even 1536 axes. This is a **high-dimensional vector space**. Every piece of data becomes a single point in this space. Data that is semantically similar (e.g., "Dog" and "Puppy") will be located physically close to each other in this space.

### Mechanics

We represent this data mathematically as a Matrix $X$ of size $N \times D$ :
   * **$N$**: The number of samples (vectors) in your database.
   * **$D$**: The dimensionality of each vector (determined by the model, e.g., BERT uses 768).

In Python/NumPy, this is simply a 2D array. Crucially, most vector search libraries (including FAISS) are highly optimized for `float32` data types. Using `float64` (default in Python) can cause errors or unnecessary memory bloat.

### Simpler Explanation

Imagine a massive library. Instead of organizing books by genre or title, we give every book a precise GPS coordinate in a multi-dimensional universe. If you want a book about "Cooking," you don't look up the word; you go to the "Cooking" coordinates, and you'll find all the relevant books clustered right there.

### Trade-offs
   * **Pros:** Allows semantic search (matching meaning, not just keywords).
   * **Cons:** "The Curse of Dimensionality." As $D$ increases, the amount of data needed to generalize increases, and calculating distances becomes computationally expensive.

---

### Your Task

You need to simulate a dataset of embeddings to prepare for our search algorithms.

**Specifications:**
   1. Create a Python script using `numpy`.
   2. Define two constants: `nb` (number of database vectors) = 10,000 and `d` (dimension) = 128.
   3. Generate a matrix `xb` of shape `(nb, d)` filled with random numbers.
   4. **Critical:** Ensure the data type of the matrix is explicitly `float32`.
   5. Set a random seed so our results are reproducible.
   6. Print the shape and data type of `xb` to verify.


In [2]:
import numpy as np

np.random.seed(42)
nb = 10000
d = 128

xb = np.random.rand(nb, d).astype(np.float32)
xb.shape

(10000, 128)

## **Concept 2: Distance Metrics**

### Intuition

Once we have points in space, we need a ruler to measure how close they are. In AI, "closeness" implies similarity. If the distance between the "User Query" vector and a "Document" vector is small, that document is likely relevant.

### Mechanics

There are two primary ways to measure this in high-dimensional spaces:

1. **Euclidean Distance (L2):**
   * The straight-line distance between two points.
   * Formula: $d(x, y) = \sqrt{\sum_{i=1}^{D} (x_i - y_i)^2}$
   * **Behavior:** Sensitive to the magnitude (length) of vectors. If one vector is  and another is , they are far apart even if they point in the same direction.

2. **Inner Product (Dot Product) / Cosine Similarity:**
   * Measures the alignment (angle) between vectors.
   * Formula (Dot Product): $IP(x, y) = \sum_{i=1}^{D} x_i \cdot y_i$
   * **Behavior:** If vectors are normalized (length = 1), Inner Product *is* Cosine Similarity. It focuses on "direction" rather than "magnitude."



### Simpler Explanation
   * **Euclidean:** How much physical energy does it take to walk from point A to point B?
   * **Cosine:** Are we pointing at the same star? (It doesn't matter if you are standing on Earth or Mars; if you both point at the sun, the angle difference is low).

### Trade-offs
   * **L2:** Good when magnitude matters (e.g., pixel intensity in images).
   * **Inner Product:** Usually preferred for text/semantic search (NLP) because the length of the document text shouldn't necessarily penalize the match. **FAISS** is highly optimized for L2 and Inner Product.

---

### Your Task

You will implement these metrics manually using NumPy to understand the linear algebra behind the scenes.

**Specifications:**
   1. Generate a **query matrix** `xq` containing **5 vectors** of dimension `d` (use `np.random`).
   2. Ensure `xq` is `float32`.
   3. **Task A (L2 Distance):** Calculate the squared Euclidean distance between the **first vector** of `xq` and the **first vector** of `xb`. Use the formula: `sum((x - y)^2)`.
   4. **Task B (Dot Product):** Calculate the dot product between the **first vector** of `xq` and the **first vector** of `xb`.
   5. Print both results.

**Constraints:**
   * Do not use `scipy` or `sklearn`. Use standard NumPy math.
   * Do not perform matrix multiplication for the whole dataset yet; just do 1-to-1 vector comparison for this step.

In [3]:
import numpy as np

xq = np.random.rand(5, d).astype(np.float32)

x = xb[0]
y = xq[0]

dist = np.sum((x - y) ** 2)
dot_p = np.dot(x, y)

print(f"Distance = {dist}\nDot Product = {dot_p}")

Distance = 22.061784744262695
Dot Product = 29.73244857788086


The output shows a non-zero distance ($22.06$), confirming that `xq` and `xb` are now distinct vectors in the space.

---

## **Concept 3: ANN Theory (Approximate Nearest Neighbors)**

### Intuition

In your previous task, you compared 1 query vector against 1 database vector. To find the "nearest neighbor" for a query, you'd have to compare it against **all** 10,000 vectors in `xb`.

If $N = 10,000$, that's fine.
If $N = 1 \text{ Billion}$ (web scale), calculating 1 Billion distances for *every single user query* is impossible in real-time.

**ANN (Approximate Nearest Neighbors)** algorithms solve this by trading a tiny bit of accuracy (Recall) for massive speed (Latency). We accept that we might find the 2nd or 3rd closest match instead of the absolute 1st, in exchange for searching only a fraction of the data.

### Mechanics: Two Main Approaches

1. **IVF (Inverted File Index): Partitioning**
* We divide the vector space into clusters (cells).
* We compute the "centroid" (center point) of each cluster.
* When a query comes in, we measure the distance to the *centroids*.
* We identify the closest centroid and **only** search the vectors inside that specific cell.
* **Analogy:** Instead of checking every book in the library, check the catalog to find the "Cooking" section, then search only that shelf.


2. **HNSW (Hierarchical Navigable Small World): Graph Traversal**
* Vectors are nodes in a graph connected to their neighbors.
* It builds a multi-layer graph (like a highway system).
* Top layers have long connections (fast travel across the map). Bottom layers have dense connections (local precision).
* **Analogy:** Playing "Six Degrees of Kevin Bacon" to find a path from A to B.



### Simpler Explanation

* **Brute Force (Flat):** Check every single haystalk to find the needle. (100% accurate, Slow).
* **ANN (IVF/HNSW):** Use a metal detector to find the general area, then look there. (99% accurate, Super Fast).

### Trade-offs

* **Brute Force:** Recall = 100%. Latency = High. Memory = Low.
* **IVF:** Recall = Tunable (based on how many cells you probe). Latency = Low. Requires "Training" step.
* **HNSW:** Recall = High. Latency = Very Low. Memory = High (needs to store graph edges).

---

### Your Task

This is a conceptual check to ensure you understand the architecture choices before we code.

**Scenario:**

1. **System A:** A FaceID unlock system for a secure building. The database has only 500 employees. Accuracy must be perfect.
2. **System B:** An e-commerce recommendation engine ("Similar products"). The database has 50 million items. Speed is critical; if the user waits >200ms, they leave.

**Question:** Which approach (Brute Force vs. ANN) would you choose for System A and System B?
