# Mini Vector Database

Many ML applications rely heavily on embeddings (numerical vector representations of the data).

Vector databases are a way to store embeddings, and search for similar ones. 

Example: Retrival Augmented Generation (RAG). The query is a question, the database searches documents the most similar to that question, and returns them. Then an LLM has the data required data to answer the question.


This notebook implements two small but real vector databases:


### 1. **Brute-Force Vector Database**
- Stores vectors and metadata 
- Uses cosine similarity to compare query to **each** vector in the database (O(n) time)
- Simple and exact


### 2. **ANN Vector Database (Locality-Sensitive Hashing)**
- Uses random hyperplanes to hash vectors, grouping similar vectors into buckets
- Searches only inside the matching bucket (faster than brute-force)
- Fast but approximate

In [1]:
import numpy as np
from collections import defaultdict

In [2]:
class BruteForceVectorDB:
    def __init__(self):
        self.vectors = []
        self.data = []

    def cos_sim(self, a, b):
        a_norm = np.linalg.norm(a)
        b_norm = np.linalg.norm(b)
        if a_norm == 0 or b_norm == 0:
            return 0
        return np.dot(a, b) / (a_norm * b_norm)

    def add(self, vec, meta):
        self.vectors.append(vec)
        self.data.append(meta)

    def search(self, query_vec, k=5):
        scores = []

        for vec, meta in zip(self.vectors, self.data):
            sim = self.cos_sim(query_vec, vec)
            scores.append((sim, meta))

        scores.sort(key=lambda x: x[0], reverse=True)
        return scores[:k]


In [4]:
bfdb = BruteForceVectorDB()

# Add some example vectors
bfdb.add(np.array([1, 0, 0, 0]), "Dog")
bfdb.add(np.array([0, 1, 0, 0]), "Cat")
bfdb.add(np.array([0, 0.6, 0.4, 0]), "Lion")
bfdb.add(np.array([0, 0, 1, 0]), "Wild")

# Query vector. Maybe something like "Wolf", should return "Dog" as the #1 result
query = np.array([0.6, 0, 0.4, 0])

bfdb.search(query, k=2)

[(np.float64(0.8320502943378436), 'Dog'),
 (np.float64(0.5547001962252291), 'Wild')]