# MiniVecDB: A Simple Vector Database in Python

MiniVecDB is a lightweight vector database implementation that demonstrates the fundamentals of vector search. It's built for educational purposes to help understand how vector databases work under the hood.

This file will become your README and also the index of your documentation.

## What's a Vector Database?

Vector databases store and search embeddings - numerical representations of content that capture meaning and relationships. They enable:

- **Semantic search**: Find content based on meaning, not just keywords
- **Recommendations**: Suggest similar items
- **Classification**: Group related items together

## Visual Guide to Vector Search

```
┌───────────────────────────────────────────────────────────────┐
│                                                               │
│  Text to Vector Embedding                                     │
│                                                               │
│  "I love dogs"  ──────┐                                       │
│                       │                                       │
│                       ▼                                       │
│  ┌─────────────────────────────────────┐                      │
│  │      Gemini Embedding Model         │                      │
│  └─────────────────────────────────────┘                      │
│                       │                                       │
│                       ▼                                       │
│  [0.021, -0.108, 0.324, ..., -0.021]  ◄── 768-dimensional     │
│                                            vector             │
│                                                               │
└───────────────────────────────────────────────────────────────┘
```

```
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│  VectorStorage                                                  │
│                                                                 │
│  ┌─────────────┐     ┌────────────────────────────────────────┐ │
│  │   Index     │     │            Vectors                     │ │
│  ├─────────────┤     ├────────────────────────────────────────┤ │
│  │      0      │ ──► │ [0.21, -0.11, 0.54, ..., 0.76]         │ │
│  │      1      │ ──► │ [0.11, 0.36, -0.42, ..., -0.21]        │ │
│  │      2      │ ──► │ [-0.33, 0.12, 0.91, ..., 0.05]         │ │
│  └─────────────┘     └────────────────────────────────────────┘ │
│                                                                 │
│  ┌─────────────┐     ┌────────────────────────────────────────┐ │
│  │   Index     │     │            Metadata                    │ │
│  ├─────────────┤     ├────────────────────────────────────────┤ │
│  │      0      │ ──► │ {"headline": "Scientists Discover...", │ │
│  │             │     │  "genre": "Science"}                   │ │
│  │      1      │ ──► │ {"headline": "Tech Company Launches..",│ │
│  │             │     │  "genre": "Technology"}                │ │
│  │      2      │ ──► │ {"headline": "Stock Market Hits...",   │ │
│  │             │     │  "genre": "Finance"}                   │ │
│  └─────────────┘     └────────────────────────────────────────┘ │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

```
┌────────────────────────────────────────────────────────────────┐
│                                                                │
│  IVF Index - Inverted File Structure                           │
│                                                                │
│  Step 1: Cluster vectors using K-means                         │
│                                                                │
│    ●       ○                                                   │
│     ●   ○    ○                    ● Cluster 0                  │
│      ● ○                          ○ Cluster 1                  │
│     ●   ○                         ◆ Cluster 2                  │
│    ●       ○                                                   │
│        ◆ ◆ ◆                                                   │
│         ◆ ◆                                                    │
│                                                                │
│  Step 2: Create inverted lists                                 │
│                                                                │
│  ┌────────────┐    ┌───────────────────────────────────┐       │
│  │  Cluster   │    │  Vector IDs in cluster            │       │
│  ├────────────┤    ├───────────────────────────────────┤       │
│  │     0      │ ─► │  24, 42, 57, 123, 189             │       │
│  │     1      │ ─► │  7, 19, 76, 88, 152, 167          │       │
│  │     2      │ ─► │  31, 45, 103, 127                 │       │
│  └────────────┘    └───────────────────────────────────┘       │
│                                                                │
│  Step 3: During search, only examine vectors in closest        │
│          clusters (nprobe=2)                                   │
│                                                                │
└────────────────────────────────────────────────────────────────┘
```


```
┌────────────────────────────────────────────────────────────────┐
│                                                                │
│  Semantic Search Workflow                                      │
│                                                                │
│  1. Convert query to embedding                                 │
│                                                                │
│  "Latest sports results"  ────► [0.12, -0.43, ..., 0.21]       │
│                                                                │
│  2. Find nearest clusters (nprobe=2)                           │
│                                                                │
│       ●                                                        │
│     ●   ●                                                      │
│      ●X●           ◆ ◆       X = Query Vector                  │
│     ●   ●           ◆ ◆      ● = Closest Cluster               │
│       ●              ◆       ○ = Other Cluster                 │
│                      ○ ○     ◆ = Other Cluster                 │
│                     ○   ○                                      │
│                      ○ ○                                       │
│                                                                │
│  3. Compare only with vectors in those clusters                │
│                                                                │
│  4. Return most similar results:                               │
│     - "Local Football Team Wins Championship"                  │
│     - "Marathon Runner Breaks World Record"                    │
│                                                                │
└────────────────────────────────────────────────────────────────┘
```

```
┌────────────────────────────────────────────────────────────────┐
│                                                                │
│  MiniVecDB Project Flow                                        │
│                                                                │
│  ┌───────────────┐     ┌───────────────┐    ┌───────────────┐  │
│  │   Text Data   │ ──► │   Embeddings  │ ─► │VectorStorage  │  │
│  └───────────────┘     └───────────────┘    └───────────────┘  │
│         │                      ▲                    │          │
│         │                      │                    ▼          │
│         │                      │              ┌───────────────┐│
│         │                      │              │  IVF Index    ││
│         ▼                      │              └───────────────┘│
│  ┌───────────────┐             │                    │          │
│  │ Search Query  │─────────────┘                    │          │
│  └───────────────┘                                  │          │
│         │                                           │          │
│         │                                           ▼          │
│         │                                    ┌───────────────┐ │
│         └───────────────────────────────────►│   Results     │ │
│                                              └───────────────┘ │
│                                                                │
└────────────────────────────────────────────────────────────────┘
```

## Features

- **Simple storage layer**: Store, retrieve, and manage vector embeddings with metadata
- **Fast search with IVF**: Efficient approximate nearest neighbor search
- **Gemini API integration**: Generate high-quality embeddings from text
- **Flexible distance metrics**: Choose between cosine, Euclidean, and dot product similarity

## Project Structure

The project is organized into four main components:

1. **VectorStorage**: The foundation for storing and managing vectors and metadata
2. **IVF Index**: An efficient search index using clustering to accelerate queries
3. **Embedding Generation**: Utilities to create vector embeddings from text using Gemini
4. **Sample Application**: A news headline search demo

## Understanding the Notebooks
Each notebook in this project explores a different aspect of vector databases:


`00_vector_storage.ipynb`
This notebook implements the core storage system for our MiniVecDB project. It introduces the concept of vector databases and demonstrates how to store, retrieve, and manage vector embeddings with metadata. The VectorStorage class handles basic operations like adding vectors, getting them by ID, deletion, and saving/loading the database to disk.


`01_ivf_index.ipynb`
This notebook builds on the storage layer to implement efficient search using the Inverted File (IVF) Index algorithm. It explains the challenge of searching through large vector collections and demonstrates how clustering vectors can dramatically improve search speed. The notebook covers different distance functions, K-means clustering, and searching with different parameters.


`02_gen_embeddings.ipynb`
This notebook sets up our connection to Google's Gemini API for creating text embeddings. It explains what embeddings are, how they capture semantic meaning, and why they're powerful for finding related content. The get_embedding function implemented here converts text to 768-dimensional vectors optimized for semantic similarity.


`03_headline_embeddings.ipynb`
This notebook applies everything from the previous notebooks to build a complete workflow. It processes a dataset of news headlines, converts them to embeddings using Gemini (with rate limiting considerations), and stores them for use with our IVF index. This preprocessing step creates the vector database that powers semantic search demonstrations.

## Getting Started

### Prerequisites

- Python 3.9+
- NumPy
- FastCore
- Google Gemini API access

### Installation

```bash
git clone https://github.com/yourusername/minivecdb.git
cd minivecdb
pip install -e .
```

## Quick Example

In [None]:
from minivecdb.vector_storage import VectorStorage
from minivecdb.ivf_index import IVFIndex
from minivecdb.gen_emb import get_embedding

# Create a storage for 768-dimensional vectors
storage = VectorStorage(768)

# Add some vectors with metadata
headline = "Scientists Discover New Species in Amazon"
vector = get_embedding(headline)
storage.add(vector, {"headline": headline, "category": "science"})

# Add more vectors...

# Create and build an index for faster search
index = IVFIndex(storage)
index.build()

# Search for similar content
query = "New biological discovery in rainforest"
query_vector = get_embedding(query)
results = index.search(query_vector, k=5)

# Display results
for vector, metadata in results:
    print(metadata["headline"])

Average list size: 1.0, Max: 1
Scientists Discover New Species in Amazon


## How It Works

### Vector Storage

The `VectorStorage` class provides the foundational layer:
- Stores vectors with unique IDs
- Associates metadata with each vector
- Persists data to disk using JSON

### IVF Indexing

The Inverted File (IVF) index accelerates search by:
1. Clustering similar vectors using K-means
2. Creating "inverted lists" mapping clusters to their vectors
3. During search, checking only the most promising clusters

This approach dramatically reduces the number of comparisons needed for large datasets.

### Embedding Generation

We use Google's Gemini API to convert text into 768-dimensional vectors that capture semantic meaning. Similar concepts produce similar vectors, enabling semantic search.

## Limitations

This is an educational implementation with some limitations:
- Not optimized for very large datasets (millions of vectors)
- Basic persistence mechanism using JSON
- No automatic index updates (requires full rebuild)
- No concurrent access support

## License

MIT

## Acknowledgments

- The [FastKMeans](https://github.com/AnswerDotAI/fastkmeans) library for efficient clustering
- Google's Gemini API for high-quality embeddings