# Vector Storage

> This notebook explores the VectorStorage class, which is the foundation of our MiniVecDB project. The VectorStorage class provides a simple yet effective way to store, retrieve, and manage vector embeddings along with their associated metadata

## What is a vector database?

Think of a vector database as a smart filing cabinet that can find similar items. 
Each "document" (like text, images, or audio) gets converted into a long list of numbers (a vector).

Similar documents have similar vectors, so we can find related content by looking for vectors 
that are close to each other in this mathematical space.

Our VectorStorage class is the simplest form of this - it stores vectors and lets us:
- Add new vectors with metadata (information about what the vector represents)
- Get vectors by their ID
- Delete vectors we don't need anymore
- Save and load our database

In [1]:
#| default_exp vector_storage

In [2]:
#| hide
from nbdev.showdoc import *

In [3]:
#| export
import numpy as np
import json
from typing import Dict, List, Any, Tuple, Union
import fastcore.all as fc

## The VectorStorage Class

The `VectorStorage` class is the foundation of MiniVecDB. It's designed to be simple yet effective.

Each vector is stored with:
- A unique numeric ID
- The vector itself (a numpy array)
- Metadata (any JSON-serializable information about what the vector represents)

Think of it as a simple dictionary where the keys are IDs and the values are (vector, metadata) pairs.

Below is our complete VectorStorage class. I'll explain each method after the code:

- **__init__**: Sets up storage for vectors and metadata
- **add**: Adds a new vector with metadata and returns its ID
- **get**: Retrieves a vector and its metadata by ID
- **delete**: Removes a vector and its metadata
- **get_all**: Returns all vectors and metadata
- **save**: Saves the database to a file
- **load**: Loads a database from a file

In [4]:
#| export
class VectorStorage:
    def __init__(self, dimension):
        fc.store_attr()
        self.vectors = {}  # id -> vector
        self.metadatas = {}  # id -> metadata
        self.next_id = 0
    
    def add(self, vector, metadata):
        fc.test_eq(self.dimension, len(vector))
        current_id = self.next_id
        self.vectors[current_id] = vector
        self.metadatas[current_id] = metadata
        self.next_id += 1
        return current_id
    
    def get(self, id): return (self.vectors[id], self.metadatas[id])
    
    def delete(self, id):
        if id not in self.vectors: return False
        self.vectors.pop(id)
        self.metadatas.pop(id)
        return True
    
    def get_all(self): return fc.L((key, self.vectors[key], self.metadatas[key]) for key in self.vectors)
        
    
    def save(self, filepath):
        data = {
            'dimension': self.dimension,
            'next_id': self.next_id,
            'vectors': {str(k): v.tolist() for k, v in self.vectors.items()},
            'metadatas': self.metadatas  # This should already be serializable
        }
        fc.Path(filepath).write_text(fc.dumps(data))
    
    @classmethod
    def load(cls, filepath):
        data = fc.Path(filepath).read_json()
        
        storage = cls(data['dimension'])
        storage.next_id = data['next_id']
        
        storage.vectors = {int(k): np.array(v) for k, v in data['vectors'].items()}
        storage.metadatas = {int(k): v for k, v in data['metadatas'].items()}
        
        return storage

## Detailed Method Explanations

### The add Method

The `add` method stores a new vector and its metadata. It:
1. Checks that the vector has the correct dimension
2. Assigns a unique ID
3. Stores the vector and metadata
4. Returns the ID so you can reference this vector later

### The get and delete Methods

The `get` method retrieves a vector and its metadata using the ID.

The `delete` method removes a vector and its metadata from storage.

Both methods are straightforward but essential for managing your vector database.

### Saving and Loading Storage

One key feature of `VectorStorage` is persistence - the ability to save your database to disk and load it back later.

The `save` method converts all vectors and metadata to a JSON-serializable format and writes to a file.

The `load` method recreates a `VectorStorage` object from a saved file, converting the loaded data back into the right formats (like numpy arrays).

This means you can build your vector database once and reuse it across multiple sessions or applications.

In [5]:
storage = VectorStorage(dimension=768)

vector1 = np.random.randn(768)
vector2 = np.random.randn(768)
id1 = storage.add(vector1, {"title": "Item 1"})
id2 = storage.add(vector2, {"title": "Item 2"})

vector, metadata = storage.get(id1)

storage.delete(id2)

storage.save("data/my_vectors.json")

loaded_storage = VectorStorage.load("data/my_vectors.json")
len(loaded_storage.get_all())

1

This example shows a complete workflow:
1. Create a new storage for 768-dimensional vectors
2. Add two vectors with simple metadata
3. Retrieve a vector and its metadata
4. Delete a vector
5. Save the storage to disk
6. Load the storage from disk

This is a very simple demonstration. In a real application, you might store thousands or millions of vectors representing documents, images, or other data.

Next Steps:
- Look at 01_ivf_index.ipynb to learn how to make searching more efficient
- Look at 02_gen_embeddings.ipynb to see how to create vectors from text using Gemini
- Look at 03_headline_embeddings.ipynb for a complete example application

In [6]:
#| hide
import nbdev; nbdev.nbdev_export()