## Simple Vector Database

July 2023

A vector database is a specialized type of database designed to store, and manage large collections of vectors - which is an ordered list of elements that can represent or encode data that requires 2+ variables for its measurement.

Contrast this with 'scalars' which requires only one type of variable for its measurement.

When we say that a sailboat traveled for 2 miles upwind, we are describing its movement in a scalar way -

But when we say that a sailboat traveled for 2 miles downwind, 10 degrees northeast from its initial position, we are describing its movement in a vectorized way - it has magnitude and direction.

Vectors are important in data science and machine learning because they allow us to represent complex data in a more representative, useful and structured way.

So-called 'feature vectors' are used widely in machine learning because of the effectiveness and practicality of representing objects in a numerical way to help with many kinds of analyses. They are good for analysis because there are many techniques for comparing feature vectors, using for example cosine similarity or Euclidean distance.

An introduction -> https://machinelearningmastery.com/gentle-introduction-vectors-machine-learning/

Use of Feature Vectors -> https://brilliant.org/wiki/feature-vector/

In [27]:
import numpy as np
from numpy import linalg

In [28]:
from collections import defaultdict

In [29]:
from typing import List, Tuple

In [30]:
def cosine_similarity(v1: np.ndarray, v2: np.ndarray) -> float:
    dot_product = np.dot(v1, v2)
    norm_v1 = np.linalg.norm(v1)
    norm_v2 = np.linalg.norm(v2)
    return dot_product / (norm_v1 * norm_v2)

In [35]:
class VectorDatabase:
    def __init__(self):
        self.vectors = defaultdict(np.ndarray)

    def insert(self, key: str, vector: np.ndarray) -> None:
        self.vectors[key] = vector

    def search_using_cosine(self, query_vector: np.ndarray, k: int) -> List[Tuple[str, float]]:
        similarities = [(key, cosine_similarity(query_vector, vector)) for key, vector in self.vectors.items()]
        similarities.sort(key=lambda x: x[1], reverse=True)
        return similarities[:k]

    def search_using_euclidean(self, query_vector: np.ndarray, k: int) -> List[Tuple[str, float]]:
        similarities = [(key, np.linalg.norm(query_vector - vector)) for key, vector in self.vectors.items()]
        similarities.sort(key=lambda x: x[1], reverse=True)
        return similarities[:k]

    def retrieve(self, key: str) -> np.ndarray:
        return self.vectors.get(key, None)

In [36]:
# Initialize
vector_db = VectorDatabase()


In [37]:
# Insert vectors into the database
vector_db.insert("vector_1", np.array([0.1, 0.2, 0.3]))
vector_db.insert("vector_2", np.array([0.4, 0.5, 0.6]))
vector_db.insert("vector_3", np.array([0.7, 0.8, 0.9]))



In [38]:
# Search for similar vectors using cosine
query_vector = np.array([0.15, 0.25, 0.35])
similar_vectors = vector_db.search_using_cosine(query_vector, k=2)
print("Similar vectors:", similar_vectors)

Similar vectors: [('vector_1', 0.9974149030430577), ('vector_2', 0.9881950691041642)]


In [39]:
# Retrieve a specific vector by its key
retrieved_vector = vector_db.retrieve("vector_1")
print("Retrieved vector:", retrieved_vector)

Retrieved vector: [0.1 0.2 0.3]


In [40]:
# Search for similar vectors using euclidean
query_vector = np.array([0.15, 0.25, 0.35])
similar_vectors = vector_db.search_using_euclidean(query_vector, k=2)
print("Similar vectors:", similar_vectors)

Similar vectors: [('vector_3', 0.9526279441628825), ('vector_2', 0.4330127018922193)]


In [41]:
# Retrieve a specific vector by its key
retrieved_vector = vector_db.retrieve("vector_1")
print("Retrieved vector:", retrieved_vector)

Retrieved vector: [0.1 0.2 0.3]
