# vxdb Quickstart

This notebook walks you through every core feature of **vxdb** in under 5 minutes.

No API keys needed — we use small dummy vectors so you can run every cell immediately.

**What you'll learn:**
1. Create a database and collection
2. Insert vectors with metadata and documents
3. Vector similarity search
4. Filtered search (metadata operators)
5. Hybrid search (vector + keyword)
6. Pure keyword search
7. Update and delete operations
8. Multiple collections

In [1]:
!pip install vxdb -q

[31mERROR: Could not find a version that satisfies the requirement vxdb (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for vxdb[0m[31m
[0m

## Step 1: Create a Database and Collection

A **Database** holds multiple **Collections**. Each collection stores vectors of a fixed dimension with a chosen distance metric and index type.

Pass `path=` to persist data to disk (survives restarts). Omit it for a fast, ephemeral in-memory database.

| Parameter | Options | Default |
|-----------|---------|---------| 
| `path` | Any directory path, or `None` | `None` (in-memory) |
| `metric` | `"cosine"`, `"euclidean"`, `"dot"` | `"cosine"` |
| `index` | `"flat"` (exact brute-force), `"hnsw"` (approximate, faster for large datasets) | `"flat"` |

In [1]:
import vxdb

# Persistent — data survives restarts (recommended for real usage)
# db = vxdb.Database(path="./quickstart_data")

# In-memory — fast and ephemeral (used here so the notebook is self-contained)
db = vxdb.Database()

# dimension = number of floats per vector (must match your embedding model)
collection = db.create_collection(
    name="articles",
    dimension=4,        # small for demo — real models use 384, 768, 1536, etc.
    metric="cosine",    # cosine similarity (most common for text embeddings)
    index="flat",       # exact search — switch to "hnsw" for >100k vectors
)

print(f"Created: {collection}")
print(f"Collections in db: {db.list_collections()}")

Created: Collection(name='articles')
Collections in db: ['articles']


## Step 2: Insert Vectors

Use `upsert()` to insert or update vectors. Each vector needs:
- **id** — a unique string identifier
- **vector** — a list of floats (must match the collection's dimension)
- **metadata** *(optional)* — a dict of key-value pairs for filtering
- **documents** *(optional)* — raw text strings to enable hybrid/keyword search

In [2]:
# In a real app, these vectors come from an embedding model (see the other notebooks).
# Here we use small 4-d vectors to keep things readable.

collection.upsert(
    ids=["doc1", "doc2", "doc3", "doc4", "doc5"],
    vectors=[
        [1.0, 0.0, 0.0, 0.0],  # doc1: points in the "x" direction
        [0.0, 1.0, 0.0, 0.0],  # doc2: points in the "y" direction
        [0.9, 0.1, 0.0, 0.0],  # doc3: very similar to doc1
        [0.0, 0.0, 1.0, 0.0],  # doc4: points in the "z" direction
        [0.5, 0.5, 0.0, 0.0],  # doc5: between doc1 and doc2
    ],
    metadata=[
        {"title": "Intro to ML",    "category": "tech",      "year": 2024},
        {"title": "Cooking Pasta",   "category": "food",      "year": 2023},
        {"title": "Advanced ML",     "category": "tech",      "year": 2024},
        {"title": "Gardening 101",   "category": "lifestyle", "year": 2022},
        {"title": "Data Science",    "category": "tech",      "year": 2023},
    ],
    documents=[
        "Introduction to machine learning and neural networks",
        "How to cook perfect pasta with homemade sauce",
        "Advanced machine learning techniques and optimization",
        "A beginner's guide to gardening and growing vegetables",
        "Data science fundamentals with Python and statistics",
    ],
)

print(f"Inserted {collection.count()} vectors")

Inserted 5 vectors


## Step 3: Vector Similarity Search

Pass a query vector and get back the closest matches, ranked by distance (lower = more similar for cosine).

In [3]:
query_vector = [1.0, 0.0, 0.0, 0.0]  # looking for docs similar to doc1

results = collection.query(vector=query_vector, top_k=3)

print("Vector Search — top 3 closest to [1, 0, 0, 0]:\n")
for r in results:
    print(f"  {r['id']:>5}  score={r['score']:.4f}  {r['metadata']['title']}")

Vector Search — top 3 closest to [1, 0, 0, 0]:

   doc1  score=0.0000  Intro to ML
   doc3  score=0.0061  Advanced ML
   doc5  score=0.2929  Data Science


## Step 4: Filtered Search

Narrow results using metadata filters. vxdb supports 10 operators:

| Operator | Meaning |
|----------|---------|
| `$eq` | equals |
| `$ne` | not equals |
| `$gt` / `$gte` | greater than / greater-or-equal |
| `$lt` / `$lte` | less than / less-or-equal |
| `$in` / `$nin` | in list / not in list |
| `$and` / `$or` | combine conditions |

In [4]:
# Only return tech articles from 2024 or later
results = collection.query(
    vector=query_vector,
    top_k=5,
    filter={
        "$and": [
            {"category": {"$eq": "tech"}},
            {"year": {"$gte": 2024}},
        ]
    },
)

print("Filtered Search — tech articles, year >= 2024:\n")
for r in results:
    print(f"  {r['id']:>5}  score={r['score']:.4f}  {r['metadata']}")

Filtered Search — tech articles, year >= 2024:

   doc1  score=0.0000  {'year': 2024, 'title': 'Intro to ML', 'category': 'tech'}
   doc3  score=0.0061  {'year': 2024, 'category': 'tech', 'title': 'Advanced ML'}


## Step 5: Hybrid Search (Vector + Keyword)

Hybrid search combines **vector similarity** with **BM25 keyword matching** using Reciprocal Rank Fusion (RRF).

This is useful when semantic similarity alone isn't enough — e.g., searching for specific terms, product codes, or proper nouns that embeddings may not capture well.

The `alpha` parameter controls the blend:
- `alpha=1.0` → pure vector search
- `alpha=0.5` → equal weight (default)
- `alpha=0.0` → pure keyword search

> **Note:** You must pass `documents=` during `upsert()` to use hybrid/keyword search.

In [5]:
results = collection.hybrid_query(
    vector=query_vector,
    query="machine learning",
    top_k=3,
    alpha=0.5,
)

print("Hybrid Search — vector ≈ [1,0,0,0] + keywords 'machine learning':\n")
for r in results:
    print(f"  {r['id']:>5}  score={r['score']:.4f}  {r['metadata']['title']}")

Hybrid Search — vector ≈ [1,0,0,0] + keywords 'machine learning':

   doc3  score=0.0163  Advanced ML
   doc1  score=0.0163  Intro to ML
   doc5  score=0.0079  Data Science


## Step 6: Pure Keyword Search

Search by text only, no vector needed. Uses BM25 scoring internally.

In [6]:
results = collection.keyword_search(query="pasta sauce", top_k=3)

print("Keyword Search — 'pasta sauce':\n")
for r in results:
    print(f"  {r['id']:>5}  score={r['score']:.4f}  {r['metadata']['title']}")

Keyword Search — 'pasta sauce':

   doc2  score=2.6195  Cooking Pasta


## Step 7: Update and Delete

- **Upsert with an existing ID** overwrites that vector, metadata, and document.
- **Delete** removes vectors by ID and returns which ones were actually found.

In [7]:
# Update: upsert with the same ID replaces everything
collection.upsert(
    ids=["doc1"],
    vectors=[[0.0, 0.0, 0.0, 1.0]],
    metadata=[{"title": "Intro to ML (revised)", "category": "tech", "year": 2025}],
)
print(f"After update: count = {collection.count()}")

# Delete by ID
deleted = collection.delete(ids=["doc4"])
print(f"Deleted doc4: {deleted}")
print(f"After delete: count = {collection.count()}")

After update: count = 5
Deleted doc4: [True]
After delete: count = 4


## Step 8: Multiple Collections

A single database can hold many collections — useful for organizing different data types (text, images, audio) or different embedding models.

In [8]:
db.create_collection("images", dimension=512, metric="cosine", index="hnsw")
db.create_collection("audio", dimension=256, metric="euclidean")

print(f"All collections: {sorted(db.list_collections())}")

db.delete_collection("audio")
print(f"After deleting 'audio': {sorted(db.list_collections())}")

All collections: ['articles', 'audio', 'images']
After deleting 'audio': ['articles', 'images']


## Next Steps

You now know all the core operations. Check out the other notebooks in this directory to see how to plug in **real embedding models**:

| Notebook | Embedding Source | API Key Needed? |
|----------|-----------------|-----------------|
| `openai_embeddings.ipynb` | OpenAI `text-embedding-3-small` | Yes |
| `sentence_transformers.ipynb` | Hugging Face (runs locally) | No |
| `langchain_integration.ipynb` | LangChain (any provider) | Depends |
| `cohere_embeddings.ipynb` | Cohere `embed-v4.0` | Yes |
| `hybrid_search.ipynb` | End-to-end hybrid search RAG pipeline | Depends |

In [11]:
db.list_collections()

['images', 'articles']