# OpenAI Embeddings + vxdb

This notebook shows how to generate embeddings with **OpenAI's API** and store/search them in vxdb.

**Model used:** `text-embedding-3-small` (1536 dimensions, $0.02 per 1M tokens)

You can swap in `text-embedding-3-large` (3072 dims) for higher quality, or `text-embedding-ada-002` (1536 dims, legacy).

**Prerequisites:**
```bash
pip install vxdb openai
```

**You'll need:** an OpenAI API key → https://platform.openai.com/api-keys

In [None]:
!pip install vxdb openai -q

## Step 1: Set up the OpenAI client

Set your API key as an environment variable or paste it directly (not recommended for shared notebooks).

In [None]:
import os
from openai import OpenAI

# Option 1: set env var before running — export OPENAI_API_KEY="sk-..."
# Option 2: pass directly (not recommended for shared notebooks)
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

EMBEDDING_MODEL = "text-embedding-3-small"
EMBEDDING_DIM = 1536  # dimensions for text-embedding-3-small

## Step 2: Define an embedding helper

The OpenAI API accepts batches of up to 2048 texts. We wrap it in a simple function.

In [None]:
def get_embeddings(texts: list[str]) -> list[list[float]]:
    """Call the OpenAI embeddings API and return a list of vectors."""
    response = client.embeddings.create(
        input=texts,
        model=EMBEDDING_MODEL,
    )
    # The API returns embeddings in the same order as the input
    return [item.embedding for item in response.data]


# Quick test
test_vec = get_embeddings(["hello world"])
print(f"Embedding dimension: {len(test_vec[0])}")
print(f"First 5 values: {test_vec[0][:5]}")

## Step 3: Prepare your data

We'll index a small collection of documents. In production, this could be paragraphs from PDFs, product descriptions, support tickets, etc.

In [None]:
documents = [
    {
        "id": "ml-intro",
        "text": "Machine learning is a subset of artificial intelligence that enables systems to learn from data without being explicitly programmed.",
        "metadata": {"topic": "ml", "level": "beginner"},
    },
    {
        "id": "dl-intro",
        "text": "Deep learning uses neural networks with many layers to model complex patterns in large amounts of data.",
        "metadata": {"topic": "dl", "level": "beginner"},
    },
    {
        "id": "transformers",
        "text": "Transformer models use self-attention mechanisms to process sequences in parallel, revolutionizing NLP tasks like translation and summarization.",
        "metadata": {"topic": "nlp", "level": "intermediate"},
    },
    {
        "id": "rag",
        "text": "Retrieval-Augmented Generation combines a retrieval system with a generative model, grounding LLM outputs in real documents to reduce hallucination.",
        "metadata": {"topic": "rag", "level": "advanced"},
    },
    {
        "id": "vec-db",
        "text": "Vector databases store high-dimensional embeddings and support fast approximate nearest-neighbor search for semantic similarity.",
        "metadata": {"topic": "infrastructure", "level": "intermediate"},
    },
    {
        "id": "fine-tuning",
        "text": "Fine-tuning adapts a pre-trained model to a specific task by continuing training on a smaller, domain-specific dataset.",
        "metadata": {"topic": "ml", "level": "advanced"},
    },
]

print(f"Prepared {len(documents)} documents")

## Step 4: Embed and insert into vxdb

We embed all documents in a single batch call, then upsert into a vxdb collection.

In [None]:
import vxdb

# Embed all texts in one batch call (cheaper and faster)
texts = [doc["text"] for doc in documents]
vectors = get_embeddings(texts)

# Create database and collection
db = vxdb.Database()
collection = db.create_collection(
    name="openai_docs",
    dimension=EMBEDDING_DIM,
    metric="cosine",
    index="flat",
)

# Upsert: vectors + metadata + raw text for hybrid search
collection.upsert(
    ids=[doc["id"] for doc in documents],
    vectors=vectors,
    metadata=[doc["metadata"] for doc in documents],
    documents=texts,
)

print(f"Indexed {collection.count()} documents with {EMBEDDING_DIM}-dim OpenAI embeddings")

## Step 5: Semantic search

Embed the query with the same model, then search by vector similarity.

In [None]:
query = "How do I make my LLM more accurate with external knowledge?"
query_vector = get_embeddings([query])[0]

results = collection.query(vector=query_vector, top_k=3)

print(f"Query: '{query}'\n")
for r in results:
    print(f"  [{r['id']}]  score={r['score']:.4f}  topic={r['metadata']['topic']}")
    # Find the original text for display
    original = next(d["text"] for d in documents if d["id"] == r["id"])
    print(f"    → {original[:100]}...\n")

## Step 6: Filtered semantic search

Combine vector similarity with metadata constraints — e.g., only search beginner-level docs.

In [None]:
results = collection.query(
    vector=query_vector,
    top_k=3,
    filter={"level": {"$eq": "beginner"}},
)

print("Filtered to beginner-level docs only:\n")
for r in results:
    print(f"  [{r['id']}]  score={r['score']:.4f}  {r['metadata']}")

## Step 7: Hybrid search

Combine vector similarity with keyword matching. Useful when you want to ensure results contain specific terms.

In [None]:
results = collection.hybrid_query(
    vector=query_vector,
    query="retrieval augmented generation hallucination",
    top_k=3,
    alpha=0.5,
)

print("Hybrid search (vector + keyword):\n")
for r in results:
    print(f"  [{r['id']}]  rrf_score={r['score']:.4f}  {r['metadata']}")

## Using vxdb's EmbeddingFunction interface

For cleaner code, you can wrap OpenAI into vxdb's pluggable `EmbeddingFunction` base class:

In [None]:
from vxdb.embedding import EmbeddingFunction


class OpenAIEmbedding(EmbeddingFunction):
    """Reusable wrapper around OpenAI's embedding API."""

    def __init__(self, model: str = "text-embedding-3-small"):
        self.client = OpenAI()
        self.model = model

    def embed(self, texts: list[str]) -> list[list[float]]:
        response = self.client.embeddings.create(input=texts, model=self.model)
        return [item.embedding for item in response.data]


# Usage:
embedder = OpenAIEmbedding()
vecs = embedder.embed(["This is a test sentence."])
print(f"OpenAIEmbedding produced a {len(vecs[0])}-dim vector")