# Sentence Transformers + vxdb (Local, Free, No API Key)

This notebook shows how to generate embeddings **locally** using [sentence-transformers](https://www.sbert.net/) and store them in vxdb.

**Why sentence-transformers?**
- Runs 100% locally — no API key, no internet needed after model download
- Free and open-source
- Great for privacy-sensitive data
- Many pre-trained models on Hugging Face

**Model used:** `all-MiniLM-L6-v2` (384 dimensions, ~80 MB, very fast)

Other good options:
| Model | Dims | Size | Quality |
|-------|------|------|---------|
| `all-MiniLM-L6-v2` | 384 | 80 MB | Good (fast) |
| `all-mpnet-base-v2` | 768 | 420 MB | Better |
| `BAAI/bge-small-en-v1.5` | 384 | 130 MB | Great |
| `BAAI/bge-large-en-v1.5` | 1024 | 1.3 GB | Best |

**Prerequisites:**
```bash
pip install vxdb sentence-transformers
```

In [None]:
!pip install vxdb sentence-transformers -q

## Step 1: Load the model

The model is downloaded once from Hugging Face (~80 MB) and cached locally.

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

EMBEDDING_DIM = model.get_sentence_embedding_dimension()
print(f"Model loaded: all-MiniLM-L6-v2 ({EMBEDDING_DIM} dimensions)")

## Step 2: Prepare documents

We'll use a small knowledge base about programming languages.

In [None]:
documents = [
    {"id": "python",     "text": "Python is a high-level interpreted language known for its readability and vast ecosystem of libraries for data science and web development."},
    {"id": "rust",       "text": "Rust is a systems programming language focused on safety, speed, and concurrency without a garbage collector."},
    {"id": "javascript", "text": "JavaScript is the language of the web, running in browsers and on servers via Node.js, powering interactive user interfaces."},
    {"id": "go",         "text": "Go is a statically typed compiled language designed at Google for simplicity, reliability, and efficient concurrency with goroutines."},
    {"id": "cpp",        "text": "C++ is a powerful systems language offering low-level memory control and high performance for games, databases, and operating systems."},
    {"id": "typescript",  "text": "TypeScript extends JavaScript with static types, catching errors at compile time while remaining compatible with the JS ecosystem."},
    {"id": "java",       "text": "Java is a class-based object-oriented language that runs on the JVM, widely used in enterprise applications and Android development."},
    {"id": "swift",      "text": "Swift is Apple's modern programming language for iOS and macOS development, combining safety with high performance."},
]

texts = [d["text"] for d in documents]
print(f"Prepared {len(documents)} documents")

## Step 3: Embed and index

`model.encode()` returns numpy arrays — we convert to plain Python lists for vxdb.

In [None]:
import vxdb

# Encode all documents at once (batched for speed)
vectors = model.encode(texts).tolist()  # numpy → list[list[float]]

db = vxdb.Database()
collection = db.create_collection(
    name="languages",
    dimension=EMBEDDING_DIM,
    metric="cosine",
)

collection.upsert(
    ids=[d["id"] for d in documents],
    vectors=vectors,
    documents=texts,  # enable hybrid search
)

print(f"Indexed {collection.count()} documents ({EMBEDDING_DIM}-dim local embeddings)")

## Step 4: Semantic search

In [None]:
queries = [
    "What language is best for building fast, safe backend services?",
    "I want to build a web app with a nice UI",
    "Which language should I use for data analysis?",
]

for query in queries:
    query_vec = model.encode([query]).tolist()[0]
    results = collection.query(vector=query_vec, top_k=3)

    print(f"Q: {query}")
    for r in results:
        print(f"   → {r['id']:>12}  score={r['score']:.4f}")
    print()

## Step 5: Hybrid search

Combine the semantic understanding of the embedding model with exact keyword matching.

In [None]:
query = "garbage collector memory safety"
query_vec = model.encode([query]).tolist()[0]

# Compare: vector-only vs hybrid
print("Vector-only search:")
for r in collection.query(vector=query_vec, top_k=3):
    print(f"   → {r['id']:>12}  score={r['score']:.4f}")

print("\nHybrid search (alpha=0.5):")
for r in collection.hybrid_query(vector=query_vec, query=query, top_k=3, alpha=0.5):
    print(f"   → {r['id']:>12}  score={r['score']:.4f}")

print("\nKeyword-only search:")
for r in collection.keyword_search(query=query, top_k=3):
    print(f"   → {r['id']:>12}  score={r['score']:.4f}")

## Bonus: Pluggable EmbeddingFunction wrapper

Clean, reusable class that fits vxdb's interface:

In [None]:
from vxdb.embedding import EmbeddingFunction


class SentenceTransformerEmbedding(EmbeddingFunction):
    """Local embedding using any sentence-transformers model."""

    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)
        self.dimension = self.model.get_sentence_embedding_dimension()

    def embed(self, texts: list[str]) -> list[list[float]]:
        return self.model.encode(texts).tolist()


# Usage:
embedder = SentenceTransformerEmbedding()
vecs = embedder.embed(["This runs entirely on your machine."])
print(f"Local embedding: {len(vecs[0])} dimensions, no API key needed!")