# Hybrid Search Deep Dive

This notebook explores **when and why** to use hybrid search, and how to tune the `alpha` parameter for best results.

**Hybrid search** combines two retrieval strategies:
1. **Vector search** — finds semantically similar documents (meaning-based)
2. **Keyword search (BM25)** — finds documents containing exact terms (word-based)

Results are merged via **Reciprocal Rank Fusion (RRF)**, which combines rankings from both systems without needing to normalize scores.

**When to use hybrid search:**
- Specific terms matter (product codes, proper nouns, error messages)
- Users mix natural language with exact keywords
- You want the best of both worlds without choosing

**No API keys needed** — this notebook uses sentence-transformers (local, free).

```bash
pip install vxdb sentence-transformers
```

In [None]:
!pip install vxdb sentence-transformers -q

## Setup: Build a product catalog

We'll use a realistic scenario — a tech product catalog where users search with a mix of natural language and specific product names/codes.

In [None]:
from sentence_transformers import SentenceTransformer
import vxdb

model = SentenceTransformer("all-MiniLM-L6-v2")

products = [
    {"id": "MBP-M4",    "text": "MacBook Pro M4 Max — 16-inch laptop with Apple silicon, 48GB unified memory, ideal for video editing and machine learning workloads.",    "category": "laptop",    "price": 3499},
    {"id": "RTX-5090",   "text": "NVIDIA GeForce RTX 5090 — flagship GPU with 32GB GDDR7, ray tracing, and DLSS 4 for gaming and AI inference at 4K resolution.",          "category": "gpu",       "price": 1999},
    {"id": "T14s-G6",    "text": "ThinkPad T14s Gen 6 — ultralight business laptop with AMD Ryzen AI processor, 14-inch OLED display, and all-day battery life.",           "category": "laptop",    "price": 1429},
    {"id": "H100-SXM",   "text": "NVIDIA H100 SXM — datacenter GPU designed for large-scale AI training, transformer models, and high-performance computing clusters.",     "category": "gpu",       "price": 30000},
    {"id": "MBA-M4",     "text": "MacBook Air M4 — ultra-thin 13-inch laptop with fanless design, 18-hour battery, perfect for students and everyday productivity.",         "category": "laptop",    "price": 1099},
    {"id": "4090D",      "text": "NVIDIA GeForce RTX 4090 — previous generation flagship with 24GB GDDR6X, excellent for deep learning research and 4K gaming.",             "category": "gpu",       "price": 1599},
    {"id": "XPS-16",     "text": "Dell XPS 16 — premium 16-inch laptop with Intel Core Ultra, NVIDIA RTX 4070, and a stunning 4K OLED touchscreen for creative professionals.","category": "laptop",  "price": 2199},
    {"id": "A6000",      "text": "NVIDIA RTX A6000 — professional workstation GPU with 48GB GDDR6 for CAD, 3D rendering, scientific visualization, and AI development.",     "category": "gpu",       "price": 4650},
    {"id": "FW-16",      "text": "Framework Laptop 16 — modular, repairable laptop with swappable GPU module, mechanical keyboard, and open-source firmware.",                "category": "laptop",    "price": 1399},
    {"id": "RX-9070",    "text": "AMD Radeon RX 9070 XT — mid-range GPU with 16GB GDDR6 and FSR 4 upscaling, competitive performance for 1440p gaming.",                     "category": "gpu",       "price": 549},
]

texts = [p["text"] for p in products]
vectors = model.encode(texts).tolist()

db = vxdb.Database()
collection = db.create_collection("products", dimension=384, metric="cosine")

collection.upsert(
    ids=[p["id"] for p in products],
    vectors=vectors,
    metadata=[{"category": p["category"], "price": p["price"]} for p in products],
    documents=texts,
)

print(f"Indexed {collection.count()} products")

## Experiment 1: When vector search alone falls short

Embeddings capture *meaning* but may miss specific product codes or model names. Let's see what happens when a user searches for a specific product.

In [None]:
def compare_search(query: str, top_k: int = 5):
    """Run all three search modes and compare results side by side."""
    q_vec = model.encode([query]).tolist()[0]

    print(f"Query: '{query}'\n")

    vec_results = collection.query(vector=q_vec, top_k=top_k)
    kw_results = collection.keyword_search(query=query, top_k=top_k)
    hybrid_results = collection.hybrid_query(vector=q_vec, query=query, top_k=top_k, alpha=0.5)

    print(f"{'Rank':<6} {'Vector Search':<20} {'Keyword Search':<20} {'Hybrid (α=0.5)':<20}")
    print("-" * 66)
    for i in range(top_k):
        v = vec_results[i]["id"] if i < len(vec_results) else "-"
        k = kw_results[i]["id"] if i < len(kw_results) else "-"
        h = hybrid_results[i]["id"] if i < len(hybrid_results) else "-"
        print(f"  {i+1:<4} {v:<20} {k:<20} {h:<20}")
    print()


# A user searching for a specific product code
compare_search("RTX 5090")

## Experiment 2: Natural language queries

For broad, meaning-based queries, vector search tends to do well. Let's see if hybrid still adds value.

In [None]:
compare_search("lightweight laptop for students with long battery life")
compare_search("best GPU for training large transformer models")

## Experiment 3: Tuning alpha

Let's sweep across different `alpha` values to see how the ranking changes for a mixed query (both semantic meaning and specific terms).

In [None]:
query = "MacBook for machine learning"
q_vec = model.encode([query]).tolist()[0]

alphas = [0.0, 0.25, 0.5, 0.75, 1.0]

print(f"Query: '{query}'\n")
header = f"{'Rank':<6}" + "".join(f"{'α=' + str(a):<15}" for a in alphas)
print(header)
print("-" * len(header))

for rank in range(5):
    row = f"  {rank+1:<4}"
    for alpha in alphas:
        results = collection.hybrid_query(vector=q_vec, query=query, top_k=5, alpha=alpha)
        doc_id = results[rank]["id"] if rank < len(results) else "-"
        row += f"{doc_id:<15}"
    print(row)

print("""
Observations:
• α=0.0 (pure keyword): Finds docs with "MacBook" and "machine learning" terms
• α=1.0 (pure vector): Finds semantically similar docs (may miss exact term matches)
• α=0.5 (balanced):    Best of both — ranks docs that match both meaning AND keywords higher
""")

## Experiment 4: Hybrid + metadata filters

You can combine hybrid search with metadata filters for powerful, precise retrieval.

In [None]:
# Hybrid search, then post-filter to GPUs under $2000
query = "best GPU for deep learning under 2000 dollars"
q_vec = model.encode([query]).tolist()[0]

# Vector + filter
print("Vector search (GPUs under $2000):")
for r in collection.query(
    vector=q_vec, top_k=5,
    filter={"$and": [{"category": {"$eq": "gpu"}}, {"price": {"$lte": 2000}}]}
):
    print(f"  → {r['id']:>10}  score={r['score']:.4f}  ${r['metadata']['price']}")

# Keyword search (no filter needed to compare)
print("\nKeyword search (all):")
for r in collection.keyword_search(query="deep learning GPU", top_k=5):
    print(f"  → {r['id']:>10}  score={r['score']:.4f}")

## Summary: When to use what

| Search Mode | Best For | Example Query |
|------------|---------|---------------|
| **Vector only** (`query`) | Broad semantic questions | "lightweight laptop for students" |
| **Keyword only** (`keyword_search`) | Exact terms, codes, names | "RTX 5090" |
| **Hybrid** (`hybrid_query`) | Mixed intent, best default | "MacBook for machine learning" |
| **Filtered** (`query` + `filter`) | Constrained search | "GPU under $2000" |

**Rule of thumb:** Start with `hybrid_query(alpha=0.5)` as your default. If you find it's returning too many irrelevant keyword matches, increase `alpha`. If users complain about missing exact matches, decrease `alpha`.