# Cohere Embeddings + vxdb

This notebook shows how to generate embeddings with **Cohere's API** and store/search them in vxdb.

**Why Cohere?**
- `embed-v4.0` supports 100+ languages out of the box
- Separate `input_type` for documents vs queries (improves retrieval quality)
- Generous free tier (1,000 API calls/month)

**Model used:** `embed-v4.0` (1024 dimensions)

| Model | Dims | Languages | Notes |
|-------|------|-----------|-------|
| `embed-v4.0` | 1024 | 100+ | Latest, best quality |
| `embed-english-v3.0` | 1024 | English | Slightly faster |
| `embed-multilingual-v3.0` | 1024 | 100+ | Previous gen |

**Prerequisites:**
```bash
pip install vxdb cohere
```

**You'll need:** a Cohere API key → https://dashboard.cohere.com/api-keys

In [None]:
!pip install vxdb cohere -q

## Step 1: Set up the Cohere client

In [None]:
import os
import cohere

# Reads from COHERE_API_KEY env var, or pass directly
co = cohere.Client(api_key=os.environ.get("COHERE_API_KEY"))

EMBEDDING_MODEL = "embed-v4.0"
EMBEDDING_DIM = 1024

## Step 2: Define embedding helpers

Cohere's API uses `input_type` to distinguish between documents (being indexed) and queries (being searched).
This asymmetric encoding improves retrieval quality.

In [None]:
def embed_documents(texts: list[str]) -> list[list[float]]:
    """Embed texts for indexing (input_type='search_document')."""
    response = co.embed(
        texts=texts,
        model=EMBEDDING_MODEL,
        input_type="search_document",
        embedding_types=["float"],
    )
    return [list(e) for e in response.embeddings.float_]


def embed_query(text: str) -> list[float]:
    """Embed a single query (input_type='search_query')."""
    response = co.embed(
        texts=[text],
        model=EMBEDDING_MODEL,
        input_type="search_query",
        embedding_types=["float"],
    )
    return list(response.embeddings.float_[0])


# Quick test
test_vec = embed_query("hello world")
print(f"Cohere embedding dimension: {len(test_vec)}")

## Step 3: Index documents

In [None]:
import vxdb

documents = [
    {"id": "paris",   "text": "Paris is the capital of France, known for the Eiffel Tower, world-class museums, and its café culture.",          "lang": "en"},
    {"id": "tokyo",   "text": "Tokyo is Japan's bustling capital, blending ultramodern skyscrapers with historic temples and gardens.",             "lang": "en"},
    {"id": "berlin",  "text": "Berlin ist die Hauptstadt Deutschlands, bekannt für ihre Geschichte, Kultur und lebendige Kunstszene.",             "lang": "de"},
    {"id": "madrid",  "text": "Madrid es la capital de España, famosa por el Museo del Prado, el Parque del Retiro y su animada vida nocturna.",  "lang": "es"},
    {"id": "nyc",     "text": "New York City is the largest city in the US, home to Times Square, Central Park, and the Statue of Liberty.",       "lang": "en"},
    {"id": "saopaulo","text": "São Paulo é a maior cidade do Brasil, conhecida por sua diversidade cultural, gastronomia e vida noturna.",         "lang": "pt"},
]

texts = [d["text"] for d in documents]
vectors = embed_documents(texts)

# Use path= for persistent storage: vxdb.Database(path="./my_data")
db = vxdb.Database()
collection = db.create_collection("cities", dimension=EMBEDDING_DIM, metric="cosine")

collection.upsert(
    ids=[d["id"] for d in documents],
    vectors=vectors,
    metadata=[{"lang": d["lang"]} for d in documents],
    documents=texts,
)

print(f"Indexed {collection.count()} multilingual documents")

## Step 4: Cross-lingual search

Because Cohere's model is multilingual, you can query in one language and find results in another.

In [None]:
# Search in English — should find results across ALL languages
queries = [
    "famous museums and art galleries in European capitals",
    "biggest city in South America",
    "Japanese culture and temples",
]

for query in queries:
    q_vec = embed_query(query)
    results = collection.query(vector=q_vec, top_k=3)

    print(f"Q: {query}")
    for r in results:
        print(f"   → {r['id']:>10}  score={r['score']:.4f}  lang={r['metadata']['lang']}")
    print()

## Step 5: Filtered + hybrid search

In [None]:
query = "capital city with good nightlife"
q_vec = embed_query(query)

# Filter to English-language docs only
print("Filtered to English only:")
for r in collection.query(vector=q_vec, top_k=3, filter={"lang": {"$eq": "en"}}):
    print(f"   → {r['id']:>10}  score={r['score']:.4f}")

# Hybrid: vector + keyword (finds "nightlife" mentions across languages)
print("\nHybrid search:")
for r in collection.hybrid_query(vector=q_vec, query="nightlife capital", top_k=3, alpha=0.5):
    print(f"   → {r['id']:>10}  score={r['score']:.4f}")