# Embeddings and Vector Stores

In the previous notebook, we learned how to generate embeddings and calculate similarity manually. In real-world applications, we often deal with millions of documents. Searching through them linearly (calculating similarity one by one) is too slow.

**Vector Stores** (or Vector Databases) are specialized systems designed to store embeddings and perform efficient similarity search (often using algorithms like HNSW).

In [None]:
%pip install chromadb sentence-transformers

## 1. Setting up ChromaDB

We will use **ChromaDB**, an open-source embedding database. It is easy to run locally.

In [None]:
import chromadb

# Create an ephemeral client (in-memory, data lost on restart)
# For persistence, use: client = chromadb.PersistentClient(path="./chroma_db")
client = chromadb.Client()

## 2. Creating a Collection

A collection in Chroma is similar to a table in a SQL database. It holds your documents, embeddings, and metadata.

In [None]:
# Create or get a collection
# Note: Chroma uses 'sentence-transformers/all-MiniLM-L6-v2' by default if no embedding function is specified.
collection = client.get_or_create_collection(name="my_knowledge_base")

## 3. Adding Documents

We can add raw text to the collection. Chroma will automatically generate embeddings for us using the default model.

In [None]:
documents = [
    "Python is a high-level, general-purpose programming language.",
    "ChromaDB is an open-source vector database for AI applications.",
    "Embeddings are dense vector representations of data.",
    "JavaScript is often used for web development.",
    "Docker is a platform for developing, shipping, and running applications."
]

metadatas = [
    {"category": "language"},
    {"category": "tool"},
    {"category": "concept"},
    {"category": "language"},
    {"category": "tool"}
]

ids = ["doc1", "doc2", "doc3", "doc4", "doc5"]

# Add data to the collection
collection.add(
    documents=documents,
    metadatas=metadatas,
    ids=ids
)

print(f"Collection count: {collection.count()}")

## 4. Semantic Search

We can now query the collection using natural language.

In [None]:
results = collection.query(
    query_texts=["How to store vectors?"],
    n_results=2
)

print("Query: How to store vectors?")
for i in range(len(results['documents'][0])):
    print(f"Result {i+1}: {results['documents'][0][i]} (Metadata: {results['metadatas'][0][i]})")

## 5. Filtering with Metadata

One of the powerful features of vector stores is the ability to combine semantic search with metadata filtering.

In [None]:
# Search for 'programming' but only in the 'tool' category (should probably not find language matches)
results_filtered = collection.query(
    query_texts=["programming"],
    n_results=2,
    where={"category": "tool"}
)

print("\nQuery: 'programming' with filter category='tool'")
for i in range(len(results_filtered['documents'][0])):
    print(f"Result {i+1}: {results_filtered['documents'][0][i]} (Metadata: {results_filtered['metadatas'][0][i]})")