# ChromaDB Semantic Search

This notebook demonstrates using ChromaDB, a vector database for building semantic search systems.

## Learning Objectives

By the end of this notebook, you will be able to:
- Initialize ChromaDB clients and create collections
- Add documents with embeddings to vector databases
- Perform semantic searches using vector similarity
- Filter results using metadata and content queries
- Update and delete documents in collections

## What is ChromaDB?

ChromaDB is an open-source vector database designed for:
- Storing document embeddings efficiently
- Fast semantic similarity search
- Metadata filtering and hybrid queries
- Simple Python API for easy integration

## Setup: Install Required Libraries

In [None]:
import os
os.environ['UV_LINK_MODE'] = 'copy'

!uv pip install accelerate==1.6.0 sentence-transformers==4.0.2

print("✓ Required libraries installed successfully!")

In [None]:
import chromadb
from chromadb.utils import embedding_functions

print("✓ Libraries imported successfully!")

In [None]:
# Initialize ChromaDB client (in-memory)
print("Initializing ChromaDB client...")
client = chromadb.Client()

# Create embedding function
embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

print("✓ ChromaDB client initialized!")
print(f"  Embedding model: all-MiniLM-L6-v2 (384 dimensions)")

## Create Collection

**Collections** organize documents in ChromaDB (similar to tables in databases).

In [None]:
# Create collection for documents
collection = client.create_collection(
    name="documents",
    embedding_function=embedding_function
)

print("✓ Collection 'documents' created successfully!")

## Helper Function for Displaying Results

In [None]:
def display_results(results):
    """Display ChromaDB search results in readable format"""
    print("\nSearch Results:")
    print("=" * 80)
    for i, (doc, doc_id, metadata, distance) in enumerate(zip(
        results['documents'][0],
        results['ids'][0],
        results['metadatas'][0] if results['metadatas'] else [None] * len(results['ids'][0]),
        results['distances'][0]
    ), 1):
        print(f"\n{i}. Document: {doc}")
        print(f"   ID: {doc_id}")
        if metadata:
            print(f"   Metadata: {metadata}")
        print(f"   Distance: {distance:.4f} (lower = more similar)")
    print("\n" + "=" * 80)

print("✓ Helper function defined!")

## Basic Vector Operations

Adding documents and performing simple semantic search.

In [None]:
print("=" * 80)
print("BASIC VECTOR OPERATIONS")
print("=" * 80)

# Example documents
documents = [
    "The quick brown fox jumps over the lazy dog",
    "A man is walking his dog in the park",
    "The weather is sunny and warm today",
    "Artificial intelligence is transforming the technology landscape",
    "Vector databases are essential for semantic search applications",
    "Deep learning models require substantial computational resources",
    "The city skyline looks beautiful at sunset",
    "Machine learning algorithms find patterns in data"
]
ids = ["doc1", "doc2", "doc3", "doc4", "doc5", "doc6", "doc7", "doc8"]

# Add documents
print(f"\nAdding {len(documents)} documents to collection...")
collection.add(documents=documents, ids=ids)
print(f"✓ Documents added! Collection count: {collection.count()}")

# Perform semantic search
query_text = "AI and technology trends"
print(f"\nSearching for: '{query_text}'")
results = collection.query(query_texts=[query_text], n_results=3)

display_results(results)

## Working with Metadata and Filtering

ChromaDB allows attaching metadata to documents and filtering searches based on this metadata.

In [None]:
print("=" * 80)
print("METADATA AND FILTERING")
print("=" * 80)

# Create new collection with metadata
filtered_docs_collection = client.create_collection(
    name="filtered_documents",
    embedding_function=embedding_function
)

# Metadata for documents
metadatas = [
    {"category": "animal", "length": "short", "year": 2021},
    {"category": "lifestyle", "length": "short", "year": 2022},
    {"category": "weather", "length": "short", "year": 2023},
    {"category": "technology", "length": "medium", "year": 2023},
    {"category": "technology", "length": "medium", "year": 2024},
    {"category": "technology", "length": "long", "year": 2024},
    {"category": "travel", "length": "short", "year": 2023},
    {"category": "technology", "length": "medium", "year": 2024}
]

print("\nAdding documents with metadata...")
filtered_docs_collection.add(documents=documents, ids=ids, metadatas=metadatas)
print(f"✓ Added {filtered_docs_collection.count()} documents with metadata")

## Simple Metadata Filtering

In [None]:
# Simple metadata filter
print("\n" + "─" * 80)
print("Filter: category='technology'")
print("─" * 80)
results = filtered_docs_collection.query(
    query_texts=["AI advancements"],
    n_results=3,
    where={"category": "technology"}
)
display_results(results)

## Complex Metadata Filtering

Using logical operators for more advanced filtering.

In [None]:
# Complex filter with AND logic
print("\n" + "─" * 80)
print("Complex Filter: category='technology' AND year=2024")
print("─" * 80)
results = filtered_docs_collection.query(
    query_texts=["AI advancements"],
    n_results=3,
    where={"$and": [
        {"category": {"$eq": "technology"}},
        {"year": {"$eq": 2024}}
    ]}
)
display_results(results)

## Content-Based Filtering

In [None]:
# Content-based filter
print("\n" + "─" * 80)
print("Content Filter: Documents containing 'Artificial intelligence'")
print("─" * 80)
results = filtered_docs_collection.query(
    query_texts=["AI advancements"],
    n_results=3,
    where_document={"$contains": "Artificial intelligence"}
)
display_results(results)

## Document Management

ChromaDB provides methods for updating and deleting documents.

In [None]:
print("=" * 80)
print("DOCUMENT MANAGEMENT")
print("=" * 80)

# Get document
print("\n1. GET document by ID:")
result = collection.get(ids=["doc1"])
print(f"   Original: {result['documents'][0]}")

# Update document
print("\n2. UPDATE document:")
collection.update(
    ids=["doc1"],
    documents=["The quick silver fox leaps over the sleepy hound"]
)
result = collection.get(ids=["doc1"])
print(f"   ✓ Updated: {result['documents'][0]}")

# Delete document
print("\n3. DELETE document:")
collection.delete(ids=["doc2"])
print(f"   ✓ Deleted doc2. Collection now has {collection.count()} documents")

print("\n" + "=" * 80)