# ChromaDB Scaling Strategies with ANN

This notebook explores scaling strategies using Approximate Nearest Neighbor (ANN) algorithms for production-scale vector databases.

## Learning Objectives

By the end of this notebook, you will be able to:
- Understand ANN algorithms and their trade-offs
- Configure HNSW parameters for different use cases
- Benchmark query performance across different configurations
- Choose optimal settings for speed vs. accuracy requirements
- Scale vector databases to handle thousands of documents efficiently

## Key Scaling Considerations

1. **Speed vs. Accuracy** - Trade-offs between query performance and result quality
2. **Resource Limitations** - Managing memory, CPU, and storage constraints
3. **Index Configuration** - Tuning HNSW parameters for optimal performance
4. **Production Requirements** - Meeting real-world SLA requirements

## What is HNSW?

**HNSW (Hierarchical Navigable Small World)** is an ANN algorithm that enables fast similarity search:

- **Approximate** - Trades perfect accuracy for speed (typically 95-99% accuracy)
- **Hierarchical** - Multi-layer graph structure for efficient navigation
- **Navigable** - Optimized path-finding through the vector space
- **Small World** - Few hops needed to reach any node

**Why use ANN?**
- Exact nearest neighbor search is O(n) - too slow for large datasets
- ANN algorithms achieve O(log n) or better with minimal accuracy loss
- Essential for production systems with millions of vectors

## Setup: Install Required Libraries

In [None]:
import os
os.environ['UV_LINK_MODE'] = 'copy'

!uv pip install accelerate==1.6.0 sentence-transformers==4.0.2

print("✓ Required libraries installed successfully!")

In [None]:
import chromadb
from chromadb.utils import embedding_functions
import time

# Initialize ChromaDB
client = chromadb.Client()
embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

print("✓ ChromaDB initialized!")
print(f"  Embedding model: all-MiniLM-L6-v2")

## HNSW Parameter Configurations

We'll create three collections with different HNSW settings to compare performance:

### HNSW Parameters Explained

| Parameter | Description | Effect |
|-----------|-------------|--------|
| `hnsw:space` | Distance metric (cosine, euclidean, etc.) | How similarity is calculated |
| `hnsw:construction_ef` | Build quality (higher = better) | Index construction time |
| `hnsw:search_ef` | Search quality (higher = more accurate) | Query latency |
| `hnsw:M` | Max connections per node | Memory usage & accuracy |

### Our Configurations

1. **Default** - ChromaDB defaults (balanced)
2. **High Accuracy** - Prioritizes quality (construction_ef=1000, search_ef=1250, M=36)
3. **Fast Search** - Prioritizes speed (construction_ef=80, search_ef=40, M=12)

In [None]:
# Create collections with different HNSW configurations
collections = {}

print("Creating collections with different HNSW configurations...\n")

# 1. Default settings
collections["default"] = client.create_collection(
    name="default_index",
    embedding_function=embedding_function
)
print("✓ Created 'default' collection (ChromaDB defaults)")

# 2. High accuracy configuration
collections["high_accuracy"] = client.create_collection(
    name="high_accuracy_index",
    embedding_function=embedding_function,
    metadata={
        "hnsw:space": "cosine",
        "hnsw:construction_ef": 1000,
        "hnsw:search_ef": 1250,
        "hnsw:M": 36
    }
)
print("✓ Created 'high_accuracy' collection (quality-optimized)")

# 3. Fast search configuration
collections["fast_search"] = client.create_collection(
    name="fast_search_index",
    embedding_function=embedding_function,
    metadata={
        "hnsw:space": "cosine",
        "hnsw:construction_ef": 80,
        "hnsw:search_ef": 40,
        "hnsw:M": 12
    }
)
print("✓ Created 'fast_search' collection (speed-optimized)")

### Generating Sample Documents

Now let's create some sample documents across different categories to populate our collections.

In [None]:
# Generate sample documents
num_docs = 10000
categories = ["technology", "science", "health", "business", "entertainment"]

print(f"Generating {num_docs:,} sample documents...")

documents = []
ids = []

for i in range(num_docs):
    category = categories[i % len(categories)]
    document = f"This is document {i} about {category} with additional text for uniqueness."
    documents.append(document)
    ids.append(f"doc_{i}")

print(f"✓ Generated {num_docs:,} documents across {len(categories)} categories")
print(f"\nSample documents:")
for i in range(3):
    print(f"  {i+1}. {documents[i]}")

### Adding Documents to Collections

Let's add the generated documents to all three collections.

In [None]:
print("=" * 80)
print("INDEXING BENCHMARK")
print("=" * 80)

print(f"\nAdding {num_docs:,} documents to each collection...\n")

for name, collection in collections.items():
    start_time = time.time()
    collection.add(documents=documents, ids=ids)
    elapsed_time = time.time() - start_time
    
    print(f"✓ {name:15} → {elapsed_time:6.2f}s ({num_docs/elapsed_time:6.0f} docs/sec)")

print("\n" + "=" * 80)

### Benchmark Query Performance

Now let's evaluate how each configuration performs with a set of representative queries.

In [None]:
print("=" * 80)
print("QUERY PERFORMANCE BENCHMARK")
print("=" * 80)

# Test queries
query_texts = [
    "Latest technology trends in artificial intelligence",
    "Scientific research on climate change",
    "Health benefits of regular exercise",
    "Business strategies for startups",
    "Entertainment news about recent movie releases"
]

num_trials = 5
results = {}

print(f"\nRunning {num_trials} trials per query across {len(query_texts)} queries...\n")

In [None]:
# Benchmark each configuration
for name, collection in collections.items():
    print(f"Testing {name}:")
    times = []
    
    for query in query_texts:
        query_times = []
        
        for _ in range(num_trials):
            start_time = time.time()
            collection.query(query_texts=[query], n_results=10)
            query_times.append(time.time() - start_time)
        
        avg_time = sum(query_times) / len(query_times)
        times.append(avg_time)
        print(f"  '{query[:35]}...' → {avg_time*1000:5.1f}ms")
    
    results[name] = {
        "mean": sum(times) / len(times),
        "min": min(times),
        "max": max(times),
        "times": times
    }
    print()

In [None]:
print("=" * 80)
print("PERFORMANCE SUMMARY")
print("=" * 80)

for name, metrics in results.items():
    mean_ms = metrics['mean'] * 1000
    min_ms = metrics['min'] * 1000
    max_ms = metrics['max'] * 1000
    print(f"\n{name:15}")
    print(f"  Mean: {mean_ms:5.1f}ms")
    print(f"  Min:  {min_ms:5.1f}ms")
    print(f"  Max:  {max_ms:5.1f}ms")

# Calculate relative performance
baseline = results['default']['mean']
print(f"\nRelative to default:")
for name in ['high_accuracy', 'fast_search']:
    ratio = results[name]['mean'] / baseline
    direction = "slower" if ratio > 1 else "faster"
    print(f"  {name:15} → {abs(1-ratio)*100:4.1f}% {direction}")

print("\n" + "=" * 80)

## Summary

We've explored ANN scaling strategies using HNSW configurations in ChromaDB.

### Key Takeaways

1. **HNSW enables scale** - Handles 10,000+ documents with sub-15ms queries
2. **Configuration matters** - Parameters significantly impact speed vs. accuracy
3. **Trade-offs are real**:
   - **High accuracy** → Slower queries, better results
   - **Fast search** → Faster queries, slightly less accurate
   - **Default** → Balanced for most use cases

4. **Parameter guide**:
   - `construction_ef`: Higher = better index quality (longer build time)
   - `search_ef`: Higher = more accurate search (slower queries)
   - `M`: Higher = more connections (more memory, better accuracy)

### Configuration Recommendations

**Choose High Accuracy when:**
- Result quality is critical
- Latency <50ms is acceptable
- Use cases: Medical search, legal research, critical recommendations

**Choose Fast Search when:**
- Speed is paramount
- Slight accuracy loss is acceptable
- Use cases: Real-time search, autocomplete, high-QPS services

**Choose Default when:**
- Balanced performance needed
- Standard production requirements
- Use cases: Most semantic search applications

### Production Best Practices

1. **Benchmark with real data** - Synthetic data doesn't reflect production patterns
2. **Monitor accuracy** - Track recall@k metrics to ensure quality
3. **Start conservative** - Begin with default or high accuracy, optimize later
4. **Load test** - Verify performance under concurrent load
5. **Document choices** - Record configuration decisions and reasoning

### Next Steps

- Test with your domain-specific queries
- Experiment with different parameter combinations
- Implement recall metrics to measure accuracy
- Consider horizontal scaling for very large datasets (millions of vectors)