# ChromaDB Scaling with Caching

This notebook explores caching strategies for optimizing vector database performance at scale.

## Learning Objectives

By the end of this notebook, you will be able to:
- Implement LRU caching for query results
- Measure cache performance and hit rates
- Understand caching trade-offs and best practices
- Apply horizontal scaling strategies for production systems
- Optimize resource usage across memory, CPU, and network

## Why Implement Caching?

Caching is crucial for production vector database systems:

1. **Reduced latency** - Cached results return instantly (no embedding computation or vector search)
2. **Lower costs** - Fewer GPU/CPU cycles for embeddings and similarity calculations
3. **Better scalability** - Handle more queries per second with same resources
4. **Improved UX** - Sub-millisecond responses for common queries

## Setup: Install Required Libraries

In [None]:
import os
os.environ['UV_LINK_MODE'] = 'copy'

!uv pip install accelerate==1.6.0 sentence-transformers==4.0.2

print("✓ Required libraries installed successfully!")

In [None]:
import chromadb
from chromadb.utils import embedding_functions
import time
import random

print("✓ Libraries imported successfully!")

## LRU Cache Implementation

**LRU (Least Recently Used)** cache keeps frequently accessed items and evicts least-used entries when full.

### How LRU Works

1. Track access order for all cached items
2. When cache is full, remove least recently used item
3. Move accessed items to "most recently used" position

In [None]:
class LRUCache:
    """Simple LRU cache for query results"""
    def __init__(self, capacity=100):
        self.capacity = capacity
        self.cache = {}
        self.usage_order = []
    
    def get(self, key):
        """Get item from cache, return None if not found"""
        if key in self.cache:
            # Move to end (most recently used)
            self.usage_order.remove(key)
            self.usage_order.append(key)
            return self.cache[key]
        return None
    
    def put(self, key, value):
        """Add item to cache, evict LRU if full"""
        if key in self.cache:
            self.cache[key] = value
            self.usage_order.remove(key)
            self.usage_order.append(key)
        else:
            if len(self.cache) >= self.capacity:
                # Evict least recently used
                lru_key = self.usage_order.pop(0)
                del self.cache[lru_key]
            
            self.cache[key] = value
            self.usage_order.append(key)
    
    def clear(self):
        self.cache = {}
        self.usage_order = []
    
    def __len__(self):
        return len(self.cache)

print("✓ LRUCache class defined!")

### Setting Up the Collection

Now let's create a collection and populate it with sample documents for our caching experiment.

In [None]:
# Initialize ChromaDB
client = chromadb.Client()
embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

# Create collection
collection = client.create_collection(
    name="cache_test",
    embedding_function=embedding_function
)

print("✓ ChromaDB collection created!")

In [None]:
# Add sample documents
num_docs = 1000
documents = [f"Sample document {i} with content for testing caching" for i in range(num_docs)]
ids = [f"cache_doc_{i}" for i in range(num_docs)]

print(f"Adding {num_docs:,} documents...")
for i in range(0, num_docs, 100):
    collection.add(documents=documents[i:i+100], ids=ids[i:i+100])

print(f"✓ Added {num_docs:,} documents to collection")

### Cached Query Function

Let's implement a function that uses our cache to store and retrieve query results.

In [None]:
# Initialize cache
query_cache = LRUCache(capacity=50)

def cached_query(query_text, n_results=10, use_cache=True):
    """Query with optional caching"""
    cache_key = f"{query_text}:{n_results}"
    
    if use_cache:
        cached_result = query_cache.get(cache_key)
        if cached_result is not None:
            return cached_result, True  # Cache hit
    
    # Cache miss - perform actual query
    result = collection.query(query_texts=[query_text], n_results=n_results)
    
    if use_cache:
        query_cache.put(cache_key, result)
    
    return result, False  # Cache miss

print("✓ Cached query function defined!")

### Preparing Query Mix

To simulate a realistic workload, we'll create a mix of common (frequently repeated) and unique queries.

In [None]:
# Create realistic query mix
common_queries = [
    "document with content",
    "sample document",
    "testing caching",
    "various content"
]

unique_queries = [f"unique query {i}" for i in range(50)]

# Mix: common queries repeated, unique queries occasional
mixed_queries = []
for _ in range(20):
    mixed_queries.extend(common_queries)  # 80 common
    mixed_queries.extend(random.sample(unique_queries, 5))  # 100 unique

random.shuffle(mixed_queries)

print(f"✓ Generated {len(mixed_queries)} queries")
print(f"  Common queries: {len([q for q in mixed_queries if q in common_queries])}")
print(f"  Unique queries: {len([q for q in mixed_queries if q in unique_queries])}")

### Benchmark: No Cache vs. With Cache

Now let's measure the performance difference between running queries without a cache versus with a cache.

In [None]:
print("=" * 80)
print("BENCHMARK: NO CACHE")
print("=" * 80)

start_time = time.time()
for query in mixed_queries:
    _, _ = cached_query(query, use_cache=False)
no_cache_time = time.time() - start_time

print(f"✓ Completed {len(mixed_queries)} queries without cache")
print(f"  Total time: {no_cache_time:.4f}s")
print(f"  Avg per query: {no_cache_time/len(mixed_queries)*1000:.2f}ms")

In [None]:
print("\n" + "=" * 80)
print("BENCHMARK: WITH CACHE")
print("=" * 80)

query_cache.clear()
start_time = time.time()
hits = 0

for query in mixed_queries:
    _, is_hit = cached_query(query, use_cache=True)
    if is_hit:
        hits += 1

with_cache_time = time.time() - start_time
hit_rate = hits / len(mixed_queries)

print(f"✓ Completed {len(mixed_queries)} queries with cache")
print(f"  Total time: {with_cache_time:.4f}s")
print(f"  Avg per query: {with_cache_time/len(mixed_queries)*1000:.2f}ms")
print(f"  Cache hits: {hits}/{len(mixed_queries)} ({hit_rate:.1%})")
print(f"  Cache size: {len(query_cache)}/{query_cache.capacity}")

In [None]:
print("\n" + "=" * 80)
print("PERFORMANCE COMPARISON")
print("=" * 80)

speedup = no_cache_time / with_cache_time
time_saved = no_cache_time - with_cache_time
percent_saved = (1 - with_cache_time/no_cache_time) * 100

print(f"\nWithout cache: {no_cache_time:.4f}s")
print(f"With cache:    {with_cache_time:.4f}s")
print(f"\nTime saved:    {time_saved:.4f}s ({percent_saved:.1f}%)")
print(f"Speedup:       {speedup:.2f}x faster")
print(f"Hit rate:      {hit_rate:.1%}")

print("\n" + "=" * 80)

## Advanced Scaling Strategies

### Horizontal Scaling Approaches

As your vector database grows beyond the capacity of a single machine, you'll need to implement horizontal scaling strategies. Here are some common approaches:

1. **Sharding** - Partitioning your vector space across multiple instances
   - **By ID range** - Deterministic but may lead to unbalanced shards
   - **By vector clustering** - Better search performance but more complex

2. **Replication** - Creating copies of your data across multiple instances
   - Improves read throughput and fault tolerance
   - Requires synchronization mechanisms for writes

3. **Hybrid approaches** - Combining sharding and replication
   - Example: ChromaDB cluster with data sharded across nodes and each shard replicated

### Resource Management Best Practices

1. **Memory Optimization**
   - Use quantization to reduce vector size (e.g., 32-bit to 8-bit)
   - Implement disk-based storage for less frequently accessed vectors

2. **CPU Utilization**
   - Batch similar operations
   - Use asynchronous processing where possible

3. **Network Efficiency**
   - Minimize data transfer between components
   - Compress payloads when possible

### Real-world Implementation Considerations

1. **Monitoring and Observability**
   - Track latency, throughput, and error rates
   - Set up alerts for performance degradation

2. **Failure Handling**
   - Implement graceful degradation strategies
   - Consider fallback search methods

3. **Update Strategies**
   - Batch updates to reduce index rebuilding frequency
   - Consider incremental index updates

4. **Hybrid Search Approaches**
   - Combine vector search with keyword search for better results
   - Filter vectors based on metadata before computing distances

## Summary

We've explored caching strategies and scaling approaches for production vector databases.

### Key Takeaways

1. **Caching is highly effective** - 70-80% time savings for realistic query patterns
2. **LRU works well** - Simple to implement, effective for temporal locality
3. **Cache sizing matters** - Balance memory usage vs. hit rate
4. **Monitor hit rates** - Track cache effectiveness in production

### Caching Best Practices

**When to use caching:**
- High query volume with repeated patterns
- Expensive embedding computations
- Read-heavy workloads
- Latency-sensitive applications

**Cache configuration:**
- **Size** - Monitor hit rates, increase if misses are frequent
- **TTL** - Expire stale results (important for frequently updated data)
- **Eviction** - LRU works well for most cases, consider LFU for skewed access patterns

**Production considerations:**
- Distributed caching (Redis, Memcached) for multi-instance deployments
- Cache warming for known popular queries
- Monitor cache memory usage and hit/miss rates
- Implement cache invalidation for data updates

### Horizontal Scaling Strategies

**1. Sharding** - Partition vector space across instances
- By ID range (simple, may be unbalanced)
- By vector clustering (better search, more complex)
- By metadata (e.g., category, tenant)

**2. Replication** - Copy data across instances
- Improves read throughput
- Provides fault tolerance
- Requires write synchronization

**3. Hybrid** - Combine sharding + replication
- Example: 3 shards, each replicated 2x = 6 instances
- Balance reads and writes
- Trade-off: complexity vs. performance

### Resource Optimization

**Memory:**
- Vector quantization (32-bit → 8-bit)
- Disk-backed storage for cold data
- Compressed vector formats

**CPU:**
- Batch similar operations
- Asynchronous processing
- Multi-threading for independent queries

**Network:**
- Minimize data transfer
- Compress payloads
- Local caching at edge

### Production Checklist

✓ Implement monitoring (latency, throughput, errors)  
✓ Set up alerting for degradation  
✓ Plan graceful degradation strategies  
✓ Implement failover mechanisms  
✓ Test under load (concurrent queries)  
✓ Document scaling decisions  
✓ Plan for data growth  
✓ Implement backup/restore procedures

### Next Steps

- Implement distributed caching with Redis
- Set up multi-instance deployments
- Add comprehensive monitoring
- Test sharding strategies with your data
- Benchmark under production-like load