-
Notifications
You must be signed in to change notification settings - Fork 0
Optimization & Best Practices
- Introduction
- Project Structure
- Core Components
- Architecture Overview
- Detailed Component Analysis
- Dependency Analysis
- Performance Considerations
- Troubleshooting Guide
- Conclusion
- Appendices
This document provides comprehensive optimization and best practices for embedding systems in the repository. It covers BM25 tokenizer implementation for keyword-based search enhancement, hybrid search strategies combining dense and sparse vectors, embedding dimension optimization, batch sizing, provider selection, cost optimization, caching, performance benchmarking, quality assessment, model comparison, troubleshooting, storage optimization, retrieval tuning, and scalability for high-volume workloads.
The embedding optimization spans several modules:
- Embedding service orchestrates provider selection, batching, anomaly detection, and metrics.
- Provider implementations support OpenAI and TEI with retry logic and robust error handling.
- BM25 tokenizer enables keyword-based sparse vector search for hybrid retrieval.
- Qdrant integration supports hybrid search, vector management, and pagination.
- Redis cache accelerates search results and integrates with invalidation channels.
- Health checks and audit/logging provide observability and anomaly detection.
graph TB
subgraph "Embedding Layer"
ES["EmbeddingService<br/>service.ts"]
CFG["Embedding Config<br/>config.ts"]
PRV["Providers<br/>providers.ts"]
BM25["BM25 Tokenizer<br/>bm25-tokenizer.ts"]
AUD["Audit & Anomalies<br/>audit.ts"]
HM["Health Check<br/>health.ts"]
MET["Metrics<br/>embedding-metrics.ts"]
end
subgraph "Vector Store"
QCONN["Qdrant Connection<br/>search.ts"]
QVM["Vector Management<br/>qdrant-vector-management.ts"]
QINIT["Collection Init & BM25<br/>store-init.ts"]
end
subgraph "Caching"
RC["Redis Cache<br/>redis-cache.ts"]
end
ES --> PRV
ES --> CFG
ES --> MET
ES --> AUD
ES --> HM
ES --> BM25
ES --> QCONN
QCONN --> QVM
QCONN --> QINIT
QCONN --> RC
Diagram sources
- service.ts:38-284
- providers.ts:251-278
- config.ts:12-36
- bm25-tokenizer.ts:37-56
- audit.ts:94-157
- health.ts:16-119
- embedding-metrics.ts:11-47
- search.ts:11-82
- qdrant-vector-management.ts:13-114
- store-init.ts:130-148
- redis-cache.ts:21-211
Section sources
- service.ts:14-284
- providers.ts:1-280
- bm25-tokenizer.ts:1-57
- config.ts:1-40
- embedding-metrics.ts:1-51
- audit.ts:1-197
- health.ts:1-121
- search.ts:1-82
- qdrant-vector-management.ts:1-301
- store-init.ts:1-155
- redis-cache.ts:1-211
- EmbeddingService: Central orchestration for generating single and batch embeddings, selecting providers, validating dimensions, emitting metrics, and logging anomalies.
- Providers: OpenAI and TEI implementations with retry/backoff, error categorization, and audit logging.
- BM25 Tokenizer: Lightweight tokenizer producing sparse vectors for Qdrant BM25 search.
- Qdrant Hybrid Search: Dense similarity search combined with BM25 sparse retrieval and reciprocal rank fusion.
- Vector Management: Named vector creation, migration between dimensions, and safe recreation strategies.
- Redis Cache: Search result caching with TTL and invalidation patterns.
- Metrics and Audit: Embedding request counters, durations, vector sizes, batch sizes, and anomaly detection.
Section sources
- service.ts:38-284
- providers.ts:77-278
- bm25-tokenizer.ts:37-56
- search.ts:11-82
- qdrant-vector-management.ts:13-202
- redis-cache.ts:21-211
- embedding-metrics.ts:11-47
- audit.ts:94-157
The embedding pipeline integrates with Qdrant for hybrid retrieval and caches hot queries in Redis. Health checks and audit trails ensure reliability and observability.
sequenceDiagram
participant Client as "Client"
participant ES as "EmbeddingService"
participant PRV as "Providers"
participant Q as "Qdrant"
participant RC as "RedisCache"
Client->>ES : generateEmbedding(query)
ES->>PRV : postEmbeddings(query)
PRV-->>ES : embeddings[]
ES->>ES : validate dimension, norms, latency
ES-->>Client : embedding
Client->>RC : getSearchResult(query, limit)
alt cache miss
RC-->>Client : null
Client->>ES : generateEmbedding(query)
ES->>Q : hybrid search (dense + BM25 RRF)
Q-->>ES : results
ES-->>RC : cache results
ES-->>Client : results
else cache hit
RC-->>Client : cached results
end
Diagram sources
- service.ts:47-127
- providers.ts:251-278
- search.ts:11-82
- redis-cache.ts:36-70
- Dimension resolution: Resolved at first successful call and enforced on subsequent calls to prevent mismatches.
- Batch processing: Validates inputs, tracks batch size, and ensures consistent dimensions across vectors.
- Metrics: Tracks request counts, durations, vector sizes, and batch sizes with tenant/provider labels.
- Anomaly detection: Flags high latency, unusual vector norms, and dimension mismatches.
flowchart TD
Start(["generateEmbedding"]) --> Normalize["Normalize input text"]
Normalize --> EmptyCheck{"Empty?"}
EmptyCheck --> |Yes| ThrowEmpty["Throw error"]
EmptyCheck --> |No| CallProvider["postEmbeddings()"]
CallProvider --> Parse["Parse embeddings"]
Parse --> ValidateDim["Validate dimension"]
ValidateDim --> Anomaly["Detect anomalies"]
Anomaly --> |Critical| ThrowMismatch["Throw dimension mismatch"]
Anomaly --> |OK| EmitMetrics["Emit metrics & audit"]
EmitMetrics --> Return["Return embedding"]
Diagram sources
- service.ts:47-127
- audit.ts:94-157
- embedding-metrics.ts:11-47
Section sources
- service.ts:38-284
- audit.ts:94-157
- embedding-metrics.ts:11-47
- Provider selection: Explicit preference or auto-detection with fallback.
- Retry/backoff: Network transient errors and specific HTTP statuses retried with bounded attempts.
- Audit logging: Captures provider calls with status, dimensions, latency, and optional HTTP status/error messages.
flowchart TD
PStart(["postEmbeddings"]) --> Pref{"Provider pref?"}
Pref --> |openai| OA["postEmbeddingsOpenAI"]
Pref --> |tei| TEI["postEmbeddingsTEI"]
Pref --> |auto| AutoOA{"OpenAI configured?"}
AutoOA --> |Yes| TryOA["Try OpenAI"]
TryOA --> OAErr{"Error?"}
OAErr --> |Yes & TEI configured| TEI
OAErr --> |No| Done["Return embeddings"]
AutoOA --> |No| TEI
TEI --> Done
Diagram sources
- providers.ts:251-278
- providers.ts:77-175
- providers.ts:177-249
Section sources
- providers.ts:14-47
- providers.ts:77-249
- Tokenization: Lowercase, non-alphanumeric split, stop word removal, minimum token length.
- Indexing: FNV-1a hash modulo sparse dimension.
- Values: Sublinear term frequency 1 + log(count).
- Output: Indices and values arrays for Qdrant sparse vector search.
flowchart TD
TStart(["tokenizeToSparse"]) --> Lower["Lowercase"]
Lower --> Split["Split on non-alphanumeric"]
Split --> Filter["Remove stop words & short tokens"]
Filter --> Count["Count term frequencies"]
Count --> Hash["FNV-1a hash mod 30000"]
Hash --> TF["Compute 1 + log(count)"]
TF --> Emit["Emit {indices, values}"]
Diagram sources
- bm25-tokenizer.ts:37-56
Section sources
- bm25-tokenizer.ts:1-57
- Dense search: Vector similarity using the primary dense vector.
- BM25 sparse search: Keyword matching using BM25 sparse vectors.
- Fusion: Reciprocal rank fusion (RRF) combines results from multiple prefetch queries.
- Scoring: Boosts for title, activation patterns, labels, and tags.
sequenceDiagram
participant S as "Search"
participant ES as "EmbeddingService"
participant Q as "Qdrant"
participant BM25 as "BM25 Tokenizer"
S->>ES : generateEmbedding(query)
ES-->>S : query vector
S->>Q : queryVector (using primary dense)
S->>Q : BM25 query (sparse)
Q-->>S : dense results
Q-->>S : BM25 results
S->>S : RRF fusion with boosts
S-->>Client : ranked results
Diagram sources
- search.ts:11-82
- bm25-tokenizer.ts:37-56
Section sources
- search.ts:11-82
- store-init.ts:130-148
- Add named vectors: Attempts updateCollection; falls back to safe recreation preserving data.
- Migrate vector space: Scrolls points with old vector, re-embeds content, and upserts new vectors in batches.
- Remove vector: Recreates collection without the target vector and restores data.
flowchart TD
VMStart(["addVectorsToCollection"]) --> TryUpdate["Try updateCollection"]
TryUpdate --> |Success| Done["Done"]
TryUpdate --> |Fail| Gather["Scroll all points"]
Gather --> Recreate["Delete & recreate collection with merged vectors"]
Recreate --> Restore["Upsert points in batches"]
Restore --> Done
Diagram sources
- qdrant-vector-management.ts:13-114
- qdrant-vector-management.ts:126-202
- qdrant-vector-management.ts:208-301
Section sources
- qdrant-vector-management.ts:13-114
- qdrant-vector-management.ts:126-202
- qdrant-vector-management.ts:208-301
- Caching: Stores search results with TTL and distinguishes collapsed vs natural modes.
- Invalidation: Provides targeted invalidation for begin/activate caches and a pub/sub channel for broader invalidation.
- Integration: Used by higher-level search flows to accelerate repeated queries.
flowchart TD
RS(["getSearchResult"]) --> BuildKey["Build cache key"]
BuildKey --> KVGet["keyValueStore.getJson"]
KVGet --> Hit{"Cache hit?"}
Hit --> |Yes| IncHit["Increment hits"] --> ReturnHit["Return cached"]
Hit --> |No| IncMiss["Increment misses"] --> ReturnMiss["Return null"]
WS(["setSearchResult"]) --> BuildKey2["Build cache key"]
BuildKey2 --> Serialize["Serialize results"]
Serialize --> KVSet["keyValueStore.setJson(TTL)"]
KVSet --> Done["Done"]
Diagram sources
- redis-cache.ts:36-70
- redis-cache.ts:186-211
Section sources
- redis-cache.ts:21-211
- Health: Determines provider availability, handles rate limits and auth failures, and probes endpoint readiness.
- Audit: Emits structured logs for successes and errors, including dimensions, latency, and error messages.
- Metrics: Exposes counters and histograms for embedding requests, durations, errors, vector sizes, and batch sizes.
flowchart TD
HC(["runEmbeddingHealthCheck"]) --> Pref{"Provider pref?"}
Pref --> |openai| OAHC["Check OpenAI"]
Pref --> |tei| TEIHC["Check TEI"]
Pref --> |auto| TryOAHC["Try OpenAI"]
TryOAHC --> OAHC
OAHC --> OAStatus{"Operational?"}
OAStatus --> |Yes| ReportOK["Report healthy"]
OAStatus --> |No| CheckTEI["Try TEI if configured"]
CheckTEI --> TEIStatus{"Operational?"}
TEIStatus --> |Yes| ReportOK
TEIStatus --> |No| ReportFail["Report unhealthy"]
Diagram sources
- health.ts:16-119
- audit.ts:60-92
- embedding-metrics.ts:11-47
Section sources
- health.ts:16-119
- audit.ts:60-92
- embedding-metrics.ts:11-47
- EmbeddingService depends on provider implementations, config, metrics, audit, and health modules.
- Qdrant search depends on EmbeddingService for query vectors and BM25 tokenizer for sparse queries.
- Vector management utilities depend on Qdrant client APIs and collection metadata.
- Redis cache integrates with the key-value store abstraction and invalidation patterns.
graph LR
ES["EmbeddingService"] --> PRV["Providers"]
ES --> CFG["Config"]
ES --> MET["Metrics"]
ES --> AUD["Audit"]
ES --> HM["Health"]
ES --> BM25["BM25 Tokenizer"]
ES --> QSRCH["Qdrant Search"]
QSRCH --> QVM["Vector Management"]
QSRCH --> QINIT["Collection Init"]
QSRCH --> RC["Redis Cache"]
Diagram sources
- service.ts:38-284
- providers.ts:251-278
- config.ts:12-36
- embedding-metrics.ts:11-47
- audit.ts:94-157
- health.ts:16-119
- bm25-tokenizer.ts:37-56
- search.ts:11-82
- qdrant-vector-management.ts:13-114
- store-init.ts:130-148
- redis-cache.ts:21-211
Section sources
- service.ts:38-284
- providers.ts:251-278
- search.ts:11-82
- qdrant-vector-management.ts:13-114
- redis-cache.ts:21-211
- Embedding dimension optimization
- Resolve dimension at startup via a minimal embedding probe to cache and enforce consistent dimensions across requests.
- Ensure vector size metrics reflect float32 size (4 bytes per float) to estimate storage and network overhead.
- Batch size tuning
- Track batch sizes with histograms to identify optimal throughput vs latency trade-offs.
- Filter out empty inputs and compute character lengths for cost-aware batching.
- Provider selection strategies
- Prefer explicit provider configuration for predictable behavior; auto-detection with OpenAI preferred when both are configured.
- Use health checks to detect rate limits and authentication failures and adjust fallbacks accordingly.
- Hybrid search tuning
- Adjust prefetch limits per vector type and combine with RRF to balance precision and recall.
- Apply boost weights for title, activation patterns, labels, and tags to align with domain semantics.
- Caching and storage
- Use Redis cache for frequent queries with TTL; invalidate on memory updates to maintain freshness.
- For vector migrations, batch upserts and preserve payload to minimize downtime and memory pressure.
- Cost optimization
- Monitor embedding vector sizes and request counts; reduce dimensionality or query frequency where appropriate.
- Use BM25 sparse vectors to improve keyword recall without increasing dense vector costs.
- Benchmarking and quality assessment
- Maintain baselines for top scores and query performance; compare current results against historical baselines to detect regressions.
- Use anomaly detection for latency and vector norms to flag potential provider or model issues.
[No sources needed since this section provides general guidance]
Common issues and resolutions:
- Provider misconfiguration
- Symptoms: Authentication failures, rate limits, or missing endpoints.
- Actions: Verify API keys and model names; use health checks to confirm connectivity and status codes.
- Dimension mismatch
- Symptoms: Errors indicating unexpected embedding dimensions.
- Actions: Ensure a startup probe resolves the dimension; enforce dimension validation on all subsequent calls.
- Latency spikes and anomalies
- Symptoms: High-latency requests or unusual vector norms.
- Actions: Review audit logs and anomaly events; adjust provider selection or retry policies.
- Cache invalidation
- Symptoms: Stale search results after memory updates.
- Actions: Trigger invalidation for begin/activate caches and ensure pub/sub invalidation is enabled.
- Vector migration failures
- Symptoms: Partial migrations or inconsistencies after dimension changes.
- Actions: Use safe recreation strategy and batch upserts; monitor progress and handle partial failures gracefully.
Section sources
- health.ts:16-119
- audit.ts:94-157
- redis-cache.ts:186-211
- qdrant-vector-management.ts:13-114
This repository implements a robust, observable, and scalable embedding system with hybrid dense/sparse retrieval, provider flexibility, and strong operational controls. By leveraging BM25 tokenization, vector management utilities, Redis caching, and comprehensive metrics and audit logging, teams can optimize embedding quality, performance, and cost while maintaining reliability at scale.
[No sources needed since this section summarizes without analyzing specific files]
- Embedding quality assessment
- Compare vector norms and latencies over time; flag anomalies exceeding thresholds.
- Validate embedding dimensions across batches and reject inconsistent results.
- Model comparison
- Run health checks and measure latency distributions for different providers/models.
- Benchmark hybrid search top scores against historical baselines to detect regressions.
- Benchmarking methods
- Use histogram metrics for embedding durations and batch sizes to identify bottlenecks.
- Track embedding vector sizes to estimate storage and bandwidth needs.
Section sources
- audit.ts:94-157
- embedding-metrics.ts:11-47
- kairos-search-scores.test.ts:103-134