Skip to content

feat: Replace DuckDB VSS with sharded USearch HNSW indexes#146

Open
ofriw wants to merge 111 commits intomainfrom
duckdb-optimization
Open

feat: Replace DuckDB VSS with sharded USearch HNSW indexes#146
ofriw wants to merge 111 commits intomainfrom
duckdb-optimization

Conversation

@ofriw
Copy link
Copy Markdown
Collaborator

@ofriw ofriw commented Jan 9, 2026

Note: This PR was generated by an AI agent. If you'd like to talk with other humans, drop by our Discord!


This is a substantial architectural overhaul that replaces DuckDB's built-in VSS extension with a custom sharded USearch HNSW implementation. The core motivation: DuckDB's HNSW indexes don't scale well beyond ~100K vectors and rebuilding them on every insert was becoming a bottleneck. The new architecture uses centroid-based shard routing to group semantically similar vectors, enabling parallel search across shards while maintaining locality.

The test suite got a complete rewrite with synthetic vector generators that eliminate external API dependencies. SyntheticEmbeddingGenerator provides deterministic, reproducible vectors for testing clustering, split/merge cycles, and recall validation. The new ValidatingEmbeddingProvider intercepts all chunks at the embedding layer to enforce size constraints - this caught parser bugs where oversized chunks were slipping through. Speaking of which: the chunk size constraint tests now cover all 25+ language parsers end-to-end.

Review focus areas:

  • shard_manager.py - The K-Means split logic (lines 580-680) deserves scrutiny; there's cycle prevention logic that needs to work correctly
  • test_sharding.py - 3000+ lines of new tests; spot-check the invariant validation tests (I1-I13) which define the correctness model
  • Multi-hop search changes removed an artificial "return first 5 if <5 results" fallback that was masking issues

Breaking changes: None at the API level. Internal DuckDB schema adds a shard_id column to embedding tables and a new vector_shards catalog table. Existing databases will auto-migrate on first access.

See sharding in action: Run the 1M vector stress test to watch splits/merges at production thresholds (split at 100K, merge at 10K):

uv run pytest tests/test_sharding.py::TestMillionVectorStress --run-slow -v -s

Takes ~15-20 minutes. Inserts 1M vectors in randomized ~1K batches, then deletes in batches, verifying invariants throughout.


Attached is an agent optimized review file that you can feed back to your agent for applying the requested changes - AGENT_REVIEW.md

ofriw and others added 30 commits January 2, 2026 14:31
…ompaction

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…tion

Adds storage stats, deferred HNSW index creation, and atomic swap with
lock file for crash recovery.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Encapsulates USearch index creation, viewing, clustering, multi-search,
quality measurement, and medoid computation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Provides ShardState dataclass and get_shard_state function to derive
metrics from DuckDB and USearch index files without stored counters.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Coordinates shard search, insert, delete, split, merge, and fix_pass
maintenance using LIRE-style convergence loop and centroid-based routing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tracks batch state during bulk indexing with configurable quality
check intervals and deferred/immediate modes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Connection manager no longer holds persistent connections - these are
now managed by the executor pattern. Simplifies WAL recovery by removing
VSS-dependent recovery paths.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
All database operations now delegate to provider's executor for
thread-safe execution. Removes direct connection access from repositories.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replaces DuckDB's HNSW index with custom sharded USearch index.
Adds vector_shards table, shard_id column to embeddings, and
integrates ShardManager for vector search operations.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Removes create_vector_index, drop_vector_index, and create_deferred_indexes
from the protocol. Vector indexing is now handled by provider-specific
implementations.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use ImportError instead of bare Exception, properly echo request IDs,
and log errors to debug file when CHUNKHOUND_DEBUG_FILE is set.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Vector indexes are now managed by sharding system via fix_pass.
Removes _manage_hnsw_indexes method and related logic.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Comprehensive test suite covering 13 invariants from spec section 11.5:
- Data integrity (I1-I6): single assignment, shard existence, counts
- Operational (I7-I10): fix pass idempotence, convergence, LIRE, NPA
- Search (I11-I13): no false negatives, tombstone exclusion, centroid filter

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds create_vector_index/drop_vector_index stubs to FakeDatabaseProvider.
Updates integration tests to use new API and await async operations.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use index.keys[:] slice instead of iterating index.keys (O(n) via get_keys_in_slice vs O(n²) iterator)
- Rewrite get_medoid() to use sample-mean + HNSW search in O(sample + log n) time

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Always use K-means for splits (USearch clustering doesn't guarantee constraints)
- Add validation to skip splits that would create shards below merge_threshold
- Use efficient key slice access for index key iteration
- Change missing file log from warning to info (expected for new shards)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add compaction_utils.py with threshold-based compaction logic:
- get_storage_stats: DuckDB block statistics
- get_row_group_utilization: logical vs physical row counts
- should_compact: three-stage decision (free ratio, min size, utilization)
- compact_database: EXPORT/IMPORT/SWAP cycle

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add --run-slow CLI flag and @slow marker in conftest.py
- Add generate_batch() method for bulk vector generation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add TestMillionVectorStress: end-to-end stress test with production thresholds
- Add TestSplitMergeCycleProtection: verify splits create viable children
- Add batch_insert_embeddings_to_db helper and shard verification functions
- Rename TestNativeUSearchClustering to TestKMeansSplitClustering
- Add MockDBProvider.optimize() using shared compaction_utils

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add EMBEDDING_CHARS_PER_TOKEN (3) and LLM_CHARS_PER_TOKEN (4) constants.
All embedding providers now use 3 chars/token, all LLM providers use
estimate_tokens_llm() with 4 chars/token. Removes duplicate estimate_tokens_rough.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
ofriw and others added 5 commits January 19, 2026 18:19
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ction

- nprobe config ensures minimum shard exploration (auto: sqrt(shard_count))
- Radius-aware selection uses best-case similarity (centroid - radius)
- Split correction detects overlapping clusters and reassigns boundary vectors
- Tests for radius caching, nprobe guarantees, and split correction invariants

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@ofriw ofriw marked this pull request as ready for review January 21, 2026 12:44
ofriw and others added 20 commits January 21, 2026 14:58
# Conflicts:
#	chunkhound/parsers/svelte_parser.py
#	tests/unit/test_disk_usage_limit.py
…f copy

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Compute accurate radius via DuckDB full scan instead of sample-based
- Add incremental radius updates O(k) during insertion hot path
- Raise minimum nprobe from 1 to 2 for defense-in-depth
- Make K-means deterministic with evenly-spaced sampling + numpy assignment
- Build child indexes atomically before parent deletion to eliminate race window

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Track affected shards during batch deletion and trigger fix_pass when
shard falls below merge threshold. Defers rebuild until transaction
commits when inside a transaction.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add radius epsilon buffer for floating-point precision
- Add orthogonal centroid and controlled cluster generation
- Rewrite lifecycle tests with focused scenarios and invariant verification

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
HNSW is approximate by design. Changed 100% recall assertions to 95%
threshold (industry standard). Added kmeans_random_state config for
reproducible tests. Removed mock-heavy sampling tests that were testing
implementation details rather than behavior.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Test was non-deterministic (missing kmeans_random_state) and redundant -
coverage already provided by test_split_at_threshold using shared fixture.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
ofriw added a commit that referenced this pull request Mar 21, 2026
The DuckDB VSS extension segfaults when compacting HNSW indexes after
bulk deletions on Linux. Remove PRAGMA hnsw_compact_index calls and
the _compact_hnsw_indexes helper, keeping only CHECKPOINT for space
reclamation. The DuckDB HNSW index is being replaced entirely by #146.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant