feat: Support Distributed Segmented BTREE Index Building#21
Merged
Conversation
…e segment API This change introduces parallel/distributed segmented BTREE index building in daft-lance. By building full independent index segments per worker and committing them atomically via the dataset's API, we resolve the issue where describe_indices failed on distributed indexes due to empty index_details metadata. Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
80b79e3 to
bca48a3
Compare
Refactors the `type: ignore` comment in `daft_lance/lance_scalar_index.py` at line 101 to omit `arg-type` as keyword arguments are now used and the warning is no longer emitted. Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
1c79c59 to
c9a7f51
Compare
rchowell
approved these changes
Jun 5, 2026
Contributor
|
Thanks for the contribution! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR implements Segmented BTREE Index support when building distributed scalar indexes with
daft-lanceusing the newsegmented=Trueparameter.What's Changing
SegmentedFragmentIndexHandlerthat builds a fully formed independentlance.Indexsegment on each worker.commit_existing_index_segments().pylancedependency constraint to>=7.0.0to support the required segment commitment APIs.Why This is Better
index_details. Downstream tooling callingdescribe_indices()would crash. This change preserves proper protobuf-serializedindex_detailsin the committed index segments.LanceOperationcrafting.🚀 Architectural Benefits for Extremely Large Tables
Building indices on extremely large tables (multi-billion rows, hundreds of gigabytes/terabytes of data) introduces severe scaling and performance bottlenecks. Adopting a segmented index building approach over a single monolithic file provides immense advantages:
1. Zero Shuffling or Global Sorting Overhead
A BTree index is fundamentally a sorted data structure. Under the classic partitioned-and-merged approach, merging parallel partitions requires a global K-way sort merge across workers' outputs.
2. Elastic Horizontal Scalability (No Coordinator Bottlenecks)
In the partitioned-and-merged model, the merge phase runs entirely on the coordinator node.
3. Highly Cost-Effective Incremental index Updates (Append-Friendly)
In real-world data lakes, datasets are rarely static; new data is appended constantly.
commit_existing_index_segments. Already-indexed data remains untouched, turning index maintenance costs from4. Resiliency & Fault-Tolerance
If a distributed job fails mid-way:
Test Plan
Ran pytest on local suite:
test_lancedb_scalar_index.pyverifying multiple segment generation, query correctness, index description, and fallback workflows.🤖 Generated with Claude Code