Skip to content

MB-62182: Avoid re-training vector indexes during merge#2204

Merged
abhinavdangeti merged 5 commits intomasterfrom
fastmerge
Mar 27, 2026
Merged

MB-62182: Avoid re-training vector indexes during merge#2204
abhinavdangeti merged 5 commits intomasterfrom
fastmerge

Conversation

@Thejas-bhat
Copy link
Copy Markdown
Member

@Thejas-bhat Thejas-bhat commented Jun 17, 2025

  • The main purpose of this PR is to avoid unnecessary re-training of the vector indexes during merge process.
  • Going by the numbers, we need roughly 156K vectors for a 1M dataset ((min_num_vectors_per_centroid) * num_centroids = 39 * 4 * sqrt(1M)) as per recommendation
  • The data ingestion is now split into 2 phases - the first phase involves creating a centroid index using the Train() API and the bolt is recorded with the progress in terms of samples trained upon. The second phase is just the normal indexing of data using the Batch() or the Index() APIs.
  • Later on, when the vector indexes are getting merged the merger will use the centroid index to merge the inverted lists (centroids) in a block-wise fashion without reconstructing the layout.
  • The feature can be enabled by passing a "vector_index_fast_merge": "true" key-value pair as part of kvconfig while creating/opening the index.

@abhinavdangeti abhinavdangeti added this to the v2.6.0 milestone Jul 21, 2025
@Thejas-bhat Thejas-bhat changed the title WIP fast merge [WIP] MB-62182: Avoid re-training vector indexes during merge Jan 15, 2026
@Thejas-bhat Thejas-bhat force-pushed the fastmerge branch 3 times, most recently from c7a94a6 to 59a66df Compare January 29, 2026 19:16
@Thejas-bhat Thejas-bhat marked this pull request as ready for review January 29, 2026 19:34
@Thejas-bhat Thejas-bhat changed the title [WIP] MB-62182: Avoid re-training vector indexes during merge MB-62182: Avoid re-training vector indexes during merge Jan 29, 2026
@Thejas-bhat Thejas-bhat moved this from Todo to In Progress in Fast Merge Jan 30, 2026
@Thejas-bhat Thejas-bhat force-pushed the fastmerge branch 4 times, most recently from 74072d3 to ce43814 Compare February 6, 2026 00:24
@abhinavdangeti
Copy link
Copy Markdown
Member

@Thejas-bhat is this ready for review? Looks like it needs a rebase.

@Thejas-bhat Thejas-bhat force-pushed the fastmerge branch 3 times, most recently from 02394a9 to f4bae39 Compare March 26, 2026 22:23
@coveralls
Copy link
Copy Markdown

Coverage Status

coverage: 52.616% (-0.08%) from 52.693%
when pulling 9d87598 on fastmerge
into 6b72a24 on master.

@abhinavdangeti
Copy link
Copy Markdown
Member

@Thejas-bhat have you verified unit tests to work under -tags=vectors?

@Thejas-bhat
Copy link
Copy Markdown
Member Author

@Thejas-bhat have you verified unit tests to work under -tags=vectors?

yeah it passes on my local system

...
?   	github.com/blevesearch/bleve/v2/search/highlight/highlighter/ansi	[no test files]
?   	github.com/blevesearch/bleve/v2/search/highlight/highlighter/html	[no test files]
ok  	github.com/blevesearch/bleve/v2/search/highlight/highlighter/simple	(cached)
ok  	github.com/blevesearch/bleve/v2/search/query	(cached)
ok  	github.com/blevesearch/bleve/v2/search/scorer	(cached)
ok  	github.com/blevesearch/bleve/v2/search/searcher	5.670s
?   	github.com/blevesearch/bleve/v2/size	[no test files]
ok  	github.com/blevesearch/bleve/v2/test	20.841s
?   	github.com/blevesearch/bleve/v2/util	[no test files]

@abhinavdangeti abhinavdangeti merged commit 3d4b002 into master Mar 27, 2026
10 checks passed
@abhinavdangeti abhinavdangeti deleted the fastmerge branch March 27, 2026 18:47
@github-project-automation github-project-automation bot moved this from In Progress to Done in Fast Merge Mar 27, 2026
ns-codereview pushed a commit to couchbase/cbft that referenced this pull request Mar 28, 2026
          in a vector index

- First comes the sampling phase, where a faiss index is created
using the vectors randomly sampled from KV. The coarse quantizer of
this 'centroid index' is used in the later parts of the
index lifecycle.
- The data ingestion on all the bleveDests is blocked until this
centroid index is generated and streamed to all the partitions on that
node.
- After the sampling phase, the feeds are unlocked at the batch worker
level to ingest the data, which follows the existing codepath.
- blevesearch/bleve#2204

Change-Id: I1f7e35d6b519a4cf5a502a7246f407c33c9dfae5
Reviewed-on: https://review.couchbase.org/c/cbft/+/235240
Tested-by: <thejas.orkombu@couchbase.com>
Well-Formed: Build Bot <build@couchbase.com>
Reviewed-by: <thejas.orkombu@couchbase.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants