Skip to content

Conversation

@sarahlwelton
Copy link
Contributor

Per request, writing up new documentation to explain the architecture behind Vector Search indexes on FTS.

@sarahlwelton sarahlwelton marked this pull request as ready for review December 12, 2024 16:45
Copy link

@abhinavdangeti abhinavdangeti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @sarahlwelton - I've some thoughts/suggestions here.

Vector Search specifically uses https://faiss.ai/index.html[FAISS^] indexes.
Any vectors inside your documents are indexed using FAISS, to create a new query vector that can be searched for similar vectors inside your Vector Search index.

Vector Search chooses the best https://github.com/facebookresearch/faiss/wiki/Faiss-indexes[FAISS index class^], or vector search algorithm, for your data, and automatically tunes parameters to provide a balance of recall and latency.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should mention here the 3rd optimization we offer - memory_efficient as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good shout! Yes, let's add that in the next line.

Every cell has a centroid.
Every vector in the processed dataset is assigned to a cell that corresponds to its nearest centroid.

In an IVF index, a Vector Search first tries to find the cell that the query vector belongs to.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, sounds a little misleading - perhaps can re-phrase this to - "search tries to establish a centroid vector closest to the query vector", and then based on the default nprobe (override-able) and default max_codes (override-able) , the search will occur over that many clusters / cells from the identified closest centroid to determine the top-k.

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - this was pulled from what Jon shared, so if there's a more accurate way to phrase it, I'm all for it.

In an IVF index, a Vector Search first tries to find the cell that the query vector belongs to.
After it knows the cell to search, Vector Search uses another algorithm to find out the exact vector that's closest to the query vector in that cell.

The result of an IVF index search can be less accurate, as the nearest vector to a query vector can be in a different cell than the chosen cell.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's omit this "can be less accurate" bit because it sounds alarming - especially because the user has no control over this. You can say - search over centroid indexes is non-exhaustive .. and the reader can infer the rest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid. Changing to "IVF index searches are not exhaustive searches."

After it knows the cell to search, Vector Search uses another algorithm to find out the exact vector that's closest to the query vector in that cell.

The result of an IVF index search can be less accurate, as the nearest vector to a query vector can be in a different cell than the chosen cell.
You can increase accuracy by changing the *nprobe* parameter when you xref:fine-tune-vector-search.adoc[fine tune your Vector Search queries].

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_nprobe_pct (because the user has no idea about the number of centroids chosen for the index - depends on our batching and merging).

There's also max_codes_pct which defaults to 100 - meaning all vectors within the chosen centroid clusters will be scanned. They can reduce this to increase latency at the cost of recall.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify for me? Is it "max_nprobe_pct" and "max_codes_pct" or "ivf_nprobe_pct" and "ivf_max_codes_pct"?

They didn't let me write the docs for that feature, so I was relying on another writer - and they have ivf, not max.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah shoot, sorry yes they're ivf_nprobe_pct and ivf_max_codes_pct applicable to IVF indexes only.

Using https://grpc.io/[gRPC^], the coordinating node scatters the request to all other partitions for the Search or Vector Search index in the request across other nodes.
The coordinating node applies filters to the results received from the other partitions, and returns the final result set.

Results are scored, and based on the xref:search:search-request-params.adoc#sort[Sort Object] provided in the Search request, returned in a list.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should mention tf-idf and vector distance scores are summed during hybrid search and the user can influence the scoring with the boost setting.

Copy link
Contributor

@Rebecca-Martinez007 Rebecca-Martinez007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great page Sarah 🚀
I made a few small suggestions based off my thoughts while reading. One of them I left as a comment and opened the discussion that way.

{description}

A Vector Search index still relies on <<sync,>> and uses <<segments,>> to manage merging and persisting data to disk in your cluster.
All changes from DCP and the Data Service are introduced to a Search index in batches, which are further managed by segments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
All changes from DCP and the Data Service are introduced to a Search index in batches, which are further managed by segments.
All changes from Data Change Protocol (DCP) and the Data Service are introduced to a Search index in batches, which are further managed by segments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Database Change Protocol, as written below, but thanks :)

IVF index searches are not exhaustive searches.
You can increase accuracy by changing the `max_nprobe_pct` parameter or `max_codes_pct` when you xref:fine-tune-vector-search.adoc[fine tune your Vector Search queries].

Larger IVF indexes automatically train to learn the data distribution of your vectors, and the centroids of cells in your dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing as a comment because I might be wrong. I find that 'train to learn" is hard to understand, is it an industry term?

I was going to suggest changing "train to learn" to "adapt to", however, considering the following sentences, "train" might be the right term to use here.

Another suggestion: "Larger IVF indexes are automatically trained to learn the data distribution of your vectors, and the centroids of cells in your dataset."

sarahlwelton and others added 3 commits January 15, 2025 11:15
Co-authored-by: Rebecca Martinez <167447972+Rebecca-Martinez007@users.noreply.github.com>
Co-authored-by: Rebecca Martinez <167447972+Rebecca-Martinez007@users.noreply.github.com>
Copy link
Contributor

@Rebecca-Martinez007 Rebecca-Martinez007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@sarahlwelton sarahlwelton merged commit a60131f into release/7.6 Jan 16, 2025
@sarahlwelton sarahlwelton deleted the DOC-12430-search-index-arch-vs-server branch January 16, 2025 18:33
sarahlwelton added a commit that referenced this pull request Apr 17, 2025
* [DOC-12430] Adding anchor to child-field-options-reference

First draft of vector-search-index-architecture

* [DOC-12430] Add entry to nav.adoc

* [DOC-12430] Elaboration on when each index type is used + other fixes

* [DOC-12430] Tying processing in with scoring.

* [DOC-12430] Addressing some comments from SME review

* Update modules/vector-search/pages/vector-search-index-architecture.adoc

Co-authored-by: Rebecca Martinez <167447972+Rebecca-Martinez007@users.noreply.github.com>

* Update modules/vector-search/pages/vector-search-index-architecture.adoc

Co-authored-by: Rebecca Martinez <167447972+Rebecca-Martinez007@users.noreply.github.com>

* [DOC-12430] Changes/suggestions from peer review

---------

Co-authored-by: Rebecca Martinez <167447972+Rebecca-Martinez007@users.noreply.github.com>
sarahlwelton added a commit that referenced this pull request Apr 17, 2025
* [DOC-12430] Adding anchor to child-field-options-reference

First draft of vector-search-index-architecture

* [DOC-12430] Add entry to nav.adoc

* [DOC-12430] Elaboration on when each index type is used + other fixes

* [DOC-12430] Tying processing in with scoring.

* [DOC-12430] Addressing some comments from SME review

* Update modules/vector-search/pages/vector-search-index-architecture.adoc

Co-authored-by: Rebecca Martinez <167447972+Rebecca-Martinez007@users.noreply.github.com>

* Update modules/vector-search/pages/vector-search-index-architecture.adoc

Co-authored-by: Rebecca Martinez <167447972+Rebecca-Martinez007@users.noreply.github.com>

* [DOC-12430] Changes/suggestions from peer review

---------

Co-authored-by: Rebecca Martinez <167447972+Rebecca-Martinez007@users.noreply.github.com>
@simon-dew simon-dew added fts Full-Text Search 8.0 Couchbase Server 8.0 labels Sep 9, 2025
@sarahlwelton sarahlwelton removed the 8.0 Couchbase Server 8.0 label Sep 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fts Full-Text Search

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants