[DOC-12430] Vector Search Index Architecture #308

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

sarahlwelton merged 10 commits into release/7.6 from DOC-12430-search-index-arch-vs-server

Jan 16, 2025

Contributor

sarahlwelton commented Dec 12, 2024

Per request, writing up new documentation to explain the architecture behind Vector Search indexes on FTS.

sarahlwelton added 5 commits

December 11, 2024 12:00


          [DOC-12430] Adding anchor to child-field-options-reference

2ce50d7

First draft of vector-search-index-architecture


          [DOC-12430] Add entry to nav.adoc


          Merge branch 'release/7.6' into DOC-12430-search-index-arch-vs-server

8a63caf


          [DOC-12430] Elaboration on when each index type is used + other fixes

f43b526


          [DOC-12430] Tying processing in with scoring.

fa6403f

sarahlwelton marked this pull request as ready for review

December 12, 2024 16:45


          Merge remote-tracking branch 'origin/release/7.6' into DOC-12430-sear…

788ef26

…ch-index-arch-vs-server

abhinavdangeti suggested changes

View reviewed changes

abhinavdangeti left a comment

Nice work @sarahlwelton - I've some thoughts/suggestions here.

modules/vector-search/pages/vector-search-index-architecture.adoc

+              Vector Search specifically uses https://faiss.ai/index.html[FAISS^] indexes.
+              Any vectors inside your documents are indexed using FAISS, to create a new query vector that can be searched for similar vectors inside your Vector Search index.
+              Vector Search chooses the best https://github.com/facebookresearch/faiss/wiki/Faiss-indexes[FAISS index class^], or vector search algorithm, for your data, and automatically tunes parameters to provide a balance of recall and latency.

abhinavdangeti Dec 12, 2024

Should mention here the 3rd optimization we offer - memory_efficient as well?

Contributor Author

sarahlwelton Dec 12, 2024

Good shout! Yes, let's add that in the next line.

modules/vector-search/pages/vector-search-index-architecture.adoc Outdated

+              Every cell has a centroid.
+              Every vector in the processed dataset is assigned to a cell that corresponds to its nearest centroid.
+              In an IVF index, a Vector Search first tries to find the cell that the query vector belongs to.

abhinavdangeti Dec 12, 2024

Hmm, sounds a little misleading - perhaps can re-phrase this to - "search tries to establish a centroid vector closest to the query vector", and then based on the default nprobe (override-able) and default max_codes (override-able) , the search will occur over that many clusters / cells from the identified closest centroid to determine the top-k.

Thoughts?

Contributor Author

sarahlwelton Dec 12, 2024

Sure - this was pulled from what Jon shared, so if there's a more accurate way to phrase it, I'm all for it.

modules/vector-search/pages/vector-search-index-architecture.adoc Outdated

+              In an IVF index, a Vector Search first tries to find the cell that the query vector belongs to.
+              After it knows the cell to search, Vector Search uses another algorithm to find out the exact vector that's closest to the query vector in that cell.
+              The result of an IVF index search can be less accurate, as the nearest vector to a query vector can be in a different cell than the chosen cell.

abhinavdangeti Dec 12, 2024

Let's omit this "can be less accurate" bit because it sounds alarming - especially because the user has no control over this. You can say - search over centroid indexes is non-exhaustive .. and the reader can infer the rest.

Contributor Author

sarahlwelton Dec 12, 2024

Valid. Changing to "IVF index searches are not exhaustive searches."

modules/vector-search/pages/vector-search-index-architecture.adoc Outdated

+              After it knows the cell to search, Vector Search uses another algorithm to find out the exact vector that's closest to the query vector in that cell.
+              The result of an IVF index search can be less accurate, as the nearest vector to a query vector can be in a different cell than the chosen cell.
+              You can increase accuracy by changing the *nprobe* parameter when you xref:fine-tune-vector-search.adoc[fine tune your Vector Search queries].

abhinavdangeti Dec 12, 2024

max_nprobe_pct (because the user has no idea about the number of centroids chosen for the index - depends on our batching and merging).

There's also max_codes_pct which defaults to 100 - meaning all vectors within the chosen centroid clusters will be scanned. They can reduce this to increase latency at the cost of recall.

Contributor Author

sarahlwelton Dec 12, 2024

Can you clarify for me? Is it "max_nprobe_pct" and "max_codes_pct" or "ivf_nprobe_pct" and "ivf_max_codes_pct"?

They didn't let me write the docs for that feature, so I was relying on another writer - and they have ivf, not max.

abhinavdangeti Dec 12, 2024

Ah shoot, sorry yes they're ivf_nprobe_pct and ivf_max_codes_pct applicable to IVF indexes only.

modules/vector-search/pages/vector-search-index-architecture.adoc

+              Using https://grpc.io/[gRPC^], the coordinating node scatters the request to all other partitions for the Search or Vector Search index in the request across other nodes.
+              The coordinating node applies filters to the results received from the other partitions, and returns the final result set.
+              Results are scored, and based on the xref:search:search-request-params.adoc#sort[Sort Object] provided in the Search request, returned in a list.

abhinavdangeti Dec 12, 2024

Should mention tf-idf and vector distance scores are summed during hybrid search and the user can influence the scoring with the boost setting.


          [DOC-12430] Addressing some comments from SME review

e6d6e8f

Rebecca-Martinez007 requested changes

View reviewed changes

Contributor

Rebecca-Martinez007 left a comment

Great page Sarah 🚀
I made a few small suggestions based off my thoughts while reading. One of them I left as a comment and opened the discussion that way.

modules/vector-search/pages/vector-search-index-architecture.adoc Outdated Show resolved Hide resolved

modules/vector-search/pages/vector-search-index-architecture.adoc Outdated

+              {description}
+              A Vector Search index still relies on <<sync,>> and uses <<segments,>> to manage merging and persisting data to disk in your cluster.
+              All changes from DCP and the Data Service are introduced to a Search index in batches, which are further managed by segments.

Contributor

Rebecca-Martinez007 Jan 14, 2025

Suggested change

      
            All changes from DCP and the Data Service are introduced to a Search index in batches, which are further managed by segments. 
          
            All changes from Data Change Protocol (DCP) and the Data Service are introduced to a Search index in batches, which are further managed by segments.

Contributor Author

sarahlwelton Jan 15, 2025

Database Change Protocol, as written below, but thanks :)

modules/vector-search/pages/vector-search-index-architecture.adoc Outdated Show resolved Hide resolved

modules/vector-search/pages/vector-search-index-architecture.adoc Outdated

+              IVF index searches are not exhaustive searches.
+              You can increase accuracy by changing the `max_nprobe_pct` parameter or `max_codes_pct` when you xref:fine-tune-vector-search.adoc[fine tune your Vector Search queries].
+              Larger IVF indexes automatically train to learn the data distribution of your vectors, and the centroids of cells in your dataset.

Contributor

Rebecca-Martinez007 Jan 14, 2025

Writing as a comment because I might be wrong. I find that 'train to learn" is hard to understand, is it an industry term?

I was going to suggest changing "train to learn" to "adapt to", however, considering the following sentences, "train" might be the right term to use here.

Another suggestion: "Larger IVF indexes are automatically trained to learn the data distribution of your vectors, and the centroids of cells in your dataset."

sarahlwelton and others added 3 commits

January 15, 2025 11:15


          Update modules/vector-search/pages/vector-search-index-architecture.adoc

625f294

Co-authored-by: Rebecca Martinez <167447972+Rebecca-Martinez007@users.noreply.github.com>


          Update modules/vector-search/pages/vector-search-index-architecture.adoc

903ce18

Co-authored-by: Rebecca Martinez <167447972+Rebecca-Martinez007@users.noreply.github.com>


          [DOC-12430] Changes/suggestions from peer review

be53aa7

sarahlwelton requested a review from Rebecca-Martinez007

January 15, 2025 16:19

Rebecca-Martinez007 approved these changes

View reviewed changes

Contributor

Rebecca-Martinez007 left a comment

LGTM 🚀

sarahlwelton merged commit a60131f into release/7.6

sarahlwelton deleted the DOC-12430-search-index-arch-vs-server branch

January 16, 2025 18:33

sarahlwelton added a commit that referenced this pull request


          [DOC-12430] Vector Search Index Architecture (#308)

8d885a2

* [DOC-12430] Adding anchor to child-field-options-reference

First draft of vector-search-index-architecture

* [DOC-12430] Add entry to nav.adoc

* [DOC-12430] Elaboration on when each index type is used + other fixes

* [DOC-12430] Tying processing in with scoring.

* [DOC-12430] Addressing some comments from SME review

* Update modules/vector-search/pages/vector-search-index-architecture.adoc

Co-authored-by: Rebecca Martinez <167447972+Rebecca-Martinez007@users.noreply.github.com>

* Update modules/vector-search/pages/vector-search-index-architecture.adoc

Co-authored-by: Rebecca Martinez <167447972+Rebecca-Martinez007@users.noreply.github.com>

* [DOC-12430] Changes/suggestions from peer review

---------

Co-authored-by: Rebecca Martinez <167447972+Rebecca-Martinez007@users.noreply.github.com>

sarahlwelton added a commit that referenced this pull request


          [DOC-12430] Vector Search Index Architecture (#308)

26ebb7f

* [DOC-12430] Adding anchor to child-field-options-reference

First draft of vector-search-index-architecture

* [DOC-12430] Add entry to nav.adoc

* [DOC-12430] Elaboration on when each index type is used + other fixes

* [DOC-12430] Tying processing in with scoring.

* [DOC-12430] Addressing some comments from SME review

* Update modules/vector-search/pages/vector-search-index-architecture.adoc

Co-authored-by: Rebecca Martinez <167447972+Rebecca-Martinez007@users.noreply.github.com>

* Update modules/vector-search/pages/vector-search-index-architecture.adoc

Co-authored-by: Rebecca Martinez <167447972+Rebecca-Martinez007@users.noreply.github.com>

* [DOC-12430] Changes/suggestions from peer review

---------

Co-authored-by: Rebecca Martinez <167447972+Rebecca-Martinez007@users.noreply.github.com>

simon-dew added fts 8.0 labels

sarahlwelton removed the 8.0 label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fts