-
Notifications
You must be signed in to change notification settings - Fork 19
[DOC-12430] Vector Search Index Architecture #308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOC-12430] Vector Search Index Architecture #308
Conversation
First draft of vector-search-index-architecture
…ch-index-arch-vs-server
abhinavdangeti
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @sarahlwelton - I've some thoughts/suggestions here.
| Vector Search specifically uses https://faiss.ai/index.html[FAISS^] indexes. | ||
| Any vectors inside your documents are indexed using FAISS, to create a new query vector that can be searched for similar vectors inside your Vector Search index. | ||
|
|
||
| Vector Search chooses the best https://github.com/facebookresearch/faiss/wiki/Faiss-indexes[FAISS index class^], or vector search algorithm, for your data, and automatically tunes parameters to provide a balance of recall and latency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should mention here the 3rd optimization we offer - memory_efficient as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good shout! Yes, let's add that in the next line.
| Every cell has a centroid. | ||
| Every vector in the processed dataset is assigned to a cell that corresponds to its nearest centroid. | ||
|
|
||
| In an IVF index, a Vector Search first tries to find the cell that the query vector belongs to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, sounds a little misleading - perhaps can re-phrase this to - "search tries to establish a centroid vector closest to the query vector", and then based on the default nprobe (override-able) and default max_codes (override-able) , the search will occur over that many clusters / cells from the identified closest centroid to determine the top-k.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure - this was pulled from what Jon shared, so if there's a more accurate way to phrase it, I'm all for it.
| In an IVF index, a Vector Search first tries to find the cell that the query vector belongs to. | ||
| After it knows the cell to search, Vector Search uses another algorithm to find out the exact vector that's closest to the query vector in that cell. | ||
|
|
||
| The result of an IVF index search can be less accurate, as the nearest vector to a query vector can be in a different cell than the chosen cell. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's omit this "can be less accurate" bit because it sounds alarming - especially because the user has no control over this. You can say - search over centroid indexes is non-exhaustive .. and the reader can infer the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valid. Changing to "IVF index searches are not exhaustive searches."
| After it knows the cell to search, Vector Search uses another algorithm to find out the exact vector that's closest to the query vector in that cell. | ||
|
|
||
| The result of an IVF index search can be less accurate, as the nearest vector to a query vector can be in a different cell than the chosen cell. | ||
| You can increase accuracy by changing the *nprobe* parameter when you xref:fine-tune-vector-search.adoc[fine tune your Vector Search queries]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max_nprobe_pct (because the user has no idea about the number of centroids chosen for the index - depends on our batching and merging).
There's also max_codes_pct which defaults to 100 - meaning all vectors within the chosen centroid clusters will be scanned. They can reduce this to increase latency at the cost of recall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify for me? Is it "max_nprobe_pct" and "max_codes_pct" or "ivf_nprobe_pct" and "ivf_max_codes_pct"?
They didn't let me write the docs for that feature, so I was relying on another writer - and they have ivf, not max.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah shoot, sorry yes they're ivf_nprobe_pct and ivf_max_codes_pct applicable to IVF indexes only.
| Using https://grpc.io/[gRPC^], the coordinating node scatters the request to all other partitions for the Search or Vector Search index in the request across other nodes. | ||
| The coordinating node applies filters to the results received from the other partitions, and returns the final result set. | ||
|
|
||
| Results are scored, and based on the xref:search:search-request-params.adoc#sort[Sort Object] provided in the Search request, returned in a list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should mention tf-idf and vector distance scores are summed during hybrid search and the user can influence the scoring with the boost setting.
Rebecca-Martinez007
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great page Sarah 🚀
I made a few small suggestions based off my thoughts while reading. One of them I left as a comment and opened the discussion that way.
modules/vector-search/pages/vector-search-index-architecture.adoc
Outdated
Show resolved
Hide resolved
| {description} | ||
|
|
||
| A Vector Search index still relies on <<sync,>> and uses <<segments,>> to manage merging and persisting data to disk in your cluster. | ||
| All changes from DCP and the Data Service are introduced to a Search index in batches, which are further managed by segments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| All changes from DCP and the Data Service are introduced to a Search index in batches, which are further managed by segments. | |
| All changes from Data Change Protocol (DCP) and the Data Service are introduced to a Search index in batches, which are further managed by segments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Database Change Protocol, as written below, but thanks :)
modules/vector-search/pages/vector-search-index-architecture.adoc
Outdated
Show resolved
Hide resolved
| IVF index searches are not exhaustive searches. | ||
| You can increase accuracy by changing the `max_nprobe_pct` parameter or `max_codes_pct` when you xref:fine-tune-vector-search.adoc[fine tune your Vector Search queries]. | ||
|
|
||
| Larger IVF indexes automatically train to learn the data distribution of your vectors, and the centroids of cells in your dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing as a comment because I might be wrong. I find that 'train to learn" is hard to understand, is it an industry term?
I was going to suggest changing "train to learn" to "adapt to", however, considering the following sentences, "train" might be the right term to use here.
Another suggestion: "Larger IVF indexes are automatically trained to learn the data distribution of your vectors, and the centroids of cells in your dataset."
Co-authored-by: Rebecca Martinez <167447972+Rebecca-Martinez007@users.noreply.github.com>
Co-authored-by: Rebecca Martinez <167447972+Rebecca-Martinez007@users.noreply.github.com>
Rebecca-Martinez007
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀
* [DOC-12430] Adding anchor to child-field-options-reference First draft of vector-search-index-architecture * [DOC-12430] Add entry to nav.adoc * [DOC-12430] Elaboration on when each index type is used + other fixes * [DOC-12430] Tying processing in with scoring. * [DOC-12430] Addressing some comments from SME review * Update modules/vector-search/pages/vector-search-index-architecture.adoc Co-authored-by: Rebecca Martinez <167447972+Rebecca-Martinez007@users.noreply.github.com> * Update modules/vector-search/pages/vector-search-index-architecture.adoc Co-authored-by: Rebecca Martinez <167447972+Rebecca-Martinez007@users.noreply.github.com> * [DOC-12430] Changes/suggestions from peer review --------- Co-authored-by: Rebecca Martinez <167447972+Rebecca-Martinez007@users.noreply.github.com>
* [DOC-12430] Adding anchor to child-field-options-reference First draft of vector-search-index-architecture * [DOC-12430] Add entry to nav.adoc * [DOC-12430] Elaboration on when each index type is used + other fixes * [DOC-12430] Tying processing in with scoring. * [DOC-12430] Addressing some comments from SME review * Update modules/vector-search/pages/vector-search-index-architecture.adoc Co-authored-by: Rebecca Martinez <167447972+Rebecca-Martinez007@users.noreply.github.com> * Update modules/vector-search/pages/vector-search-index-architecture.adoc Co-authored-by: Rebecca Martinez <167447972+Rebecca-Martinez007@users.noreply.github.com> * [DOC-12430] Changes/suggestions from peer review --------- Co-authored-by: Rebecca Martinez <167447972+Rebecca-Martinez007@users.noreply.github.com>
Per request, writing up new documentation to explain the architecture behind Vector Search indexes on FTS.