New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Making k and num_candidates optional for knn search #101209
Conversation
Documentation preview: |
int adjustedK = k != null ? k : size; | ||
int adjustedNumCands = numCands != null | ||
? numCands | ||
: Math.min(Math.max(NUM_CANDS_DEFAULT, (int) NUM_CANDS_MULTIPLICATIVE_FACTOR * adjustedK), NUM_CANDS_LIMIT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still TBD
List<DfsKnnResults> mergedResults = new ArrayList<>(source.knnSearch().size()); | ||
for (int i = 0; i < source.knnSearch().size(); i++) { | ||
Integer topK = source.knnSearch().get(i).k(); | ||
int size = source.size() == -1 ? DEFAULT_SIZE : source.size(); | ||
TopDocs mergedTopDocs = TopDocs.merge(topK != null ? topK : size, topDocsLists.get(i).toArray(new TopDocs[0])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could handle this within the SearchSourceBuilder directly? When parsing (thus creating via an API), parsing the knn
section or size
section could set the k
for the knn
there.
Similarly if somebody calls SearchSourceBuilder#setSize(int)
that could propagate down to the knn
object if it exists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking into it initially, but I wasn't impressed with a couple of approaches I've tried, as we were introducing dependencies between different component in the SearchSourceBuilder
- especially when there's no guaranty of order. On the other hand, as things are now, if someone calls SearchSourceBuilder#setSize
at a later point, then they would also have the responsibility of setting up & updating k
which is not right 🤔 (similarly for org.elasticsearch.search.DefaultSearchContext#size(int)
)
Will look into it a bit more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the functionality to go through the SearchSourceBuilder
in 9092f37. Haven't set it in the SearchSourceBuilder#size
method, but rather when we parse the initial size
param (iff present).
Happy to update it as well, but was wondering which cases related to knn
search we'd capture (was looking at the usages but I'm not very familiar with all cases).
this.k = in.readOptionalVInt(); | ||
this.numCands = in.readOptionalVInt(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you are ready for further review, etc. all these will have to be backwards compatible. Meaning looking at the in.getTransportVersion()
and verifying its at least the version where this change is introduced.
Similar logic down in writeTo(StreamOutput)
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the heads up @benwtrent ! Will update accordingly.
Hi @pmpailis, I've created a changelog YAML for you. |
Pinging @elastic/es-search (Team:Search) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff!
...c/yamlRestTest/resources/rest-api-spec/test/search.vectors/150_knn_search_missing_params.yml
Outdated
Show resolved
Hide resolved
...c/yamlRestTest/resources/rest-api-spec/test/search.vectors/150_knn_search_missing_params.yml
Outdated
Show resolved
Hide resolved
@@ -193,6 +193,7 @@ static TransportVersion def(int id) { | |||
public static final TransportVersion ENRICH_ELASTICSEARCH_VERSION_REMOVED = def(8_560_00_0); | |||
public static final TransportVersion NODE_STATS_REQUEST_SIMPLIFIED = def(8_561_00_0); | |||
|
|||
public static final TransportVersion KNN_K_NUMCANDS_AS_OPTIONAL_PARAMS = def(8_993_00_0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please update it.
@@ -194,6 +196,7 @@ public void toSearchRequest(SearchRequestBuilder builder) { | |||
|
|||
// visible for testing | |||
static class KnnSearch { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not update this class anylonger. Its deprecated and folks shouldn't use the _knn_search
API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rolled back all changes to KnnSearch
in 1aa708b
server/src/main/java/org/elasticsearch/index/query/SearchExecutionContext.java
Outdated
Show resolved
Hide resolved
private void adjustKnnSizeParam() { | ||
if (size > 0) knnSearch.forEach(knn -> knn.requestSize(size)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this isn't being used at all within the KnnSearchBuilder. However, we can do better.
We should have a KnnSearchBuilder.Builder
that KnnSearchBuilder.fromXContent
returns. That can have setters for k
that gets adjusted after the last field is read when parsing the SearchSourceBuilder
.
This way we have a KnnSearchBuilder.Builder#build(int)
where int
is the request size (the default or user provided). And KnnSearchBuilder
will never have null
k
nor null num_candidates
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean - Nice suggestion!
Currently, org.elasticsearch.search.builder.SearchSourceBuilder#parseXContent
doesn't do much outside of the while
loop other than iterating through all objects. I guess we should add the call to the builder within this parsing section though, as only then the SearchSourceBuilder
would have been finalized parsing.
Do you think that we should hide all other public test usages/constructors of KnnSearchBuilder
behind the builder (except from the one accepting a StreamInput
), or just isolate this very specific use-case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think that we should hide all other public test usages/constructors of KnnSearchBuilder behind the builder (except from the one accepting a StreamInput), or just isolate this very specific use-case?
I think the ctor should be package private & we use a builder object for all final
object values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And KnnSearchBuilder will never have null k nor null num_candidates.
So, since the responsibility to fill out missing values will be moved over to the Builder
on the node that received the original request, we wouldn't need any transport versions or backwards compatibility as well, right? The values will have already been filled when sending over the objects, so the behavior right after builder#build
will be pretty much the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the KnnSearchBuilder doesn't have null values, it won't need bwc serialization as nothing changes there. But, the KnnVectorQueryBuilder probably still needs it as num_candidates
can be null when executed as knn-as-a-query
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap, knn-query would still need it. But we can take it out of knn-search. Getting on it : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @pmpailis are you blocked by me here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope @benwtrent ! Was just taking a look at toXContent
to understand under which circumstances this is invoked, and more specifically in the context of knn-search
. Will have an update for that shortly :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of things. Looking REALLY nice :)
public KnnSearchBuilder build(int size) { | ||
int requestSize = size < 0 ? DEFAULT_SIZE : size; | ||
int adjustedK = k == null ? requestSize : k; | ||
int adjustedNumCandidates = numCandidates == null | ||
? Math.round(Math.min(NUM_CANDS_LIMIT, NUM_CANDS_MULTIPLICATIVE_FACTOR * adjustedK)) | ||
: numCandidates; | ||
return new KnnSearchBuilder( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
final int k; | ||
final int numCands; | ||
final Float similarity; | ||
int k; | ||
int numCands; | ||
Float similarity; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can all still be final
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap - restored them in 390aff2
test/framework/src/main/java/org/elasticsearch/test/AbstractQueryVectorBuilderTestCase.java
Outdated
Show resolved
Hide resolved
test/framework/src/main/java/org/elasticsearch/test/AbstractQueryVectorBuilderTestCase.java
Outdated
Show resolved
Hide resolved
…eryVectorBuilderTestCase.java Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>
…eryVectorBuilderTestCase.java Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>
The aim of this PR is to make
num_candidates
andk
optional forknn
search, and provide instead some sensible defaults.As also defined in the docs,
k
defines how many results we will eventually report back from each shard, whilenum_candidates
is an exploration parameter and defines the size of the nearest neighbors queue we will compute on each shard.k
is used in two places:DfsPhase#executeKnnVectorQuery
) when creating theTopScoreDocCollector
for executing the knn querySearchPhaseController#mergeKnnResults
) in order to keep the topk
.num_cands
is used solely within theKnnSearchBuilder
when creating the appropriateKnnVectorQueryBuilder
object.The parameter
k
can default tosize
(orDEFAULT_SIZE
if not present) - while fornum_cands
things are a bit more interesting :) There are mainly 2 approaches that we could follow:k
(orsize
), finding a sweet-spot between recall / latency, ensuring that we will try to perform as little operations as possiblenumCands
(always>= k
though) - while for bigger ones we might want to allow for a few more steps to avoid local minima and try to find the true nearest neighbors. This could potentially give us more freedom in controlling (and reducing) the number of nodes we visit for each shard.In this PR, as a first step, we will look into the 1st approach though, and make
numCands
depend on the incoming search request, instead of the indexed data. Based on the discussion & benchmarking results in the related issue, we'll proceed to define it asnum_candidates=Math.min(1.5 * k, 10_000)
(where 10_000 is the limit fornum_candidates
)As there are some restrictions on the relationship between
numCands
andk
, if we are to use any default values, in this PR we change thek
andnumCands
variables fromfinal
and apply the same validation checks that are defined in the constructor to make sure that we either have a properly definedKnnSearchBuilder
instance, or fail with anIllegalArgumentException
. This is to also cover the scenario where a user might have providednumCands
andsize
values, wherenumCands < size
, so we want to raise an exception early on.In the future, should we decide to make
numCands
defined based on the graph size, we can move the implementation & assignment over toKnnVectorQueryBuilder#doToQuery
where we have info on the specific shard/graph size.Closes #97533