New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-16567: KnnQueryParser support for both pre-filters and post-filter #1245
Conversation
solr/core/src/java/org/apache/solr/search/neural/KnnQParser.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking closer at KnnQParser; I've never reviewed it before. I see that it consumes the top-level request's fq
params and it supplies them into the new KnnVectorQuery(...)
. I think this is a performance issue in that it ignores Solr's caching / interpretation of 'fq' into the well-known filter cache. I believe it ought to be using Solr's SolrIndexSearcher.getProcessedFilter
method to get a combined Query of the parts, which is "cache-aware". Other features of Solr are similar (notably QueryElevationComponent & grouping) so you can get inspiration / understanding by looking at some callers of that method. Ultimately this should mean less code because getProcessedFilter is going to do most of the work, including understanding PostFilter.
A ramification of the lack of this use of getProcessedFilter is that typical filter queries are going to be processed twice -- once at the top level (perhaps retrieved from a cache), and again by this KNN stuff (not using a cache).
solr/core/src/java/org/apache/solr/search/neural/KnnQParser.java
Outdated
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/search/neural/KnnQParser.java
Outdated
Show resolved
Hide resolved
David, your help has been invaluable! I think that I have found the perfect place for this fix and it literally would require few lines of code, rather than the complicate methods that are in place now: org.apache.solr.search.QueryUtils#combineQueryAndFilter Also, now Lucene is a separate project, so I basically should do the change in Lucene, then wait for a Lucene release, include it in Solr ect ect So, long story short, my suggestion:
let me know what do you think! |
Ok, I updated the PR, this is what I have done:
Unless any additional good ideas, I would go with this now. |
Thanks for your complements. Can't you simply use SolrIndexSearcher#getProcessedFilter now? As to your proposal, I am confused as to exactly where you propose inserting the logic you provided a snippet of. If you propose SolrIndexSearcher somewhere then I don't like it because it's clearly special casing a specific query which is a design problem. Are you trying to basically move certain FQs out of their top level position and into/embedded in a particular parsed query? |
Hi @dsmiley ,
Just to give an idea, the final code will look different as we will have to create a new instance of KnnVectorQuery using the getters of the old one. |
And yes, in the workaround I can use the getProcessedFilters and I will, but once we have the Lucene side it will go away. |
A special case nearly anywhere (except directly in KNN oriented code of course) is a design/maintenance issue. Some special cases like MatchAllDocs are understandable but a check for KNN in QueryUtils... eh... :-/ Maybe you could show in a new PR what this would look like so I could see. Perhaps when I understand better what you are trying to accomplish, I'll see a better solution.
Could you respond to that please? |
Ok, I'll produce another branch with the example code assuming Lucene changes are there. So what I am trying to accomplish:
Right now we process the filters at parsing time (ad do it again in the Searcher). Hope this helps with context @dsmiley ! |
here a rough pull request just to give you the idea @dsmiley : |
Refactored a bit, removed custom code and used getProcessedFilter. |
Just wanted to comment what a great discussion this has been... I learned some new stuff... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think this is superior.
Yes, Solr will wind up calling getProcessedFilter twice but typically, filter queries are cached, and so looking up cached queries is nearly free. It's certainly better than the logic that was happening here before -- to actually evaluate those filter queries every time.
solr/core/src/java/org/apache/solr/search/neural/KnnQParser.java
Outdated
Show resolved
Hide resolved
@@ -84,30 +84,20 @@ public Query parse() { | |||
} | |||
|
|||
private Query getFilterQuery() throws SolrException { | |||
if (!isFilter()) { | |||
boolean isSubQuery = recurseCount != 0; | |||
if (!isFilter() && !isSubQuery) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is isFilter being checked? Why is isSubQuery being checked? In my experience, use of these is extremely rare. isFilter... that would only be pertinent if the KNN query was specified in a filter query (fq
), which is kind of surprising because AFAIK, KNN is for relevancy (scoring). Does this query need to operate differently or to make more optimal decisions if it's in a filter query (to only filter and not rank)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the isFilter:
I may want to run a Knn search as a filter, to then score the documents by any lexical query, it's probably unlikely, but possible.
The check avoid you end up with an infinite recursion with org.apache.solr.search.QueryUtils#parseFilterQueries.
But the check also avoid a KnnQuery in the filters to capture the other filter queries and use them as pre-filters, so it's not ideal anyway.
Tagging @eliaporciani if he has any better idea.
I'll think about it
For the "isSubQuery", I want to avoid ending up with the infinite recursion frange and similar filters may end up (the original issue in the ticket).
try { | ||
filters = QueryUtils.parseFilterQueries(req, true); | ||
} catch (SyntaxError e) { | ||
List<Query> filters = QueryUtils.parseFilterQueries(req, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Boolean parameters are kind of sadly confusing in methods when it's not obvious what it is in code review tools (because the method name isn't a hint either). Here, it is "fixNegativeQueries". It should be rare to need to do that. It's redundant/needless here because getProcessedFilter deals with the matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, now I remember this bit, I was initially (and I am still) not a fan of that boolean param (as I tend to forget what it means and the name also doesn't help).
We discussed for a better name(we couldn't find) and the reason to put it that was was because that piece of code was shared across various Solr parts and other reviewers thought it was better to isolate it and move it to the queryUtils.
Not sure of course, if it's still the best solution to be honest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this boolean could be removed (logic is to not call makeQueryable inside). The only spot to be improved is then RealTimeGetComponent. Simply update the spot that loops over the filters (RealTimeGetComponent.process line ~ 328 to ensure the query is "queryable".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just needs a CHANGES.txt
I read your response to my question:
You didn't give a simple yes-no answer but my interpretation is definitely a "yes". Another possible approach is to make embedding specific filter queries very explicit -- put them in local-params of the KNN query parser instead of standard 'fq' at the top of the request. This makes it clear there's no double-interpretation problem and it provides a great deal of control to the Solr user/dev. But of course it's less friendly / more complex. I don't know enough about KNN to know if certain expensive filter queries (that are not post-filters) might be better off happening ~after the KNN instead of "pre". This is not supported by KNN today but could be. |
Just to bring a "data-science KNN user" view on this. I think there is actually two related but different ways one could use the Dense Vectors and KNN features. The first one is the most obvious and straightforward: you use it as KNN per si, i.e. you bring the top K documents most similar to the target. The second way to use the KNN is to use the similarity score that is calculated with the vectors. In this case we use the vectors and their similarity for ranking instead of retrieval. And there are some use cases that we could even combine multiple similarity scores to create an aggregated score, or even combine the similarity score with the actual "lexical score" (BM25). For this second use case we often use a very high K to guarantee that we calculate the similarity score for all the relevant documents. So, considering this second use case of using the KNN for calculating the similarity score, having the ability to filter the documents based on the value of the similarity is very useful. And as previously mentioned, sometimes this is not even directly a single similarity score, it could be a combination of multiple scores, and we could apply the Post Filter on top of the aggregated score. |
…er (#1245) * SOLR-16567: KnnQueryParser support for both pre-filters and post-filters(cost>0)
…er (#1245) * SOLR-16567: KnnQueryParser support for both pre-filters and post-filters(cost>0)
…er, compilation error fix (#1245)
…er, compilation error fix (#1245)
https://issues.apache.org/jira/browse/SOLR-16567
Description
Frange and in general post filters were abandoned for KNN search in favour of pre-filters.
This PR brings back supports for post-filters when they have a cost>0 (in line with the rest of Solr)
Tests
tests have been added
Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.