SOLR-16567: KnnQueryParser support for both pre-filters and post-filter #1245

alessandrobenedetti · 2022-12-19T17:56:18Z

https://issues.apache.org/jira/browse/SOLR-16567

Description

Frange and in general post filters were abandoned for KNN search in favour of pre-filters.
This PR brings back supports for post-filters when they have a cost>0 (in line with the rest of Solr)

Tests

tests have been added

Checklist

Please review the following and check all that apply:

[ X] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
[ X] I have created a Jira issue and added the issue ID to my pull request title.
[ X] I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
[ X] I have developed this patch against the main branch.
[ X] I have run ./gradlew check.
[ X] I have added tests for my changes.
I have added documentation for the Reference Guide

…ers(cost>0)

solr/core/src/java/org/apache/solr/search/neural/KnnQParser.java

dsmiley

I'm looking closer at KnnQParser; I've never reviewed it before. I see that it consumes the top-level request's fq params and it supplies them into the new KnnVectorQuery(...). I think this is a performance issue in that it ignores Solr's caching / interpretation of 'fq' into the well-known filter cache. I believe it ought to be using Solr's SolrIndexSearcher.getProcessedFilter method to get a combined Query of the parts, which is "cache-aware". Other features of Solr are similar (notably QueryElevationComponent & grouping) so you can get inspiration / understanding by looking at some callers of that method. Ultimately this should mean less code because getProcessedFilter is going to do most of the work, including understanding PostFilter.

A ramification of the lack of this use of getProcessedFilter is that typical filter queries are going to be processed twice -- once at the top level (perhaps retrieved from a cache), and again by this KNN stuff (not using a cache).

solr/core/src/java/org/apache/solr/search/neural/KnnQParser.java

alessandrobenedetti · 2022-12-21T10:04:07Z

I'm looking closer at KnnQParser; ...

David, your help has been invaluable!
You are absolutely right and this is an oversight of mine when I did the original review of the pre-filtering work(this naming is quite common in the neural search community).

I think that I have found the perfect place for this fix and it literally would require few lines of code, rather than the complicate methods that are in place now:

org.apache.solr.search.QueryUtils#combineQueryAndFilter
... } else if(scoreQuery instanceof KnnVectorQuery){ (KnnVectorQuery)scoreQuery.setFilter(filterQuery); } else { return new BooleanQuery.Builder() .add(scoreQuery, Occur.MUST) .add(filterQuery, Occur.FILTER) .build(); }
Basically when we combine the query with the filter, we manage the thing differently for KNN queries, and the filter(excluding post filters) is set in the Lucene KnnVectorQuery.
This happens after the org.apache.solr.search.SolrIndexSearcher#getProcessedFilter in the Searcher, so filters are already cached and separated from post-filters
So far so good, elegant and minimal code change, filters are processed, everyone is happy...
BUT
currently KnnVectorQuery in Lucene has no getters and setters and has all variable as final!!
So, we are where we are (to be honest, for a library class I would have gone with private variables with getters/setters since the beginning).

Also, now Lucene is a separate project, so I basically should do the change in Lucene, then wait for a Lucene release, include it in Solr ect ect

So, long story short, my suggestion:

we proceed with the current hack, renaming where necessary to make it nicer, but no massive change
contribute the change Lucene side, probably removing final, and adding getters/setters or if the community disagree, just adding getters so that It's possible to create a new class from the input one
once Lucene releases and we have the code in Solr, do the nice and clean implementation

let me know what do you think!

alessandrobenedetti · 2022-12-21T16:49:28Z

Ok, I updated the PR, this is what I have done:

workaround to fix the bug and renames suggested by @dsmiley
opened a pull request in Lucene to implement getters in the KnnVectorQuery: https://github.com/apache/lucene/pull/12029/files

Unless any additional good ideas, I would go with this now.
Then as soon as the Lucene side is sorted out and in Solr, I would implement the optimal approach, removing all the redundant code and just managing the filters in the org.apache.solr.search.QueryUtils#combineQueryAndFilter ( it will be so easy and clean, it's a shame I can't do it immediately)

dsmiley · 2022-12-21T17:58:54Z

Thanks for your complements.

Can't you simply use SolrIndexSearcher#getProcessedFilter now?

As to your proposal, I am confused as to exactly where you propose inserting the logic you provided a snippet of. If you propose SolrIndexSearcher somewhere then I don't like it because it's clearly special casing a specific query which is a design problem. Are you trying to basically move certain FQs out of their top level position and into/embedded in a particular parsed query?

alessandrobenedetti · 2022-12-21T18:33:28Z

Hi @dsmiley ,
once we have the Lucene changes, the idea is to change this method:
org.apache.solr.search.QueryUtils#combineQueryAndFilter
with an additional if clause:
else { return new BooleanQuery.Builder() .add(scoreQuery, Occur.MUST) .add(filterQuery, Occur.FILTER) .build(); }

  will become:
  `else if{scoreQuery instanceof KnnVectorQuery}{
  return new KnnVectorQuery(scoreQuery,filterQuery);} 
  else{
    return new BooleanQuery.Builder()
        .add(scoreQuery, Occur.MUST)
        .add(filterQuery, Occur.FILTER)
        .build();
  }`

Just to give an idea, the final code will look different as we will have to create a new instance of KnnVectorQuery using the getters of the old one.
With this change, we will be able to simplify the KnnQueryParser removing all the stuff for pre-filters and post-filters.
GetProcessedFilters will be called just once as usual and we'll get the benefit of caching and post-0filter separation automatically.
SolrIndexSearcher won't be touched at all.

alessandrobenedetti · 2022-12-21T18:34:05Z

And yes, in the workaround I can use the getProcessedFilters and I will, but once we have the Lucene side it will go away.

dsmiley · 2022-12-21T18:40:43Z

A special case nearly anywhere (except directly in KNN oriented code of course) is a design/maintenance issue. Some special cases like MatchAllDocs are understandable but a check for KNN in QueryUtils... eh... :-/ Maybe you could show in a new PR what this would look like so I could see. Perhaps when I understand better what you are trying to accomplish, I'll see a better solution.

(me:) Are you trying to basically move certain FQs out of their top level position and into/embedded in a particular parsed query?

Could you respond to that please?

alessandrobenedetti · 2022-12-21T18:51:49Z

Ok, I'll produce another branch with the example code assuming Lucene changes are there.

So what I am trying to accomplish:

Knn Query has a Query filter as a constructor parameter (and instance variable). This filter is meant to be a pre-filter(in Approximate Nearest Neighbour search means a filter that is executed before the Top K nearest neighbors are returned).
It is used internally in the approximate nearest neighbor search Lucene code to only accept certain neighbors from the graph(along with the alive bitSet)
In Apache Solr we need to make sure that all filter queries except explicit post-filters are processed and set in the Knn Query.
But parsing happens before the Searcher will process the filters. So if are able to modify the Lucene KnnQuery in org.apache.solr.search.QueryUtils#combineQueryAndFilter (or create a new one), we are done.

Right now we process the filters at parsing time (ad do it again in the Searcher).
Potentially we could process and remove the filters from the request in the query parser, but it seems nasty to me, hence my idea of modifying the combineQueryAndFilter.
Because that method has the responsibility of building a new Query, combining the main Query and all filters(except post-filters).
So it seems the perfect place for implementing the custom logic for the KnnQuery, that behaves differently when you combine it with filters.

Hope this helps with context @dsmiley !

alessandrobenedetti · 2022-12-21T19:39:30Z

here a rough pull request just to give you the idea @dsmiley :
https://github.com/apache/solr/pull/1246/files

alessandrobenedetti · 2022-12-22T09:35:20Z

Refactored a bit, removed custom code and used getProcessedFilter.
It looks much better now, but still, the getProcessedFilter is done twice.
It's probably an acceptable workaround at the moment?
I created the tentative branch to show what it would be like when we have access to the internal of the KNN Query (which I still believe to be ideal).
Open to better ideas and thanks again for the review @dsmiley !

epugh · 2022-12-22T12:37:23Z

Just wanted to comment what a great discussion this has been... I learned some new stuff...

dsmiley

Overall I think this is superior.
Yes, Solr will wind up calling getProcessedFilter twice but typically, filter queries are cached, and so looking up cached queries is nearly free. It's certainly better than the logic that was happening here before -- to actually evaluate those filter queries every time.

solr/core/src/java/org/apache/solr/search/neural/KnnQParser.java

dsmiley · 2022-12-22T14:31:32Z

solr/core/src/java/org/apache/solr/search/neural/KnnQParser.java

@@ -84,30 +84,20 @@ public Query parse() {
  }

  private Query getFilterQuery() throws SolrException {
-    if (!isFilter()) {
+    boolean isSubQuery = recurseCount != 0;
+    if (!isFilter() && !isSubQuery) {


Why is isFilter being checked? Why is isSubQuery being checked? In my experience, use of these is extremely rare. isFilter... that would only be pertinent if the KNN query was specified in a filter query (fq), which is kind of surprising because AFAIK, KNN is for relevancy (scoring). Does this query need to operate differently or to make more optimal decisions if it's in a filter query (to only filter and not rank)?

For the isFilter:
I may want to run a Knn search as a filter, to then score the documents by any lexical query, it's probably unlikely, but possible.
The check avoid you end up with an infinite recursion with org.apache.solr.search.QueryUtils#parseFilterQueries.
But the check also avoid a KnnQuery in the filters to capture the other filter queries and use them as pre-filters, so it's not ideal anyway.
Tagging @eliaporciani if he has any better idea.
I'll think about it

For the "isSubQuery", I want to avoid ending up with the infinite recursion frange and similar filters may end up (the original issue in the ticket).

dsmiley · 2022-12-22T15:48:13Z

solr/core/src/java/org/apache/solr/search/neural/KnnQParser.java

        try {
-          filters = QueryUtils.parseFilterQueries(req, true);
-        } catch (SyntaxError e) {
+          List<Query> filters = QueryUtils.parseFilterQueries(req, true);


Boolean parameters are kind of sadly confusing in methods when it's not obvious what it is in code review tools (because the method name isn't a hint either). Here, it is "fixNegativeQueries". It should be rare to need to do that. It's redundant/needless here because getProcessedFilter deals with the matter.

Ok, now I remember this bit, I was initially (and I am still) not a fan of that boolean param (as I tend to forget what it means and the name also doesn't help).
We discussed for a better name(we couldn't find) and the reason to put it that was was because that piece of code was shared across various Solr parts and other reviewers thought it was better to isolate it and move it to the queryUtils.

Not sure of course, if it's still the best solution to be honest

I think this boolean could be removed (logic is to not call makeQueryable inside). The only spot to be improved is then RealTimeGetComponent. Simply update the spot that loops over the filters (RealTimeGetComponent.process line ~ 328 to ensure the query is "queryable".

dsmiley

Just needs a CHANGES.txt

dsmiley · 2023-01-04T05:22:47Z

I read your response to my question:

(me:) Are you trying to basically move certain FQs out of their top level position and into/embedded in a particular parsed query?

You didn't give a simple yes-no answer but my interpretation is definitely a "yes". Another possible approach is to make embedding specific filter queries very explicit -- put them in local-params of the KNN query parser instead of standard 'fq' at the top of the request. This makes it clear there's no double-interpretation problem and it provides a great deal of control to the Solr user/dev. But of course it's less friendly / more complex. I don't know enough about KNN to know if certain expensive filter queries (that are not post-filters) might be better off happening ~after the KNN instead of "pre". This is not supported by KNN today but could be.

gabrielmagno · 2023-01-04T13:28:58Z

Just to bring a "data-science KNN user" view on this.

I think there is actually two related but different ways one could use the Dense Vectors and KNN features. The first one is the most obvious and straightforward: you use it as KNN per si, i.e. you bring the top K documents most similar to the target.

The second way to use the KNN is to use the similarity score that is calculated with the vectors. In this case we use the vectors and their similarity for ranking instead of retrieval. And there are some use cases that we could even combine multiple similarity scores to create an aggregated score, or even combine the similarity score with the actual "lexical score" (BM25). For this second use case we often use a very high K to guarantee that we calculate the similarity score for all the relevant documents.

So, considering this second use case of using the KNN for calculating the similarity score, having the ability to filter the documents based on the value of the similarity is very useful. And as previously mentioned, sometimes this is not even directly a single similarity score, it could be a combination of multiple scores, and we could apply the Post Filter on top of the aggregated score.

…er (#1245) * SOLR-16567: KnnQueryParser support for both pre-filters and post-filters(cost>0)

…er, compilation error fix (#1245)

eliaporciani and others added 2 commits December 19, 2022 18:49

SOLR-16567: KnnQueryParser support for both pre-filters and post-filt…

e683af1

…ers(cost>0)

formatting

999862c

alessandrobenedetti self-assigned this Dec 19, 2022

dsmiley reviewed Dec 19, 2022

View reviewed changes

solr/core/src/java/org/apache/solr/search/neural/KnnQParser.java Outdated Show resolved Hide resolved

better post-filter check according to review suggestions

88954c8

dsmiley requested changes Dec 20, 2022

View reviewed changes

solr/core/src/java/org/apache/solr/search/neural/KnnQParser.java Outdated Show resolved Hide resolved

solr/core/src/java/org/apache/solr/search/neural/KnnQParser.java Outdated Show resolved Hide resolved

alessandrobenedetti mentioned this pull request Dec 21, 2022

introduce support in KnnVectorQuery for getters apache/lucene#12029

Merged

workaround to improve pre-filtering support for Knn search

84d9e71

re-using getProcessedFilter

7129a86

dsmiley reviewed Dec 22, 2022

View reviewed changes

SyntaxError propagation

d1e7c6a

dsmiley reviewed Dec 22, 2022

View reviewed changes

alessandrobenedetti added 2 commits December 23, 2022 10:05

Knn in filters

aa1b586

method boolean param removal

35dfd54

dsmiley approved these changes Dec 30, 2022

View reviewed changes

alessandrobenedetti added 3 commits January 4, 2023 19:25

Merge remote-tracking branch 'upstream/main' into jira/SOLR-16567

4e116b3

changes added

37a2b57

changes added

e92c4a0

alessandrobenedetti merged commit 257c93c into apache:main Jan 4, 2023

alessandrobenedetti added a commit that referenced this pull request Jan 4, 2023

SOLR-16567: KnnQueryParser support for both pre-filters and post-filt…

3c14fd8

…er (#1245) * SOLR-16567: KnnQueryParser support for both pre-filters and post-filters(cost>0)

alessandrobenedetti added a commit that referenced this pull request Jan 4, 2023

SOLR-16567: KnnQueryParser support for both pre-filters and post-filt…

d5811d8

…er (#1245) * SOLR-16567: KnnQueryParser support for both pre-filters and post-filters(cost>0)

alessandrobenedetti added a commit that referenced this pull request Jan 5, 2023

SOLR-16567: KnnQueryParser support for both pre-filters and post-filt…

e948e66

…er, compilation error fix (#1245)

alessandrobenedetti added a commit that referenced this pull request Jan 5, 2023

SOLR-16567: KnnQueryParser support for both pre-filters and post-filt…

33c9eb9

…er, compilation error fix (#1245)

alessandrobenedetti mentioned this pull request Jan 21, 2023

introduce support in KnnVectorQuery for getters apache/lucene#12099

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOLR-16567: KnnQueryParser support for both pre-filters and post-filter #1245

SOLR-16567: KnnQueryParser support for both pre-filters and post-filter #1245

alessandrobenedetti commented Dec 19, 2022 •

edited

dsmiley left a comment

alessandrobenedetti commented Dec 21, 2022 •

edited

alessandrobenedetti commented Dec 21, 2022

dsmiley commented Dec 21, 2022

alessandrobenedetti commented Dec 21, 2022

alessandrobenedetti commented Dec 21, 2022

dsmiley commented Dec 21, 2022

alessandrobenedetti commented Dec 21, 2022

alessandrobenedetti commented Dec 21, 2022

alessandrobenedetti commented Dec 22, 2022

epugh commented Dec 22, 2022

dsmiley left a comment

dsmiley Dec 22, 2022

alessandrobenedetti Dec 22, 2022

dsmiley Dec 22, 2022

alessandrobenedetti Dec 22, 2022

dsmiley Dec 22, 2022

alessandrobenedetti Dec 29, 2022

dsmiley left a comment

dsmiley commented Jan 4, 2023

gabrielmagno commented Jan 4, 2023 •

edited

SOLR-16567: KnnQueryParser support for both pre-filters and post-filter #1245

SOLR-16567: KnnQueryParser support for both pre-filters and post-filter #1245

Conversation

alessandrobenedetti commented Dec 19, 2022 • edited

Description

Tests

Checklist

dsmiley left a comment

Choose a reason for hiding this comment

alessandrobenedetti commented Dec 21, 2022 • edited

alessandrobenedetti commented Dec 21, 2022

dsmiley commented Dec 21, 2022

alessandrobenedetti commented Dec 21, 2022

alessandrobenedetti commented Dec 21, 2022

dsmiley commented Dec 21, 2022

alessandrobenedetti commented Dec 21, 2022

alessandrobenedetti commented Dec 21, 2022

alessandrobenedetti commented Dec 22, 2022

epugh commented Dec 22, 2022

dsmiley left a comment

Choose a reason for hiding this comment

dsmiley Dec 22, 2022

Choose a reason for hiding this comment

alessandrobenedetti Dec 22, 2022

Choose a reason for hiding this comment

dsmiley Dec 22, 2022

Choose a reason for hiding this comment

alessandrobenedetti Dec 22, 2022

Choose a reason for hiding this comment

dsmiley Dec 22, 2022

Choose a reason for hiding this comment

alessandrobenedetti Dec 29, 2022

Choose a reason for hiding this comment

dsmiley left a comment

Choose a reason for hiding this comment

dsmiley commented Jan 4, 2023

gabrielmagno commented Jan 4, 2023 • edited

alessandrobenedetti commented Dec 19, 2022 •

edited

alessandrobenedetti commented Dec 21, 2022 •

edited

gabrielmagno commented Jan 4, 2023 •

edited