Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-16567: KnnQueryParser support for both pre-filters and post-filter #1245

Merged
merged 11 commits into from Jan 4, 2023

Conversation

alessandrobenedetti
Copy link
Contributor

@alessandrobenedetti alessandrobenedetti commented Dec 19, 2022

https://issues.apache.org/jira/browse/SOLR-16567

Description

Frange and in general post filters were abandoned for KNN search in favour of pre-filters.
This PR brings back supports for post-filters when they have a cost>0 (in line with the rest of Solr)

Tests

tests have been added

Checklist

Please review the following and check all that apply:

  • [ X] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • [ X] I have created a Jira issue and added the issue ID to my pull request title.
  • [ X] I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • [ X] I have developed this patch against the main branch.
  • [ X] I have run ./gradlew check.
  • [ X] I have added tests for my changes.
  • I have added documentation for the Reference Guide

@alessandrobenedetti alessandrobenedetti self-assigned this Dec 19, 2022
Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking closer at KnnQParser; I've never reviewed it before. I see that it consumes the top-level request's fq params and it supplies them into the new KnnVectorQuery(...). I think this is a performance issue in that it ignores Solr's caching / interpretation of 'fq' into the well-known filter cache. I believe it ought to be using Solr's SolrIndexSearcher.getProcessedFilter method to get a combined Query of the parts, which is "cache-aware". Other features of Solr are similar (notably QueryElevationComponent & grouping) so you can get inspiration / understanding by looking at some callers of that method. Ultimately this should mean less code because getProcessedFilter is going to do most of the work, including understanding PostFilter.

A ramification of the lack of this use of getProcessedFilter is that typical filter queries are going to be processed twice -- once at the top level (perhaps retrieved from a cache), and again by this KNN stuff (not using a cache).

@alessandrobenedetti
Copy link
Contributor Author

alessandrobenedetti commented Dec 21, 2022

I'm looking closer at KnnQParser; ...

David, your help has been invaluable!
You are absolutely right and this is an oversight of mine when I did the original review of the pre-filtering work(this naming is quite common in the neural search community).

I think that I have found the perfect place for this fix and it literally would require few lines of code, rather than the complicate methods that are in place now:

org.apache.solr.search.QueryUtils#combineQueryAndFilter
... } else if(scoreQuery instanceof KnnVectorQuery){ (KnnVectorQuery)scoreQuery.setFilter(filterQuery); } else { return new BooleanQuery.Builder() .add(scoreQuery, Occur.MUST) .add(filterQuery, Occur.FILTER) .build(); }
Basically when we combine the query with the filter, we manage the thing differently for KNN queries, and the filter(excluding post filters) is set in the Lucene KnnVectorQuery.
This happens after the org.apache.solr.search.SolrIndexSearcher#getProcessedFilter in the Searcher, so filters are already cached and separated from post-filters
So far so good, elegant and minimal code change, filters are processed, everyone is happy...
BUT
currently KnnVectorQuery in Lucene has no getters and setters and has all variable as final!!
So, we are where we are (to be honest, for a library class I would have gone with private variables with getters/setters since the beginning).

Also, now Lucene is a separate project, so I basically should do the change in Lucene, then wait for a Lucene release, include it in Solr ect ect

So, long story short, my suggestion:

  1. we proceed with the current hack, renaming where necessary to make it nicer, but no massive change
  2. contribute the change Lucene side, probably removing final, and adding getters/setters or if the community disagree, just adding getters so that It's possible to create a new class from the input one
  3. once Lucene releases and we have the code in Solr, do the nice and clean implementation

let me know what do you think!

@alessandrobenedetti
Copy link
Contributor Author

Ok, I updated the PR, this is what I have done:

  1. workaround to fix the bug and renames suggested by @dsmiley
  2. opened a pull request in Lucene to implement getters in the KnnVectorQuery: https://github.com/apache/lucene/pull/12029/files

Unless any additional good ideas, I would go with this now.
Then as soon as the Lucene side is sorted out and in Solr, I would implement the optimal approach, removing all the redundant code and just managing the filters in the org.apache.solr.search.QueryUtils#combineQueryAndFilter ( it will be so easy and clean, it's a shame I can't do it immediately)

@dsmiley
Copy link
Contributor

dsmiley commented Dec 21, 2022

Thanks for your complements.

Can't you simply use SolrIndexSearcher#getProcessedFilter now?

As to your proposal, I am confused as to exactly where you propose inserting the logic you provided a snippet of. If you propose SolrIndexSearcher somewhere then I don't like it because it's clearly special casing a specific query which is a design problem. Are you trying to basically move certain FQs out of their top level position and into/embedded in a particular parsed query?

@alessandrobenedetti
Copy link
Contributor Author

Hi @dsmiley ,
once we have the Lucene changes, the idea is to change this method:
org.apache.solr.search.QueryUtils#combineQueryAndFilter
with an additional if clause:
else { return new BooleanQuery.Builder() .add(scoreQuery, Occur.MUST) .add(filterQuery, Occur.FILTER) .build(); }

  will become:
  `else if{scoreQuery instanceof KnnVectorQuery}{
  return new KnnVectorQuery(scoreQuery,filterQuery);} 
  else{
    return new BooleanQuery.Builder()
        .add(scoreQuery, Occur.MUST)
        .add(filterQuery, Occur.FILTER)
        .build();
  }`

Just to give an idea, the final code will look different as we will have to create a new instance of KnnVectorQuery using the getters of the old one.
With this change, we will be able to simplify the KnnQueryParser removing all the stuff for pre-filters and post-filters.
GetProcessedFilters will be called just once as usual and we'll get the benefit of caching and post-0filter separation automatically.
SolrIndexSearcher won't be touched at all.

@alessandrobenedetti
Copy link
Contributor Author

And yes, in the workaround I can use the getProcessedFilters and I will, but once we have the Lucene side it will go away.

@dsmiley
Copy link
Contributor

dsmiley commented Dec 21, 2022

A special case nearly anywhere (except directly in KNN oriented code of course) is a design/maintenance issue. Some special cases like MatchAllDocs are understandable but a check for KNN in QueryUtils... eh... :-/ Maybe you could show in a new PR what this would look like so I could see. Perhaps when I understand better what you are trying to accomplish, I'll see a better solution.

(me:) Are you trying to basically move certain FQs out of their top level position and into/embedded in a particular parsed query?

Could you respond to that please?

@alessandrobenedetti
Copy link
Contributor Author

Ok, I'll produce another branch with the example code assuming Lucene changes are there.

So what I am trying to accomplish:

  1. Knn Query has a Query filter as a constructor parameter (and instance variable). This filter is meant to be a pre-filter(in Approximate Nearest Neighbour search means a filter that is executed before the Top K nearest neighbors are returned).
    It is used internally in the approximate nearest neighbor search Lucene code to only accept certain neighbors from the graph(along with the alive bitSet)
  2. In Apache Solr we need to make sure that all filter queries except explicit post-filters are processed and set in the Knn Query.
  3. But parsing happens before the Searcher will process the filters. So if are able to modify the Lucene KnnQuery in org.apache.solr.search.QueryUtils#combineQueryAndFilter (or create a new one), we are done.

Right now we process the filters at parsing time (ad do it again in the Searcher).
Potentially we could process and remove the filters from the request in the query parser, but it seems nasty to me, hence my idea of modifying the combineQueryAndFilter.
Because that method has the responsibility of building a new Query, combining the main Query and all filters(except post-filters).
So it seems the perfect place for implementing the custom logic for the KnnQuery, that behaves differently when you combine it with filters.

Hope this helps with context @dsmiley !

@alessandrobenedetti
Copy link
Contributor Author

here a rough pull request just to give you the idea @dsmiley :
https://github.com/apache/solr/pull/1246/files

@alessandrobenedetti
Copy link
Contributor Author

Refactored a bit, removed custom code and used getProcessedFilter.
It looks much better now, but still, the getProcessedFilter is done twice.
It's probably an acceptable workaround at the moment?
I created the tentative branch to show what it would be like when we have access to the internal of the KNN Query (which I still believe to be ideal).
Open to better ideas and thanks again for the review @dsmiley !

@epugh
Copy link
Contributor

epugh commented Dec 22, 2022

Just wanted to comment what a great discussion this has been... I learned some new stuff...

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this is superior.
Yes, Solr will wind up calling getProcessedFilter twice but typically, filter queries are cached, and so looking up cached queries is nearly free. It's certainly better than the logic that was happening here before -- to actually evaluate those filter queries every time.

@@ -84,30 +84,20 @@ public Query parse() {
}

private Query getFilterQuery() throws SolrException {
if (!isFilter()) {
boolean isSubQuery = recurseCount != 0;
if (!isFilter() && !isSubQuery) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is isFilter being checked? Why is isSubQuery being checked? In my experience, use of these is extremely rare. isFilter... that would only be pertinent if the KNN query was specified in a filter query (fq), which is kind of surprising because AFAIK, KNN is for relevancy (scoring). Does this query need to operate differently or to make more optimal decisions if it's in a filter query (to only filter and not rank)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the isFilter:
I may want to run a Knn search as a filter, to then score the documents by any lexical query, it's probably unlikely, but possible.
The check avoid you end up with an infinite recursion with org.apache.solr.search.QueryUtils#parseFilterQueries.
But the check also avoid a KnnQuery in the filters to capture the other filter queries and use them as pre-filters, so it's not ideal anyway.
Tagging @eliaporciani if he has any better idea.
I'll think about it

For the "isSubQuery", I want to avoid ending up with the infinite recursion frange and similar filters may end up (the original issue in the ticket).

try {
filters = QueryUtils.parseFilterQueries(req, true);
} catch (SyntaxError e) {
List<Query> filters = QueryUtils.parseFilterQueries(req, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boolean parameters are kind of sadly confusing in methods when it's not obvious what it is in code review tools (because the method name isn't a hint either). Here, it is "fixNegativeQueries". It should be rare to need to do that. It's redundant/needless here because getProcessedFilter deals with the matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, now I remember this bit, I was initially (and I am still) not a fan of that boolean param (as I tend to forget what it means and the name also doesn't help).
We discussed for a better name(we couldn't find) and the reason to put it that was was because that piece of code was shared across various Solr parts and other reviewers thought it was better to isolate it and move it to the queryUtils.

Not sure of course, if it's still the best solution to be honest

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this boolean could be removed (logic is to not call makeQueryable inside). The only spot to be improved is then RealTimeGetComponent. Simply update the spot that loops over the filters (RealTimeGetComponent.process line ~ 328 to ensure the query is "queryable".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just needs a CHANGES.txt

@dsmiley
Copy link
Contributor

dsmiley commented Jan 4, 2023

I read your response to my question:

(me:) Are you trying to basically move certain FQs out of their top level position and into/embedded in a particular parsed query?

You didn't give a simple yes-no answer but my interpretation is definitely a "yes". Another possible approach is to make embedding specific filter queries very explicit -- put them in local-params of the KNN query parser instead of standard 'fq' at the top of the request. This makes it clear there's no double-interpretation problem and it provides a great deal of control to the Solr user/dev. But of course it's less friendly / more complex. I don't know enough about KNN to know if certain expensive filter queries (that are not post-filters) might be better off happening ~after the KNN instead of "pre". This is not supported by KNN today but could be.

@gabrielmagno
Copy link
Contributor

gabrielmagno commented Jan 4, 2023

Just to bring a "data-science KNN user" view on this.

I think there is actually two related but different ways one could use the Dense Vectors and KNN features. The first one is the most obvious and straightforward: you use it as KNN per si, i.e. you bring the top K documents most similar to the target.

The second way to use the KNN is to use the similarity score that is calculated with the vectors. In this case we use the vectors and their similarity for ranking instead of retrieval. And there are some use cases that we could even combine multiple similarity scores to create an aggregated score, or even combine the similarity score with the actual "lexical score" (BM25). For this second use case we often use a very high K to guarantee that we calculate the similarity score for all the relevant documents.

So, considering this second use case of using the KNN for calculating the similarity score, having the ability to filter the documents based on the value of the similarity is very useful. And as previously mentioned, sometimes this is not even directly a single similarity score, it could be a combination of multiple scores, and we could apply the Post Filter on top of the aggregated score.

@alessandrobenedetti alessandrobenedetti merged commit 257c93c into apache:main Jan 4, 2023
alessandrobenedetti added a commit that referenced this pull request Jan 4, 2023
…er (#1245)

* SOLR-16567: KnnQueryParser support for both pre-filters and post-filters(cost>0)
alessandrobenedetti added a commit that referenced this pull request Jan 4, 2023
…er (#1245)

* SOLR-16567: KnnQueryParser support for both pre-filters and post-filters(cost>0)
alessandrobenedetti added a commit that referenced this pull request Jan 5, 2023
alessandrobenedetti added a commit that referenced this pull request Jan 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants