-
Notifications
You must be signed in to change notification settings - Fork 25.6k
POC - Automatic prefiltering for semantic_text queries #137467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
POC - Automatic prefiltering for semantic_text queries #137467
Conversation
This is a POC that implements automatic prefiltering for `semantic_text` queries in Query DSL. Semantic text queries for `text_embedding` tasks get rewritten as kNN vector queries. The latter support filters that are applied before search. In this PR we introduce a `Prefiltering` interface that gets implemented by query builders that need prefiltering (match/semantic) or need to propagate prefilters (all compound query builders). When compound queries get rewritten, we propagate filter queries to direct child queries that support prefiltering. When match or semantic queries get rewritten to kNN queries, we set their prefilters to the kNN query so that they are applied before search.
server/src/test/java/org/elasticsearch/index/query/PrefilteringTestUtils.java
Outdated
Show resolved
Hide resolved
Mikep86
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! The POC is pretty clean overall and gives us a good foundation to build on 🚀
Can we add pre-filtering for NestedQueryBuilder? That's another query type that takes a child query.
server/src/main/java/org/elasticsearch/index/query/PrefilteredQuery.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/query/BoolQueryBuilder.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/query/BoostingQueryBuilder.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/query/DisMaxQueryBuilder.java
Show resolved
Hide resolved
benwtrent
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a huge change. Adding a new interface to our most common query builders requires some more feedback. Let me think about this.
| return new MatchAllQueryBuilder().boost(boost()).queryName(queryName()); | ||
| } | ||
|
|
||
| propagatePrefilters(Stream.concat(mustClauses.stream(), shouldClauses.stream()).toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should clauses, unless min_should_match: > 1 don't actually filter docs at all. Of course, we should consider pushing down filter, must_not, must.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't really about filtering. It is about executing kNN queries correctly. Think of a match query against a dense semantic_text field. It will not get the correct results if it does not have its filters be applied prior to the kNN search.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dimitris-athanasiou I understand this. I am struggling to see WHICH filters are being propagated and why.
Could we lay out an example of which filters need to get pushed down in the query tree? For example:
- Is it all filters that are applied to the immediate parent? Or up the entire query tree?
- Do we need to consider sibling filters? As technically these impact regular sibling branches of the query tree as well.
Semantic_text (and kNN for that matter), can be present in every clause kind. Consequently, every clause needs to think about accepting prefilters and applying them or results are impacted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any query that can have children queries needs to be able to propagate prefilters.
However, the actual prefilters are coming from:
- bool query: must, filter, must_not clauses all should be used for prefiltering. (there is still some discussion about how we deal with
mustqueries for other technical reasons). I am only propagatingfilterclauses for now but I'm working on adding the rest.
All the other compound queries do not add further queries as prefilters, they just need to be able to pass through any prefilters they were passed on from further up the tree.
Edit: nested query isn't actually adding additional prefilters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well stated @dimitris-athanasiou. This is basically a plumbing problem. To implement effective automatic pre-filtering, we need to collect all the filter queries applied through ancestors in query tree and route them to leaf queries that apply prefilters.
bool query is the only compound query type that allows users to apply new filter queries. All other compound query types just need to handle passing through prefilters to child queries.
| import static org.elasticsearch.search.fetch.subphase.InnerHitsContext.intersect; | ||
|
|
||
| public class NestedQueryBuilder extends AbstractQueryBuilder<NestedQueryBuilder> { | ||
| public class NestedQueryBuilder extends AbstractQueryBuilder<NestedQueryBuilder> implements Prefiltering<NestedQueryBuilder> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't immediately know how this should behave given nested could be within a boolean query, but the higher level filters are applied to the upper level of the nested context, but the inner filters would be applied to the lower level of the nested context.
Does this mean we try to push both down into something that is asking for pre-filtering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll work out an example of how this would work. Or realize it wouldn't :-)
carlosdelest
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looking good - the idea is nicely captured.
Some things I'd like to follow up on:
- Rewriting of the filters seem necessary to me as part of the compound query rewriting.
- We can probably benefit from an abstract class that introduces getters / setters and propagation mechanism
- I think we are missing YAML tests and tests for the compound query builders. An abstract class would make sense here as well to me, to ensure we are checking prefilters consistently when adding them, on rewriting, etc
server/src/main/java/org/elasticsearch/index/query/Prefiltering.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/query/functionscore/FunctionScoreQueryBuilder.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/query/BoolQueryBuilder.java
Outdated
Show resolved
Hide resolved
|
|
||
| @Override | ||
| public List<QueryBuilder> getPrefilters() { | ||
| return Stream.concat(prefilters.stream(), filterClauses.stream()).toList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should take into account must and must_not as well, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
must clauses we decided not to include. There are complications because for each must clause we need to apply the other n - 1 must clauses.
must_not we decided to add here. I'll do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmmm, I see. IIUC that should happen with must_not clauses as well?
One way to look at it would be to retrieve all must / must_not / filter clauses in a boolean query, and add them to the prefilters. That way we could use the same strategy for pushing only the prefilters that are different to the target query 🤔
We can iterate on this! No need to implement every corner case as of now - it will help to reason first on the base cases and do follow ups as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, yes, must_not can be propagated, not used as target queries. So, we propagate filter and must_not clauses to must and should clauses.
server/src/main/java/org/elasticsearch/index/query/functionscore/FunctionScoreQueryBuilder.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/query/Prefiltering.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/index/query/PrefilteringTestUtils.java
Outdated
Show resolved
Hide resolved
BASE=e486766b84b4a2c21b5f01e328bd764f1edd20b2 HEAD=22006a8e97999ad822bc1c965873787119340b35 Branch=main
8a28cfb to
22006a8
Compare
BASE=e486766b84b4a2c21b5f01e328bd764f1edd20b2 HEAD=22006a8e97999ad822bc1c965873787119340b35 Branch=main
This is a POC that implements automatic prefiltering for
semantic_textqueries in Query DSL.Semantic text queries for
text_embeddingtasks get rewritten as kNN vector queries. The latter support filters that are applied before search.In this PR we introduce a
Prefilteringinterface that gets implemented by query builders that need prefiltering (match/semantic) or need to propagate prefilters (all compound query builders).When compound queries get rewritten, we propagate filter queries to direct child queries that support prefiltering.
When match or semantic queries get rewritten to kNN queries, we set their prefilters to the kNN query so that they are applied before search.
The only query that actually produces prefilters is the
boolquery. Other compound queries simply pass throughprefilters to their child queries. For
bool, we consider as prefilters clauses that are run in the filter context, namelyfilterandmust_not. Therefore, we applyfilterandmust_notclauses as prefilters tomustandshouldclauses. This is clear and easy to reason for, and it avoids the complications of trying to applyn - 1must clauses to each other must clause.