Skip to content

Conversation

@dimitris-athanasiou
Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou commented Oct 31, 2025

This is a POC that implements automatic prefiltering for semantic_text queries in Query DSL.

Semantic text queries for text_embedding tasks get rewritten as kNN vector queries. The latter support filters that are applied before search.

In this PR we introduce a Prefiltering interface that gets implemented by query builders that need prefiltering (match/semantic) or need to propagate prefilters (all compound query builders).

When compound queries get rewritten, we propagate filter queries to direct child queries that support prefiltering.

When match or semantic queries get rewritten to kNN queries, we set their prefilters to the kNN query so that they are applied before search.

The only query that actually produces prefilters is the bool query. Other compound queries simply pass through
prefilters to their child queries. For bool, we consider as prefilters clauses that are run in the filter context, namely filter and must_not. Therefore, we apply filter and must_not clauses as prefilters to must and should clauses. This is clear and easy to reason for, and it avoids the complications of trying to apply n - 1 must clauses to each other must clause.

This is a POC that implements automatic prefiltering for `semantic_text`
queries in Query DSL.

Semantic text queries for `text_embedding` tasks get rewritten as kNN vector queries.
The latter support filters that are applied before search.

In this PR we introduce a `Prefiltering` interface that gets implemented by
query builders that need prefiltering (match/semantic) or need to propagate prefilters
(all compound query builders).

When compound queries get rewritten, we propagate filter queries to direct child queries
that support prefiltering.

When match or semantic queries get rewritten to kNN queries, we set their prefilters to
the kNN query so that they are applied before search.
@dimitris-athanasiou dimitris-athanasiou added >non-issue :SearchOrg/Relevance Label for the Search (solution/org) Relevance team v9.3.0 labels Oct 31, 2025
Copy link
Contributor

@Mikep86 Mikep86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! The POC is pretty clean overall and gives us a good foundation to build on 🚀

Can we add pre-filtering for NestedQueryBuilder? That's another query type that takes a child query.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a huge change. Adding a new interface to our most common query builders requires some more feedback. Let me think about this.

return new MatchAllQueryBuilder().boost(boost()).queryName(queryName());
}

propagatePrefilters(Stream.concat(mustClauses.stream(), shouldClauses.stream()).toList());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should clauses, unless min_should_match: > 1 don't actually filter docs at all. Of course, we should consider pushing down filter, must_not, must.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really about filtering. It is about executing kNN queries correctly. Think of a match query against a dense semantic_text field. It will not get the correct results if it does not have its filters be applied prior to the kNN search.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dimitris-athanasiou I understand this. I am struggling to see WHICH filters are being propagated and why.

Could we lay out an example of which filters need to get pushed down in the query tree? For example:

  • Is it all filters that are applied to the immediate parent? Or up the entire query tree?
  • Do we need to consider sibling filters? As technically these impact regular sibling branches of the query tree as well.

Semantic_text (and kNN for that matter), can be present in every clause kind. Consequently, every clause needs to think about accepting prefilters and applying them or results are impacted.

Copy link
Contributor Author

@dimitris-athanasiou dimitris-athanasiou Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any query that can have children queries needs to be able to propagate prefilters.

However, the actual prefilters are coming from:

  • bool query: must, filter, must_not clauses all should be used for prefiltering. (there is still some discussion about how we deal with must queries for other technical reasons). I am only propagating filter clauses for now but I'm working on adding the rest.

All the other compound queries do not add further queries as prefilters, they just need to be able to pass through any prefilters they were passed on from further up the tree.

Edit: nested query isn't actually adding additional prefilters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well stated @dimitris-athanasiou. This is basically a plumbing problem. To implement effective automatic pre-filtering, we need to collect all the filter queries applied through ancestors in query tree and route them to leaf queries that apply prefilters.

bool query is the only compound query type that allows users to apply new filter queries. All other compound query types just need to handle passing through prefilters to child queries.

import static org.elasticsearch.search.fetch.subphase.InnerHitsContext.intersect;

public class NestedQueryBuilder extends AbstractQueryBuilder<NestedQueryBuilder> {
public class NestedQueryBuilder extends AbstractQueryBuilder<NestedQueryBuilder> implements Prefiltering<NestedQueryBuilder> {
Copy link
Member

@benwtrent benwtrent Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't immediately know how this should behave given nested could be within a boolean query, but the higher level filters are applied to the upper level of the nested context, but the inner filters would be applied to the lower level of the nested context.

Does this mean we try to push both down into something that is asking for pre-filtering?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll work out an example of how this would work. Or realize it wouldn't :-)

Copy link
Member

@carlosdelest carlosdelest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking good - the idea is nicely captured.

Some things I'd like to follow up on:

  • Rewriting of the filters seem necessary to me as part of the compound query rewriting.
  • We can probably benefit from an abstract class that introduces getters / setters and propagation mechanism
  • I think we are missing YAML tests and tests for the compound query builders. An abstract class would make sense here as well to me, to ensure we are checking prefilters consistently when adding them, on rewriting, etc


@Override
public List<QueryBuilder> getPrefilters() {
return Stream.concat(prefilters.stream(), filterClauses.stream()).toList();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should take into account must and must_not as well, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must clauses we decided not to include. There are complications because for each must clause we need to apply the other n - 1 must clauses.

must_not we decided to add here. I'll do so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmmm, I see. IIUC that should happen with must_not clauses as well?

One way to look at it would be to retrieve all must / must_not / filter clauses in a boolean query, and add them to the prefilters. That way we could use the same strategy for pushing only the prefilters that are different to the target query 🤔

We can iterate on this! No need to implement every corner case as of now - it will help to reason first on the base cases and do follow ups as needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, yes, must_not can be propagated, not used as target queries. So, we propagate filter and must_not clauses to must and should clauses.

phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Nov 7, 2025
BASE=e486766b84b4a2c21b5f01e328bd764f1edd20b2
HEAD=22006a8e97999ad822bc1c965873787119340b35
Branch=main
phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Nov 8, 2025
BASE=e486766b84b4a2c21b5f01e328bd764f1edd20b2
HEAD=22006a8e97999ad822bc1c965873787119340b35
Branch=main
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>non-issue :SearchOrg/Relevance Label for the Search (solution/org) Relevance team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants