-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
We currently only push ES|QL conditions that have an exact equivalent to a Lucene query.
Take as example, where all conditions are pushed down to lucene and we use a SORT that's also pushed down:
FROM wikipedia METADATA _score
| WHERE title:"europe"
| sort _score desc
| limit 10
This query will use the LuceneTopNSourceOperator and should be close in performance to a running a match query in the DSL. The LuceneTopNSourceOperator will emit at most 10 rows that need to be processed on the compute service side.
The moment we have a filter condition that is not pushed down, we no longer use the LuceneTopNSourceOperator, but LuceneSourceOperator. We will still push down the match query, but LuceneSourceOperator will output all docs that match the query string. These docs will then be processed on the compute service side to apply the non-pushable filter and then sorted to get the the top 10.
FROM wikipedia METADATA _score
| WHERE title:"europe" and length(title) > 10
| sort _score desc
| limit 10
We should look into whether we can push down more WHERE conditions as filters.
We will likely need a custom Lucene query for this that can evaluate an ES|QL expression.
We can start with something simple such as pushing down only conditions that depend on the Literals and indexed fields (not other runtime columns resulted from EVALs).
If we want to first validate if this would improve things, we can start with a simple prototype that pushes down a set conditions like length(title) > 10 as a painless script query and then run a simple benchmark on a larger dataset (where it is more likely that we will see an improvement).
EDIT: We should look into adding a min competitive optimization.
For example we can have a callback between the LuceneSourceOperator and TopNOperator such that we can set a min competitive score back in LuceneSourceOperator once we fill the priority queue in TopNOperator