-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use CoveringQuery to select percolator candidates? #26307
Comments
I worked on integrating the CoveringQuery in the percolator and it is a good fit! I tested this change out on a percolator query set and the query time was reduced by 3 times. I did run into two challenges:
|
Maybe we should still use a |
and than fallback using a constant |
extract all clauses from a conjunction query. When clauses from a conjunction are extracted the number of clauses is also stored in an internal doc values field (minimum_should_match field). This field is used by the CoveringQuery and allows the percolator to reduce the number of false positives when selecting candidate matches and in certain cases be absolutely sure that a conjunction candidate match will match and then skip MemoryIndex validation. This can greatly improve performance. Before this change only a single clause was extracted from a conjunction query. The percolator tried to extract the clauses that was rarest in order (based on term length) to attempt less candidate queries to be selected in the first place. However this still method there is still a very high chance that candidate query matches are false positives. This change also removes the influencing query extraction added via #26081 as this is no longer needed because now all conjunction clauses are extracted. https://www.elastic.co/guide/en/elasticsearch/reference/6.x/percolator.html#_influencing_query_extraction Closes #26307
Currently when extracting conjunctions such as
a AND b AND c
, we use heuristics in order to figure out which term is the rarest and we only index this term.Lucene just got a new
CoveringQuery
which allows to configure the number of required clauses on a per-document basis. This means that we could index all terms ofa AND b AND c
alongside a numeric field that would store3
and then at query time the createdCoveringQuery
wouldn't have false positives: we would not even need to run the query through the memory index to check whether it matches.The text was updated successfully, but these errors were encountered: