Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CoveringQuery to select percolator candidates? #26307

Closed
jpountz opened this issue Aug 21, 2017 · 3 comments
Closed

Use CoveringQuery to select percolator candidates? #26307

jpountz opened this issue Aug 21, 2017 · 3 comments
Labels
>enhancement :Search Relevance/Percolator Reverse search: find queries that match a document Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@jpountz
Copy link
Contributor

jpountz commented Aug 21, 2017

Currently when extracting conjunctions such as a AND b AND c, we use heuristics in order to figure out which term is the rarest and we only index this term.

Lucene just got a new CoveringQuery which allows to configure the number of required clauses on a per-document basis. This means that we could index all terms of a AND b AND c alongside a numeric field that would store 3 and then at query time the created CoveringQuery wouldn't have false positives: we would not even need to run the query through the memory index to check whether it matches.

@jpountz jpountz added :Search Relevance/Percolator Reverse search: find queries that match a document >enhancement labels Aug 21, 2017
@martijnvg
Copy link
Member

I worked on integrating the CoveringQuery in the percolator and it is a good fit! I tested this change out on a percolator query set and the query time was reduced by 3 times.

I did run into two challenges:

  • Currently all the extracted terms from the document to be percolated are being put in a TermInSetQuery instance. This is no longer possible with the CoveringQuery, because then all terms inside TermsInSetQuery count as single match. So currently in the commit mentioned above each query term is added as a separateTermQuery instance to the CoveringQuery. This is not ideal because if a large document is to be percolated then could cause a TooManyClauses exception. I need to think about this more on how to solve this.
  • In case of range queries each extraction does not simply increment the minimum_should_match for that percolator query like for a term based extraction, so that can lead to more false positives for percolator queries with range queries than term based queries. The is because the way number fields are extracted from the document to be percolated. Per field a single range is extracted and if a percolator query has two or more range queries on the same field than the the minimum should match can be higher than clauses in the CoveringQuery. Therefore right now the minimum should match is incremented once per number field when processing the percolator query at index time. I think that this is acceptable, percolator queries with range queries are more prone to false positives than percolator queries with term based queries to begin with, due to how the extracted ranges are stored.

@jpountz
Copy link
Contributor Author

jpountz commented Oct 6, 2017

This is not ideal because if a large document is to be percolated then could cause a TooManyClauses exception

Maybe we should still use a TermsQuery when percolated documents are large?

@martijnvg
Copy link
Member

Maybe we should still use a TermsQuery when percolated documents are large?

and than fallback using a constant LongValuesSource instance set to 1 as minimumNumberMatch?

martijnvg added a commit that referenced this issue Nov 10, 2017
extract all clauses from a conjunction query.

When clauses from a conjunction are extracted the number of clauses is
also stored in an internal doc values field (minimum_should_match field).
This field is used by the CoveringQuery and allows the percolator to
reduce the number of false positives when selecting candidate matches and
in certain cases be absolutely sure that a conjunction candidate match
will match and then skip MemoryIndex validation. This can greatly improve
performance.

Before this change only a single clause was extracted from a conjunction
query. The percolator tried to extract the clauses that was rarest in order
(based on term length) to attempt less candidate queries to be selected
in the first place. However this still method there is still a very high
chance that candidate query matches are false positives.

This change also removes the influencing query extraction added via #26081
as this is no longer needed because now all conjunction clauses are extracted.

https://www.elastic.co/guide/en/elasticsearch/reference/6.x/percolator.html#_influencing_query_extraction

Closes #26307
@javanna javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Relevance/Percolator Reverse search: find queries that match a document Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

3 participants