Use CoveringQuery to select percolator candidates? #26307

jpountz · 2017-08-21T14:38:19Z

Currently when extracting conjunctions such as a AND b AND c, we use heuristics in order to figure out which term is the rarest and we only index this term.

Lucene just got a new CoveringQuery which allows to configure the number of required clauses on a per-document basis. This means that we could index all terms of a AND b AND c alongside a numeric field that would store 3 and then at query time the created CoveringQuery wouldn't have false positives: we would not even need to run the query through the memory index to check whether it matches.

The text was updated successfully, but these errors were encountered:

martijnvg · 2017-10-06T16:02:28Z

I worked on integrating the CoveringQuery in the percolator and it is a good fit! I tested this change out on a percolator query set and the query time was reduced by 3 times.

I did run into two challenges:

Currently all the extracted terms from the document to be percolated are being put in a TermInSetQuery instance. This is no longer possible with the CoveringQuery, because then all terms inside TermsInSetQuery count as single match. So currently in the commit mentioned above each query term is added as a separateTermQuery instance to the CoveringQuery. This is not ideal because if a large document is to be percolated then could cause a TooManyClauses exception. I need to think about this more on how to solve this.
In case of range queries each extraction does not simply increment the minimum_should_match for that percolator query like for a term based extraction, so that can lead to more false positives for percolator queries with range queries than term based queries. The is because the way number fields are extracted from the document to be percolated. Per field a single range is extracted and if a percolator query has two or more range queries on the same field than the the minimum should match can be higher than clauses in the CoveringQuery. Therefore right now the minimum should match is incremented once per number field when processing the percolator query at index time. I think that this is acceptable, percolator queries with range queries are more prone to false positives than percolator queries with term based queries to begin with, due to how the extracted ranges are stored.

jpountz · 2017-10-06T16:09:56Z

This is not ideal because if a large document is to be percolated then could cause a TooManyClauses exception

Maybe we should still use a TermsQuery when percolated documents are large?

martijnvg · 2017-10-06T16:16:24Z

Maybe we should still use a TermsQuery when percolated documents are large?

and than fallback using a constant LongValuesSource instance set to 1 as minimumNumberMatch?

extract all clauses from a conjunction query. When clauses from a conjunction are extracted the number of clauses is also stored in an internal doc values field (minimum_should_match field). This field is used by the CoveringQuery and allows the percolator to reduce the number of false positives when selecting candidate matches and in certain cases be absolutely sure that a conjunction candidate match will match and then skip MemoryIndex validation. This can greatly improve performance. Before this change only a single clause was extracted from a conjunction query. The percolator tried to extract the clauses that was rarest in order (based on term length) to attempt less candidate queries to be selected in the first place. However this still method there is still a very high chance that candidate query matches are false positives. This change also removes the influencing query extraction added via #26081 as this is no longer needed because now all conjunction clauses are extracted. https://www.elastic.co/guide/en/elasticsearch/reference/6.x/percolator.html#_influencing_query_extraction Closes #26307

jpountz added :Search Relevance/Percolator Reverse search: find queries that match a document >enhancement labels Aug 21, 2017

martijnvg mentioned this issue Aug 21, 2017

Improve percolator performance #25445

Closed

9 tasks

martijnvg mentioned this issue Nov 6, 2017

Use Lucene's CoveringQuery to select percolate candidate matches #27271

Merged

martijnvg closed this as completed in b4048b4 Nov 10, 2017

javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use CoveringQuery to select percolator candidates? #26307

Use CoveringQuery to select percolator candidates? #26307

jpountz commented Aug 21, 2017

martijnvg commented Oct 6, 2017

jpountz commented Oct 6, 2017

martijnvg commented Oct 6, 2017

Use CoveringQuery to select percolator candidates? #26307

Use CoveringQuery to select percolator candidates? #26307

Comments

jpountz commented Aug 21, 2017

martijnvg commented Oct 6, 2017

jpountz commented Oct 6, 2017

martijnvg commented Oct 6, 2017