How to prune non-relevant top documents automatically #55603

jimczi · 2020-04-22T15:33:01Z

In 7.0 we've added an optimization that allows to run pure disjunction queries (OR) without visiting all matches of the most frequent terms. Prior to this version, users have to ensure that they remove the most frequent terms (stop words removal) or switch to the common terms query to get acceptable performance.
We've decided to deprecate the common terms query for this reason. Users shouldn't rely on a cutoff_frequency in order to ensure fast disjunctions. The fact that this cutoff_frequency should change when documents are added/deleted but also that the frequency of the same term can be different even on replicas (since deleted docs are part of the count) makes it slightly dangerous to use. A small change in your index can make some queries much slower because an high-frequency terms don't reach the current cutoff_frequency anymore.

However, the common terms query is also sometimes used to improve the precision of search results. For instance the query the OR beatles would return top documents containing only the if there are no document containing the term beatles. Using the common terms query can ensure (assuming that the cuttof_frequency considers the as a frequent term) that no results are returned in this case. This looks like a valid use case for this query so we're wondering if should un-deprecate it since we don't have a direct replacement for this feature.
One thing that was raised during the initial discussion is that we should look at improving the detection of high frequent terms without the need for users to provide a precise cuttof_frequency. We also think that it's worth discussing all options which is why I am opening this issue and marking it as a blocker for 8.0.

I am curious to hear thoughts from users of the common terms query and particularly how do you deal with changing indices to update the cutoff_frequency ?

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-04-22T15:33:04Z

Pinging @elastic/es-search (:Search/Search)

mayya-sharipova · 2020-04-23T19:09:13Z

we should look at improving the detection of high frequent terms without the need for users to provide a precise cuttof_frequency.

Another idea is to have a dynamic min_score that is supplied as percentage of the max_score. After we have already collected top hits, and we know max_score and min_score in them, we can filter out hits that have too little scores.

markharwood · 2020-06-30T14:54:41Z

However, the common terms query is also sometimes used to improve the precision of search results.

Dropping terms is one way to improve precision but there are others too:

Switching from ngram fields to full-term fields (see app-search)
Switching from OR to AND
Switching from AND to phrase

What I was considering when looking at ways to trim long tails of garbage was ways to tell when switching from a stricter search strategy to a weaker one breaks the meaning of the query. I think this may be detect-able if you are lucky enough to have well-categorised data (many ecommerce vendors spend a lot of time on this). There can be a step-change in the diversity of categories as you switch from a strong strategy to a weaker one - the count of categories can act as a measure of the number of different meanings a query clause has. Consider this analysis of high-scoring versus low-scoring results in an ecommerce query:

If we can organise results into buckets based on the query clause strictness (I used large boosts to separate two clauses in the above example) then we can use a count of categories in each bucket as a measure of the focus in each clause. Poorly focused clauses might be ones with hundreds of categories and would be ones we might choose to drop.

mabdelhedi · 2021-06-02T09:30:40Z

Hello,
I am also facing a case when removing the cutoff_frequency on my multi-match query, for example searching on a "street" field "jump street", I have some documents that only match the word "street" (frequent word) ...

How can I really "cut off" the frequent term to be ignored from query (so I could have only documents containing at least "jump") without having to look for all frequent words and define them as stopwords ?

jimczi · 2021-10-15T13:43:29Z

I am removing the blocker label for now. We are still not decided if we should restore the functionality or provide a replacement.

elasticsearchmachine · 2022-07-27T15:00:41Z

Pinging @elastic/es-search (Team:Search)

jimczi added blocker :Search/Search Search-related issues that do not fall into other categories v8.0.0 labels Apr 22, 2020

mayya-sharipova added the feedback_needed label Apr 23, 2020

rjernst added the Team:Search Meta label for search team label May 4, 2020

markharwood mentioned this issue Jul 6, 2020

Consider support for an optional 'fallback query'. #51840

Open

jimczi removed the blocker label Oct 15, 2021

arteam added v8.1.0 and removed v8.0.0 labels Jan 12, 2022

mark-vieira added v8.2.0 and removed v8.1.0 labels Feb 2, 2022

salvatore-campagna added v8.3.0 and removed v8.2.0 labels Mar 30, 2022

craigtaverner added v8.4.0 and removed v8.3.0 labels May 25, 2022

mark-vieira added v8.5.0 and removed v8.4.0 labels Jul 27, 2022

csoulios added v8.6.0 and removed v8.5.0 labels Sep 21, 2022

kingherc added v8.7.0 and removed v8.6.0 labels Nov 16, 2022

rjernst added v8.8.0 and removed v8.7.0 labels Feb 8, 2023

gmarouli removed the v8.8.0 label Apr 26, 2023

gmarouli added the v8.9.0 label Apr 26, 2023

pugnascotia added v8.10.0 and removed v8.9.0 labels Jun 22, 2023

quux00 added v8.11.0 and removed v8.10.0 labels Aug 16, 2023

mattc58 added v8.12.0 and removed v8.11.0 labels Oct 4, 2023

brianseeders added v8.13.0 and removed v8.12.0 labels Dec 6, 2023

elasticsearchmachine added v8.14.0 and removed v8.13.0 labels Feb 14, 2024

elasticsearchmachine added v8.15.0 and removed v8.14.0 labels Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to prune non-relevant top documents automatically #55603

How to prune non-relevant top documents automatically #55603

jimczi commented Apr 22, 2020 •

edited

elasticmachine commented Apr 22, 2020

mayya-sharipova commented Apr 23, 2020

markharwood commented Jun 30, 2020

mabdelhedi commented Jun 2, 2021

jimczi commented Oct 15, 2021

elasticsearchmachine commented Jul 27, 2022

How to prune non-relevant top documents automatically #55603

How to prune non-relevant top documents automatically #55603

Comments

jimczi commented Apr 22, 2020 • edited

elasticmachine commented Apr 22, 2020

mayya-sharipova commented Apr 23, 2020

markharwood commented Jun 30, 2020

mabdelhedi commented Jun 2, 2021

jimczi commented Oct 15, 2021

elasticsearchmachine commented Jul 27, 2022

jimczi commented Apr 22, 2020 •

edited