Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to prune non-relevant top documents automatically #55603

Open
jimczi opened this issue Apr 22, 2020 · 6 comments
Open

How to prune non-relevant top documents automatically #55603

jimczi opened this issue Apr 22, 2020 · 6 comments
Labels
feedback_needed :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v8.15.0

Comments

@jimczi
Copy link
Contributor

jimczi commented Apr 22, 2020

In 7.0 we've added an optimization that allows to run pure disjunction queries (OR) without visiting all matches of the most frequent terms. Prior to this version, users have to ensure that they remove the most frequent terms (stop words removal) or switch to the common terms query to get acceptable performance.
We've decided to deprecate the common terms query for this reason. Users shouldn't rely on a cutoff_frequency in order to ensure fast disjunctions. The fact that this cutoff_frequency should change when documents are added/deleted but also that the frequency of the same term can be different even on replicas (since deleted docs are part of the count) makes it slightly dangerous to use. A small change in your index can make some queries much slower because an high-frequency terms don't reach the current cutoff_frequency anymore.

However, the common terms query is also sometimes used to improve the precision of search results. For instance the query the OR beatles would return top documents containing only the if there are no document containing the term beatles. Using the common terms query can ensure (assuming that the cuttof_frequency considers the as a frequent term) that no results are returned in this case. This looks like a valid use case for this query so we're wondering if should un-deprecate it since we don't have a direct replacement for this feature.
One thing that was raised during the initial discussion is that we should look at improving the detection of high frequent terms without the need for users to provide a precise cuttof_frequency. We also think that it's worth discussing all options which is why I am opening this issue and marking it as a blocker for 8.0.

I am curious to hear thoughts from users of the common terms query and particularly how do you deal with changing indices to update the cutoff_frequency ?

@jimczi jimczi added blocker :Search/Search Search-related issues that do not fall into other categories v8.0.0 labels Apr 22, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@mayya-sharipova
Copy link
Contributor

we should look at improving the detection of high frequent terms without the need for users to provide a precise cuttof_frequency.

Another idea is to have a dynamic min_score that is supplied as percentage of the max_score. After we have already collected top hits, and we know max_score and min_score in them, we can filter out hits that have too little scores.

@rjernst rjernst added the Team:Search Meta label for search team label May 4, 2020
@markharwood
Copy link
Contributor

However, the common terms query is also sometimes used to improve the precision of search results.

Dropping terms is one way to improve precision but there are others too:

  • Switching from ngram fields to full-term fields (see app-search)
  • Switching from OR to AND
  • Switching from AND to phrase

What I was considering when looking at ways to trim long tails of garbage was ways to tell when switching from a stricter search strategy to a weaker one breaks the meaning of the query. I think this may be detect-able if you are lucky enough to have well-categorised data (many ecommerce vendors spend a lot of time on this). There can be a step-change in the diversity of categories as you switch from a strong strategy to a weaker one - the count of categories can act as a measure of the number of different meanings a query clause has. Consider this analysis of high-scoring versus low-scoring results in an ecommerce query:

If we can organise results into buckets based on the query clause strictness (I used large boosts to separate two clauses in the above example) then we can use a count of categories in each bucket as a measure of the focus in each clause. Poorly focused clauses might be ones with hundreds of categories and would be ones we might choose to drop.

@mabdelhedi
Copy link

Hello,
I am also facing a case when removing the cutoff_frequency on my multi-match query, for example searching on a "street" field "jump street", I have some documents that only match the word "street" (frequent word) ...

How can I really "cut off" the frequent term to be ignored from query (so I could have only documents containing at least "jump") without having to look for all frequent words and define them as stopwords ?

@jimczi
Copy link
Contributor Author

jimczi commented Oct 15, 2021

I am removing the blocker label for now. We are still not decided if we should restore the functionality or provide a replacement.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@csoulios csoulios added v8.6.0 and removed v8.5.0 labels Sep 21, 2022
@kingherc kingherc added v8.7.0 and removed v8.6.0 labels Nov 16, 2022
@rjernst rjernst added v8.8.0 and removed v8.7.0 labels Feb 8, 2023
@gmarouli gmarouli removed the v8.8.0 label Apr 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feedback_needed :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v8.15.0
Projects
None yet
Development

No branches or pull requests