Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disallow the `classic` (TF-IDF) similarity on 6.0 indices. #23208

Closed
jpountz opened this issue Feb 16, 2017 · 10 comments

Comments

Projects
None yet
6 participants
@jpountz
Copy link
Contributor

commented Feb 16, 2017

BM25 should generally perform better than TF-IDF. Moreover, Lucene is removing coords and query normalization in 7.0 (that Elasticsearch will be based on) so we should start deprecating the classic similarity.

@colings86

This comment has been minimized.

Copy link
Member

commented Mar 31, 2017

@jpountz I made this an adoptme as it seems like something we intend to do rather than something that needs further discussion

@Rezaak1024

This comment has been minimized.

Copy link

commented May 14, 2017

Hi, I'm new to the project and would like to get started on this issue if no one else is already working on it.

@colings86 colings86 added discuss and removed help wanted labels Jun 5, 2017

@jpountz

This comment has been minimized.

Copy link
Contributor Author

commented Jun 5, 2017

Adding the discuss label to figure out whether we should just disallow it on new indices, or also add it in a plugin for users who might really really need TF-IDF scoring.

@clintongormley

This comment has been minimized.

Copy link
Member

commented Jun 6, 2017

@jpountz can we really support it without query coordination? If not, then I'd opt for removing it. (Even if we can I'm leaning towards removing it)

@jpountz

This comment has been minimized.

Copy link
Contributor Author

commented Jun 7, 2017

The removal of query coordination might significantly decrease the quality of this similarity indeed. @eskibars Maybe you have some pespective on this, I think you suggested some users might rely specifically on TF-IDF in the last search&aggs meeting?

@eskibars

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2017

@eskibars Maybe you have some pespective on this, I think you suggested some users might rely specifically on TF-IDF in the last search&aggs meeting?

I'm certain some people rely on the full old/classic TF-IDF implementation, but in a few different ways that I'll tease apart.

  1. We've seen people in our forums mention they have regression tests that include ordering and my own history leads me to believe that while these users will be the minority, they are not terribly uncommon and they often have extensive UAT / and long UAT cycles, which is why they built the tests in the first place. For these users, dropping classic similarity is going to mean one of 2 things: simply avoiding the upgrade entirely and sticking around on 5.x for as long as possible or upgrading a test environment and then re-engaging their long UAT cycles, potentially failing them and having to go edit queries, etc as they go through these cycles.
  2. To a much lesser extent, I've anecdotally heard some users with a fixed catalog of data include the numeric score calculations of some specific queries in their regression tests.
  3. There are people that are convinced TF-IDF is better for their data+queries than BM25. I've heard this a number times: sometimes where the user has done a lot of testing and shown that TF-IDF is better in one way or another for them and sometimes not. I also believe TF-IDF also has more literature around it, is taught in more schools, and is generally a bit easier to understand (the formula is more or less in the name) so there may be some resistance purely due to how understood the two are relative to one another.

To be clear, I think that the vast majority of users do not fall in any of these categories and I've seen BM25 does perform better in the vast majority of UAT (especially in text search) and is built on sound principals avoiding keyword stuffing, etc. Given that Lucene is dropping coord, I'm not sure what we can do for users falling in categories 1 and 2 other than brace them for the impact ASAP. For users in the third category, one thing I was thinking was the possibility of pulling classic similarity out into a plugin that we could bootstrap for the community (and put it in the community's hands to support). No matter what, we also need to accept that for some portion of users, this change will delay (to change queries / go through UAT again) or entirely keep them from upgrading (if they decide they "need" to keep the old behavior).

@jpountz

This comment has been minimized.

Copy link
Contributor Author

commented Jun 15, 2017

I'm wondering whether the scripted similarity feature that we discussed a couple times could be a good workaround rather than creating a plugin. Reimplementing TF-IDF could actually be a good documentation example of it.

@ghost

This comment has been minimized.

Copy link

commented Jul 13, 2017

Hello I fall into this category and I have empirical testing to back it up. Please contact me if you have any questions. For my data/application for Elasticsearch TF-IDF produces a better MRR (mean reciprocal rank, a common IR score for search engines) than BM25. If TF-IDF is not retained then I will be stuck on version 5.

There are people that are convinced TF-IDF is better for their data+queries than BM25. I've heard this a number times: sometimes where the user has done a lot of testing and shown that TF-IDF is better in one way or another for them and sometimes not. I also believe TF-IDF also has more literature around it, is taught in more schools, and is generally a bit easier to understand (the formula is more or less in the name) so there may be some resistance purely due to how understood the two are relative to one another.

@jpountz

This comment has been minimized.

Copy link
Contributor Author

commented Aug 2, 2017

We will make this deprecation smoother by adding a scripted similarity: #25831.

jpountz added a commit to jpountz/elasticsearch that referenced this issue Aug 16, 2017

Disallow the `classic` similarity on new indices.
The `classic` similarity used to rely on specific features like query
normalization and coord factors, which were specific to this similarity and
have been removed as the `bm25` similarity has now been the default similarity
for some time. As a consequence, using the `classic` similarity could lead
to depeptive scores, which is why we want to prevent users from using it on new
indices. `bm25` is generally considered a superior option.

Closes elastic#23208
@cbuescher

This comment has been minimized.

Copy link
Member

commented Mar 13, 2018

@jpountz I'm not entirely sure about the current state of this issue, is the "discuss" label still appropriate? What are the next steps if not?
cc @elastic/es-search-aggs

jpountz added a commit to jpountz/elasticsearch that referenced this issue Mar 21, 2018

Improve similarity integration.
This improves the way similarities are plugged in in order to:
 - reject the classic similarity on 7.x indices and emit a deprecation
   warning otherwise
 - reject unkwown parameters on 7.x indices and emit a deprecation
   warning otherwise

Even though this breaks the plugin API, I'd like to backport to 7.x so
that users can get deprecation warnings when they are doing something
that will become unsupported in the future.

Closes elastic#23208
Closes elastic#29035

jpountz added a commit that referenced this issue Apr 3, 2018

Improve similarity integration. (#29187)
This improves the way similarities are plugged in in order to:
 - reject the classic similarity on 7.x indices and emit a deprecation
   warning otherwise
 - reject unkwown parameters on 7.x indices and emit a deprecation
   warning otherwise

Even though this breaks the plugin API, I'd like to backport to 7.x so
that users can get deprecation warnings when they are doing something
that will become unsupported in the future.

Closes #23208
Closes #29035

jpountz added a commit to jpountz/elasticsearch that referenced this issue Apr 3, 2018

Improve similarity integration. (elastic#29187)
This improves the way similarities are plugged in in order to:
 - reject the classic similarity on 7.x indices and emit a deprecation
   warning otherwise
 - reject unkwown parameters on 7.x indices and emit a deprecation
   warning otherwise

Even though this breaks the plugin API, I'd like to backport to 7.x so
that users can get deprecation warnings when they are doing something
that will become unsupported in the future.

Closes elastic#23208
Closes elastic#29035

jpountz added a commit that referenced this issue Apr 4, 2018

Improve similarity integration. (#29187)
This improves the way similarities are plugged in in order to:
 - reject the classic similarity on 7.x indices and emit a deprecation
   warning otherwise
 - reject unkwown parameters on 7.x indices and emit a deprecation
   warning otherwise

Even though this breaks the plugin API, I'd like to backport to 7.x so
that users can get deprecation warnings when they are doing something
that will become unsupported in the future.

Closes #23208
Closes #29035

@jimczi jimczi removed this from Search & Aggs in Background tasks Jun 14, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.