New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove script access to term statistics #19462
Conversation
You should also remove the docs https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-advanced-scripting.html Please could you change the title to something more meaningful, such as "Remove script access to term statistics" |
The code changes LGTM |
In scripts (at least some of the languages), the terms dictionary and postings can be access with the special _index variable. This is for very advanced use cases which want to do their own scoring. The problem is segment level statistics must be recomputed for every document. Additionally, this is not friendly to the terms index caching as the order of looking up terms should be controlled by lucene. This change removes _index from scripts. Anyone using it can and should instead write a Similarity plugin, which is explicitly designed to allow doing the calculations needed for a relevance score. closes elastic#19359
4f8bf3d
to
191cdaf
Compare
@clintongormley I removed those docs and updated the title as you suggested. |
There's also a mention and link which you'll need to remove in this section: https://github.com/elastic/elasticsearch/blob/master/docs/reference/modules/scripting/fields.asciidoc#search-and-aggregation-scripts Would it be possible to add the appropriate deprecation logging in 2.4.0? |
I'm not sure how to do that without creating potentially very large logs. We only know this is accessed when a script is being run, and it is called from the script. So eg if a script runs on a million docs you would get a million deprecation messages? |
Deprecation logging is off by default. But yes, I see what you mean. I wonder if we should be rate-limiting duplicate messages in the deprecation log infra itself. |
Another way would be to do something like
in every method of LeafIndexLookup in order to have one message per segment, which would make the volume lower. |
I think we should rethink this PR given #19359 (comment) |
I have seen scripts being used for retrieving terms' statistics and re-scoring the documents based on them (or sorting the documents based on them) in our public community. It is true it is not often being used, but I've seen it. Removing this possibility assumes the user will need to get a hold of Java and write code for the same thing that was possible in queries in a much simpler and accessible way. |
@rjernst is this PR still needed given Clint's earlier comment about rethinking it? |
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
@jpountz I've updated this PR now that index lookup is deprecated in 5.5. Can you take a look again? |
In scripts (at least some of the languages), the terms dictionary and
postings can be access with the special _index variable. This is for
very advanced use cases which want to do their own scoring. The problem
is segment level statistics must be recomputed for every document.
Additionally, this is not friendly to the terms index caching as the
order of looking up terms should be controlled by lucene.
This change removes _index from scripts. Anyone using it can and should
instead write a Similarity plugin, which is explicitly designed to allow
doing the calculations needed for a relevance score.
closes #19359