Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove script access to term statistics #19462

Merged
merged 7 commits into from May 16, 2017

Conversation

rjernst
Copy link
Member

@rjernst rjernst commented Jul 15, 2016

In scripts (at least some of the languages), the terms dictionary and
postings can be access with the special _index variable. This is for
very advanced use cases which want to do their own scoring. The problem
is segment level statistics must be recomputed for every document.
Additionally, this is not friendly to the terms index caching as the
order of looking up terms should be controlled by lucene.

This change removes _index from scripts. Anyone using it can and should
instead write a Similarity plugin, which is explicitly designed to allow
doing the calculations needed for a relevance score.

closes #19359

@rjernst rjernst added >breaking :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache v5.0.0-alpha5 labels Jul 15, 2016
@clintongormley
Copy link

You should also remove the docs https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-advanced-scripting.html

Please could you change the title to something more meaningful, such as "Remove script access to term statistics"

@jpountz
Copy link
Contributor

jpountz commented Jul 18, 2016

The code changes LGTM

@rjernst rjernst changed the title Scripting: Removing _index access Remove script access to term statistics Jul 18, 2016
In scripts (at least some of the languages), the terms dictionary and
postings can be access with the special _index variable. This is for
very advanced use cases which want to do their own scoring. The problem
is segment level statistics must be recomputed for every document.
Additionally, this is not friendly to the terms index caching as the
order of looking up terms should be controlled by lucene.

This change removes _index from scripts. Anyone using it can and should
instead write a Similarity plugin, which is explicitly designed to allow
doing the calculations needed for a relevance score.

closes elastic#19359
@rjernst
Copy link
Member Author

rjernst commented Jul 18, 2016

@clintongormley I removed those docs and updated the title as you suggested.

@clintongormley
Copy link

There's also a mention and link which you'll need to remove in this section: https://github.com/elastic/elasticsearch/blob/master/docs/reference/modules/scripting/fields.asciidoc#search-and-aggregation-scripts

Would it be possible to add the appropriate deprecation logging in 2.4.0?

@rjernst
Copy link
Member Author

rjernst commented Jul 18, 2016

Would it be possible to add the appropriate deprecation logging in 2.4.0?

I'm not sure how to do that without creating potentially very large logs. We only know this is accessed when a script is being run, and it is called from the script. So eg if a script runs on a million docs you would get a million deprecation messages?

@clintongormley
Copy link

Deprecation logging is off by default. But yes, I see what you mean. I wonder if we should be rate-limiting duplicate messages in the deprecation log infra itself.

@jpountz
Copy link
Contributor

jpountz commented Jul 18, 2016

Another way would be to do something like

if (logged == false) {
  // log message
  logged = true;
}

in every method of LeafIndexLookup in order to have one message per segment, which would make the volume lower.

@clintongormley
Copy link

I think we should rethink this PR given #19359 (comment)

@astefan
Copy link
Contributor

astefan commented Aug 8, 2016

I have seen scripts being used for retrieving terms' statistics and re-scoring the documents based on them (or sorting the documents based on them) in our public community. It is true it is not often being used, but I've seen it. Removing this possibility assumes the user will need to get a hold of Java and write code for the same thing that was possible in queries in a much simpler and accessible way.

@dakrone
Copy link
Member

dakrone commented Sep 12, 2016

@rjernst is this PR still needed given Clint's earlier comment about rethinking it?

@rjernst
Copy link
Member Author

rjernst commented Sep 15, 2016

@dakrone I think this PR still makes sense, and I left a comment on #19359 explaining why.

@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@rjernst
Copy link
Member Author

rjernst commented May 16, 2017

@jpountz I've updated this PR now that index lookup is deprecated in 5.5. Can you take a look again?

@rjernst rjernst added the v6.0.0 label May 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>breaking :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache v6.0.0-alpha2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove ability to gather index statistics in scripts (aka IndexLookup)
6 participants