Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add _terms_stats API #21886

Closed
shaie opened this issue Nov 30, 2016 · 19 comments
Closed

Add _terms_stats API #21886

shaie opened this issue Nov 30, 2016 · 19 comments
Labels
>feature :Search/Search Search-related issues that do not fall into other categories stalled

Comments

@shaie
Copy link
Contributor

shaie commented Nov 30, 2016

Would be nice if ES exposed an API similar to _field_stats, to get individual term statistics.

One approach/workaround that I described here is to use the _termvectors API with an artificial document and request term_statistics. This however returns the statistics from one random shard, so you'd need to fire that to each of the index shards, using ?preference=_shards:{num}.

A _term_stats API would essentially do that, only without the need to use an artificial document, termvectors etc. and would probably be much more efficient. The entire information is available in Lucene's TermsEnum already, so I hope this isn't a very complicated feature request.

@clintongormley
Copy link

What exactly would the API look like? My concern here is in producing very large responses (esp when you have to combine results from different shards).

@shaie
Copy link
Contributor Author

shaie commented Dec 3, 2016

Hey @clintongormley, sorry but somehow I didn't receive a notification about your response, so missed it.

I was thinking of an API very similar to _field_stats, that looks like /{index}/_term_stats?field=foo&terms=t1,t2,t3 (or have terms be of the form field:term,field:term or some other structure maybe JSON) and returns index/cluster-level stats of those terms, such as docFreq and totalTermFreq.

If it wasn't clear, I definitely don't think the API should return the stats of all terms of a field in the index, only select terms.

I don't see how this API can be more expensive than _field_stats which combines the results from many shards as well. You'd go to each shard once with all requested terms, get back the response and sum them all up, no?

We can also restrict the API to at most X terms (maybe 100? just throwing a number) so that the request/response doesn't become too heavy.

@s1monw
Copy link
Contributor

s1monw commented Dec 6, 2016

I have concerns too, this is really a low level operation that should be executed locally or it should be an implementation detail of a another high-level feature. I don't see the usecase for such a primitive and term level is certainly very very expert. @shaie what is the usecase of such a feature? If you want to use this for some kind of query postprocessing then we can think about making the rescorer stuff pluggable? Another option where I can actually see this being useful is to extend our explain output somehow. when you search you can do &explain=true and you get lucene level explain output for each hit. Maybe there is a good way to extend this to also render the aggregated term statistics if they are present or only that ie. explain_distributed_freqs then the actual freq are coming back that are used for the query and nobody needs a second round trip. I can see that being useful.

@shaie
Copy link
Contributor Author

shaie commented Dec 7, 2016

The usecases involve indexing some data in an ES index/cluster, but accessing it from another system. In that case, being able to access the term statistics only from within the ES cluster is less useful. While your writeup about explain makes sense, I think your premise is that someone wants to access the term statistics to perform actions on the documents that are indexed within the same ES index, but this isn't the case.

So for example, consider that you're managing user profiles in an ES index. Each profile is indexed as one document and comprises a set of terms (and maybe even associated weights). Now consider another system which wants to answer a question like "how important are the terms t1 and t2 to users", where the importance might be measured by a combination of totalTermFreq and sumDocFreq which roughly tell how many users even care about the terms. It needs that in order to improve search quality for its data, which is indexed in a separate system (again, different data set, data source). It is not interested in a specific user's profile (that can be retrieved via _termvectors) but wants to use the user profiles information in order to apply better boosts to words that appear in queries that are sent to it.

Another example is related to "learning". Think about indexing some data source (say Wikipedia) and you want to use it in order to augment the search queries that are sent to you system, which may index Web news. Again, you're not interested in affecting the score of the Wikipedia results, but being able to access term statistics in the Wikipedia index can help you tune your search on Web news.

I can provide more examples for sure, but they all follow the same idea - you don't necessarily want to access the terms statistics in order to use them for the same index.

Out of curiosity, what's the usecase for the _field_stats API, and don't you think that it falls into the same category as _terms_stats?

And about expert use cases - I see it as a pro if ES addresses both regular and expert users' use cases. But of course I'm biased 😉. And I understand that this sort of feature can be developed as a plug-in, which needs access to low-level Lucene API, however I think that having a REST API might help other (less expert) users, use the data in order to do innovative things with ES.

I think I stated it somewhere, but if not - I am willing to do the work to add this API to ES (if you will agree to have it though)!

@clintongormley
Copy link

If it wasn't clear, I definitely don't think the API should return the stats of all terms of a field in the index, only select terms.

This was my main concern - it limits the impact of this API, even when pulling back term stats from all shards.

@clintongormley
Copy link

That said, I don't understand the nitty gritty and so may be missing the reasons for @s1monw 's concerns

@shaie
Copy link
Contributor Author

shaie commented Dec 7, 2016

This was my main concern - it limits the impact of this API, even when pulling back term stats from all shards.

Indeed, it does not allow you to extract the statistics of all terms of a field, but _field_stats does not allow you to retrieve stats of all fields too, only select ones. I think this is OK because I don't see a case for fetching stats for all terms of a field. If you can obtain stats for select terms, and the field statistics, you have enough information at hand to tell something qualitative about the terms you're interested in, and how they relate to the rest of the terms in the field.

@s1monw
Copy link
Contributor

s1monw commented Dec 15, 2016

That said, I don't understand the nitty gritty and so may be missing the reasons for @s1monw 's concerns

I think the biggest reason here is that it's very hard to remove any kind of API in a project like this so we need to make the right decisions here. I am still not convinced that such an API is used by more than 1% of the userbase and if that is the case I'd opt for the ask of a plugin. One thing we can do is to add a new misc plugin to elasticsearch core where we can add stuff like this. There we can define different BWC guarantees for the APIs and core remains lean. ideally we could also move field_stats to this project.

@shaie
Copy link
Contributor Author

shaie commented Dec 15, 2016

Thanks @s1monw.

One thing we can do is to add a new misc plugin to elasticsearch core where we can add stuff like this

Having absolute no experience with ES plugins, what does it mean? Can a plugin expose its own REST APIs? Would they then follow the same other ES APIs (e.g. /index/type/_term_stats?...)?

Will such a plugin be loaded by default with ES? I mean, is what you're proposing mostly concerns BWC guarantees, or it also means that whoever wants to use this plugin, will have to tell ES to load it?

@s1monw
Copy link
Contributor

s1monw commented Dec 15, 2016

having absolute no experience with ES plugins, what does it mean? Can a plugin expose its own REST APIs? Would they then follow the same other ES APIs (e.g. /index/type/_term_stats?...)?

yes it can do everything just like core here

Will such a plugin be loaded by default with ES? I mean, is what you're proposing mostly concerns BWC guarantees, or it also means that whoever wants to use this plugin, will have to tell ES to load it?

correct, you have to run the plugin manager to install it. I don't even know if BWC is an issue here at this point

@shaie
Copy link
Contributor Author

shaie commented Dec 15, 2016

OK then what do you recommend? Are you in favor at all about adding such an API/plugin, or still need to think about it?

If you do support it, I wouldn't mind (as said above) taking a shot at it, but would appreciate a pointer to how to start develop plugins.

@s1monw
Copy link
Contributor

s1monw commented Dec 15, 2016

I wonder what others think about such a misc plugin... @bleskes @jpountz @clintongormley

@clintongormley
Copy link

I think the biggest reason here is that it's very hard to remove any kind of API in a project like this so we need to make the right decisions here. I am still not convinced that such an API is used by more than 1% of the userbase and if that is the case I'd opt for the ask of a plugin.

I completely agree with the above - makes sense

One thing we can do is to add a new misc plugin to elasticsearch core where we can add stuff like this. There we can define different BWC guarantees for the APIs and core remains lean. ideally we could also move field_stats to this project.

Would the point of this misc plugin to be able to gain some insight into how many users are using the included features, so we can figure out which features should be moved to fully supported modules?

I'm not sure what a misc plugin gives you over having a plugin for a specific feature. Also, having a plugin per feature makes it easy to see how popular it is: how many people install the plugin?

Also note that Kibana depends heavily on the field stats API, so moving that to be an optional plugin would break Kibana completely.

@s1monw
Copy link
Contributor

s1monw commented Dec 16, 2016

I'm not sure what a misc plugin gives you over having a plugin for a specific feature. Also, having a plugin per feature makes it easy to see how popular it is: how many people install the plugin?

a plugin per feature is only useful IMO if the feature has a 3rd party dependency. Think of an analyzer etc. If we have 20 of these features we would have 20 plugins, what a mess.... I think we can have some kind of sandbox plugin that is marked experimental etc. where we can upgrade stuff to core if we think it's needed.

Also note that Kibana depends heavily on the field stats API, so moving that to be an optional plugin would break Kibana completely.

really? :) If it wouldn't I'd have deleted it by now. I think we need to work towards fixing that eventually.

@jpountz
Copy link
Contributor

jpountz commented Dec 20, 2016

I guess my biggest concern is about backward compatibility. For instance we changed the way numerics are stored in 5.x, and the new data structure we use cannot give you things like doc freqs in constant time. It was hard already, and if we had had this API at that time, that would have been another break to deal with. So we should be totally clear about the fact that this feature may change in unexpected ways at any time. I think putting this kind of things in a plugin helps convey this message.

If it wouldn't I'd have deleted it by now. I think we need to work towards fixing that eventually.

+++++++

@s1monw
Copy link
Contributor

s1monw commented Dec 21, 2016

@clintongormley I guess we can go with a plugin here, lets make sure this only works for string/keyword fields and not for numerics and get the semantic clear. @shaie do you wanna take a stab at it?

@shaie
Copy link
Contributor Author

shaie commented Dec 22, 2016

I certainly would like to give it a try :). Can you point me to an example plugin, preferably one that adds REST APIs, so I can get a head start? Also, FYI that I'm going to be on some PTO in the coming weeks, so it might take me some time, but would definitely want to do it. @s1monw, any guidance will be appreciated!

@s1monw
Copy link
Contributor

s1monw commented Dec 22, 2016

The closes I can think of is the reindex module, here is the plugin class for it. RestHandlers are registered here. If you start a plugin it should be under /plugins not under /modules. In oder to make the build pick it up you gotta edit this file too and add your plugin to it. Also I'd just start by copying an existing one? it has rest-tests as well as unittests etc. and is minimal. I won't be around for a week or so but feel free to ask questions or open a new PR to discuss

@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Term Vectors labels Feb 14, 2018
@talevy
Copy link
Contributor

talevy commented Mar 26, 2018

This feature request is an interesting idea but since its opening we have not seen enough feedback that it is a feature we should pursue. We prefer to close this issue as a clear indication that we are not going to work on this at this time. We are always open to reconsidering this in the future based on compelling feedback; despite this issue being closed please feel free to leave feedback on the proposal (including +1s).

@talevy talevy closed this as completed Mar 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search/Search Search-related issues that do not fall into other categories stalled
Projects
None yet
Development

No branches or pull requests

6 participants