-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add _terms_stats API #21886
Comments
What exactly would the API look like? My concern here is in producing very large responses (esp when you have to combine results from different shards). |
Hey @clintongormley, sorry but somehow I didn't receive a notification about your response, so missed it. I was thinking of an API very similar to _field_stats, that looks like If it wasn't clear, I definitely don't think the API should return the stats of all terms of a field in the index, only select terms. I don't see how this API can be more expensive than _field_stats which combines the results from many shards as well. You'd go to each shard once with all requested terms, get back the response and sum them all up, no? We can also restrict the API to at most X terms (maybe 100? just throwing a number) so that the request/response doesn't become too heavy. |
I have concerns too, this is really a low level operation that should be executed locally or it should be an implementation detail of a another high-level feature. I don't see the usecase for such a primitive and term level is certainly very very expert. @shaie what is the usecase of such a feature? If you want to use this for some kind of query postprocessing then we can think about making the rescorer stuff pluggable? Another option where I can actually see this being useful is to extend our explain output somehow. when you search you can do |
The usecases involve indexing some data in an ES index/cluster, but accessing it from another system. In that case, being able to access the term statistics only from within the ES cluster is less useful. While your writeup about So for example, consider that you're managing user profiles in an ES index. Each profile is indexed as one document and comprises a set of terms (and maybe even associated weights). Now consider another system which wants to answer a question like "how important are the terms t1 and t2 to users", where the importance might be measured by a combination of Another example is related to "learning". Think about indexing some data source (say Wikipedia) and you want to use it in order to augment the search queries that are sent to you system, which may index Web news. Again, you're not interested in affecting the score of the Wikipedia results, but being able to access term statistics in the Wikipedia index can help you tune your search on Web news. I can provide more examples for sure, but they all follow the same idea - you don't necessarily want to access the terms statistics in order to use them for the same index. Out of curiosity, what's the usecase for the And about expert use cases - I see it as a pro if ES addresses both regular and expert users' use cases. But of course I'm biased 😉. And I understand that this sort of feature can be developed as a plug-in, which needs access to low-level Lucene API, however I think that having a REST API might help other (less expert) users, use the data in order to do innovative things with ES. I think I stated it somewhere, but if not - I am willing to do the work to add this API to ES (if you will agree to have it though)! |
This was my main concern - it limits the impact of this API, even when pulling back term stats from all shards. |
That said, I don't understand the nitty gritty and so may be missing the reasons for @s1monw 's concerns |
Indeed, it does not allow you to extract the statistics of all terms of a field, but |
I think the biggest reason here is that it's very hard to remove any kind of API in a project like this so we need to make the right decisions here. I am still not convinced that such an API is used by more than 1% of the userbase and if that is the case I'd opt for the ask of a plugin. One thing we can do is to add a new |
Thanks @s1monw.
Having absolute no experience with ES plugins, what does it mean? Can a plugin expose its own REST APIs? Would they then follow the same other ES APIs (e.g. Will such a plugin be loaded by default with ES? I mean, is what you're proposing mostly concerns BWC guarantees, or it also means that whoever wants to use this plugin, will have to tell ES to load it? |
yes it can do everything just like core here
correct, you have to run the plugin manager to install it. I don't even know if BWC is an issue here at this point |
OK then what do you recommend? Are you in favor at all about adding such an API/plugin, or still need to think about it? If you do support it, I wouldn't mind (as said above) taking a shot at it, but would appreciate a pointer to how to start develop plugins. |
I wonder what others think about such a misc plugin... @bleskes @jpountz @clintongormley |
I completely agree with the above - makes sense
Would the point of this misc plugin to be able to gain some insight into how many users are using the included features, so we can figure out which features should be moved to fully supported modules? I'm not sure what a Also note that Kibana depends heavily on the field stats API, so moving that to be an optional plugin would break Kibana completely. |
a plugin per feature is only useful IMO if the feature has a 3rd party dependency. Think of an analyzer etc. If we have 20 of these features we would have 20 plugins, what a mess.... I think we can have some kind of sandbox plugin that is marked experimental etc. where we can upgrade stuff to core if we think it's needed.
really? :) If it wouldn't I'd have deleted it by now. I think we need to work towards fixing that eventually. |
I guess my biggest concern is about backward compatibility. For instance we changed the way numerics are stored in 5.x, and the new data structure we use cannot give you things like doc freqs in constant time. It was hard already, and if we had had this API at that time, that would have been another break to deal with. So we should be totally clear about the fact that this feature may change in unexpected ways at any time. I think putting this kind of things in a plugin helps convey this message.
+++++++ |
@clintongormley I guess we can go with a plugin here, lets make sure this only works for string/keyword fields and not for numerics and get the semantic clear. @shaie do you wanna take a stab at it? |
I certainly would like to give it a try :). Can you point me to an example plugin, preferably one that adds REST APIs, so I can get a head start? Also, FYI that I'm going to be on some PTO in the coming weeks, so it might take me some time, but would definitely want to do it. @s1monw, any guidance will be appreciated! |
The closes I can think of is the reindex module, here is the plugin class for it. RestHandlers are registered here. If you start a plugin it should be under |
This feature request is an interesting idea but since its opening we have not seen enough feedback that it is a feature we should pursue. We prefer to close this issue as a clear indication that we are not going to work on this at this time. We are always open to reconsidering this in the future based on compelling feedback; despite this issue being closed please feel free to leave feedback on the proposal (including +1s). |
Would be nice if ES exposed an API similar to
_field_stats
, to get individual term statistics.One approach/workaround that I described here is to use the
_termvectors
API with an artificial document and requestterm_statistics
. This however returns the statistics from one random shard, so you'd need to fire that to each of the index shards, using?preference=_shards:{num}
.A
_term_stats
API would essentially do that, only without the need to use an artificial document, termvectors etc. and would probably be much more efficient. The entire information is available in Lucene's TermsEnum already, so I hope this isn't a very complicated feature request.The text was updated successfully, but these errors were encountered: