Term Count API #640

Closed
tallpsmith opened this Issue Jan 20, 2011 · 1 comment

Projects

None yet

2 participants

@tallpsmith

I couldn't find this in the docs, or in the issue tracker, and following on from a discussion I had googled here:

http://elasticsearch-users.115913.n3.nabble.com/Terms-API-for-Spellchecker-td1691838.html

It appears others would like a Term Count API as well. (it apparently used to be in ES, if I read that correctly).

I understand with sharding that it's not as simple as it may be, because with a pathological case of 1 shard having a lot of terms, but another not, it's not easy to get an accurate term count without getting each distinct list from the shards and doing a distinct on them.

A simpler method may just be to expose a result from each shard, something like:

{
    "shards": {
        "shard1": 5,
        "shard2": 12,
        "shard37":450
    }
    "range": {
        "min": 450,
        "max": 467
    }
}

This is produced from knowing that the absolute minimum number of distinct terms has to be the maximum from an individual shard (when shard37 holds all the unique terms, and the other shards just hold a subset). The absolute maximum number of distinct terms can only be the sum of the shard counts (in the pathological case where each shard is storing terms no other shard has).

This would have to be very fast to compute, and still useful, but may not satisfy all cases. The only alternative is to get a unique term stream from each shard and merge them into a distinct list and count . For very large numbers of terms that could prove a memory hog.

If I knew were to start, I'd have a crack at this, pointers in the direction and I can start to attempt it.

@jpountz
Contributor
jpountz commented Mar 13, 2014

#5426 has just been resolved and allows to compute unique counts.

@jpountz jpountz closed this Mar 13, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment