Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Added an option to show the upper bound of the error for the terms aggregation #6778
...he terms aggregation.
This is only applicable when the order is set to _count. The upper bound of the error in the doc count is calculated by summing the doc count of the last term on each shard which did not return the term. The implementation calculates the error by summing the doc count for the last term on each shard for which the term IS returned and then subtracts this value from the sum of the doc counts for the last term from ALL shards.
I just played with it and I think this is an interesting feature to raise awareness about the accuracy issues of the terms aggregation and although as a way to test the impact of the shard_size parameter. The per-term error is interesting, but I think the global error that you added is also interesting because it also gives information about terms that didn't make it to the top terms.
To move forward, I think it would be nice to have it on all sort orders (potentially by using a special value of eg. -1 when the maximum error cannot be estimated or would be so large that it would not be really useful).
I largely agree although during my testing of this feature I have had quite a few situations where the error for the whole aggregation has been quite big relative to the doc count for the last returned term (e.g error of 3600 with a doc count for the last returned term of 5400) but the error on all of the terms was 0. This seems confusing for a user? Although maybe this just highlights the importance of clearly explaining the way the error is calculated and what it means?
Agree regarding your suggestions for moving forward and the issue around default shard size