Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slightly more accurate terms sorting on sub-aggs #72684

Closed
nik9000 opened this issue May 4, 2021 · 2 comments · Fixed by #79205
Closed

Slightly more accurate terms sorting on sub-aggs #72684

nik9000 opened this issue May 4, 2021 · 2 comments · Fixed by #79205
Assignees
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@nik9000
Copy link
Member

nik9000 commented May 4, 2021

Right now we default the shard_size to 1.5*size+10 in terms in an effort to keep the doc count errors lower. When you sort the terms by something other than doc_count descending the error is unbounded. But larger shard_size will lead to more accurate results because we ship more data back to the coordinating node. Maybe we should bump the shard_size if you are sorting on anything other than doc_count descending to get you more accurate results by default. Sure, it'll be slower, but it's the price you pay for a little more accuracy. And the performance when sorting by the default won't change.

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 4, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@nik9000
Copy link
Member Author

nik9000 commented Oct 13, 2021

A few of us go to talking about this and figured out we don't really know how often we get into this state. We document that this sort of thing can be an error. We're not super clear about the side effects here. I think we use kind of dense language. I think we can improve that.

Anyway! It'd be useful to collect a counter of how often this happens. I'm not sure it's worth the effort right now. We're kind of swamped and we're not sure that we'd really do with the counter.

We don't have a good way to fix this. The errors are still unbounded. Even if we bump the shard_size. It'd help. Some. But we don't know how much and it'd absolutely come with performance changes. And, even better, you can do this yourself. Just set the shard_size higher.

So! I'm going to convert this into a docs issue and just update those docs and leave the rest as it.

@nik9000 nik9000 self-assigned this Oct 13, 2021
nik9000 added a commit to nik9000/elasticsearch that referenced this issue Oct 14, 2021
The `terms` agg picks the top `size` terms in a single scatter/gather
pass across all the shards. For the default `order` and if you `order`
by `_key` this works quite well. Some errors creep in, but it's fairly
easy to point to them and understand them. But ordering by doc count
ascending is like inviting the error vampire into your agg. It's super
easy to get inaccurate results. This updates the docs to be more stark
about it.

Closes elastic#72684
elasticsearchmachine pushed a commit that referenced this issue Nov 1, 2021
The `terms` agg picks the top `size` terms in a single scatter/gather
pass across all the shards. For the default `order` and if you `order`
by `_key` this works quite well. Some errors creep in, but it's fairly
easy to point to them and understand them. But ordering by doc count
ascending is like inviting the error vampire into your agg. It's super
easy to get inaccurate results. This updates the docs to be more stark
about it. Closes #72684
nik9000 added a commit to nik9000/elasticsearch that referenced this issue Nov 1, 2021
The `terms` agg picks the top `size` terms in a single scatter/gather
pass across all the shards. For the default `order` and if you `order`
by `_key` this works quite well. Some errors creep in, but it's fairly
easy to point to them and understand them. But ordering by doc count
ascending is like inviting the error vampire into your agg. It's super
easy to get inaccurate results. This updates the docs to be more stark
about it. Closes elastic#72684
nik9000 added a commit to nik9000/elasticsearch that referenced this issue Nov 2, 2021
The `terms` agg picks the top `size` terms in a single scatter/gather
pass across all the shards. For the default `order` and if you `order`
by `_key` this works quite well. Some errors creep in, but it's fairly
easy to point to them and understand them. But ordering by doc count
ascending is like inviting the error vampire into your agg. It's super
easy to get inaccurate results. This updates the docs to be more stark
about it. Closes elastic#72684
nik9000 added a commit that referenced this issue Nov 2, 2021
The `terms` agg picks the top `size` terms in a single scatter/gather
pass across all the shards. For the default `order` and if you `order`
by `_key` this works quite well. Some errors creep in, but it's fairly
easy to point to them and understand them. But ordering by doc count
ascending is like inviting the error vampire into your agg. It's super
easy to get inaccurate results. This updates the docs to be more stark
about it. Closes #72684
nik9000 added a commit to nik9000/elasticsearch that referenced this issue Nov 2, 2021
…elastic#80223)

The `terms` agg picks the top `size` terms in a single scatter/gather
pass across all the shards. For the default `order` and if you `order`
by `_key` this works quite well. Some errors creep in, but it's fairly
easy to point to them and understand them. But ordering by doc count
ascending is like inviting the error vampire into your agg. It's super
easy to get inaccurate results. This updates the docs to be more stark
about it. Closes elastic#72684
nik9000 added a commit to nik9000/elasticsearch that referenced this issue Nov 2, 2021
…elastic#80223)

The `terms` agg picks the top `size` terms in a single scatter/gather
pass across all the shards. For the default `order` and if you `order`
by `_key` this works quite well. Some errors creep in, but it's fairly
easy to point to them and understand them. But ordering by doc count
ascending is like inviting the error vampire into your agg. It's super
easy to get inaccurate results. This updates the docs to be more stark
about it. Closes elastic#72684
nik9000 added a commit that referenced this issue Nov 2, 2021
… (#80227)

The `terms` agg picks the top `size` terms in a single scatter/gather
pass across all the shards. For the default `order` and if you `order`
by `_key` this works quite well. Some errors creep in, but it's fairly
easy to point to them and understand them. But ordering by doc count
ascending is like inviting the error vampire into your agg. It's super
easy to get inaccurate results. This updates the docs to be more stark
about it. Closes #72684
nik9000 added a commit that referenced this issue Nov 2, 2021
… (#80226)

The `terms` agg picks the top `size` terms in a single scatter/gather
pass across all the shards. For the default `order` and if you `order`
by `_key` this works quite well. Some errors creep in, but it's fairly
easy to point to them and understand them. But ordering by doc count
ascending is like inviting the error vampire into your agg. It's super
easy to get inaccurate results. This updates the docs to be more stark
about it. Closes #72684
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants