Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable shard_size default for term aggregations #84744

Open
jade-lucas opened this issue Mar 8, 2022 · 2 comments
Open

Configurable shard_size default for term aggregations #84744

jade-lucas opened this issue Mar 8, 2022 · 2 comments
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@jade-lucas
Copy link

jade-lucas commented Mar 8, 2022

Description

Request:
The ability to set the default shard_size for the terms aggregation in index settings and/or in advanced kibana settings.

Problem Statement:
In our environment, we have user groups that prefer to use lens to "slice and dice" their data. One common theme that we are starting to see is that when these users use the term aggregation, they will often point out data discrepancies with averages, median, and similar metrics. When these data discrepancies are brought to our engineers, we layout all the reasons why as described in the below link. Often we direct the end user to use an aggregation based visualization in Kibana and provider a recommendation of the shard_size to be used in the input json section. This resolves the data discrepancy almost all of the time. However we commonly receive suggestions by our user groups that they don't want to set the shard_size every time they create a visualization. Reasons are, they often forget to specify it, they really don't know what it does and miss use it, and some of the user groups prefer to use lens(no shard_size support).

Our developers are responsible for defining index/component templates. It would be ideal if our developers could define default shard_size in an index\component template as an index setting. If not, perhaps the advanced section in Kibana would suffice? I think that allowing a way for advanced users (developers\engineers\admins) to optionally configure the default shard_size would result in fewer reported data discrepancies, reduced time triaging from the technical teams, and better experience for all.

Purposed template settings the default shard_size for term aggregations.

{
    "template": {
      "settings": {
        "index": {
          "lifecycle": {
            "name": "my_ilm_policy"
          },
          "refresh_interval": "15s",
          "number_of_shards": "3",
          "number_of_replicas": "1",
          **"term_default_shard_size": "$size * 2.5 + 15"**
        }
      }
    },
    "_meta": {
      "description": "My Description"
    }
  }

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-shard-size

@jade-lucas jade-lucas added >enhancement needs:triage Requires assignment of a team area label labels Mar 8, 2022
@DJRickyB DJRickyB added :Analytics/Aggregations Aggregations and removed needs:triage Requires assignment of a team area label labels Mar 8, 2022
@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Mar 8, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@wchaparro
Copy link
Contributor

Hey there @jade-lucas,

Thanks for your request and detailed description of your use case. Currently, there is no straightforward mechanism for calculating shard size. It's more complicated than performing a simple calculation and we’d like to ensure that we do this right and in a general manner. The general solution is to be able to calculate the aggregation more accurately, and specifically for things like terms aggregation on rare terms, even if it means we need to take more time to do so. This also means we will determine the right shard_size for you to increase accuracy. We are considering this for our longer term roadmap.

Doing the simple calc for the number of shards is something that can be done now. We are keeping this issue open and linking to the related meta issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

No branches or pull requests

5 participants