Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observability dependencies view broken for >= 90 days of historical data #178491

Closed
cachedout opened this issue Mar 12, 2024 · 7 comments · Fixed by #182884
Closed

Observability dependencies view broken for >= 90 days of historical data #178491

cachedout opened this issue Mar 12, 2024 · 7 comments · Fixed by #182884
Assignees
Labels
apm:dependencies-ui bug Fixes for quality problems that affect the customer experience Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team

Comments

@cachedout
Copy link
Contributor

cachedout commented Mar 12, 2024

Kibana version:
Serverless build 03/12/24
Elasticsearch version:
Serverless build 03/12/24
Server OS version:
Serverless build 03/12/24
Browser version:
N/A
Browser OS version:
N/A
Original install method (e.g. download page, yum, from source, etc.):
Serverless build 03/12/24
Describe the bug:
When using the Observability test cluster for Serverless QA and selecting 90 days of historical data, an error about too many buckets is displayed.

Steps to reproduce:

  1. Using QA o11y test cluster
  2. Go to Applications -> Dependencies
  3. Select 90 days of historical data

Expected behavior:
No error
Screenshots (if relevant):
Screenshot 2024-03-12 at 12 59 33

Errors in browser console (if relevant):

Provide logs and/or server output (if relevant):

Any additional context:

@cachedout cachedout added bug Fixes for quality problems that affect the customer experience Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team labels Mar 12, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

@kpatticha
Copy link
Contributor

Related ticket: #161239

@smith smith added the needs-refinement Needs PM's to refine scope label Mar 22, 2024
@neptunian neptunian self-assigned this Apr 29, 2024
@neptunian
Copy link
Contributor

neptunian commented Apr 30, 2024

In #161239, we changed the composite size to 1500 with no pagination. However during a wide enough time range with 1500 unique top level buckets (service name, dependency name), it would still be easy to reach the default elasticsearch limit of 65,536. In the query below the interval for the histogram is daily (86400s) for something like a 3 month time range. 1500 (services/dependencies) * 90 (days) = 135,000 buckets. Not including potentially more buckets depending on event.outcome field (3 possible extra buckets per day).

date histogram creating buckets per day over a 3 month time range:

        "timeseries": {
          "date_histogram": {
            "field": "@timestamp",
            "fixed_interval": "86400s",
            "extended_bounds": {
              "min": 1706629793149,
              "max": 1714488593149
            }
          },
full query:
{
  "track_total_hits": true,
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "processor.event": [
              "metric"
            ]
          }
        },
        {
          "bool": {
            "filter": [
              {
                "bool": {
                  "filter": [
                    {
                      "term": {
                        "metricset.name": "service_destination"
                      }
                    }
                  ],
                  "must_not": {
                    "terms": {
                      "metricset.interval": [
                        "10m",
                        "60m"
                      ]
                    }
                  }
                }
              }
            ]
          }
        }
      ],
      "must": [
        {
          "bool": {
            "filter": [
              {
                "bool": {
                  "filter": [
                    {
                      "term": {
                        "metricset.name": "service_destination"
                      }
                    }
                  ],
                  "must_not": {
                    "terms": {
                      "metricset.interval": [
                        "10m",
                        "60m"
                      ]
                    }
                  }
                }
              },
              {
                "range": {
                  "@timestamp": {
                    "gte": 1706629793149,
                    "lte": 1714488593149,
                    "format": "epoch_millis"
                  }
                }
              },
              {
                "bool": {
                  "must_not": [
                    {
                      "terms": {
                        "agent.name": [
                          "js-base",
                          "rum-js",
                          "opentelemetry/webjs"
                        ]
                      }
                    }
                  ]
                }
              }
            ]
          }
        }
      ]
    }
  },
  "aggs": {
    "connections": {
      "composite": {
        "size": 10000,
        "sources": [
          {
            "serviceName": {
              "terms": {
                "field": "service.name"
              }
            }
          },
          {
            "dependencyName": {
              "terms": {
                "field": "span.destination.service.resource"
              }
            }
          }
        ]
      },
      "aggs": {
        "sample": {
          "top_metrics": {
            "size": 1,
            "metrics": [
              {
                "field": "service.environment"
              },
              {
                "field": "agent.name"
              },
              {
                "field": "span.type"
              },
              {
                "field": "span.subtype"
              }
            ],
            "sort": {
              "@timestamp": "desc"
            }
          }
        },
        "total_latency_sum": {
          "sum": {
            "field": "span.destination.service.response_time.sum.us"
          }
        },
        "total_latency_count": {
          "sum": {
            "field": "span.destination.service.response_time.count"
          }
        },
        "timeseries": {
          "date_histogram": {
            "field": "@timestamp",
            "fixed_interval": "86400s",
            "extended_bounds": {
              "min": 1706629793149,
              "max": 1714488593149
            }
          },
          "aggs": {
            "latency_sum": {
              "sum": {
                "field": "span.destination.service.response_time.sum.us"
              }
            },
            "count": {
              "sum": {
                "field": "span.destination.service.response_time.count"
              }
            },
            "event.outcome": {
              "terms": {
                "field": "event.outcome"
              },
              "aggs": {
                "count": {
                  "sum": {
                    "field": "span.destination.service.response_time.count"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Here are some options:

  • Smarter time intervals for the date histogram If we choose to widen the intervals for eg 3 months has intervals of 12 buckets (1 per week) instead of 90 (per day), there is less resolution in the charts (perhaps this is okay given how small these charts are), and we should be able to avoid the "too many buckets" exception. Good short term solution if we can accept the tradeoff.

  • Separate histogram timeseries buckets from services. @crespocarlos mentioned this in [APM] Dependencies call can create too many buckets #161239, a request to get the services and separately a request to get timeseries data only for visible services. Better real and perceived performance as the list of services will appear quickly which may be all they need. This is how the Services Inventory works and probably the best long term solution. Note: This API is used in the Services Overview and would need to see how it would affect it.

  • Smaller composite size to something smaller to make the "too many buckets" exception less likely. Instead of 1500, make it 500 and then paginate 3 times to get 1500 results. The tradeoff is the query will be slower due to multiple requests needing to happen. Good short term solution but I think we are less likely to want to accept a slower query.

@neptunian neptunian removed their assignment May 1, 2024
@neptunian
Copy link
Contributor

Talked with @smith and going to go with the first option of having larger time intervals which means less buckets.

@neptunian
Copy link
Contributor

neptunian commented May 8, 2024

@chrisdistasio @paulb-elastic

There's a PR open here #182884. This fix does not cover very large time ranges like 4+ years with the max amount of dependencies (1500). My thought is perhaps there should be a balance between how many buckets we try to stay under for any time range vs letting the user choose to increase their bucket limit. We can advise the user to increase their default max buckets in this case. If we feel that we should always aim to stay under the max bucket limit, even in a scenario of several years, I can do that. Currently the small time interval is 30 days. For something like 4 years this can be too large and we should switch to something like 3 months. If we want to do this I'd prefer to do that in a separate PR as it will require changes to a function used all over the APM UI with more in depth testing. The better alternative would be to implement the 2nd option, Separate histogram timeseries buckets from service.

neptunian added a commit that referenced this issue May 8, 2024
…n_stats query (#182884)

Fixes #178491

## Summary
The user receives a `too_many_buckets` exception when querying for 90
days worth of data and in many other longer time ranges. This is due to
the date histogram within each service having time intervals that are
too small.

## Solution

Lowering `numBuckets` cause the time periods to increase because the
algorithm divides the date the user selects by this number (duration /
numBuckets). The larger the time range is, [the more likely it will
choose an interval that is
larger](https://github.com/elastic/kibana/blob/main/x-pack/plugins/observability_solution/apm/common/utils/get_bucket_size/calculate_auto.js#L11),
resulting in less buckets per date histogram.

The exception can still be thrown when users select time ranges that
aren't caught in the algorithm, for eg selecting 4 years or more will
cause the error should a user have around the max # of dependencies
(1500). This is because our [smallest time interval is 30
days](https://github.com/elastic/kibana/blob/main/x-pack/plugins/observability_solution/apm/common/utils/get_bucket_size/calculate_auto.js#L26)
and that interval becomes too small in a large time range. We can
recommend in this case to increase the max bucket size in elasticsearch.
There needs to be a balance with how much we try to stay under the
default bucket limit vs letting the user change that size and get more
data.

Scenarios of duration and numBucket size and the resulting # of buckets
with the max of 1500 dependencies:
<img width="1772" alt="Screenshot 2024-05-08 at 7 41 22 AM"
src="https://github.com/elastic/kibana/assets/1676003/ab246534-7358-4372-bbce-09768eb4c341">


## Changes
- lower `numBuckets` to 8 when calling `calculateAuto.near`
- add unit tests to `calculateAuto.near` and `getBucketSize`

## Testing
1. Change the
[many_dependencies.ts](https://github.com/elastic/kibana/blob/main/packages/kbn-apm-synthtrace/src/scenarios/many_dependencies.ts#L18-L19)
synthtrace scenario to generate 1500 dependencies by changing these
lines locally:
`
const NUMBER_OF_DEPENDENCIES_PER_SERVICE = 15;
const NUMBER_OF_SERVICES = 100;
`
1. run `node scripts/synthtrace many_dependencies.ts --live --clean`
locally
3. run local kibana instance and navigate to APM dependencies inventory
http://localhost:5601/app/apm/dependencies/inventory
4. try various date ranges
@paulb-elastic
Copy link
Contributor

Thanks @neptunian that seems a good and reasonable approach (@chrisdistasio do you see a need for such long time periods?)

@neptunian if the user does select a 4+ year range, what's the user experience, do they still end up with the too_many_buckets_exception? If we wanted to detect that exception and show something like please select a shorter time period, does that also fall into the bigger piece of work you mentioned?

@neptunian
Copy link
Contributor

neptunian commented May 9, 2024

if the user does select a 4+ year range, what's the user experience, do they still end up with the too_many_buckets_exception? If we wanted to detect that exception and show something like please select a shorter time period, does that also fall into the bigger piece of work you mentioned?

Yes, they will still get the error with a "failed to fetch" in the table. With the "Separate histogram timeseries buckets from service" I mentioned, they would be unlikely to get the error because we'd only get timeseries data for the services they are looking at (defaults to 25 per page and we can make it lower). A significant part of the problem is getting timeseries data for ALL the services they have, even though they can't look at all of them anyway (defaults to 25 items per page in the table and can be set lower).

I think the current error that tells them to adjust their settings so they can get more buckets is helpful and we should keep it, but I understand they don't know exactly why and what they can do to remedy it other than changing their bucket size. So adding that kind of messaging could be helpful. "There is too much data being returned. Adjust your cluster bucket size (same as current messaging about adjust bucket size) or try narrowing your time range." This messaging comes from elasticsearch so we'd have to parse it and append some extra messaging to suggest narrowing the timerange. It would show up for all the ES queries that encounter the exception in APM and may not be helpful in some contexts if the timerange is not a significant contributor to the bucket size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
apm:dependencies-ui bug Fixes for quality problems that affect the customer experience Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants