Observability dependencies view broken for >= 90 days of historical data #178491

cachedout · 2024-03-12T11:58:38Z

Kibana version:
Serverless build 03/12/24
Elasticsearch version:
Serverless build 03/12/24
Server OS version:
Serverless build 03/12/24
Browser version:
N/A
Browser OS version:
N/A
Original install method (e.g. download page, yum, from source, etc.):
Serverless build 03/12/24
Describe the bug:
When using the Observability test cluster for Serverless QA and selecting 90 days of historical data, an error about too many buckets is displayed.

Steps to reproduce:

Using QA o11y test cluster
Go to Applications -> Dependencies
Select 90 days of historical data

Expected behavior:
No error
Screenshots (if relevant):

Errors in browser console (if relevant):

Provide logs and/or server output (if relevant):

Any additional context:

elasticmachine · 2024-03-12T11:58:40Z

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

kpatticha · 2024-03-13T08:58:06Z

Related ticket: #161239

neptunian · 2024-04-30T19:18:50Z

In #161239, we changed the composite size to 1500 with no pagination. However during a wide enough time range with 1500 unique top level buckets (service name, dependency name), it would still be easy to reach the default elasticsearch limit of 65,536. In the query below the interval for the histogram is daily (86400s) for something like a 3 month time range. 1500 (services/dependencies) * 90 (days) = 135,000 buckets. Not including potentially more buckets depending on event.outcome field (3 possible extra buckets per day).

date histogram creating buckets per day over a 3 month time range:

        "timeseries": {
          "date_histogram": {
            "field": "@timestamp",
            "fixed_interval": "86400s",
            "extended_bounds": {
              "min": 1706629793149,
              "max": 1714488593149
            }
          },

full query:

{
  "track_total_hits": true,
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "processor.event": [
              "metric"
            ]
          }
        },
        {
          "bool": {
            "filter": [
              {
                "bool": {
                  "filter": [
                    {
                      "term": {
                        "metricset.name": "service_destination"
                      }
                    }
                  ],
                  "must_not": {
                    "terms": {
                      "metricset.interval": [
                        "10m",
                        "60m"
                      ]
                    }
                  }
                }
              }
            ]
          }
        }
      ],
      "must": [
        {
          "bool": {
            "filter": [
              {
                "bool": {
                  "filter": [
                    {
                      "term": {
                        "metricset.name": "service_destination"
                      }
                    }
                  ],
                  "must_not": {
                    "terms": {
                      "metricset.interval": [
                        "10m",
                        "60m"
                      ]
                    }
                  }
                }
              },
              {
                "range": {
                  "@timestamp": {
                    "gte": 1706629793149,
                    "lte": 1714488593149,
                    "format": "epoch_millis"
                  }
                }
              },
              {
                "bool": {
                  "must_not": [
                    {
                      "terms": {
                        "agent.name": [
                          "js-base",
                          "rum-js",
                          "opentelemetry/webjs"
                        ]
                      }
                    }
                  ]
                }
              }
            ]
          }
        }
      ]
    }
  },
  "aggs": {
    "connections": {
      "composite": {
        "size": 10000,
        "sources": [
          {
            "serviceName": {
              "terms": {
                "field": "service.name"
              }
            }
          },
          {
            "dependencyName": {
              "terms": {
                "field": "span.destination.service.resource"
              }
            }
          }
        ]
      },
      "aggs": {
        "sample": {
          "top_metrics": {
            "size": 1,
            "metrics": [
              {
                "field": "service.environment"
              },
              {
                "field": "agent.name"
              },
              {
                "field": "span.type"
              },
              {
                "field": "span.subtype"
              }
            ],
            "sort": {
              "@timestamp": "desc"
            }
          }
        },
        "total_latency_sum": {
          "sum": {
            "field": "span.destination.service.response_time.sum.us"
          }
        },
        "total_latency_count": {
          "sum": {
            "field": "span.destination.service.response_time.count"
          }
        },
        "timeseries": {
          "date_histogram": {
            "field": "@timestamp",
            "fixed_interval": "86400s",
            "extended_bounds": {
              "min": 1706629793149,
              "max": 1714488593149
            }
          },
          "aggs": {
            "latency_sum": {
              "sum": {
                "field": "span.destination.service.response_time.sum.us"
              }
            },
            "count": {
              "sum": {
                "field": "span.destination.service.response_time.count"
              }
            },
            "event.outcome": {
              "terms": {
                "field": "event.outcome"
              },
              "aggs": {
                "count": {
                  "sum": {
                    "field": "span.destination.service.response_time.count"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Here are some options:

Smarter time intervals for the date histogram If we choose to widen the intervals for eg 3 months has intervals of 12 buckets (1 per week) instead of 90 (per day), there is less resolution in the charts (perhaps this is okay given how small these charts are), and we should be able to avoid the "too many buckets" exception. Good short term solution if we can accept the tradeoff.
Separate histogram timeseries buckets from services. @crespocarlos mentioned this in [APM] Dependencies call can create too many buckets #161239, a request to get the services and separately a request to get timeseries data only for visible services. Better real and perceived performance as the list of services will appear quickly which may be all they need. This is how the Services Inventory works and probably the best long term solution. Note: This API is used in the Services Overview and would need to see how it would affect it.
Smaller composite size to something smaller to make the "too many buckets" exception less likely. Instead of 1500, make it 500 and then paginate 3 times to get 1500 results. The tradeoff is the query will be slower due to multiple requests needing to happen. Good short term solution but I think we are less likely to want to accept a slower query.

neptunian · 2024-05-01T19:12:13Z

Talked with @smith and going to go with the first option of having larger time intervals which means less buckets.

neptunian · 2024-05-08T12:03:26Z

@chrisdistasio @paulb-elastic

There's a PR open here #182884. This fix does not cover very large time ranges like 4+ years with the max amount of dependencies (1500). My thought is perhaps there should be a balance between how many buckets we try to stay under for any time range vs letting the user choose to increase their bucket limit. We can advise the user to increase their default max buckets in this case. If we feel that we should always aim to stay under the max bucket limit, even in a scenario of several years, I can do that. Currently the small time interval is 30 days. For something like 4 years this can be too large and we should switch to something like 3 months. If we want to do this I'd prefer to do that in a separate PR as it will require changes to a function used all over the APM UI with more in depth testing. The better alternative would be to implement the 2nd option, Separate histogram timeseries buckets from service.

…n_stats query (#182884) Fixes #178491 ## Summary The user receives a `too_many_buckets` exception when querying for 90 days worth of data and in many other longer time ranges. This is due to the date histogram within each service having time intervals that are too small. ## Solution Lowering `numBuckets` cause the time periods to increase because the algorithm divides the date the user selects by this number (duration / numBuckets). The larger the time range is, [the more likely it will choose an interval that is larger](https://github.com/elastic/kibana/blob/main/x-pack/plugins/observability_solution/apm/common/utils/get_bucket_size/calculate_auto.js#L11), resulting in less buckets per date histogram. The exception can still be thrown when users select time ranges that aren't caught in the algorithm, for eg selecting 4 years or more will cause the error should a user have around the max # of dependencies (1500). This is because our [smallest time interval is 30 days](https://github.com/elastic/kibana/blob/main/x-pack/plugins/observability_solution/apm/common/utils/get_bucket_size/calculate_auto.js#L26) and that interval becomes too small in a large time range. We can recommend in this case to increase the max bucket size in elasticsearch. There needs to be a balance with how much we try to stay under the default bucket limit vs letting the user change that size and get more data. Scenarios of duration and numBucket size and the resulting # of buckets with the max of 1500 dependencies: <img width="1772" alt="Screenshot 2024-05-08 at 7 41 22 AM" src="https://github.com/elastic/kibana/assets/1676003/ab246534-7358-4372-bbce-09768eb4c341"> ## Changes - lower `numBuckets` to 8 when calling `calculateAuto.near` - add unit tests to `calculateAuto.near` and `getBucketSize` ## Testing 1. Change the [many_dependencies.ts](https://github.com/elastic/kibana/blob/main/packages/kbn-apm-synthtrace/src/scenarios/many_dependencies.ts#L18-L19) synthtrace scenario to generate 1500 dependencies by changing these lines locally: ` const NUMBER_OF_DEPENDENCIES_PER_SERVICE = 15; const NUMBER_OF_SERVICES = 100; ` 1. run `node scripts/synthtrace many_dependencies.ts --live --clean` locally 3. run local kibana instance and navigate to APM dependencies inventory http://localhost:5601/app/apm/dependencies/inventory 4. try various date ranges

paulb-elastic · 2024-05-09T17:16:30Z

Thanks @neptunian that seems a good and reasonable approach (@chrisdistasio do you see a need for such long time periods?)

@neptunian if the user does select a 4+ year range, what's the user experience, do they still end up with the too_many_buckets_exception? If we wanted to detect that exception and show something like please select a shorter time period, does that also fall into the bigger piece of work you mentioned?

neptunian · 2024-05-09T18:27:18Z

if the user does select a 4+ year range, what's the user experience, do they still end up with the too_many_buckets_exception? If we wanted to detect that exception and show something like please select a shorter time period, does that also fall into the bigger piece of work you mentioned?

Yes, they will still get the error with a "failed to fetch" in the table. With the "Separate histogram timeseries buckets from service" I mentioned, they would be unlikely to get the error because we'd only get timeseries data for the services they are looking at (defaults to 25 per page and we can make it lower). A significant part of the problem is getting timeseries data for ALL the services they have, even though they can't look at all of them anyway (defaults to 25 items per page in the table and can be set lower).

I think the current error that tells them to adjust their settings so they can get more buckets is helpful and we should keep it, but I understand they don't know exactly why and what they can do to remedy it other than changing their bucket size. So adding that kind of messaging could be helpful. "There is too much data being returned. Adjust your cluster bucket size (same as current messaging about adjust bucket size) or try narrowing your time range." This messaging comes from elasticsearch so we'd have to parse it and append some extra messaging to suggest narrowing the timerange. It would show up for all the ES queries that encounter the exception in APM and may not be helpful in some contexts if the timerange is not a significant contributor to the bucket size.

cachedout added bug Fixes for quality problems that affect the customer experience Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team labels Mar 12, 2024

smith added the needs-refinement Needs PM's to refine scope label Mar 22, 2024

smith added the apm:dependencies-ui label Mar 29, 2024

neptunian self-assigned this Apr 29, 2024

neptunian removed their assignment May 1, 2024

crespocarlos mentioned this issue May 2, 2024

[APM] Top dependencies request sometimes fails when searching outside of the boost window #178979

Closed

neptunian self-assigned this May 2, 2024

smith removed the needs-refinement Needs PM's to refine scope label May 7, 2024

neptunian mentioned this issue May 7, 2024

[APM UI] decrease bucket size top_dependencies sends to get_connection_stats query #182884

Merged

neptunian closed this as completed in #182884 May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observability dependencies view broken for >= 90 days of historical data #178491

Observability dependencies view broken for >= 90 days of historical data #178491

cachedout commented Mar 12, 2024 •

edited

elasticmachine commented Mar 12, 2024

kpatticha commented Mar 13, 2024

neptunian commented Apr 30, 2024 •

edited

neptunian commented May 1, 2024

neptunian commented May 8, 2024 •

edited

paulb-elastic commented May 9, 2024

neptunian commented May 9, 2024 •

edited

Observability dependencies view broken for >= 90 days of historical data #178491

Observability dependencies view broken for >= 90 days of historical data #178491

Comments

cachedout commented Mar 12, 2024 • edited

elasticmachine commented Mar 12, 2024

kpatticha commented Mar 13, 2024

neptunian commented Apr 30, 2024 • edited

neptunian commented May 1, 2024

neptunian commented May 8, 2024 • edited

paulb-elastic commented May 9, 2024

neptunian commented May 9, 2024 • edited

cachedout commented Mar 12, 2024 •

edited

neptunian commented Apr 30, 2024 •

edited

neptunian commented May 8, 2024 •

edited

neptunian commented May 9, 2024 •

edited