-
Notifications
You must be signed in to change notification settings - Fork 8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Observability dependencies view broken for >= 90 days of historical data #178491
Comments
Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services) |
Related ticket: #161239 |
In #161239, we changed the composite size to 1500 with no pagination. However during a wide enough time range with 1500 unique top level buckets (service name, dependency name), it would still be easy to reach the default elasticsearch limit of 65,536. In the query below the interval for the histogram is daily (86400s) for something like a 3 month time range. 1500 (services/dependencies) * 90 (days) = 135,000 buckets. Not including potentially more buckets depending on date histogram creating buckets per day over a 3 month time range:
full query:
Here are some options:
|
Talked with @smith and going to go with the first option of having larger time intervals which means less buckets. |
There's a PR open here #182884. This fix does not cover very large time ranges like 4+ years with the max amount of dependencies (1500). My thought is perhaps there should be a balance between how many buckets we try to stay under for any time range vs letting the user choose to increase their bucket limit. We can advise the user to increase their default max buckets in this case. If we feel that we should always aim to stay under the max bucket limit, even in a scenario of several years, I can do that. Currently the small time interval is 30 days. For something like 4 years this can be too large and we should switch to something like 3 months. If we want to do this I'd prefer to do that in a separate PR as it will require changes to a function used all over the APM UI with more in depth testing. The better alternative would be to implement the 2nd option, Separate histogram timeseries buckets from service. |
…n_stats query (#182884) Fixes #178491 ## Summary The user receives a `too_many_buckets` exception when querying for 90 days worth of data and in many other longer time ranges. This is due to the date histogram within each service having time intervals that are too small. ## Solution Lowering `numBuckets` cause the time periods to increase because the algorithm divides the date the user selects by this number (duration / numBuckets). The larger the time range is, [the more likely it will choose an interval that is larger](https://github.com/elastic/kibana/blob/main/x-pack/plugins/observability_solution/apm/common/utils/get_bucket_size/calculate_auto.js#L11), resulting in less buckets per date histogram. The exception can still be thrown when users select time ranges that aren't caught in the algorithm, for eg selecting 4 years or more will cause the error should a user have around the max # of dependencies (1500). This is because our [smallest time interval is 30 days](https://github.com/elastic/kibana/blob/main/x-pack/plugins/observability_solution/apm/common/utils/get_bucket_size/calculate_auto.js#L26) and that interval becomes too small in a large time range. We can recommend in this case to increase the max bucket size in elasticsearch. There needs to be a balance with how much we try to stay under the default bucket limit vs letting the user change that size and get more data. Scenarios of duration and numBucket size and the resulting # of buckets with the max of 1500 dependencies: <img width="1772" alt="Screenshot 2024-05-08 at 7 41 22 AM" src="https://github.com/elastic/kibana/assets/1676003/ab246534-7358-4372-bbce-09768eb4c341"> ## Changes - lower `numBuckets` to 8 when calling `calculateAuto.near` - add unit tests to `calculateAuto.near` and `getBucketSize` ## Testing 1. Change the [many_dependencies.ts](https://github.com/elastic/kibana/blob/main/packages/kbn-apm-synthtrace/src/scenarios/many_dependencies.ts#L18-L19) synthtrace scenario to generate 1500 dependencies by changing these lines locally: ` const NUMBER_OF_DEPENDENCIES_PER_SERVICE = 15; const NUMBER_OF_SERVICES = 100; ` 1. run `node scripts/synthtrace many_dependencies.ts --live --clean` locally 3. run local kibana instance and navigate to APM dependencies inventory http://localhost:5601/app/apm/dependencies/inventory 4. try various date ranges
Thanks @neptunian that seems a good and reasonable approach (@chrisdistasio do you see a need for such long time periods?) @neptunian if the user does select a 4+ year range, what's the user experience, do they still end up with the |
Yes, they will still get the error with a "failed to fetch" in the table. With the "Separate histogram timeseries buckets from service" I mentioned, they would be unlikely to get the error because we'd only get timeseries data for the services they are looking at (defaults to 25 per page and we can make it lower). A significant part of the problem is getting timeseries data for ALL the services they have, even though they can't look at all of them anyway (defaults to 25 items per page in the table and can be set lower). I think the current error that tells them to adjust their settings so they can get more buckets is helpful and we should keep it, but I understand they don't know exactly why and what they can do to remedy it other than changing their bucket size. So adding that kind of messaging could be helpful. "There is too much data being returned. Adjust your cluster bucket size (same as current messaging about adjust bucket size) or try narrowing your time range." This messaging comes from elasticsearch so we'd have to parse it and append some extra messaging to suggest narrowing the timerange. It would show up for all the ES queries that encounter the exception in APM and may not be helpful in some contexts if the timerange is not a significant contributor to the bucket size. |
Kibana version:
Serverless build 03/12/24
Elasticsearch version:
Serverless build 03/12/24
Server OS version:
Serverless build 03/12/24
Browser version:
N/A
Browser OS version:
N/A
Original install method (e.g. download page, yum, from source, etc.):
Serverless build 03/12/24
Describe the bug:
When using the Observability test cluster for Serverless QA and selecting 90 days of historical data, an error about too many buckets is displayed.
Steps to reproduce:
Expected behavior:
No error
Screenshots (if relevant):
Errors in browser console (if relevant):
Provide logs and/or server output (if relevant):
Any additional context:
The text was updated successfully, but these errors were encountered: