Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Telemetry] Add telemetry around the time it is taking for grabbing the telemetry stats #119468

Closed
Bamieh opened this issue Nov 23, 2021 · 4 comments · Fixed by #132233
Closed
Labels
Feature:Telemetry impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. loe:medium Medium Level of Effort Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@Bamieh
Copy link
Member

Bamieh commented Nov 23, 2021

[User story]

Summary

We need to monitor collector performance to ensure that the telemetry footprint is low. We can surface these metrics in the usage data, CI/tests, and for devs during development.

Impact and Concerns

Labeling as Impact: High since this ensures future scalability of our telemetry and puts a system in place to enable performance optimizations for our collection methods. This also helps reduce the number of people opting out of telemetry in cases where telemetry is causing significant spikes in resources.

Acceptance criteria

Metrics around the time it takes for each or all collectors to complete fetching the data.
Metrics around the number of requests per day against the stats endpoint

Potential solutions

Monitoring collector fetch performance and report results in telemetry

  • log warning on a threshold in dev
  • list all cases where we do PIT search inside collectors
  • collaborate with teams to check why they need to do this search instead of an aggregation
  • provide tools to solve these issues
  • set a timeline to disable PIT search in collectors
  • add telemetry to track the time it is making the request

Notes

@Bamieh Bamieh added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Telemetry loe:large Large Level of Effort impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. labels Nov 23, 2021
@rudolf
Copy link
Contributor

rudolf commented Nov 24, 2021

Although slow collectors could be a sign that collection is expensive it doesn't give us the whole picture. A slow collector probably shows that this operation is expensive for Elasticsearch (i.e. an aggregation over a large amount of data), but doesn't tell us how it impacts Kibana.

In addition, we should track the Elasticsearch response length as serializing JSON has the biggest impact on Kibana performance and inefficient code that loops over an ES response is likely to also be slower the larger the response.

That can help us diagnose the following:

  • Is collection slow because ES is slow or Kibana is slow?
  • Is Kibana slow because of telemetry collection or because of another problem?
  • Is collector X the cause of the performance problem or is it slow because another collector caused a performance problem?

Having telemetry on telemetry can give us helpful summary details, but we loose a lot of resolution by looking at snapshots. If we instrument proxy logs we can do temporal analysis e.g. does the event loop spike after refreshing the usage collection cache?

@afharo
Copy link
Member

afharo commented Feb 8, 2022

IMO, we should aim to create this collector as simple as possible: This will help us detect the common offenders when requesting telemetry: i.e #123154

However, I agree that additional inputs will be critical. Like #122516

How does it sound if we set the scope of this issue to collect the time it takes for each collector to complete, and use other items like #122516 to understand the underlying requests and any possible event-loop delays derived from it?

@afharo
Copy link
Member

afharo commented Mar 15, 2022

#122516 was done!

I'm wondering though if we should implement this as telemetry or APM transaction/spans? Ideally, we should catch and fix this before changes are released. What approach would help us best to track these metrics, and potentially fix them before we release the offending version?

What do you think?

@Bamieh
Copy link
Member Author

Bamieh commented Mar 17, 2022

@afharo yea it is reasonable to have a way to catch any issues before the release however grabbing telemetry performance from the real world is invaluable and will give us deep insights and allows us to be proactive catching niche issues before they reach the levels of affecting the average clusters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Telemetry impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. loe:medium Medium Level of Effort Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
5 participants