[Telemetry] Add telemetry around the time it is taking for grabbing the telemetry stats #119468

Bamieh · 2021-11-23T13:35:09Z

[User story]

Summary

We need to monitor collector performance to ensure that the telemetry footprint is low. We can surface these metrics in the usage data, CI/tests, and for devs during development.

Impact and Concerns

Labeling as Impact: High since this ensures future scalability of our telemetry and puts a system in place to enable performance optimizations for our collection methods. This also helps reduce the number of people opting out of telemetry in cases where telemetry is causing significant spikes in resources.

Acceptance criteria

Metrics around the time it takes for each or all collectors to complete fetching the data.
Metrics around the number of requests per day against the stats endpoint

Potential solutions

Monitoring collector fetch performance and report results in telemetry

log warning on a threshold in dev
list all cases where we do PIT search inside collectors
collaborate with teams to check why they need to do this search instead of an aggregation
provide tools to solve these issues
set a timeline to disable PIT search in collectors
add telemetry to track the time it is making the request

Notes

The text was updated successfully, but these errors were encountered:

rudolf · 2021-11-24T11:50:40Z

Although slow collectors could be a sign that collection is expensive it doesn't give us the whole picture. A slow collector probably shows that this operation is expensive for Elasticsearch (i.e. an aggregation over a large amount of data), but doesn't tell us how it impacts Kibana.

In addition, we should track the Elasticsearch response length as serializing JSON has the biggest impact on Kibana performance and inefficient code that loops over an ES response is likely to also be slower the larger the response.

That can help us diagnose the following:

Is collection slow because ES is slow or Kibana is slow?
Is Kibana slow because of telemetry collection or because of another problem?
Is collector X the cause of the performance problem or is it slow because another collector caused a performance problem?

Having telemetry on telemetry can give us helpful summary details, but we loose a lot of resolution by looking at snapshots. If we instrument proxy logs we can do temporal analysis e.g. does the event loop spike after refreshing the usage collection cache?

afharo · 2022-02-08T16:15:01Z

IMO, we should aim to create this collector as simple as possible: This will help us detect the common offenders when requesting telemetry: i.e #123154

However, I agree that additional inputs will be critical. Like #122516

How does it sound if we set the scope of this issue to collect the time it takes for each collector to complete, and use other items like #122516 to understand the underlying requests and any possible event-loop delays derived from it?

afharo · 2022-03-15T14:53:34Z

#122516 was done!

I'm wondering though if we should implement this as telemetry or APM transaction/spans? Ideally, we should catch and fix this before changes are released. What approach would help us best to track these metrics, and potentially fix them before we release the offending version?

What do you think?

Bamieh · 2022-03-17T11:17:49Z

@afharo yea it is reasonable to have a way to catch any issues before the release however grabbing telemetry performance from the real world is invaluable and will give us deep insights and allows us to be proactive catching niche issues before they reach the levels of affecting the average clusters

Bamieh added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Telemetry loe:large Large Level of Effort impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. labels Nov 23, 2021

Bamieh mentioned this issue Nov 23, 2021

[Meta][Telemetry] Reduce telemetry footprint #119466

Closed

19 tasks

mshustov mentioned this issue Nov 23, 2021

Introduce budget for the Telemetry collectors #119291

Closed

planadecu added the EnableJiraSync label Jan 5, 2022

lukeelmers added loe:medium Medium Level of Effort and removed loe:large Large Level of Effort labels Mar 22, 2022

This was referenced May 10, 2022

[Telemetry] Add telemetry around the time it is taking for grabbing the telemetry stats #131949

Closed

[Telemetry] Add telemetry around the time it is taking for grabbing the telemetry stats #132233

Merged

Bamieh closed this as completed in #132233 May 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Telemetry] Add telemetry around the time it is taking for grabbing the telemetry stats #119468

[Telemetry] Add telemetry around the time it is taking for grabbing the telemetry stats #119468

Bamieh commented Nov 23, 2021

rudolf commented Nov 24, 2021

afharo commented Feb 8, 2022

afharo commented Mar 15, 2022

Bamieh commented Mar 17, 2022

[Telemetry] Add telemetry around the time it is taking for grabbing the telemetry stats #119468

[Telemetry] Add telemetry around the time it is taking for grabbing the telemetry stats #119468

Comments

Bamieh commented Nov 23, 2021

Summary

Impact and Concerns

Acceptance criteria

Potential solutions

Notes

rudolf commented Nov 24, 2021

afharo commented Feb 8, 2022

afharo commented Mar 15, 2022

Bamieh commented Mar 17, 2022