Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Infrastructure UI] Add logging to Inventory Threshold Rule #127838

Merged

Conversation

simianhacker
Copy link
Member

@simianhacker simianhacker commented Mar 15, 2022

Summary

This PR adds logging for the execution time (debug) along with the Elasticsearch request/responses (trace) to the Inventory Threshold Rule.

To enable the logging for the Inventory Threshold rule, add the following snippet to your config/kibana.dev.yml:

logging.loggers:
  - name: plugins.infra.inventoryRule
    level: trace

Example Log Format

[2022-03-17T12:02:11.722-06:00][TRACE][plugins.infra.inventoryRule] [AlertId: ef390150-a614-11ec-9e13-59145509815a][ExecutionId: 58db3d55-cce6-4930-9d44-6b22ed68dd86] Request: {"allow_no_indices":true,"ignore_unavailable":true,"index":"metrics-*,metricbeat-*","body":{"size":0,"query":{"bool":{"filter":[{"range":{"@timestamp":{"gte":1647540071721,"lte":1647540131721,"format":"epoch_millis"}}}]}},"aggs":{"nodes":{"composite":{"size":5000,"sources":[{"node":{"terms":{"field":"host.name"}}}]},"aggs":{"cpu_user":{"avg":{"field":"system.cpu.user.pct"}},"cpu_system":{"avg":{"field":"system.cpu.system.pct"}},"cpu_cores":{"max":{"field":"system.cpu.cores"}},"cpu":{"bucket_script":{"buckets_path":{"user":"cpu_user","system":"cpu_system","cores":"cpu_cores"},"script":{"source":"(params.user + params.system) / params.cores","lang":"painless"},"gap_policy":"skip"}},"selectedBucket":{"bucket_selector":{"buckets_path":{"shouldWarn":"shouldWarn","shouldTrigger":"shouldTrigger"},"script":"(params.shouldWarn != null && params.shouldWarn > 0) || (params.shouldTrigger != null && params.shouldTrigger > 0)"}},"shouldWarn":{"bucket_script":{"buckets_path":{},"script":"0"}},"shouldTrigger":{"bucket_script":{"buckets_path":{"value":"cpu"},"script":"params.value > 0.01 ? 1 : 0"}}}}}}}
[2022-03-17T12:02:11.724-06:00][TRACE][plugins.infra.inventoryRule] [AlertId: ef390150-a614-11ec-9e13-59145509815a][ExecutionId: 58db3d55-cce6-4930-9d44-6b22ed68dd86] Response: {"took":0,"timed_out":false,"_shards":{"total":0,"successful":0,"skipped":0,"failed":0},"hits":{"total":{"value":0,"relation":"eq"},"max_score":0,"hits":[]}}
[2022-03-17T12:02:11.724-06:00][DEBUG][plugins.infra.inventoryRule] [AlertId: ef390150-a614-11ec-9e13-59145509815a][ExecutionId: 58db3d55-cce6-4930-9d44-6b22ed68dd86] Scheduled 0 actions in 39ms

Checklist

@simianhacker simianhacker added release_note:enhancement Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v8.2.0 labels Mar 15, 2022
@simianhacker simianhacker requested a review from a team March 15, 2022 23:21
@simianhacker simianhacker requested a review from a team as a code owner March 15, 2022 23:21
@elasticmachine
Copy link
Contributor

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

@jasonrhodes
Copy link
Member

@kobelb can you weigh in on this? We want to be able to easily rule out our executor processing if a customer reports rules running slowly. Is there any way to get at that information via the event log events? I don't see any harm in landing this kind of debug flag, but I was curious if we can already get at this from event log events.

@mgiota mgiota self-requested a review March 16, 2022 15:36
@simianhacker
Copy link
Member Author

simianhacker commented Mar 16, 2022

@kobelb Just to give more context...

I currently have rule with 10K alerts (firing) but no actions. The average execution time in the Inventory Threshold code is ~2 seconds and the execution time listed in the event log is ~6 seconds. The difference, ~4 seconds, is the time spent by the Alerting Framework. As these rules scale up, I feel like it's going to be important to understand how much time is spent in the executor vs how much time is spent in the framework. This will be important when we troubleshoot future issues, since the tendency right now is to assume the executor is the issue; and for good reason since historically it has been. With the changes in #125034, the majority of the clock time is now spent scheduling the actions in the executor.

Here are the timings from 50K hosts

Inventory Threshold Rule: 7.471s/
├── evaluateCondition: 1.771s/
│   ├── getData: 1.502s/
│   │   ├── esClient.search: 913.582ms
│   │   ├── esClient.search: 133.364ms
│   │   ├── esClient.search: 136.914ms
│   │   ├── esClient.search: 135.469ms
│   │   ├── esClient.search: 141.311ms
│   │   └── esClient.search: 14.546ms
│   └── mapValues: 268.85ms
└── scheduleActions: 5.684s

This change will also make it very easy to troubleshoot the Elasticsearch request/response since the workload is now pushed down to the Elasticsearch layer. We will be able to ask the customer to increase the log level to trace for this specific rule and the request/responses will be displayed in the logs. That should make it very clear why rules matched or why they didn't. It will also help with troubleshooting "Slow ES Queries" vs "Slow Results Processing" because the same queries can be run inside the Dev Console.

@kobelb
Copy link
Contributor

kobelb commented Mar 16, 2022

TL;DR - the event log does not include enough information to determine the time spent in the "Alerting framework".

The event-log stores quite a few other durations:

  • total execution
  • time spent executing Elasticsearch HTTP requests, including serializing/deserializating requests/responses and network latency between ES and Kibana
  • total time Elasticsearch itself spends executing searches

IMO, we (Elastic) definitely want some visibility into the time spent in the Alerting framework. However, I don't know whether or not we want to include this in the event log, as our users shouldn't need to care about this. Including this in the debug logs for the time being seems completely reasonable to me.

Copy link
Contributor

@mgiota mgiota left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@simianhacker
Copy link
Member Author

@jasonrhodes @kobelb Here is the new log format:

[2022-03-17T12:02:11.722-06:00][TRACE][plugins.infra.inventoryRule] [AlertId: ef390150-a614-11ec-9e13-59145509815a][ExecutionId: 58db3d55-cce6-4930-9d44-6b22ed68dd86] Request: {"allow_no_indices":true,"ignore_unavailable":true,"index":"metrics-*,metricbeat-*","body":{"size":0,"query":{"bool":{"filter":[{"range":{"@timestamp":{"gte":1647540071721,"lte":1647540131721,"format":"epoch_millis"}}}]}},"aggs":{"nodes":{"composite":{"size":5000,"sources":[{"node":{"terms":{"field":"host.name"}}}]},"aggs":{"cpu_user":{"avg":{"field":"system.cpu.user.pct"}},"cpu_system":{"avg":{"field":"system.cpu.system.pct"}},"cpu_cores":{"max":{"field":"system.cpu.cores"}},"cpu":{"bucket_script":{"buckets_path":{"user":"cpu_user","system":"cpu_system","cores":"cpu_cores"},"script":{"source":"(params.user + params.system) / params.cores","lang":"painless"},"gap_policy":"skip"}},"selectedBucket":{"bucket_selector":{"buckets_path":{"shouldWarn":"shouldWarn","shouldTrigger":"shouldTrigger"},"script":"(params.shouldWarn != null && params.shouldWarn > 0) || (params.shouldTrigger != null && params.shouldTrigger > 0)"}},"shouldWarn":{"bucket_script":{"buckets_path":{},"script":"0"}},"shouldTrigger":{"bucket_script":{"buckets_path":{"value":"cpu"},"script":"params.value > 0.01 ? 1 : 0"}}}}}}}
[2022-03-17T12:02:11.724-06:00][TRACE][plugins.infra.inventoryRule] [AlertId: ef390150-a614-11ec-9e13-59145509815a][ExecutionId: 58db3d55-cce6-4930-9d44-6b22ed68dd86] Response: {"took":0,"timed_out":false,"_shards":{"total":0,"successful":0,"skipped":0,"failed":0},"hits":{"total":{"value":0,"relation":"eq"},"max_score":0,"hits":[]}}
[2022-03-17T12:02:11.724-06:00][DEBUG][plugins.infra.inventoryRule] [AlertId: ef390150-a614-11ec-9e13-59145509815a][ExecutionId: 58db3d55-cce6-4930-9d44-6b22ed68dd86] Scheduled 0 actions in 39ms

Copy link
Member

@jasonrhodes jasonrhodes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran this locally, looks super useful. Thanks!

We'll probably want to write up some kind of doc ... SKI? I'm not sure, so that how to turn this on is super obvious when the time comes for needing it.

const formatMessage = (msg: string) =>
`[AlertId: ${alertExecutionDetails.alertId}][ExecutionId: ${alertExecutionDetails.executionId}] ${msg}`;
return {
...scopedLogger,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't block the merge for this because it's not a huge deal either way, but should we even include the other methods of this logger here? If we do, and someone uses one later, it won't have the associated details but I doubt the person would notice. If we don't include them, they would at least be prompted to dig back through to see why they aren't included...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add them

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the message formatting to warn|error|fatal only when the user logs a string instead of an Error object. Unfortunately, the logger's context is private so I can't add the formatting when they pass an Error object.

@simianhacker simianhacker enabled auto-merge (squash) March 17, 2022 19:08
@simianhacker simianhacker merged commit 04ccd4d into elastic:main Mar 17, 2022
@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@kibanamachine kibanamachine added the backport missing Added to PRs automatically when the are determined to be missing a backport. label Mar 21, 2022
@kibanamachine
Copy link
Contributor

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create backports run node scripts/backport --pr 127838 or prevent reminders by adding the backport:skip label.

1 similar comment
@kibanamachine
Copy link
Contributor

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create backports run node scripts/backport --pr 127838 or prevent reminders by adding the backport:skip label.

@smith smith added backport:skip This commit does not require backporting and removed backport missing Added to PRs automatically when the are determined to be missing a backport. labels Mar 23, 2022
@simianhacker simianhacker deleted the inventory-thresold-rule-logging branch April 17, 2024 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Metrics UI Metrics UI feature release_note:enhancement Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v8.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants