[Infrastructure UI] Add logging to Inventory Threshold Rule #127838

simianhacker · 2022-03-15T23:21:19Z

Summary

This PR adds logging for the execution time (debug) along with the Elasticsearch request/responses (trace) to the Inventory Threshold Rule.

To enable the logging for the Inventory Threshold rule, add the following snippet to your config/kibana.dev.yml:

logging.loggers:
  - name: plugins.infra.inventoryRule
    level: trace

Example Log Format

[2022-03-17T12:02:11.722-06:00][TRACE][plugins.infra.inventoryRule] [AlertId: ef390150-a614-11ec-9e13-59145509815a][ExecutionId: 58db3d55-cce6-4930-9d44-6b22ed68dd86] Request: {"allow_no_indices":true,"ignore_unavailable":true,"index":"metrics-*,metricbeat-*","body":{"size":0,"query":{"bool":{"filter":[{"range":{"@timestamp":{"gte":1647540071721,"lte":1647540131721,"format":"epoch_millis"}}}]}},"aggs":{"nodes":{"composite":{"size":5000,"sources":[{"node":{"terms":{"field":"host.name"}}}]},"aggs":{"cpu_user":{"avg":{"field":"system.cpu.user.pct"}},"cpu_system":{"avg":{"field":"system.cpu.system.pct"}},"cpu_cores":{"max":{"field":"system.cpu.cores"}},"cpu":{"bucket_script":{"buckets_path":{"user":"cpu_user","system":"cpu_system","cores":"cpu_cores"},"script":{"source":"(params.user + params.system) / params.cores","lang":"painless"},"gap_policy":"skip"}},"selectedBucket":{"bucket_selector":{"buckets_path":{"shouldWarn":"shouldWarn","shouldTrigger":"shouldTrigger"},"script":"(params.shouldWarn != null && params.shouldWarn > 0) || (params.shouldTrigger != null && params.shouldTrigger > 0)"}},"shouldWarn":{"bucket_script":{"buckets_path":{},"script":"0"}},"shouldTrigger":{"bucket_script":{"buckets_path":{"value":"cpu"},"script":"params.value > 0.01 ? 1 : 0"}}}}}}}
[2022-03-17T12:02:11.724-06:00][TRACE][plugins.infra.inventoryRule] [AlertId: ef390150-a614-11ec-9e13-59145509815a][ExecutionId: 58db3d55-cce6-4930-9d44-6b22ed68dd86] Response: {"took":0,"timed_out":false,"_shards":{"total":0,"successful":0,"skipped":0,"failed":0},"hits":{"total":{"value":0,"relation":"eq"},"max_score":0,"hits":[]}}
[2022-03-17T12:02:11.724-06:00][DEBUG][plugins.infra.inventoryRule] [AlertId: ef390150-a614-11ec-9e13-59145509815a][ExecutionId: 58db3d55-cce6-4930-9d44-6b22ed68dd86] Scheduled 0 actions in 39ms

Checklist

Unit or functional tests were updated or added to match the most common scenarios

elasticmachine · 2022-03-15T23:21:21Z

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

jasonrhodes · 2022-03-16T00:46:48Z

@kobelb can you weigh in on this? We want to be able to easily rule out our executor processing if a customer reports rules running slowly. Is there any way to get at that information via the event log events? I don't see any harm in landing this kind of debug flag, but I was curious if we can already get at this from event log events.

simianhacker · 2022-03-16T17:06:43Z

@kobelb Just to give more context...

I currently have rule with 10K alerts (firing) but no actions. The average execution time in the Inventory Threshold code is ~2 seconds and the execution time listed in the event log is ~6 seconds. The difference, ~4 seconds, is the time spent by the Alerting Framework. As these rules scale up, I feel like it's going to be important to understand how much time is spent in the executor vs how much time is spent in the framework. This will be important when we troubleshoot future issues, since the tendency right now is to assume the executor is the issue; and for good reason since historically it has been. With the changes in #125034, the majority of the clock time is now spent scheduling the actions in the executor.

Here are the timings from 50K hosts

Inventory Threshold Rule: 7.471s/
├── evaluateCondition: 1.771s/
│   ├── getData: 1.502s/
│   │   ├── esClient.search: 913.582ms
│   │   ├── esClient.search: 133.364ms
│   │   ├── esClient.search: 136.914ms
│   │   ├── esClient.search: 135.469ms
│   │   ├── esClient.search: 141.311ms
│   │   └── esClient.search: 14.546ms
│   └── mapValues: 268.85ms
└── scheduleActions: 5.684s

This change will also make it very easy to troubleshoot the Elasticsearch request/response since the workload is now pushed down to the Elasticsearch layer. We will be able to ask the customer to increase the log level to trace for this specific rule and the request/responses will be displayed in the logs. That should make it very clear why rules matched or why they didn't. It will also help with troubleshooting "Slow ES Queries" vs "Slow Results Processing" because the same queries can be run inside the Dev Console.

kobelb · 2022-03-16T19:50:55Z

TL;DR - the event log does not include enough information to determine the time spent in the "Alerting framework".

The event-log stores quite a few other durations:

total execution
time spent executing Elasticsearch HTTP requests, including serializing/deserializating requests/responses and network latency between ES and Kibana
total time Elasticsearch itself spends executing searches

IMO, we (Elastic) definitely want some visibility into the time spent in the Alerting framework. However, I don't know whether or not we want to include this in the event log, as our users shouldn't need to care about this. Including this in the debug logs for the time being seems completely reasonable to me.

mgiota

LGTM

.../infra/server/lib/alerting/inventory_metric_threshold/inventory_metric_threshold_executor.ts

x-pack/plugins/infra/server/lib/alerting/inventory_metric_threshold/lib/get_data.ts

…old-rule-logging

simianhacker · 2022-03-17T18:10:39Z

@jasonrhodes @kobelb Here is the new log format:

[2022-03-17T12:02:11.722-06:00][TRACE][plugins.infra.inventoryRule] [AlertId: ef390150-a614-11ec-9e13-59145509815a][ExecutionId: 58db3d55-cce6-4930-9d44-6b22ed68dd86] Request: {"allow_no_indices":true,"ignore_unavailable":true,"index":"metrics-*,metricbeat-*","body":{"size":0,"query":{"bool":{"filter":[{"range":{"@timestamp":{"gte":1647540071721,"lte":1647540131721,"format":"epoch_millis"}}}]}},"aggs":{"nodes":{"composite":{"size":5000,"sources":[{"node":{"terms":{"field":"host.name"}}}]},"aggs":{"cpu_user":{"avg":{"field":"system.cpu.user.pct"}},"cpu_system":{"avg":{"field":"system.cpu.system.pct"}},"cpu_cores":{"max":{"field":"system.cpu.cores"}},"cpu":{"bucket_script":{"buckets_path":{"user":"cpu_user","system":"cpu_system","cores":"cpu_cores"},"script":{"source":"(params.user + params.system) / params.cores","lang":"painless"},"gap_policy":"skip"}},"selectedBucket":{"bucket_selector":{"buckets_path":{"shouldWarn":"shouldWarn","shouldTrigger":"shouldTrigger"},"script":"(params.shouldWarn != null && params.shouldWarn > 0) || (params.shouldTrigger != null && params.shouldTrigger > 0)"}},"shouldWarn":{"bucket_script":{"buckets_path":{},"script":"0"}},"shouldTrigger":{"bucket_script":{"buckets_path":{"value":"cpu"},"script":"params.value > 0.01 ? 1 : 0"}}}}}}}
[2022-03-17T12:02:11.724-06:00][TRACE][plugins.infra.inventoryRule] [AlertId: ef390150-a614-11ec-9e13-59145509815a][ExecutionId: 58db3d55-cce6-4930-9d44-6b22ed68dd86] Response: {"took":0,"timed_out":false,"_shards":{"total":0,"successful":0,"skipped":0,"failed":0},"hits":{"total":{"value":0,"relation":"eq"},"max_score":0,"hits":[]}}
[2022-03-17T12:02:11.724-06:00][DEBUG][plugins.infra.inventoryRule] [AlertId: ef390150-a614-11ec-9e13-59145509815a][ExecutionId: 58db3d55-cce6-4930-9d44-6b22ed68dd86] Scheduled 0 actions in 39ms

jasonrhodes

Ran this locally, looks super useful. Thanks!

We'll probably want to write up some kind of doc ... SKI? I'm not sure, so that how to turn this on is super obvious when the time comes for needing it.

jasonrhodes · 2022-03-17T18:20:24Z

x-pack/plugins/infra/server/lib/alerting/common/utils.ts

+  const formatMessage = (msg: string) =>
+    `[AlertId: ${alertExecutionDetails.alertId}][ExecutionId: ${alertExecutionDetails.executionId}] ${msg}`;
+  return {
+    ...scopedLogger,


Don't block the merge for this because it's not a huge deal either way, but should we even include the other methods of this logger here? If we do, and someone uses one later, it won't have the associated details but I doubt the person would notice. If we don't include them, they would at least be prompted to dig back through to see why they aren't included...

I can add them

I added the message formatting to warn|error|fatal only when the user logs a string instead of an Error object. Unfortunately, the logger's context is private so I can't add the formatting when they pass an Error object.

kibana-ci · 2022-03-17T23:50:10Z

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

💔 Build #31555 failed 123d283
💔 Build #31531 failed e2395a6
💚 Build #30771 succeeded 6b11c2f

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

kibanamachine · 2022-03-21T23:56:18Z

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create backports run node scripts/backport --pr 127838 or prevent reminders by adding the backport:skip label.

kibanamachine · 2022-03-23T00:06:18Z

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create backports run node scripts/backport --pr 127838 or prevent reminders by adding the backport:skip label.

[Infrastructure UI] Add logging to Inventory Threshold Rule

6b11c2f

simianhacker added release_note:enhancement Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v8.2.0 labels Mar 15, 2022

simianhacker requested a review from a team March 15, 2022 23:21

simianhacker requested a review from a team as a code owner March 15, 2022 23:21

mgiota self-requested a review March 16, 2022 15:36

mgiota approved these changes Mar 16, 2022

View reviewed changes

jasonrhodes reviewed Mar 17, 2022

View reviewed changes

.../infra/server/lib/alerting/inventory_metric_threshold/inventory_metric_threshold_executor.ts Outdated Show resolved Hide resolved

jasonrhodes reviewed Mar 17, 2022

View reviewed changes

x-pack/plugins/infra/server/lib/alerting/inventory_metric_threshold/lib/get_data.ts Show resolved Hide resolved

simianhacker added 2 commits March 17, 2022 11:15

Adding alertId and executionId to log messages

375d78a

Merge branch 'main' of github.com:elastic/kibana into inventory-thres…

e422ee5

…old-rule-logging

simianhacker requested a review from jasonrhodes March 17, 2022 17:18

simianhacker added 3 commits March 17, 2022 11:22

Adding alertExecutionDetails to tests

a09e100

Refactor logger instead of passing around AlertExecutionDetails

5e8f99c

Removing unused deps

e2395a6

simianhacker requested a review from kobelb March 17, 2022 18:12

jasonrhodes approved these changes Mar 17, 2022

View reviewed changes

jasonrhodes reviewed Mar 17, 2022

View reviewed changes

Adding warn, error, fatal to scopedLogger

123d283

simianhacker enabled auto-merge (squash) March 17, 2022 19:08

jasonrhodes approved these changes Mar 17, 2022

View reviewed changes

Renaming formatMessage to fmtMsg

5cc81b9

simianhacker merged commit 04ccd4d into elastic:main Mar 17, 2022

kibanamachine added the backport missing Added to PRs automatically when the are determined to be missing a backport. label Mar 21, 2022

smith added backport:skip This commit does not require backporting and removed backport missing Added to PRs automatically when the are determined to be missing a backport. labels Mar 23, 2022

simianhacker deleted the inventory-thresold-rule-logging branch April 17, 2024 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Infrastructure UI] Add logging to Inventory Threshold Rule #127838

[Infrastructure UI] Add logging to Inventory Threshold Rule #127838

simianhacker commented Mar 15, 2022 •

edited

elasticmachine commented Mar 15, 2022

jasonrhodes commented Mar 16, 2022

simianhacker commented Mar 16, 2022 •

edited

kobelb commented Mar 16, 2022 •

edited

mgiota left a comment

simianhacker commented Mar 17, 2022

jasonrhodes left a comment

jasonrhodes Mar 17, 2022

simianhacker Mar 17, 2022

simianhacker Mar 17, 2022 •

edited

kibana-ci commented Mar 17, 2022

kibanamachine commented Mar 21, 2022

kibanamachine commented Mar 23, 2022

[Infrastructure UI] Add logging to Inventory Threshold Rule #127838

[Infrastructure UI] Add logging to Inventory Threshold Rule #127838

Conversation

simianhacker commented Mar 15, 2022 • edited

Summary

Example Log Format

Checklist

elasticmachine commented Mar 15, 2022

jasonrhodes commented Mar 16, 2022

simianhacker commented Mar 16, 2022 • edited

kobelb commented Mar 16, 2022 • edited

mgiota left a comment

Choose a reason for hiding this comment

simianhacker commented Mar 17, 2022

jasonrhodes left a comment

Choose a reason for hiding this comment

jasonrhodes Mar 17, 2022

Choose a reason for hiding this comment

simianhacker Mar 17, 2022

Choose a reason for hiding this comment

simianhacker Mar 17, 2022 • edited

Choose a reason for hiding this comment

kibana-ci commented Mar 17, 2022

💚 Build Succeeded

Metrics [docs]

History

kibanamachine commented Mar 21, 2022

kibanamachine commented Mar 23, 2022

simianhacker commented Mar 15, 2022 •

edited

simianhacker commented Mar 16, 2022 •

edited

kobelb commented Mar 16, 2022 •

edited

simianhacker Mar 17, 2022 •

edited