New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Infrastructure UI] Add logging to Inventory Threshold Rule #127838
[Infrastructure UI] Add logging to Inventory Threshold Rule #127838
Conversation
Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI) |
@kobelb can you weigh in on this? We want to be able to easily rule out our executor processing if a customer reports rules running slowly. Is there any way to get at that information via the event log events? I don't see any harm in landing this kind of debug flag, but I was curious if we can already get at this from event log events. |
@kobelb Just to give more context... I currently have rule with 10K alerts (firing) but no actions. The average execution time in the Inventory Threshold code is ~2 seconds and the execution time listed in the event log is ~6 seconds. The difference, ~4 seconds, is the time spent by the Alerting Framework. As these rules scale up, I feel like it's going to be important to understand how much time is spent in the executor vs how much time is spent in the framework. This will be important when we troubleshoot future issues, since the tendency right now is to assume the executor is the issue; and for good reason since historically it has been. With the changes in #125034, the majority of the clock time is now spent scheduling the actions in the executor. Here are the timings from 50K hosts
This change will also make it very easy to troubleshoot the Elasticsearch request/response since the workload is now pushed down to the Elasticsearch layer. We will be able to ask the customer to increase the log level to |
TL;DR - the event log does not include enough information to determine the time spent in the "Alerting framework". The event-log stores quite a few other durations:
IMO, we (Elastic) definitely want some visibility into the time spent in the Alerting framework. However, I don't know whether or not we want to include this in the event log, as our users shouldn't need to care about this. Including this in the debug logs for the time being seems completely reasonable to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
.../infra/server/lib/alerting/inventory_metric_threshold/inventory_metric_threshold_executor.ts
Outdated
Show resolved
Hide resolved
x-pack/plugins/infra/server/lib/alerting/inventory_metric_threshold/lib/get_data.ts
Show resolved
Hide resolved
@jasonrhodes @kobelb Here is the new log format:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran this locally, looks super useful. Thanks!
We'll probably want to write up some kind of doc ... SKI? I'm not sure, so that how to turn this on is super obvious when the time comes for needing it.
const formatMessage = (msg: string) => | ||
`[AlertId: ${alertExecutionDetails.alertId}][ExecutionId: ${alertExecutionDetails.executionId}] ${msg}`; | ||
return { | ||
...scopedLogger, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't block the merge for this because it's not a huge deal either way, but should we even include the other methods of this logger here? If we do, and someone uses one later, it won't have the associated details but I doubt the person would notice. If we don't include them, they would at least be prompted to dig back through to see why they aren't included...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the message formatting to warn|error|fatal
only when the user logs a string
instead of an Error
object. Unfortunately, the logger's context
is private so I can't add the formatting when they pass an Error
object.
💚 Build SucceededMetrics [docs]
History
To update your PR or re-run it, just comment with: |
Friendly reminder: Looks like this PR hasn’t been backported yet. |
1 similar comment
Friendly reminder: Looks like this PR hasn’t been backported yet. |
Summary
This PR adds logging for the execution time (debug) along with the Elasticsearch request/responses (trace) to the Inventory Threshold Rule.
To enable the logging for the Inventory Threshold rule, add the following snippet to your
config/kibana.dev.yml
:Example Log Format
Checklist