-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Infrastructure UI][Rules] Refactor Metric Threshold rule to push evaluations to Elasticsearch #126214
[Infrastructure UI][Rules] Refactor Metric Threshold rule to push evaluations to Elasticsearch #126214
Conversation
…ule and move evaluations to ES
retest |
@elasticmachine merge upstream |
merge conflict between base and head |
retest |
1 similar comment
retest |
0a71ac3
to
7317156
Compare
@stevedodson I'm not very familiar with the mechanics of |
@simianhacker - as long as 'Build and Deploy to Cloud' succeeds the tests don't need to pass. I've now got this PR running in cloud. Thank you! |
@@ -177,15 +179,6 @@ export const createMetricThresholdExecutor = (libs: InfraBackendLibs) => | |||
}) | |||
) | |||
.join('\n'); | |||
/* | |||
* Custom recovery actions aren't yet available in the alerting framework | |||
* Uncomment the code below once they've been implemented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simianhacker I see you removed the commented code for recovery actions. Do we have a ticket to implement custom recovery actions and build recovered alert reason? I don't want us to forget implementing this, now that we removed this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -212,20 +205,16 @@ export const createMetricThresholdExecutor = (libs: InfraBackendLibs) => | |||
.filter((result) => result[group].isNoData) | |||
.map((result) => buildNoDataAlertReason({ ...result[group], group })) | |||
.join('\n'); | |||
} else if (nextState === AlertStates.ERROR) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simianhacker Don't we have an error state anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in the same sense. There was this error condition that existed in the old implementation that no longer exists in the new methodology. Errors from this point forward are going to be exceptions caught by the framework.
const actionGroupId = | ||
nextState === AlertStates.OK | ||
? RecoveredActionGroup.id | ||
: nextState === AlertStates.NO_DATA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simianhacker AlertsState refers actually to RulesState, right? The naming confuses me. Probably refactoring alerts to rules is out of scope of this PR. Shall I create another ticket to refactor wrong uses of alerts to rules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should create a new ticket for standardizing variable names
@simianhacker I did a bit of testing and alerts got triggered fine. I created a rule to Then I started metricbeat again and I started getting following alert: What I was wondering though is if we should have an extra recovered alert for the group that started reporting data again. Most probably currently generated alerts are fine, this is just a thought I made and I put it here for possible consideration. On another note, I found another bug where Last updated value in the flyout is wrong. Instead of showing the last updated value, it shows when alert was started (You can see the bug in the 2 screenshots I posted above. Both screenshots have the same value, whereas they shouldn't). I'll create another issue for this. |
A heads up that we're seeing quite a few restarts on this Cloud instance due to it running out of memory. I haven't looked through the changes to see if it could be the cause, or if it's unrelated but wanted to raise it.
|
@simianhacker I added this review to our board as an External Review and it's now in our External Review queue. I bumped it up above the 3 other reviews requested from AO because it sounds more urgent, let me know if that's not the case. We're trying to do one ER at a time to limit how much effect they have on team output. cc: @smith |
💚 Build SucceededMetrics [docs]Async chunks
History
To update your PR or re-run it, just comment with: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -48,6 +57,8 @@ describe("The Metric Threshold Alert's getElasticsearchMetricQuery", () => { | |||
expressionParams, | |||
timeframe, | |||
100, | |||
true, | |||
void 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL on void 0
vs undefined
👍🏻
Summary
This PR pushes ALL the processing down to Elasticsearch including the group by tracking. With this PR instead of gathering all the groupings, we only need to detect the groups that were either excluded or new between the previous run and the current run. We do this by extending the time frame of the run to include the previous run and the current run. Then we create 2 buckets that represent each period (
previousPeriod
andcurrentPeriod
) and compare the document counts for each groups to determine if the group has gone missing or has either returned or is new. If the groups has gone missing, we track it in in the state. Once the group re-appears, we remove it from the rule state. If the group is new but hasn't triggered the conditions, we ignore it.This PR also closes #118820 by refactoring the rate aggregations to use 2 filter buckets with a range and a bucket script to calculate the derivative instead of using a
date_histogram
.Along with this change, I also refactored the evaluations to happen inside of Elasticsearch instead of in Kibana. This is done using a combination of
bucket_scripts
and abucket_selector
. Thebucket_selector
is only employed when the user has unchecked "Alert me if a group stops reporting data". If a user doesn't need to track missing groups, they will get a performance boost because the query only returns the groups that match the conditions. For high cardinality datasets, this will significantly reduce the load on the alerting framework due to tracking missing groups and sending notifications for them.Here is a sample query with the rate aggregation with a group by on
host.name
and theAlert me if a group stops reporting data
is unchecked:There is a caveat with this approach, when there is "no data" for the time range and we are using a document count, the
shouldTrigger
andshouldWarn
bucket scripts will be missing. For "non group by" queries, this means we need to treat the document count as ZERO and the evaluation must be done in Kibana in case the user hasdoc_count < 1
ordoc_count == 0
for the condition. Fortunately, the performance cost is non-existent in this scenario since we are only looking at a single bucket.This PR also includes a change to the way we report missing groups in a document count condition. Prior to this PR, we would backfill missing groups with ZERO for document count rules and NULL for aggregated metrics. This is actually a bug because the user asked "Alert me if a group stops reporting data". When we backfill with ZERO but the condition is
doc_count > 1
the user would not get any notification for the missing groups. With this change, we trigger a NO DATA alert regardless of the condition or metric for missing groups which matches the intent of "Alert me if a group stops reporting data" option.This PR also removes the "Drop Partial Buckets" functionality since we've moved away from using the
date_histogram
for rate aggregations.