Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Infrastructure UI][Rules] Refactor Metric Threshold rule to push evaluations to Elasticsearch #126214

Merged
merged 32 commits into from
May 17, 2022

Conversation

simianhacker
Copy link
Member

@simianhacker simianhacker commented Feb 22, 2022

Summary

This PR pushes ALL the processing down to Elasticsearch including the group by tracking. With this PR instead of gathering all the groupings, we only need to detect the groups that were either excluded or new between the previous run and the current run. We do this by extending the time frame of the run to include the previous run and the current run. Then we create 2 buckets that represent each period (previousPeriod and currentPeriod) and compare the document counts for each groups to determine if the group has gone missing or has either returned or is new. If the groups has gone missing, we track it in in the state. Once the group re-appears, we remove it from the rule state. If the group is new but hasn't triggered the conditions, we ignore it.

This PR also closes #118820 by refactoring the rate aggregations to use 2 filter buckets with a range and a bucket script to calculate the derivative instead of using a date_histogram.

Along with this change, I also refactored the evaluations to happen inside of Elasticsearch instead of in Kibana. This is done using a combination of bucket_scripts and a bucket_selector. The bucket_selector is only employed when the user has unchecked "Alert me if a group stops reporting data". If a user doesn't need to track missing groups, they will get a performance boost because the query only returns the groups that match the conditions. For high cardinality datasets, this will significantly reduce the load on the alerting framework due to tracking missing groups and sending notifications for them.

Here is a sample query with the rate aggregation with a group by on host.name and the Alert me if a group stops reporting data is unchecked:

{
  "track_total_hits": true,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": 1644959592366,
              "lte": 1644959892366,
              "format": "epoch_millis"
            }
          }
        },
        {
          "exists": {
            "field": "system.network.in.bytes"
          }
        }
      ]
    }
  },
  "size": 0,
  "aggs": {
    "groupings": {
      "composite": {
        "size": 10000,
        "sources": [
          {
            "groupBy0": {
              "terms": {
                "field": "host.name"
              }
            }
          }
        ]
      },
      "aggs": {
        "aggregatedValue_first_bucket": {
          "filter": {
            "range": {
              "@timestamp": {
                "gte": 1644959592366,
                "lt": 1644959742366,
                "format": "epoch_millis"
              }
            }
          },
          "aggs": {
            "maxValue": {
              "max": {
                "field": "system.network.in.bytes"
              }
            }
          }
        },
        "aggregatedValue_second_bucket": {
          "filter": {
            "range": {
              "@timestamp": {
                "gte": 1644959742366,
                "lt": 1644959892366,
                "format": "epoch_millis"
              }
            }
          },
          "aggs": {
            "maxValue": {
              "max": {
                "field": "system.network.in.bytes"
              }
            }
          }
        },
        "aggregatedValue": {
          "bucket_script": {
            "buckets_path": {
              "first": "aggregatedValue_first_bucket.maxValue",
              "second": "aggregatedValue_second_bucket.maxValue"
            },
            "script": "params.second > 0.0 && params.first > 0.0 && params.second > params.first ? (params.second - params.first) / 150: null"
          }
        },
        "shouldWarn": {
          "bucket_script": {
            "buckets_path": {},
            "script": "0"
          }
        },
        "shouldTrigger": {
          "bucket_script": {
            "buckets_path": {
              "value": "aggregatedValue"
            },
            "script": "params.value > 150000 ? 1 : 0"
          }
        },
        "selectedBucket": {
          "bucket_selector": {
            "buckets_path": {
              "shouldWarn": "shouldWarn",
              "shouldTrigger": "shouldTrigger"
            },
            "script": "params.shouldWarn > 0 || params.shouldTrigger > 0"
          }
        }
      }
    }
  }
}

There is a caveat with this approach, when there is "no data" for the time range and we are using a document count, the shouldTrigger and shouldWarn bucket scripts will be missing. For "non group by" queries, this means we need to treat the document count as ZERO and the evaluation must be done in Kibana in case the user has doc_count < 1 or doc_count == 0 for the condition. Fortunately, the performance cost is non-existent in this scenario since we are only looking at a single bucket.

This PR also includes a change to the way we report missing groups in a document count condition. Prior to this PR, we would backfill missing groups with ZERO for document count rules and NULL for aggregated metrics. This is actually a bug because the user asked "Alert me if a group stops reporting data". When we backfill with ZERO but the condition is doc_count > 1 the user would not get any notification for the missing groups. With this change, we trigger a NO DATA alert regardless of the condition or metric for missing groups which matches the intent of "Alert me if a group stops reporting data" option.

This PR also removes the "Drop Partial Buckets" functionality since we've moved away from using the date_histogram for rate aggregations.

@stevedodson
Copy link
Contributor

retest

@stevedodson
Copy link
Contributor

@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

merge conflict between base and head

@stevedodson
Copy link
Contributor

retest

1 similar comment
@stevedodson
Copy link
Contributor

retest

@simianhacker simianhacker force-pushed the issue-118820-refactor-group-by branch from 0a71ac3 to 7317156 Compare March 1, 2022 20:39
@simianhacker
Copy link
Member Author

@stevedodson I'm not very familiar with the mechanics of ci:cloud-deploy, do ALL the test need to be passing for that to work? If so I'll probably need another day or so to sort that out since this is a pretty big change.

@stevedodson
Copy link
Contributor

stevedodson commented Mar 2, 2022

@simianhacker - as long as 'Build and Deploy to Cloud' succeeds the tests don't need to pass. I've now got this PR running in cloud. Thank you!

@simianhacker simianhacker requested a review from a team March 9, 2022 17:54
@simianhacker simianhacker added Feature:Metrics UI Metrics UI feature Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v8.2.0 release_note:fix labels Mar 9, 2022
@@ -177,15 +179,6 @@ export const createMetricThresholdExecutor = (libs: InfraBackendLibs) =>
})
)
.join('\n');
/*
* Custom recovery actions aren't yet available in the alerting framework
* Uncomment the code below once they've been implemented
Copy link
Contributor

@mgiota mgiota Mar 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simianhacker I see you removed the commented code for recovery actions. Do we have a ticket to implement custom recovery actions and build recovered alert reason? I don't want us to forget implementing this, now that we removed this comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -212,20 +205,16 @@ export const createMetricThresholdExecutor = (libs: InfraBackendLibs) =>
.filter((result) => result[group].isNoData)
.map((result) => buildNoDataAlertReason({ ...result[group], group }))
.join('\n');
} else if (nextState === AlertStates.ERROR) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simianhacker Don't we have an error state anymore?

Copy link
Member Author

@simianhacker simianhacker Mar 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in the same sense. There was this error condition that existed in the old implementation that no longer exists in the new methodology. Errors from this point forward are going to be exceptions caught by the framework.

const actionGroupId =
nextState === AlertStates.OK
? RecoveredActionGroup.id
: nextState === AlertStates.NO_DATA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simianhacker AlertsState refers actually to RulesState, right? The naming confuses me. Probably refactoring alerts to rules is out of scope of this PR. Shall I create another ticket to refactor wrong uses of alerts to rules?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should create a new ticket for standardizing variable names

@mgiota
Copy link
Contributor

mgiota commented Mar 21, 2022

@simianhacker I did a bit of testing and alerts got triggered fine. I created a rule to alert me if a group stops reporting data, I stopped metricbeat and I successfully got following alert:

Screenshot 2022-03-21 at 21 19 04

Then I started metricbeat again and I started getting following alert:

Screenshot 2022-03-21 at 21 21 08

What I was wondering though is if we should have an extra recovered alert for the group that started reporting data again. Most probably currently generated alerts are fine, this is just a thought I made and I put it here for possible consideration.

On another note, I found another bug where Last updated value in the flyout is wrong. Instead of showing the last updated value, it shows when alert was started (You can see the bug in the 2 screenshots I posted above. Both screenshots have the same value, whereas they shouldn't). I'll create another issue for this.

@tylersmalley
Copy link
Contributor

tylersmalley commented Mar 23, 2022

A heads up that we're seeing quite a few restarts on this Cloud instance due to it running out of memory. I haven't looked through the changes to see if it could be the cause, or if it's unrelated but wanted to raise it.

A node in your cluster 4d45dd6050a1423aa6617eddf193f7dd (kibana-pr-126214) ran out of
memory at 05:42 on March 23, 2022. As a precaution, we automatically restarted the node
instance-0000000000.

@simianhacker simianhacker changed the title [Infrastructure UI] Refactor Metric Threshold rule to push evaluations to Elasticsearch [Infrastructure UI][Rules] Refactor Metric Threshold rule to push evaluations to Elasticsearch Apr 29, 2022
@jasonrhodes
Copy link
Member

@simianhacker I added this review to our board as an External Review and it's now in our External Review queue. I bumped it up above the 3 other reviews requested from AO because it sounds more urgent, let me know if that's not the case. We're trying to do one ER at a time to limit how much effect they have on team output.

cc: @smith

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
infra 1002.6KB 1001.8KB -845.0B

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Copy link
Contributor

@matschaffer matschaffer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little tough to weigh in on such a big change, but it seems good overall. I tried to exercise it with a local kibana & metricbeat but ended up getting pretty confused by the rules/alerts UI flow in general. I think it was working? Not sure. But I guess that's an issue for another PR :)

Screen Shot 2022-05-17 at 15 33 31

@@ -48,6 +57,8 @@ describe("The Metric Threshold Alert's getElasticsearchMetricQuery", () => {
expressionParams,
timeframe,
100,
true,
void 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL on void 0 vs undefined 👍🏻

@simianhacker simianhacker merged commit aa3ace8 into elastic:main May 17, 2022
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label May 17, 2022
@simianhacker simianhacker deleted the issue-118820-refactor-group-by branch April 17, 2024 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Metrics UI Metrics UI feature release_note:fix Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v8.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Metrics UI] Refactor rate aggregation for Metric Threshold Alerts to eliminate "Drop Partial Buckets"
9 participants