Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Anomaly Detection Functionality #43

Open
ToriBench opened this issue Aug 15, 2022 · 0 comments
Open

Adding Anomaly Detection Functionality #43

ToriBench opened this issue Aug 15, 2022 · 0 comments

Comments

@ToriBench
Copy link

Have a question? Please checkout our Slack Community or visit our Slack Archive.

Slack Community

Describe the Feature

CloudWatch has anomaly detection functionality for alarms that is not yet accounted for in the aws_cloudwatch_metric_alarm resource. The feature requested is the option to add anomaly detection to a metric alarm.

Expected Behavior

Clouddrove has this functionality, the usage is as such:

Basic Example

  module "alarm" {
    source                    = "clouddrove/cloudwatch-alarms/aws"
    version                   = "1.0.1"
    name                      = "alarm"
    environment               = "test"
    label_order               = ["name", "environment"]
    alarm_name                = "cpu-alarm"
    comparison_operator       = "LessThanThreshold"
    evaluation_periods        = 2
    metric_name               = "CPUUtilization"
    namespace                 = "AWS/EC2"
    period                    = "60"
    statistic                 = "Average"
    threshold                 = "40"
    alarm_description         = "This metric monitors ec2 cpu utilization"
    alarm_actions             = ["arn:aws:sns:eu-west-1:xxxxxxxxxxx:test"]
    actions_enabled           = true
    insufficient_data_actions = []
    ok_actions                = []
    dimensions                = {
                  instance_id = "i-xxxxxxxxxxxxx"
    }
}

Anomaly Example

    module "alarm" {
      source                    = "clouddrove/cloudwatch-alarms/aws"
      version                   = "1.0.1"
      name                      = "alarm"
      environment               = "test"
      label_order               = ["name", "environment"]
      alarm_name                = "cpu-alarm"
      comparison_operator       = "GreaterThanUpperThreshold"
      evaluation_periods        = 2
      threshold_metric_id       = "e1"
      query_expressions         = [{
        id          = "e1"
        expression  = "ANOMALY_DETECTION_BAND(m1)"
        label       = "CPUUtilization (Expected)"
        return_data = "true"
      }]
      query_metrics             = [{
        id          = "m1"
        return_data = "true"
        metric_name = "CPUUtilization"
        namespace   = "AWS/EC2"
        period      = "120"
        stat        = "Average"
        unit        = "Count"
        dimensions  = {
          InstanceId = module.ec2.instance_id[0]
        }
      }]
      alarm_description         = "This metric monitors ec2 cpu utilization"
      alarm_actions             = []
      actions_enabled           = true
      insufficient_data_actions = []
      ok_actions                = []
  }

Expression Example

    module "alarm" {
      source                    = "clouddrove/cloudwatch-alarms/aws"
      version                   = "1.0.1"
      name                      = "alarm"
      environment               = "test"
      label_order               = ["name", "environment"]
      expression_enabled        = true
      alarm_name                = "cpu-alarm"
      comparison_operator       = "GreaterThanUpperThreshold"
      evaluation_periods        = 2
      threshold                 = 40
      query_expressions         = [{
        id          = "e1"
        expression  = "ANOMALY_DETECTION_BAND(m1)"
        label       = "CPUUtilization (Expected)"
        return_data = "true"
      }]
      query_metrics             = [{
        id          = "m1"
        return_data = "true"
        metric_name = "CPUUtilization"
        namespace   = "AWS/EC2"
        period      = "120"
        stat        = "Average"
        unit        = "Count"
        dimensions  = {
          InstanceId = module.ec2.instance_id[0]
        }
      }]
      alarm_description         = "This metric monitors ec2 cpu utilization"
      alarm_actions             = []
      actions_enabled           = true
      insufficient_data_actions = []
      ok_actions                = []
  } 

Use Case

We want to add anomaly detection to our CIS benchmark alarms because we are often spammed with email alerts about insignificant issues and don't end up reading any of the 100's of alert emails we receive.

Describe Ideal Solution

I am not super familiar with Terraform, but I would expect the solution in alarms.tf to look something like this:

resource "aws_cloudwatch_metric_alarm" "default" {
  for_each            = local.enabled ? var.metrics : {}
  alarm_name          = each.value.alarm_name
  comparison_operator = each.value.alarm_comparison_operator
  evaluation_periods  = each.value.alarm_evaluation_periods
  metric_name         = each.value.metric_name
  namespace           = each.value.metric_namespace
  period              = each.value.alarm_period
  statistic           = each.value.alarm_statistic
  treat_missing_data  = each.value.alarm_treat_missing_data
  threshold           = each.value.alarm_threshold
  alarm_description   = each.value.alarm_description
  alarm_actions       = local.endpoints
  tags                = module.this.tags

  dynamic "metric_query" {
    for_each = var.query_expressions
    content {
      id          = metric_query.value.id
      expression  = metric_query.value.expression
      label       = metric_query.value.label
      return_data = metric_query.value.return_data
    }
  }

  dynamic "metric_query" {
    for_each = var.query_metrics
    content {
      id          = metric_query.value.id
      return_data = metric_query.value.return_data
      metric {
        metric_name = metric_query.value.metric_name
        namespace   = metric_query.value.namespace
        period      = metric_query.value.period
        stat        = metric_query.value.stat
        unit        = metric_query.value.unit

        dimensions = metric_query.value.dimensions
      }
    }
  }
}

Alternatives Considered

We have thought about adjusting the each metrics' static threshold value, but we have multiple accounts that require multiple thresholds and these thresholds are expected to change over time. Anomaly detection would be useful to monitor alert anomalies dynamically in different environments.

Additional Context

We wanted to utilize CW alarm anomaly detection for CIS benchmark alerts across several accounts and noticed that we cannot add it in our IaaS because this module does not have the variables listed in the backend. Clouddrove seems to have the functionality we are after, but we prefer using CloudPosse because of the listed security standard compliance and for the good documentation. We hope this issue can be resolved in the future because anomaly detection is a useful feature for CW monitoring!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant