Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AWS ELB unhealthy instances detector #298

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/severity.md
Original file line number Diff line number Diff line change
Expand Up @@ -212,6 +212,7 @@
|AWS ELB backend 4xx error rate|X|X|-|-|-|
|AWS ELB backend 5xx error rate|X|X|-|-|-|
|AWS ELB backend latency|X|X|-|-|-|
|AWS ELB unhealthy instances|X|-|-|-|-|


## aws-kinesis-firehose
Expand Down
3 changes: 2 additions & 1 deletion modules/integration_aws-elb/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ Note the following parameters:

These 3 parameters alongs with all variables defined in [common-variables.tf](common-variables.tf) are common to all
[modules](../) in this repository. Other variables, specific to this module, are available in
[variables.tf](variables.tf).
[variables.tf](variables.tf) and [variables-gen.tf](variables-gen.tf).
In general, the default configuration "works" but all of these Terraform
[variables](https://www.terraform.io/docs/configuration/variables.html) make it possible to
customize the detectors behavior to better fit your needs.
Expand All @@ -82,6 +82,7 @@ This module creates the following SignalFx detectors which could contain one or
|AWS ELB backend 4xx error rate|X|X|-|-|-|
|AWS ELB backend 5xx error rate|X|X|-|-|-|
|AWS ELB backend latency|X|X|-|-|-|
|AWS ELB unhealthy instances|X|-|-|-|-|

## How to collect required metrics?

Expand Down
12 changes: 12 additions & 0 deletions modules/integration_aws-elb/conf/01-unhealthy-instances.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
module: "AWS ELB"
name: "unhealthy instances"
id: unhealthy_instances_absolute
transformation: ".min(over='10m')"
signals:
signal:
metric: UnHealthyHostCount
filter: "filter('namespace', 'AWS/ELB') and filter('stat', 'upper') and (not filter('AvailabilityZone', '*'))"
rules:
critical:
threshold: 1
comparator: ">="
25 changes: 25 additions & 0 deletions modules/integration_aws-elb/detectors-gen.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
resource "signalfx_detector" "unhealthy_instances_absolute" {
name = format("%s %s", local.detector_name_prefix, "AWS ELB unhealthy instances")

authorized_writer_teams = var.authorized_writer_teams
teams = try(coalescelist(var.teams, var.authorized_writer_teams), null)
tags = compact(concat(local.common_tags, local.tags, var.extra_tags))

program_text = <<-EOF
signal = data('UnHealthyHostCount', filter=filter('namespace', 'AWS/ELB') and filter('stat', 'upper') and (not filter('AvailabilityZone', '*')) and ${module.filtering.signalflow})${var.unhealthy_instances_absolute_aggregation_function}${var.unhealthy_instances_absolute_transformation_function}.publish('signal')
detect(when(signal >= ${var.unhealthy_instances_absolute_threshold_critical})).publish('CRIT')
EOF

rule {
description = "is too high >= ${var.unhealthy_instances_absolute_threshold_critical}"
severity = "Critical"
detect_label = "CRIT"
disabled = coalesce(var.unhealthy_instances_absolute_disabled, var.detectors_disabled)
notifications = coalescelist(lookup(var.unhealthy_instances_absolute_notifications, "critical", []), var.notifications.critical)
runbook_url = try(coalesce(var.unhealthy_instances_absolute_runbook_url, var.runbook_url), "")
tip = var.unhealthy_instances_absolute_tip
parameterized_subject = var.message_subject == "" ? local.rule_subject : var.message_subject
parameterized_body = var.message_body == "" ? local.rule_body : var.message_body
}
}

5 changes: 5 additions & 0 deletions modules/integration_aws-elb/outputs.tf
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,8 @@ output "no_healthy_instances" {
value = signalfx_detector.no_healthy_instances
}

output "unhealthy_instances_absolute" {
description = "Detector resource for unhealthy_instances_absolute"
value = signalfx_detector.unhealthy_instances_absolute
}

44 changes: 44 additions & 0 deletions modules/integration_aws-elb/variables-gen.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# unhealthy_instances_absolute detector

variable "unhealthy_instances_absolute_notifications" {
description = "Notification recipients list per severity overridden for unhealthy_instances_absolute detector"
type = map(list(string))
default = {}
}

variable "unhealthy_instances_absolute_aggregation_function" {
description = "Aggregation function and group by for unhealthy_instances_absolute detector (i.e. \".mean(by=['host'])\")"
type = string
default = ""
}

variable "unhealthy_instances_absolute_transformation_function" {
description = "Transformation function for unhealthy_instances_absolute detector (i.e. \".mean(over='5m')\")"
type = string
default = ".min(over='10m')"
}

variable "unhealthy_instances_absolute_tip" {
description = "Suggested first course of action or any note useful for incident handling"
type = string
default = ""
}

variable "unhealthy_instances_absolute_runbook_url" {
description = "URL like SignalFx dashboard or wiki page which can help to troubleshoot the incident cause"
type = string
default = ""
}

variable "unhealthy_instances_absolute_disabled" {
description = "Disable all alerting rules for unhealthy_instances_absolute detector"
type = bool
default = null
}

variable "unhealthy_instances_absolute_threshold_critical" {
description = "Critical threshold for unhealthy_instances_absolute detector"
type = number
default = 1
}