Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starting Kubernetes Observability through basic SLIs/SLOs #4997

Closed
gizas opened this issue Jan 13, 2023 · 19 comments
Closed

Starting Kubernetes Observability through basic SLIs/SLOs #4997

gizas opened this issue Jan 13, 2023 · 19 comments
Assignees
Labels
Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team

Comments

@gizas
Copy link
Contributor

gizas commented Jan 13, 2023

Context

Observability definition includes much more than collection of metrics and logs. We should be able to provide to our users a holistic approach where functionalities like alerting and SLO definition should be part of the user experience for Kubernetes observability. In more details, we should help our users, when they onboard a new Kubernetes Cluster, to configure specific Alerts and SLOs and link those with specific instances of the data collected.

Action

As part of this story we should :

  • List representative categories of SLIs/ SLOs for Kubernetes
  • List basic starting alarms on the categories that were defined in previous step
  • Provide Config As Code samples for applying above suggestions
  • Open any new enhancement stories (Kibana enhancements or Actionable-Observability team in case basic functionality is missing) when the UX needs improvement. Main goal is to link Alerting and SLOs with data collected from Kubernetes Integration side

Useful Links

SLO API: https://github.com/kdelemme/kibana/blob/main/x-pack/plugins/observability/dev_docs/slo.md#L0-L1
Alert API: https://www.elastic.co/guide/en/kibana/current/alerting-apis.html

Deliverables

  • Document categories of SLIs and SLOs for Kubernetes
  • Configuration files for Alerts and SLOs
  • Any feature stories for UX enhancement: (Add alerts/watchers as assets in packages package-spec#484)
  • Demoing e2e experience and probably updating some relevant blogs for Kubernetes Observability through SLI/ SLOs

Relevant useful links:

TODOs

  1. Investigate and collect categories of SLIs and SLOs for Kubernetes ✅
  2. Investigate what alerts can be created based on the currently collected fields
    a. collect which metrics need to be collected/improved ✅
    b. Create snippets to install these alerts through the API ✅
  3. Document the usage of these alerts and find proper place for official documentation ✅
    a. PR to add them in k8s package docs: Add documentation for k8s alerts installation #5364
    b. PR to add them in automation repo: Add k8s monitoring watchers k8s-integration-infra#16

Extras
5. Investigate how alerts can be part of the package: ✅ Issue created -> elastic/package-spec#484
6. Demo/blogpost: https://github.com/elastic/blog/issues/1869
7. Any follow up documentation.

@gizas gizas added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label Jan 13, 2023
@ChrsMark ChrsMark self-assigned this Feb 2, 2023
@ChrsMark
Copy link
Member

One basic SLI/SLO should be on Leadership Switch.

Kubernetes native components including cluster-autoscaler, kube-controller-manager, and kube-scheduler are using leader-with-lease in client-go.

What we want to monitor here is:

  1. (SLI_a) Leaderless: when there is no leader (need to define the SLI which would mean a critical error)
  2. (SLI_b) Time of leadership switch (need to define the SLI which would be a warning)

Being with no leader for a period of time is quite critical for a production Kubernetes cluster. Hence we need to define proper SLIs/SLOs based on these observations.

This information can be retrieved by the kube-state-metrics Service and look like the following:

# HELP kube_lease_owner Information about the Lease's owner.
# TYPE kube_lease_owner gauge
kube_lease_owner{lease="kind-control-plane",owner_kind="Node",owner_name="kind-control-plane"} 1
# HELP kube_lease_renew_time Kube lease renew time.
# TYPE kube_lease_renew_time gauge
kube_lease_renew_time{lease="kind-control-plane"} 1.676268601e+09

SLO_a: kube_lease_owner should not be equal to zero for more than 30 seconds. That should indicate a CRITICAL error.
SLO_b: avg(kube_lease_renew_time) should not be greater than 0.5s for a period of last 10 mins. That should indicate a WARNING.

At the moment we don't have a specific metricset/data_stream that specifically collects this information from kube-state-metrics. However we can create PoC by leveraging the Prometheus collector package.
@gizas what do you think? Shall we create separate issue for implementing the metricset to collect from kube-state-metrics or we can just use the Prometheus one?
I would be ok with doing both: creating an issue for the metricset implementation and in the meanwhile experiment by using the Prometheus package.

@gizas
Copy link
Contributor Author

gizas commented Feb 13, 2023

Leader SLIs is something needed, I agree. We have seen it in sdh recenlty.

But before discussing for specific slis, I would suggest to try to cover the big categories of SLIs as described from google here https://sre.google/workbook/implementing-slos/
Lets make a doc that will try to suggest and specify all indicators for each category. This way we will be specific and try to cover different types. (availability, quality, freshness, correctness, coverage, durability. ) Can we have 1-2 indicators per category?

Now to your question:

  • Lets try to create indicators for what we can support now initially
  • For any other cases agree lets create tickets and mention them here.

or we can just use the Prometheus one?

Does this mean that we will force users to install Prometheus Integration until we implement something if they want to have alerts right? If yes dont like it so much

@ChrsMark
Copy link
Member

As a first step I will focus on identifying the high level categories/groups of SLIs/SLOs.

One good approach is to follow the Four Golden Signals which are used by Google SRE. They are Latency, Traffic, Errors and Saturation.

A more specific approach is the The RED method developed by Tom Wilkie, who was an SRE at Google and used the Four Golden Signals. The RED method drops the saturation because it is used for more advanced cases and people remember better things that come in threes :-). However I would personally be in favor of adding the saturation/utilization aspect into the game.

In summary I would propose the following groups of SLIs/SLOs:
Group_1: latency (of control plane, like apiserver, controllermanager etc)
Group_2: resource utilisation (on nodes how much cpu, memory etc is used)
Group_3: errors (errors on logs or events, or error count from components, network etc)

If we agree on this first step as a base, I will then go ahead with defining specific SLIs/SLOs per group based on the metrics/data that we currently collect for the k8s cluster. @gizas what do you think about this plan?

@gizas
Copy link
Contributor Author

gizas commented Feb 15, 2023

Yes of course Chris. I think we are aligned, that is why I had added the 3rd link above in the description , to guide the assignee of the story to group the slos and justify the choices.

However I would personally be in favor of adding the saturation/utilization aspect into the game.

You suggest this and you propose 3 Groups :) I got confused

I would suggest ( See examples from here https://sysdig.com/blog/golden-signals-kubernetes/)
Group_4: saturation (CPU usage, Disk space, Memory usage, percentage of pods in Nodes etc)

Just reminder that configuration of those through API is part of story. So lets think how we should save this information.
Maybe in a new folder called alerts inside package? Yaml or Json to support the API calls? (Script This was used to trnasform to json in previous relevant api work )

@ChrsMark
Copy link
Member

Yes of course Chris. I think we are aligned, that is why I had added the 3rd link above in the description , to guide the assignee of the story to group the slos and justify the choices.

However I would personally be in favor of adding the saturation/utilization aspect into the game.

You suggest this and you propose 3 Groups :) I got confused

I would suggest ( See examples from here https://sysdig.com/blog/golden-signals-kubernetes/) Group_4: saturation (CPU usage, Disk space, Memory usage, percentage of pods in Nodes etc)

You see 3 groups because I grouped latency and traffic into one group. In any case, this is just a conceptual grouping and 3 or 4 doesn't really matter in the end.

Just reminder that configuration of those through API is part of story. So lets think how we should save this information. Maybe in a new folder called alerts inside package? Yaml or Json to support the API calls? (Script This was used to trnasform to json in previous relevant api work )

Yeap, that should happen at some point, it's also included in the TODOs of the issue.

@ChrsMark
Copy link
Member

ChrsMark commented Feb 20, 2023

Heads up on this. After struggling a bit with ES queries and aggregations I managed to create one first alert using the k8s API request duration. One major issue here is that this metric is coming as a histogram from kube-state-metrics (Promtheus metric) but we store it not using our ES types which makes it hard then to calculate percentiles or anything useful.

See how we store these metrics at https://github.com/elastic/beats/blob/b0bbd16ed0064a4ef480ed15c96d0b954f0748a7/metricbeat/module/kubernetes/apiserver/_meta/testdata/docs.plain-expected.json#L12. @gizas we had discussed about this in the past with @MichaelKatsoulis and we concluded that maybe we need to add types support for these metricsets which use the prometheus helper. In that case storing these fields as ES histograms would help us to query for percentiles and having easier ES queries constructions. In my case I had to use runtime fields to overcome this.

I took the idea from Average Apiserver Request Latency per Resource [Metrics Kubernetes] visualization from our k8s dashboard which calculates the latency in milliseconds as (sum(kubernetes.apiserver.request.duration.us.sum)/ sum(kubernetes.apiserver.request.duration.us.count))/1000.

That's the very first alert which I put here as an example. I will continue with adding more similar ones with the currently available metrics. I will be updating this comment with the upcoming queries/alerts.

Group_1: latency (of control plane, like apiserver, controllermanager etc)

k8s API request duration alert
PUT _watcher/watch/apiserver-latency
{
"trigger": {
  "schedule": {
    "interval": "10m"
  }
},
"input": {
  "search": {
    "request": {
      "search_type": "query_then_fetch",
      "indices": [
        "*"
      ],
      "rest_total_hits_as_int": true,
      "body": {
        "query": {
          "bool": {
            "must": [
              {
                "range": {
                  "@timestamp": {
                    "gte": "now-10m",
                    "lte": "now",
                    "format": "strict_date_optional_time"
                  }
                }
              },
              {
                "bool": {
                  "must": [
                    {
                      "query_string": {
                        "query": "event.module:kubernetes AND metricset.name: apiserver AND NOT (kubernetes.apiserver.request.verb: WATCH or kubernetes.apiserver.request.verb: CONNECT)",
                        "analyze_wildcard": true
                      }
                    },
                    {
                      "exists": {
                        "field": "kubernetes.apiserver.request.duration"
                      }
                    }
                  ],
                  "filter": [],
                  "should": [],
                  "must_not": []
                }
              }
            ],
            "filter": [],
            "should": [],
            "must_not": []
          }
        },
        "size": 0,
        "runtime_mappings": {
          "avg_duration": {
            "type": "double",
            "script": {
              "source": "emit(doc['kubernetes.apiserver.request.duration.us.sum'].value / doc['kubernetes.apiserver.request.duration.us.count'].value / 1000 )"
            }
          }
        },
        "aggs": {
          "avg_duration": {
            "avg": {
              "field": "avg_duration"
            }
          }
        }
      }
    }
  }
},
"condition": {
  "script": {
    "source": "return ctx.payload.aggregations.avg_duration.value > params.threshold",
    "params": {
      "threshold": 10
    }
  }
},
"actions": {
  "my-logging-action": {
    "logging": {
      "level": "error",
      "text": "The average request duration for k8s API Server is {{ctx.payload.aggregations.avg_duration}} ms. Threshold is 10 ms."
    }
  }
},
"metadata": {
  "xpack": {
    "type": "json"
  }
}
}
k8s Controller manager request duration alert
PUT _watcher/watch/Controller-Manager-Request-Latency
{
"trigger": {
  "schedule": {
    "interval": "10m"
  }
},
"input": {
  "search": {
    "request": {
      "body": {
        "query": {
          "bool": {
            "must": [
              {
                "range": {
                  "@timestamp": {
                    "gte": "now-10m",
                    "lte": "now",
                    "format": "strict_date_optional_time"
                  }
                }
              },
              {
                "bool": {
                  "must": [
                    {
                      "query_string": {
                        "query": "event.module:kubernetes AND metricset.name: controllermanager",
                        "analyze_wildcard": true
                      }
                    },
                    {
                      "exists": {
                        "field": "kubernetes.controllermanager.client.request.duration"
                      }
                    }
                  ],
                  "filter": [],
                  "should": [],
                  "must_not": []
                }
              }
            ],
            "filter": [],
            "should": [],
            "must_not": []
          }
        },
        "size": 0,
        "runtime_mappings": {
          "avg_duration": {
            "type": "double",
            "script": {
              "source": "emit(doc['kubernetes.controllermanager.client.request.duration.us.sum'].value / doc['kubernetes.controllermanager.client.request.duration.us.count'].value / 1000 )"
            }
          }
        },
        "aggs": {
          "avg_duration": {
            "avg": {
              "field": "avg_duration"
            }
          }
        }
      },
      "indices": [
        "*"
      ]
    }
  }
},
"condition": {
  "script": {
    "source": "return ctx.payload.aggregations.avg_duration.value > params.threshold",
    "params": {
      "threshold": 10
    }
  }
},
"actions": {
  "my-logging-action": {
    "logging": {
      "text": "The average request duration for k8s Controller Manager is {{ctx.payload.aggregations.avg_duration}} ms. Threshold is 10."
    }
  }
},
"metadata": {
  "xpack": {
    "type": "json"
  },
  "name": "Controller Manager Request Latency"
}
}

Group_2: resource saturation (disk pressure etc)

k8s Node CPU utilization alert
PUT _watcher/watch/Node-CPU-Usage
{
"trigger": {
  "schedule": {
    "interval": "10m"
  }
},
"input": {
  "search": {
    "request": {
      "body": {
        "size": 0,
        "query": {
          "bool": {
            "must": [
              {
                "range": {
                  "@timestamp": {
                    "gte": "now-10m",
                    "lte": "now",
                    "format": "strict_date_optional_time"
                  }
                }
              },
              {
                "bool": {
                  "must": [
                    {
                      "query_string": {
                        "query": "event.module:kubernetes AND (metricset.name:node OR metricset.name: state_node) ",
                        "analyze_wildcard": true
                      }
                    }
                  ],
                  "filter": [],
                  "should": [],
                  "must_not": []
                }
              }
            ],
            "filter": [],
            "should": [],
            "must_not": []
          }
        },
        "aggs": {
          "nodes": {
            "terms": {
              "field": "kubernetes.node.name",
              "size": "10000",
              "order": {
                "_key": "asc"
              }
            },
            "aggs": {
              "nodeUsage": {
                "max": {
                  "field": "kubernetes.node.cpu.usage.nanocores"
                }
              },
              "nodeCap": {
                "max": {
                  "field": "kubernetes.node.cpu.capacity.cores"
                }
              },
              "nodeCPUUsagePCT": {
                "bucket_script": {
                  "buckets_path": {
                    "nodeUsage": "nodeUsage",
                    "nodeCap": "nodeCap"
                  },
                  "script": {
                    "source": "( params.nodeUsage / 1000000000 ) / params.nodeCap",
                    "lang": "painless",
                    "params": {
                      "_interval": 10000
                    }
                  },
                  "gap_policy": "skip"
                }
              }
            }
          }
        }
      },
      "indices": [
        "metrics-kubernetes*"
      ]
    }
  }
},
"condition": {
  "array_compare": {
    "ctx.payload.aggregations.nodes.buckets": {
      "path": "nodeCPUUsagePCT.value",
      "gte": {
        "value": 80
      }
    }
  }
},
"actions": {
  "log_hits": {
    "foreach": "ctx.payload.aggregations.nodes.buckets",
    "max_iterations": 500,
    "logging": {
      "text": "Kubernetes node found with high CPU usage: {{ctx.payload.key}} -> {{ctx.payload.nodeCPUUsagePCT.value}}"
    }
  }
},
"metadata": {
  "xpack": {
    "type": "json"
  },
  "name": "Node CPU Usage"
}
}
k8s Node Memory Usage alert
PUT _watcher/watch/Node-Memory-Usage
{
  "trigger": {
    "schedule": {
      "interval": "10m"
    }
  },
  "input": {
    "search": {
      "request": {
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-10m",
                      "lte": "now",
                      "format": "strict_date_optional_time"
                    }
                  }
                },
                {
                  "bool": {
                    "must": [
                      {
                        "query_string": {
                          "query": "event.module:kubernetes AND (metricset.name:node OR metricset.name: state_node) ",
                          "analyze_wildcard": true
                        }
                      }
                    ],
                    "filter": [],
                    "should": [],
                    "must_not": []
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": []
            }
          },
          "aggs": {
            "nodes": {
              "terms": {
                "field": "kubernetes.node.name",
                "size": "10000",
                "order": {
                  "_key": "asc"
                }
              },
              "aggs": {
                "nodeUsage": {
                  "max": {
                    "field": "kubernetes.node.memory.usage.bytes"
                  }
                },
                "nodeCap": {
                  "max": {
                    "field": "kubernetes.node.memory.capacity.bytes"
                  }
                },
                "nodeMemoryUsagePCT": {
                  "bucket_script": {
                    "buckets_path": {
                      "nodeUsage": "nodeUsage",
                      "nodeCap": "nodeCap"
                    },
                    "script": {
                      "source": "( params.nodeUsage ) / params.nodeCap",
                      "lang": "painless",
                      "params": {
                        "_interval": 10000
                      }
                    },
                    "gap_policy": "skip"
                  }
                }
              }
            }
          }
        },
        "indices": [
          "metrics-kubernetes*"
        ]
      }
    }
  },
  "condition": {
    "array_compare": {
      "ctx.payload.aggregations.nodes.buckets": {
        "path": "nodeMemoryUsagePCT.value",
        "gte": {
          "value": 80
        }
      }
    }
  },
  "actions": {
    "log_hits": {
      "foreach": "ctx.payload.aggregations.nodes.buckets",
      "max_iterations": 500,
      "logging": {
        "text": "Kubernetes node found with high memory usage: {{ctx.payload.key}} -> {{ctx.payload.nodeMemoryUsagePCT.value}}"
      }
    }
  },
  "metadata": {
    "xpack": {
      "type": "json"
    },
    "name": "Node Memory Usage"
  }
}
Node unschedulable alert
PUT _watcher/watch/Node-Unschedulable
{
  "trigger": {
    "schedule": {
      "interval": "10m"
    }
  },
  "input": {
    "search": {
      "request": {
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-10m",
                      "lte": "now",
                      "format": "strict_date_optional_time"
                    }
                  }
                },
                {
                  "bool": {
                    "must": [
                      {
                        "query_string": {
                          "query": "event.module:kubernetes AND metricset.name: state_node",
                          "analyze_wildcard": true
                        }
                      },
                      {
                        "exists": {
                          "field": "kubernetes.node.status.unschedulable"
                        }
                      },
                      {
                        "query_string": {
                          "query": "kubernetes.node.status.unschedulable: true",
                          "analyze_wildcard": true
                        }
                      }
                    ],
                    "filter": [],
                    "should": [],
                    "must_not": []
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": []
            }
          },
          "aggs": {
            "nodes": {
              "terms": {
                "field": "kubernetes.node.name",
                "order": {
                  "_key": "asc"
                }
              }
            }
          }
        },
        "indices": [
          "metrics-kubernetes.state_node-default"
        ]
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.hits.total": {
        "gte": 1
      }
    }
  },
  "actions": {
    "log_hits": {
      "foreach": "ctx.payload.aggregations.nodes.buckets",
      "max_iterations": 500,
      "logging": {
        "text": "Kubernetes node found unschedulable: {{ctx.payload.key}}"
      }
    }
  },
  "metadata": {
    "xpack": {
      "type": "json"
    },
    "name": "Node Unschedulable"
  }
}
Node not ready alert
PUT _watcher/watch/Node-Not-Ready
{
  "trigger": {
    "schedule": {
      "interval": "10m"
    }
  },
  "input": {
    "search": {
      "request": {
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-10m",
                      "lte": "now",
                      "format": "strict_date_optional_time"
                    }
                  }
                },
                {
                  "bool": {
                    "must": [
                      {
                        "query_string": {
                          "query": "event.module:kubernetes AND metricset.name: state_node",
                          "analyze_wildcard": true
                        }
                      },
                      {
                        "exists": {
                          "field": "kubernetes.node.status.ready"
                        }
                      },
                      {
                        "query_string": {
                          "query": "kubernetes.node.status.ready: false",
                          "analyze_wildcard": true
                        }
                      }
                    ],
                    "filter": [],
                    "should": [],
                    "must_not": []
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": []
            }
          },
          "aggs": {
            "nodes": {
              "terms": {
                "field": "kubernetes.node.name",
                "order": {
                  "_key": "asc"
                }
              }
            }
          }
        },
        "indices": [
          "metrics-kubernetes.state_node-default"
        ]
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.hits.total": {
        "gte": 1
      }
    }
  },
  "actions": {
    "log_hits": {
      "foreach": "ctx.payload.aggregations.nodes.buckets",
      "max_iterations": 500,
      "logging": {
        "text": "Kubernetes node found not ready: {{ctx.payload.key}}"
      }
    }
  },
  "metadata": {
    "xpack": {
      "type": "json"
    },
    "name": "Node Not Ready"
  }
}
Node memory_pressure
PUT _watcher/watch/Node-Memory-Pressure
{
  "trigger": {
    "schedule": {
      "interval": "10m"
    }
  },
  "input": {
    "search": {
      "request": {
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-10m",
                      "lte": "now",
                      "format": "strict_date_optional_time"
                    }
                  }
                },
                {
                  "bool": {
                    "must": [
                      {
                        "query_string": {
                          "query": "event.module:kubernetes AND metricset.name: state_node",
                          "analyze_wildcard": true
                        }
                      },
                      {
                        "exists": {
                          "field": "kubernetes.node.status.memory_pressure"
                        }
                      },
                      {
                        "query_string": {
                          "query": "kubernetes.node.status.memory_pressure: true",
                          "analyze_wildcard": true
                        }
                      }
                    ],
                    "filter": [],
                    "should": [],
                    "must_not": []
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": []
            }
          },
          "aggs": {
            "nodes": {
              "terms": {
                "field": "kubernetes.node.name",
                "order": {
                  "_key": "asc"
                }
              }
            }
          }
        },
        "indices": [
          "metrics-kubernetes.state_node-default"
        ]
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.hits.total": {
        "gte": 1
      }
    }
  },
  "actions": {
    "log_hits": {
      "foreach": "ctx.payload.aggregations.nodes.buckets",
      "max_iterations": 500,
      "logging": {
        "text": "Kubernetes node found on memory pressure: {{ctx.payload.key}}"
      }
    }
  },
  "metadata": {
    "xpack": {
      "type": "json"
    },
    "name": "Node Memory Pressure"
  }
}
Node disk_pressure
PUT _watcher/watch/Node-Disk_Pressure
{
  "trigger": {
    "schedule": {
      "interval": "10m"
    }
  },
  "input": {
    "search": {
      "request": {
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-10m",
                      "lte": "now",
                      "format": "strict_date_optional_time"
                    }
                  }
                },
                {
                  "bool": {
                    "must": [
                      {
                        "query_string": {
                          "query": "event.module:kubernetes AND metricset.name: state_node",
                          "analyze_wildcard": true
                        }
                      },
                      {
                        "exists": {
                          "field": "kubernetes.node.status.disk_pressure"
                        }
                      },
                      {
                        "query_string": {
                          "query": "kubernetes.node.status.disk_pressure: true",
                          "analyze_wildcard": true
                        }
                      }
                    ],
                    "filter": [],
                    "should": [],
                    "must_not": []
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": []
            }
          },
          "aggs": {
            "nodes": {
              "terms": {
                "field": "kubernetes.node.name",
                "order": {
                  "_key": "asc"
                }
              }
            }
          }
        },
        "indices": [
          "metrics-kubernetes.state_node-default"
        ]
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.hits.total": {
        "gte": 1
      }
    }
  },
  "actions": {
    "log_hits": {
      "foreach": "ctx.payload.aggregations.nodes.buckets",
      "max_iterations": 500,
      "logging": {
        "text": "Kubernetes node found on disk pressure: {{ctx.payload.key}}"
      }
    }
  },
  "metadata": {
    "xpack": {
      "type": "json"
    },
    "name": "Node Disk Pressure"
  }
}
Node out_of_disk
PUT _watcher/watch/Node-Out-Of-Disk
{
  "trigger": {
    "schedule": {
      "interval": "10m"
    }
  },
  "input": {
    "search": {
      "request": {
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-10m",
                      "lte": "now",
                      "format": "strict_date_optional_time"
                    }
                  }
                },
                {
                  "bool": {
                    "must": [
                      {
                        "query_string": {
                          "query": "event.module:kubernetes AND metricset.name: state_node",
                          "analyze_wildcard": true
                        }
                      },
                      {
                        "exists": {
                          "field": "kubernetes.node.status.out_of_disk"
                        }
                      },
                      {
                        "query_string": {
                          "query": "kubernetes.node.status.out_of_disk: true",
                          "analyze_wildcard": true
                        }
                      }
                    ],
                    "filter": [],
                    "should": [],
                    "must_not": []
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": []
            }
          },
          "aggs": {
            "nodes": {
              "terms": {
                "field": "kubernetes.node.name",
                "order": {
                  "_key": "asc"
                }
              }
            }
          }
        },
        "indices": [
          "metrics-kubernetes.state_node-default"
        ]
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.hits.total": {
        "gte": 1
      }
    }
  },
  "actions": {
    "log_hits": {
      "foreach": "ctx.payload.aggregations.nodes.buckets",
      "max_iterations": 500,
      "logging": {
        "text": "Kubernetes node found out of disk: {{ctx.payload.key}}"
      }
    }
  },
  "metadata": {
    "xpack": {
      "type": "json"
    },
    "name": "Node Out Of Disk"
  }
}
Node pid_pressure
PUT _watcher/watch/Node-Pid-Pressure
{
  "trigger": {
    "schedule": {
      "interval": "10m"
    }
  },
  "input": {
    "search": {
      "request": {
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-10m",
                      "lte": "now",
                      "format": "strict_date_optional_time"
                    }
                  }
                },
                {
                  "bool": {
                    "must": [
                      {
                        "query_string": {
                          "query": "event.module:kubernetes AND metricset.name: state_node",
                          "analyze_wildcard": true
                        }
                      },
                      {
                        "exists": {
                          "field": "kubernetes.node.status.pid_pressure"
                        }
                      },
                      {
                        "query_string": {
                          "query": "kubernetes.node.status.pid_pressure: true",
                          "analyze_wildcard": true
                        }
                      }
                    ],
                    "filter": [],
                    "should": [],
                    "must_not": []
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": []
            }
          },
          "aggs": {
            "nodes": {
              "terms": {
                "field": "kubernetes.node.name",
                "order": {
                  "_key": "asc"
                }
              }
            }
          }
        },
        "indices": [
          "metrics-kubernetes.state_node-default"
        ]
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.hits.total": {
        "gte": 1
      }
    }
  },
  "actions": {
    "log_hits": {
      "foreach": "ctx.payload.aggregations.nodes.buckets",
      "max_iterations": 500,
      "logging": {
        "text": "Kubernetes node found on pid pressure: {{ctx.payload.key}}"
      }
    }
  },
  "metadata": {
    "xpack": {
      "type": "json"
    },
    "name": "Node Pid Pressure"
  }
}

Group_3: errors (container OOM etc)

Pod RX Error Rate
PUT _watcher/watch/Pod-RX-Error-Rate
{
  "trigger": {
    "schedule": {
      "interval": "10m"
    }
  },
  "input": {
    "search": {
      "request": {
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-10m",
                      "lte": "now",
                      "format": "strict_date_optional_time"
                    }
                  }
                },
                {
                  "bool": {
                    "must": [
                      {
                        "query_string": {
                          "query": "event.module:kubernetes AND (metricset.name:pod) ",
                          "analyze_wildcard": true
                        }
                      }
                    ],
                    "filter": [],
                    "should": [],
                    "must_not": []
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": []
            }
          },
          "aggs": {
            "rx_error_rates": {
              "terms": {
                "field": "kubernetes.pod.name",
                "size": "10000",
                "order": {
                  "_key": "asc"
                }
              },
              "aggs": {
                "by_minute": {
                  "date_histogram": {
                    "field": "@timestamp",
                    "calendar_interval": "minute"
                  },
                  "aggs": {
                    "my_rate": {
                      "rate": {
                        "field": "kubernetes.pod.network.rx.errors",
                        "unit": "minute"
                      }
                    }
                  }
                },
                "avg_minute_rate": {
                  "avg_bucket": {
                    "buckets_path": "by_minute>my_rate",
                    "gap_policy": "skip"
                  }
                }
              }
            }
          }
        },
        "indices": [
          "metrics-kubernetes.pod-default"
        ]
      }
    }
  },
  "condition": {
    "array_compare": {
      "ctx.payload.aggregations.rx_error_rates.buckets": {
        "path": "avg_minute_rate.value",
        "gt": {
          "value": 1
        }
      }
    }
  },
  "actions": {
    "log_hits": {
      "foreach": "ctx.payload.aggregations.rx_error_rates.buckets",
      "max_iterations": 500,
      "logging": {
        "text": "Kubernetes Pod found with high rx error rate: {{ctx.payload.key}} -> {{ctx.payload.avg_minute_rate.value}}"
      }
    }
  },
  "metadata": {
    "xpack": {
      "type": "json"
    },
    "name": "Pod RX Error Rate"
  }
}
Pod TX Error Rate
PUT _watcher/watch/Pod-TX-Error-Rate
{
  "trigger": {
    "schedule": {
      "interval": "10m"
    }
  },
  "input": {
    "search": {
      "request": {
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-10m",
                      "lte": "now",
                      "format": "strict_date_optional_time"
                    }
                  }
                },
                {
                  "bool": {
                    "must": [
                      {
                        "query_string": {
                          "query": "event.module:kubernetes AND (metricset.name:pod) ",
                          "analyze_wildcard": true
                        }
                      }
                    ],
                    "filter": [],
                    "should": [],
                    "must_not": []
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": []
            }
          },
          "aggs": {
            "tx_error_rates": {
              "terms": {
                "field": "kubernetes.pod.name",
                "size": "10000",
                "order": {
                  "_key": "asc"
                }
              },
              "aggs": {
                "by_minute": {
                  "date_histogram": {
                    "field": "@timestamp",
                    "calendar_interval": "minute"
                  },
                  "aggs": {
                    "my_rate": {
                      "rate": {
                        "field": "kubernetes.pod.network.tx.errors",
                        "unit": "minute"
                      }
                    }
                  }
                },
                "avg_minute_rate": {
                  "avg_bucket": {
                    "buckets_path": "by_minute>my_rate",
                    "gap_policy": "skip"
                  }
                }
              }
            }
          }
        },
        "indices": [
          "metrics-kubernetes.pod-default"
        ]
      }
    }
  },
  "condition": {
    "array_compare": {
      "ctx.payload.aggregations.tx_error_rates.buckets": {
        "path": "avg_minute_rate.value",
        "gt": {
          "value": 1
        }
      }
    }
  },
  "actions": {
    "log_hits": {
      "foreach": "ctx.payload.aggregations.tx_error_rates.buckets",
      "max_iterations": 500,
      "logging": {
        "text": "Kubernetes Pod found with high tx error rate: {{ctx.payload.key}} -> {{ctx.payload.avg_minute_rate.value}}"
      }
    }
  },
  "metadata": {
    "xpack": {
      "type": "json"
    },
    "name": "Pod TX Error Rate"
  }
}
Pod Terminated with Error
PUT _watcher/watch/Pod-Terminated-Error
{
  "trigger": {
    "schedule": {
      "interval": "10m"
    }
  },
  "input": {
    "search": {
      "request": {
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-10m",
                      "lte": "now",
                      "format": "strict_date_optional_time"
                    }
                  }
                },
                {
                  "bool": {
                    "must": [
                      {
                        "query_string": {
                          "query": "event.module:kubernetes AND metricset.name: state_container",
                          "analyze_wildcard": true
                        }
                      },
                      {
                        "exists": {
                          "field": "kubernetes.container.status.last_terminated_reason"
                        }
                      },
                      {
                        "query_string": {
                          "query": "kubernetes.container.status.last_terminated_reason: Error",
                          "analyze_wildcard": true
                        }
                      }
                    ],
                    "filter": [],
                    "should": [],
                    "must_not": []
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": []
            }
          },
          "aggs": {
            "pods": {
              "terms": {
                "field": "kubernetes.pod.name",
                "order": {
                  "_key": "asc"
                }
              }
            }
          }
        },
        "indices": [
          "*"
        ]
      }
    }
  },
  "condition": {
    "array_compare": {
      "ctx.payload.aggregations.pods.buckets": {
        "path": "doc_count",
        "gte": {
          "value": 1
        }
      }
    }
  },
  "actions": {
    "log_hits": {
      "foreach": "ctx.payload.aggregations.pods.buckets",
      "max_iterations": 500,
      "logging": {
        "text": "Pod {{ctx.payload.key}} was terminated with status Error"
      }
    }
  },
  "metadata": {
    "xpack": {
      "type": "json"
    },
    "name": "Pod Terminated Error"
  }
}
Pod Terminated OOMKilled
PUT _watcher/watch/Pod-Terminated-OOMKilled
{
  "trigger": {
    "schedule": {
      "interval": "10m"
    }
  },
  "input": {
    "search": {
      "request": {
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-10m",
                      "lte": "now",
                      "format": "strict_date_optional_time"
                    }
                  }
                },
                {
                  "bool": {
                    "must": [
                      {
                        "query_string": {
                          "query": "event.module:kubernetes AND metricset.name: state_container",
                          "analyze_wildcard": true
                        }
                      },
                      {
                        "exists": {
                          "field": "kubernetes.container.status.last_terminated_reason"
                        }
                      },
                      {
                        "query_string": {
                          "query": "kubernetes.container.status.last_terminated_reason: OOMKilled",
                          "analyze_wildcard": true
                        }
                      }
                    ],
                    "filter": [],
                    "should": [],
                    "must_not": []
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": []
            }
          },
          "aggs": {
            "pods": {
              "terms": {
                "field": "kubernetes.pod.name",
                "order": {
                  "_key": "asc"
                }
              }
            }
          }
        },
        "indices": [
          "*"
        ]
      }
    }
  },
  "condition": {
    "array_compare": {
      "ctx.payload.aggregations.pods.buckets": {
        "path": "doc_count",
        "gte": {
          "value": 1
        }
      }
    }
  },
  "actions": {
    "log_hits": {
      "foreach": "ctx.payload.aggregations.pods.buckets",
      "max_iterations": 500,
      "logging": {
        "text": "Pod {{ctx.payload.key}} was terminated with status OOMKilled"
      }
    }
  },
  "metadata": {
    "xpack": {
      "type": "json"
    },
    "name": "Pod Terminated OOMKilled"
  }
}

Group_4: Cluster Connectivity

Kubernetes Shipped Docs During last minute
PUT _watcher/watch/Kubernetes-Shipped-Docs
{
  "trigger": {
    "schedule": {
      "interval": "1m"
    }
  },
  "input": {
    "search": {
      "request": {
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-1m",
                      "lte": "now",
                      "format": "strict_date_optional_time"
                    }
                  }
                },
                {
                  "bool": {
                    "must": [
                      {
                        "query_string": {
                          "query": "event.module:kubernetes",
                          "analyze_wildcard": true
                        }
                      }
                    ],
                    "filter": [],
                    "should": [],
                    "must_not": []
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": []
            }
          }
        },
        "indices": [
          "metrics-kubernetes*"
        ]
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.hits.total": {
        "lte": 10
      }
    }
  },
  "actions": {
    "my-logging-action": {
      "logging": {
        "text": "{{ctx.payload.hits.total}} documents were shiped in your Kubernetes indexes during the last minute. Threshold is 10."
      }
    }
  },
  "metadata": {
    "xpack": {
      "type": "json"
    },
    "name": "Kubernetes Shipped Docs"
  }
}

@ChrsMark
Copy link
Member

ChrsMark commented Feb 23, 2023

@gizas I think we already have a good set of Alerts listed in the above comment. All these are fully functional and ready to be used.

What would the ideal place to "store" those? Maybe somewhere inside the Kubernetes package? Note that due to the long json bodies it would be better to store those in standalone files.

I tried to put them under the kubernetes package and it is possible to store them under the _dev/deploy/alerts subdirectory.
See an example draft PR that illustrates the point https://github.com/elastic/integrations/pull/5364/files.

I also tried to put them under docs folder but elastic-package complains about this with item [alerts] is not allowed in folder [/home/chrismark/go/src/github.com/elastic/integrations/build/packages/kubernetes-1.31.2.zip/docs].

For now it would be ok to store them under this subdir and reference those files from our docs maybe? However this would be problematic when it comes to version syncronization etc. @jsoriano any thoughts here on how we could approach this better?

@gizas
Copy link
Contributor Author

gizas commented Feb 23, 2023

Really great progress here Chis!!!!

I reviewed them and dont think we miss something important for first iteration.
Some additional ideas:

  • A missing agent alert could be a way to identify that cluster has been disconnected? An alert with specific agent name to check rate of data coming. Dont know how much noise will this produce especially if agent restarts. Do we want something like this?
  • Also the scenario of no data coming at all from agent can be tackled with latency with system metrics. We can make the assumption that almost always we collect system monitor metrics.
  • How about the scenario that we stopped receiving logs? Can we also have something for kubernetes logs?

Now as far as the folder is concerned, I guess _dev/deploy/alerts fine for now. We can include an extra file there to track down the versions of ES and package we have tested with? I would suggest to open a new ticket to elastic-package team to start discussing the support of alerts as assets. I would suggest a new story where we would like alerts to be part of a new folder (like dashboards) inside each package.

Reminder that we would like a script to automate the installation of all of this with one run. Ideally the user to choose which to install or not

@jsoriano
Copy link
Member

What would the ideal place to "store" those? Maybe somewhere inside the Kubernetes package? Note that due to the long json bodies it would be better to store those in standalone files.

I tried to put them under the kubernetes package and it is possible to store them under the _dev/deploy/alerts subdirectory. See an example draft PR that illustrates the point https://github.com/elastic/integrations/pull/5364/files.

I also tried to put them under docs folder but elastic-package complains about this with item [alerts] is not allowed in folder [/home/chrismark/go/src/github.com/elastic/integrations/build/packages/kubernetes-1.31.2.zip/docs].

For now it would be ok to store them under this subdir and reference those files from our docs maybe? However this would be problematic when it comes to version syncronization etc. @jsoriano any thoughts here on how we could approach this better?

Are these alerts intended for documentation purposes only, or do we want them installed when installing the packages?

If they are only used for documentation, they could be included in the docs folder, inside markdown files with more information about how to use them.

If we want them to be installed when installing the package, then we should add them to their own place in the package-spec, and we should have a follow-up task in Kibana to install them when available. This process would start with a new issue in the package-spec repository.

@gizas
Copy link
Contributor Author

gizas commented Feb 23, 2023

The end goal would be to be installed when installing the package!
So we would like to initiate the discussion that alerts are assets of a package

I would suggest for now to include them in doc in order to proceed with our implementation and follow up with the stories as you suggest.

@jsoriano
Copy link
Member

The end goal would be to be installed when installing the package!
So we would like to initiate the discussion that alerts are assets of a package

Perfect, please create an issue in the package-spec repository to discuss about this.

@ChrsMark
Copy link
Member

ChrsMark commented Feb 24, 2023

Thanks for your feedback here folks! I will file an issue to the package-spec.

As far as the placement into the docs of the package for now, this seems to be problematic because of the {{ctx.payload}} reference inside the json request body. It seems that elastic-package identifies this a "function" (like the { event } one) and fails to compile the file.
See https://github.com/elastic/integrations/pull/5364/files.
The error I get is:

elastic-package build                  
Build the package
Error: updating files failed: updating readme file apiserver-latency-alert.md failed: rendering Readme failed: parsing README template failed (path: /home/chrismark/go/src/github.com/elastic/integrations/packages/kubernetes/_dev/build/docs/apiserver-latency-alert.md): template: apiserver-latency-alert.md:91: function "ctx" not defined

I wonder if we can escape this somehow 🤔 @jsoriano was this met before in any of the packages?

@jsoriano
Copy link
Member

Files in _dev/build/docs are expected to be Go templates that are rendered in docs, but I think you can put files directly in docs. If you don't need these files to be templates, try to place them in docs directly.
If that doesn't work, or you need them to be templates for some reason, you can leave them where they are, but you will need to use raw strings for the code with embedded templates ({{`{{ctx.payload}}`}}).

@ChrsMark
Copy link
Member

ChrsMark commented Feb 28, 2023

Thanks @jsoriano, having them under docs directly work fine!

@gizas I have filed a PR to add them inside the package: #5364
I will proceed with adding them as automation also at https://github.com/elastic/k8s-integration-infra: elastic/k8s-integration-infra#16

Issue in package-spec: elastic/package-spec#484

@gizas
Copy link
Contributor Author

gizas commented Mar 1, 2023

One last from me here @ChrsMark . I see we now have a full first version of alerts. Can we distinquish those from SLOs?

I mean we can keep current watchers folder to include alerts. (Dont know if we should rename to alerts)
Create a new folder called slos. And we can create 1-2 basic slos examples for our Users. SLOs are not so important for now as they should come from product requirements but it will be good to showcase the difference, the usage of SLO api and also to guide users how to build them. What do you think? Maybe you can open a new story for this

@ChrsMark
Copy link
Member

ChrsMark commented Mar 1, 2023

Good point @gizas! One basic limitation that I found in all APIs/methods other than the Watchers one is that those do not provide flexibility to define advanced ES queries with aggregations.

For example if you check the examples at https://github.com/kdelemme/kibana/blob/main/x-pack/plugins/observability/dev_docs/slo.md the SLOs can only be defined on specific fields like cpu_usage or error_rate. However in a Kubernetes cluster I'm not sure how useful this is since we need to group metrics by Objects like Nodes or Pods so there is no way to make use of the SLO API.

So in order to implement what we needed to conceptually cover the monitoring/alerting of a k8s cluster we need to go with advanced queries that Watchers allow.

One thing here is that it won't be that great for now to have 12 Watchers/Alerts and then 2 SLOs, to my mind it is better to be consistent.
However I can spend some time to check if there are k8s fields that could be used directly in SLOs like in the examples at https://github.com/kdelemme/kibana/blob/main/x-pack/plugins/observability/dev_docs/slo.md.

@ChrsMark
Copy link
Member

ChrsMark commented Mar 7, 2023

Hey @gizas I tried to come up with a basic SLO and the only thing that could create looks like the following:

curl -i -k --request POST \
  --url https://elastic:changeme@0.0.0.0:5601/api/observability/slos \
  --header 'Authorization: Basic ZWxhc3RpYzpjaGFuZ2VtZQ==' \
  --header 'Content-Type: application/json' \
  --header 'kbn-xsrf: oui' \
  --data '{
        "name": "Kubernetes Data",
        "description": "My SLO Description",
        "indicator": {
                "type": "sli.kql.custom",
                "params": {
                        "index": "metrics*",
                        "good": "data_stream.dataset: kubernetes.node",
                        "total": "",
                        "filter": ""
                }
        },
        "timeWindow": {
                "duration": "1d",
                "isRolling": true
        },
        "budgetingMethod": "occurrences",
        "objective": {
                "target": 0.001
        }
}'

Again, the API looks quite limiting at the moment and for example a numerator query is required while I only wanted to perform a doc count query.
Also the duration cannot be set on minutes. I guess the concept of SLOs implies that the monitoring scopes at a wider timespan and not in minute or second scale.

Having said this I feel that we would need to wait for https://github.com/elastic/actionable-observability/issues/26 before investing much time in creating SLOs that are not actually useful for the k8s use-case. See what are the plans for the "Custom Metric SLI" which is what we would need most probably -> https://github.com/elastic/actionable-observability/issues/26.

Update:
If we would like to create an SLO for a k8s metrics we could use something like the network errors of the Node:

curl -i -k --request POST \
  --url https://elastic:changeme@0.0.0.0:5601/api/observability/slos \
  --header 'Authorization: Basic ZWxhc3RpYzpjaGFuZ2VtZQ==' \
  --header 'Content-Type: application/json' \
  --header 'kbn-xsrf: oui' \
  --data '{
        "name": "Kubernetes Node Network Errors",
        "description": "Kubernetes Node Network Errors",
        "indicator": {
                "type": "sli.kql.custom",
                "params": {
                        "index": "metrics-kubernetes*",
                        "good": "",
                        "total": "kubernetes.node.network.rx.errors <= 5 and kubernetes.node.network.tx.errors <= 5",
                        "filter": ""
                }
        },
        "timeWindow": {
                "duration": "1d",
                "isRolling": true
        },
        "budgetingMethod": "occurrences",
        "objective": {
                "target": 0.995
        }
}'

But again since we cannot group by Node's name this indicator is not so much useful again for a k8s cluster monitoring/alerting.

@ChrsMark
Copy link
Member

ChrsMark commented Mar 8, 2023

A possible demo would look like the following:

  1. Install the Watcher for OOMKilled pods:
curl -X PUT "https://elastic:changeme@localhost:9200/_watcher/watch/Pod-Terminated-OOMKilled?pretty" -k -H 'Content-Type: application/json' -d'
{
  "trigger": {
    "schedule": {
      "interval": "1m"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "*"
        ],
        "rest_total_hits_as_int": true,
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-1m",
                      "lte": "now",
                      "format": "strict_date_optional_time"
                    }
                  }
                },
                {
                  "bool": {
                    "must": [
                      {
                        "query_string": {
                          "query": "data_stream.dataset: kubernetes.state_container",
                          "analyze_wildcard": true
                        }
                      },
                      {
                        "exists": {
                          "field": "kubernetes.container.status.last_terminated_reason"
                        }
                      },
                      {
                        "query_string": {
                          "query": "kubernetes.container.status.last_terminated_reason: OOMKilled",
                          "analyze_wildcard": true
                        }
                      }
                    ],
                    "filter": [],
                    "should": [],
                    "must_not": []
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": []
            }
          },
          "aggs": {
            "pods": {
              "terms": {
                "field": "kubernetes.pod.name",
                "order": {
                  "_key": "asc"
                }
              }
            }
          }
        }
      }
    }
  },
  "condition": {
    "array_compare": {
      "ctx.payload.aggregations.pods.buckets": {
        "path": "doc_count",
        "gte": {
          "value": 1,
          "quantifier": "some"
        }
      }
    }
  },
  "actions": {
    "ping_slack": {
      "foreach": "ctx.payload.aggregations.pods.buckets",
      "max_iterations": 500,
      "webhook": {
        "method": "POST",
        "url": "https://hooks.slack.com/services/T04SW3JHXasdfasdfasdfasdfasdf",
        "body": "{\"channel\": \"#k8s-alerts\", \"username\": \"k8s-cluster-alerting\", \"text\": \"Pod {{ctx.payload.key}} was terminated with status OOMKilled.\"}"
      }
    }
  },
  "metadata": {
    "xpack": {
      "type": "json"
    },
    "name": "Pod Terminated OOMKilled"
  }
}
'
  1. Then deploy the following Pod that will be OOMKilled:
apiVersion: v1
kind: Pod
metadata:
  name: memory-demo
  namespace: mem-example
spec:
  containers:
  - name: memory-demo-ctr
    image: polinux/stress
    resources:
      requests:
        memory: "100Mi"
      limits:
        memory: "200Mi"
    command: ["stress"]
    args: ["--vm", "4", "--vm-bytes", "750M", "--vm-hang", "4"]

When then watcher will be activated and a slack message will be sent:
slackalert

Note that one active slack webhook is required.

cc: @gizas @mlunadia

@ChrsMark
Copy link
Member

Blogpost was released: https://www.elastic.co/blog/enable-kubernetes-alerting-elastic-observability. Closing this one for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team
Projects
None yet
Development

No branches or pull requests

3 participants