[Stack Monitoring] [Test Scenario] Out of the box alerting #93072

chrisronline · 2021-03-01T17:41:31Z

Summary

Stack Monitoring provides a set of out-of-the-box alerts, created by simply loading the Stack Monitoring UI within Kibana. The default action for each alert is a server log and the action messaging is controlled by the Stack Monitoring UI code directly.

PRs

Original, and CPU alert: #68805
Disk usage alert: #75419
JVM memory usage alert: #79039
Missing monitoring data alert: #78208
Threadpool rejections alert: #79433

Testing

Creation

Ensure alerts are created once visiting the Stack Monitoring UI
Ensure a user with the minimum set of monitoring permissions is able to create and manage alerts
- ⚠️ this requires extra permissions, that should be documented (and reflected in the test scenario)

Management

Ensure that you can view and manage these alerts when Setup Mode is active
- 🪲 Alert editing fly-out shows outdated alert parameters after editing ([Stack Monitoring] Alert editing fly-out shows outdated alert parameters after editing #94440)
Ensure you can properly add an additional action (to any alert) and it works as expected

UX

Ensure you can see all created alerts while in Setup Mode
Ensure you can properly navigate between display modes of alerts [Monitoring] Some progress on making alerts better in the UI #81569
- 🪲 Alert grouping toggle never renders as "enabled" ([Stack Monitoring] Alert grouping toggle never renders as "enabled" #94556)
Ensure a toast appears in the UI when you first visit SM telling you about new alerts [Monitoring] Add toast for newly created alerts #89202

Specific alerts

Ensure you can properly trigger and see the server log for the CPU usage alert
Ensure you can properly trigger and see the server log for the disk usage alert
Ensure you can properly trigger and see the server log for the jvm memory alert
Ensure you can properly trigger and see the server log for the missing monitoring data alert (Note: This alert is only concerned with Elasticsearch now and no longer looks at other stack products [Monitoring] Missing monitoring data alert firing for version upgrade and configuration changes for Kibana in Cloud #83309)
Ensure you can properly trigger and see the server log for both threadpool rejection alerts
- 🤔 I was unable to create the conditions for this.

Information in reproducting legacy alerts -> #87377

Ensure you can properly trigger and see the server log for the legacy cluster health alert
Ensure you can properly trigger and see the server log for the legacy nodes change alert
Ensure you can properly trigger and see the server log for the legacy Elasticsearch version mismatch alert
Ensure you can properly trigger and see the server log for the legacy Kibana version mismatch alert
Ensure you can properly trigger and see the server log for the legacy Logstash version mismatch alert
Ensure you can properly trigger and see the server log for the legacy license expiration alert (This cannot be tested on cloud)
- 🤔 I was unable to create the conditions for this.

Edge cases

Ensure a CCS environment works as expected
Ensure Setup Mode works as expected on cloud ([Monitoring] Ensure setup mode works on cloud but only for alerts #73127)
Ensure a dedicated monitoring cluster environment works as expected
Ensure a quality UX with multiple Kibanas (we recommend a dedicated Kibana for a dedicated monitoring cluster) ("Duplicate" alerts can potentially exist based on what the user is doing in each Kibana but they should be able to easily disable the ones they don't want)
- 🤔 Not sure what that means. "Quality UX" is quite subjective, after all.
Ensure Stack Monitoring UI works properly when alerts are not available (both because ssl is disabled and an encryption key is not set: [Monitoring] Fix UI error when alerting is not available #77179)
Ensure Stack Monitoring alerts are not editable or createable in the Alerts & Management UI ([Monitoring] Prevent edit/create for Stack Monitoring alerts in Alerts Management #77097)
- ⚠️ this doesn't apply anymore after [Monitoring] Reverting alert management #94167

Previous issue: #85841

elasticmachine · 2021-03-01T17:41:33Z

Pinging @elastic/stack-monitoring (Team:Monitoring)

weltenwort · 2021-03-10T09:48:59Z

Ensure Stack Monitoring alerts are not editable or createable in the Alerts & Management UI

Do I understand that correctly to mean that these should not be available?

weltenwort · 2021-03-10T09:54:55Z

Ensure a user with the minimum set of monitoring permissions is able to create and manage alerts

What would this minimum set of permissions be? Only the monitoring_user role and nothing else?

weltenwort · 2021-03-10T10:25:11Z

Ensure that you can view and manage these alerts when Setup Mode is active

Is it a known issue that saving a changed alert and clicking "edit" again immediately after the success toast shows the second-to-last state?

weltenwort · 2021-03-10T10:27:35Z

Ensure a dedicated monitoring cluster environment works as expected

This might be stupid question, but can the dedicated monitoring cluster monitor itself in addition to the remote clusters?

chrisronline · 2021-03-10T11:31:36Z

#93072 (comment)

Yes, it's being changed now -> #94167. We actually never updated the test for this change, but we are reverting it so it's accurate now and should be available to test soon.

#93072 (comment)

Correct. It is monitoring_user and kibana_admin

#93072 (comment)

It is not. Worth filing a ticket for sure

#93072 (comment)

Absolutely not stupid! Yes that's fine as well.

weltenwort · 2021-03-11T12:19:34Z

Ensure a user with the minimum set of monitoring permissions is able to create and manage alerts

It seems a user with only monitoring_user and kibana_admin can cause the initial alerts to be created, but can't enter setup mode:

But

GET /_security/user/_has_privileges
{
  "index": [
    {
      "names": [
        ".monitoring-*-6-*,.monitoring-*-7-*"
      ],
      "privileges": [
        "read"
      ]
    }
  ]
}

yields

{
  "username" : "monitoring-only",
  "has_all_requested" : true,
  "cluster" : { },
  "index" : {
    ".monitoring-*-6-*,.monitoring-*-7-*" : {
      "read" : true
    }
  },
  "application" : { }
}

Is that expected?

weltenwort · 2021-03-11T15:18:42Z

Ensure a CCS environment works as expected

What is the interaction between stack monitoring alerting and CCS? When does stack monitoring access monitoring data via CCS?

weltenwort · 2021-03-11T15:28:12Z

🪲 I filed "Alert editing fly-out shows outdated alert parameters after editing" (#94440)

chrisronline · 2021-03-11T15:54:12Z

#93072 (comment)

Yes this is actually! Setup mode requires additional permissions that are not available to the basic permission set. See #50421

#93072 (comment)

Apologies for the confusion.

There are three ways in which the Stack Monitoring UI can find monitoring data.

Looking at the default cluster the Kibana instance knows about (elasticsearch.hosts kibana setting)
Connected to a separate ES cluster (through monitoring.ui.elasticsearch.hosts kibana setting)
Connected to a separate ES cluster through remote clusters (ccs)

The intention of the bullet in this test scenario is to ensure that when looking at monitoring data through a CCS "way", the alerts work as expected. We've seen a few regressions around this "way" which is why I added it.

weltenwort · 2021-03-11T17:49:19Z

The intention of the bullet in this test scenario is to ensure that when looking at monitoring data through a CCS "way", the alerts work as expected. We've seen a few regressions around this "way" which is why I added it.

Thanks for the clarification. So how do I configure it to access the monitoring data from a remote cluster via CCS?

chrisronline · 2021-03-11T18:03:24Z

Locally, start up two separate ES clusters, ensuring they aren't both running on the same port(s).
Use a single Kibana and point it to one of the clusters
In that Kibana, go to Stack Management -> Remote Clusters and add the second cluster as a remote cluster
Using curl, enable Monitoring on the second cluster through:

PUT _cluster/settings
{
  "transient": {
    "xpack.monitoring.collection.enabled": true
  }
}

Navigate to the single Kibana and verify you can see monitoring data for the second ES cluster

Now it's setup with CCS

weltenwort · 2021-03-11T18:19:51Z

Ah, so it's automatically looking at all the remote clusters. Thanks!

weltenwort · 2021-03-15T11:27:21Z

Whenever I visit the stack monitoring app on my cloud test cluster, a toast is displayed that informs me about newly created alerts. All 14 alerting rules were already present before, though, and no new alerting rules show up.

Would that warrant a bug report or would you consider this a non-issue since the toast is just informative?

weltenwort · 2021-03-15T11:27:57Z

🪲 I filed "Alert grouping toggle never renders as enabled" (#94556)

chrisronline · 2021-03-15T17:21:57Z

Whenever I visit the stack monitoring app on my cloud test cluster

Are you hard refreshing the page? If so, that's expected. It shouldn't show up again in the same session (in that you don't ever hard refresh). The logic to avoid showing it is only an in-memory variable.

weltenwort · 2021-03-15T18:35:57Z

Are you hard refreshing the page?

Yes, because without the refresh switching to a cluster takes a long time. Thanks for the clarification.

chrisronline · 2021-03-15T18:47:40Z

without the refresh switching to a cluster takes a long time

This seems like a problem too. Is this in a cloud environment I can see?

weltenwort · 2021-03-15T19:51:25Z

https://p.elstc.co/paste/aPPgTbob#UNKkP9yx2rmOX6kOZYd4nXDeiol-wVXH2orboTaykYI

Sometimes switching to a cluster view takes 10 seconds or longer. And switching back to the list doesn't work reliably either. I think I already filed an issue about that during last release's test phase, though.

chrisronline · 2021-03-16T14:00:19Z

@weltenwort That's fine. I think the slowness is a separate issue and we can follow up later.

weltenwort · 2021-03-22T14:37:52Z

I somehow can't get the license expiration to trigger. I use an ingest processor that forces the expiry_date_in_millis to something close, but it doesn't trigger even though I set its check frequency to 1 minute. Copying the the query from the code leads to hits with license in their _source like the following:

          "license" : {
            "uid" : "fb7a9815-8b0a-4608-b786-049fbec4a4a8",
            "expiry_date_in_millis" : 1616682712000,
            "issue_date" : "2020-03-24T00:00:00.000Z",
            "start_date_in_millis" : 1614556800000,
            "issued_to" : "Elastic Cloud",
            "expiry_date" : "2022-06-30T00:00:00.000Z",
            "max_nodes" : null,
            "issue_date_in_millis" : 1585008000000,
            "type" : "enterprise",
            "max_resource_units" : 100000,
            "issuer" : "API",
            "status" : "active"
          },

This looks like the expiry_date_in_millis is being modified by the pipeline, but the alert doesn't trigger. What am I doing wrong?

weltenwort · 2021-03-22T15:19:49Z

I went through every item, but (as noted in the description) was unable to produce the necessary conditions for some. I'd appreciate it if someone with more in-depth knowledge could double-check those.

I linked the issues I created underneath the corresponding item in the description.

chrisronline · 2021-03-22T16:21:43Z

@weltenwort Here is the pipeline I use for license expiration: https://gist.github.com/chrisronline/9d4d3d740e535d3c01410cac2cc74653

Does this match what you have?

weltenwort · 2021-03-22T16:57:42Z

Yes, that's what I used. The resulting document looks fine, but it doesn't trigger the alert somehow.

chrisronline · 2021-03-22T17:51:47Z

@weltenwort Are you trying on cloud? I imagine you are hitting against this work: #84361 There is a config option to hide the license expiration logic from the SM UI and I think cloud sets it by default. I'll make sure to note this in the test in the future. Apologies for the confusion.

weltenwort · 2021-03-22T18:26:13Z

Yes, I was testing mostly on cloud. The alert is visible and enabled in the UI, but never fires.

chrisronline · 2021-03-22T18:41:48Z

I opened #95090 to address that issue

igoristic · 2021-03-22T22:09:16Z

@weltenwort Testing thread pool rejections is a bit cumbersome. The only viable way is to set your max queue size to 0 via: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html. Though I was only able to achieve this locally on a bare-bone/throw-away cluster. For some reason this did not work on cloud, I imagine it's because the rejections are collected from the container and not the machine itself.

The other way is to create a pipeline that accumulates those counts (since it's a derivative) in incoming node_stats documents. The path looks like this: _source.node_stats.thread_pool.search.rejected. Though, this is a more complicated process, so let me know if the first method does not work we'll take this route

chrisronline added Team:Monitoring Stack Monitoring team test-plan labels Mar 1, 2021

chrisronline added this to the Logs & Metrics UI 7.12.0 test plan milestone Mar 1, 2021

simianhacker assigned simianhacker and unassigned simianhacker Mar 3, 2021

weltenwort self-assigned this Mar 10, 2021

sgrodzicki closed this as completed Mar 23, 2021

neptunian mentioned this issue Jun 24, 2021

[Stack Monitoring] create alert per node, index, or cluster instead of always per cluster #102544

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Stack Monitoring] [Test Scenario] Out of the box alerting #93072

[Stack Monitoring] [Test Scenario] Out of the box alerting #93072

chrisronline commented Mar 1, 2021 •

edited

Loading

elasticmachine commented Mar 1, 2021

weltenwort commented Mar 10, 2021

weltenwort commented Mar 10, 2021

weltenwort commented Mar 10, 2021

weltenwort commented Mar 10, 2021

chrisronline commented Mar 10, 2021 •

edited

Loading

weltenwort commented Mar 11, 2021

weltenwort commented Mar 11, 2021

weltenwort commented Mar 11, 2021

chrisronline commented Mar 11, 2021

weltenwort commented Mar 11, 2021

chrisronline commented Mar 11, 2021

weltenwort commented Mar 11, 2021

weltenwort commented Mar 15, 2021

weltenwort commented Mar 15, 2021

chrisronline commented Mar 15, 2021

weltenwort commented Mar 15, 2021

chrisronline commented Mar 15, 2021

weltenwort commented Mar 15, 2021 •

edited

Loading

chrisronline commented Mar 16, 2021

weltenwort commented Mar 22, 2021

weltenwort commented Mar 22, 2021

chrisronline commented Mar 22, 2021

weltenwort commented Mar 22, 2021

chrisronline commented Mar 22, 2021

weltenwort commented Mar 22, 2021

chrisronline commented Mar 22, 2021

igoristic commented Mar 22, 2021

[Stack Monitoring] [Test Scenario] Out of the box alerting #93072

[Stack Monitoring] [Test Scenario] Out of the box alerting #93072

Comments

chrisronline commented Mar 1, 2021 • edited Loading

Summary

PRs

Testing

Creation

Management

UX

Specific alerts

Edge cases

elasticmachine commented Mar 1, 2021

weltenwort commented Mar 10, 2021

weltenwort commented Mar 10, 2021

weltenwort commented Mar 10, 2021

weltenwort commented Mar 10, 2021

chrisronline commented Mar 10, 2021 • edited Loading

weltenwort commented Mar 11, 2021

weltenwort commented Mar 11, 2021

weltenwort commented Mar 11, 2021

chrisronline commented Mar 11, 2021

weltenwort commented Mar 11, 2021

chrisronline commented Mar 11, 2021

weltenwort commented Mar 11, 2021

weltenwort commented Mar 15, 2021

weltenwort commented Mar 15, 2021

chrisronline commented Mar 15, 2021

weltenwort commented Mar 15, 2021

chrisronline commented Mar 15, 2021

weltenwort commented Mar 15, 2021 • edited Loading

chrisronline commented Mar 16, 2021

weltenwort commented Mar 22, 2021

weltenwort commented Mar 22, 2021

chrisronline commented Mar 22, 2021

weltenwort commented Mar 22, 2021

chrisronline commented Mar 22, 2021

weltenwort commented Mar 22, 2021

chrisronline commented Mar 22, 2021

igoristic commented Mar 22, 2021

chrisronline commented Mar 1, 2021 •

edited

Loading

chrisronline commented Mar 10, 2021 •

edited

Loading

weltenwort commented Mar 15, 2021 •

edited

Loading