Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Stack Monitoring] [Test Scenario] Out of the box alerting #93072

Closed
21 of 24 tasks
chrisronline opened this issue Mar 1, 2021 · 28 comments
Closed
21 of 24 tasks

[Stack Monitoring] [Test Scenario] Out of the box alerting #93072

chrisronline opened this issue Mar 1, 2021 · 28 comments
Assignees
Labels
Team:Monitoring Stack Monitoring team test-plan

Comments

@chrisronline
Copy link
Contributor

chrisronline commented Mar 1, 2021

Summary

Stack Monitoring provides a set of out-of-the-box alerts, created by simply loading the Stack Monitoring UI within Kibana. The default action for each alert is a server log and the action messaging is controlled by the Stack Monitoring UI code directly.

PRs

Original, and CPU alert: #68805
Disk usage alert: #75419
JVM memory usage alert: #79039
Missing monitoring data alert: #78208
Threadpool rejections alert: #79433

Testing

Creation

  • Ensure alerts are created once visiting the Stack Monitoring UI
  • Ensure a user with the minimum set of monitoring permissions is able to create and manage alerts
    • ⚠️ this requires extra permissions, that should be documented (and reflected in the test scenario)

Management

UX

Specific alerts

  • Ensure you can properly trigger and see the server log for the CPU usage alert
  • Ensure you can properly trigger and see the server log for the disk usage alert
  • Ensure you can properly trigger and see the server log for the jvm memory alert
  • Ensure you can properly trigger and see the server log for the missing monitoring data alert (Note: This alert is only concerned with Elasticsearch now and no longer looks at other stack products [Monitoring] Missing monitoring data alert firing for version upgrade and configuration changes for Kibana in Cloud #83309)
  • Ensure you can properly trigger and see the server log for both threadpool rejection alerts
    • 🤔 I was unable to create the conditions for this.

Information in reproducting legacy alerts -> #87377

  • Ensure you can properly trigger and see the server log for the legacy cluster health alert
  • Ensure you can properly trigger and see the server log for the legacy nodes change alert
  • Ensure you can properly trigger and see the server log for the legacy Elasticsearch version mismatch alert
  • Ensure you can properly trigger and see the server log for the legacy Kibana version mismatch alert
  • Ensure you can properly trigger and see the server log for the legacy Logstash version mismatch alert
  • Ensure you can properly trigger and see the server log for the legacy license expiration alert (This cannot be tested on cloud)
    • 🤔 I was unable to create the conditions for this.

Edge cases

Previous issue: #85841

@elasticmachine
Copy link
Contributor

Pinging @elastic/stack-monitoring (Team:Monitoring)

@weltenwort weltenwort self-assigned this Mar 10, 2021
@weltenwort
Copy link
Member

Ensure Stack Monitoring alerts are not editable or createable in the Alerts & Management UI

Do I understand that correctly to mean that these should not be available?

image

@weltenwort
Copy link
Member

Ensure a user with the minimum set of monitoring permissions is able to create and manage alerts

What would this minimum set of permissions be? Only the monitoring_user role and nothing else?

@weltenwort
Copy link
Member

Ensure that you can view and manage these alerts when Setup Mode is active

Is it a known issue that saving a changed alert and clicking "edit" again immediately after the success toast shows the second-to-last state?

@weltenwort
Copy link
Member

Ensure a dedicated monitoring cluster environment works as expected

This might be stupid question, but can the dedicated monitoring cluster monitor itself in addition to the remote clusters?

@chrisronline
Copy link
Contributor Author

chrisronline commented Mar 10, 2021

#93072 (comment)

Yes, it's being changed now -> #94167. We actually never updated the test for this change, but we are reverting it so it's accurate now and should be available to test soon.

#93072 (comment)

Correct. It is monitoring_user and kibana_admin

#93072 (comment)

It is not. Worth filing a ticket for sure

#93072 (comment)

Absolutely not stupid! Yes that's fine as well.

@weltenwort
Copy link
Member

Ensure a user with the minimum set of monitoring permissions is able to create and manage alerts

It seems a user with only monitoring_user and kibana_admin can cause the initial alerts to be created, but can't enter setup mode:

image

But

GET /_security/user/_has_privileges
{
  "index": [
    {
      "names": [
        ".monitoring-*-6-*,.monitoring-*-7-*"
      ],
      "privileges": [
        "read"
      ]
    }
  ]
}

yields

{
  "username" : "monitoring-only",
  "has_all_requested" : true,
  "cluster" : { },
  "index" : {
    ".monitoring-*-6-*,.monitoring-*-7-*" : {
      "read" : true
    }
  },
  "application" : { }
}

Is that expected?

@weltenwort
Copy link
Member

Ensure a CCS environment works as expected

What is the interaction between stack monitoring alerting and CCS? When does stack monitoring access monitoring data via CCS?

@weltenwort
Copy link
Member

🪲 I filed "Alert editing fly-out shows outdated alert parameters after editing" (#94440)

@chrisronline
Copy link
Contributor Author

#93072 (comment)

Yes this is actually! Setup mode requires additional permissions that are not available to the basic permission set. See #50421

#93072 (comment)

Apologies for the confusion.

There are three ways in which the Stack Monitoring UI can find monitoring data.

  1. Looking at the default cluster the Kibana instance knows about (elasticsearch.hosts kibana setting)
  2. Connected to a separate ES cluster (through monitoring.ui.elasticsearch.hosts kibana setting)
  3. Connected to a separate ES cluster through remote clusters (ccs)

The intention of the bullet in this test scenario is to ensure that when looking at monitoring data through a CCS "way", the alerts work as expected. We've seen a few regressions around this "way" which is why I added it.

@weltenwort
Copy link
Member

The intention of the bullet in this test scenario is to ensure that when looking at monitoring data through a CCS "way", the alerts work as expected. We've seen a few regressions around this "way" which is why I added it.

Thanks for the clarification. So how do I configure it to access the monitoring data from a remote cluster via CCS?

@chrisronline
Copy link
Contributor Author

  1. Locally, start up two separate ES clusters, ensuring they aren't both running on the same port(s).
  2. Use a single Kibana and point it to one of the clusters
  3. In that Kibana, go to Stack Management -> Remote Clusters and add the second cluster as a remote cluster
  4. Using curl, enable Monitoring on the second cluster through:
PUT _cluster/settings
{
  "transient": {
    "xpack.monitoring.collection.enabled": true
  }
}
  1. Navigate to the single Kibana and verify you can see monitoring data for the second ES cluster

Now it's setup with CCS

@weltenwort
Copy link
Member

Ah, so it's automatically looking at all the remote clusters. Thanks!

@weltenwort
Copy link
Member

Whenever I visit the stack monitoring app on my cloud test cluster, a toast is displayed that informs me about newly created alerts. All 14 alerting rules were already present before, though, and no new alerting rules show up.

image

Would that warrant a bug report or would you consider this a non-issue since the toast is just informative?

@weltenwort
Copy link
Member

🪲 I filed "Alert grouping toggle never renders as enabled" (#94556)

@chrisronline
Copy link
Contributor Author

Whenever I visit the stack monitoring app on my cloud test cluster

Are you hard refreshing the page? If so, that's expected. It shouldn't show up again in the same session (in that you don't ever hard refresh). The logic to avoid showing it is only an in-memory variable.

@weltenwort
Copy link
Member

Are you hard refreshing the page?

Yes, because without the refresh switching to a cluster takes a long time. Thanks for the clarification.

@chrisronline
Copy link
Contributor Author

without the refresh switching to a cluster takes a long time

This seems like a problem too. Is this in a cloud environment I can see?

@weltenwort
Copy link
Member

weltenwort commented Mar 15, 2021

https://p.elstc.co/paste/aPPgTbob#UNKkP9yx2rmOX6kOZYd4nXDeiol-wVXH2orboTaykYI

Sometimes switching to a cluster view takes 10 seconds or longer. And switching back to the list doesn't work reliably either. I think I already filed an issue about that during last release's test phase, though.

@chrisronline
Copy link
Contributor Author

@weltenwort That's fine. I think the slowness is a separate issue and we can follow up later.

@weltenwort
Copy link
Member

I somehow can't get the license expiration to trigger. I use an ingest processor that forces the expiry_date_in_millis to something close, but it doesn't trigger even though I set its check frequency to 1 minute. Copying the the query from the code leads to hits with license in their _source like the following:

          "license" : {
            "uid" : "fb7a9815-8b0a-4608-b786-049fbec4a4a8",
            "expiry_date_in_millis" : 1616682712000,
            "issue_date" : "2020-03-24T00:00:00.000Z",
            "start_date_in_millis" : 1614556800000,
            "issued_to" : "Elastic Cloud",
            "expiry_date" : "2022-06-30T00:00:00.000Z",
            "max_nodes" : null,
            "issue_date_in_millis" : 1585008000000,
            "type" : "enterprise",
            "max_resource_units" : 100000,
            "issuer" : "API",
            "status" : "active"
          },

This looks like the expiry_date_in_millis is being modified by the pipeline, but the alert doesn't trigger. What am I doing wrong?

@weltenwort
Copy link
Member

I went through every item, but (as noted in the description) was unable to produce the necessary conditions for some. I'd appreciate it if someone with more in-depth knowledge could double-check those.

I linked the issues I created underneath the corresponding item in the description.

@chrisronline
Copy link
Contributor Author

@weltenwort Here is the pipeline I use for license expiration: https://gist.github.com/chrisronline/9d4d3d740e535d3c01410cac2cc74653

Does this match what you have?

@weltenwort
Copy link
Member

Yes, that's what I used. The resulting document looks fine, but it doesn't trigger the alert somehow.

@chrisronline
Copy link
Contributor Author

@weltenwort Are you trying on cloud? I imagine you are hitting against this work: #84361 There is a config option to hide the license expiration logic from the SM UI and I think cloud sets it by default. I'll make sure to note this in the test in the future. Apologies for the confusion.

@weltenwort
Copy link
Member

Yes, I was testing mostly on cloud. The alert is visible and enabled in the UI, but never fires.

@chrisronline
Copy link
Contributor Author

I opened #95090 to address that issue

@igoristic
Copy link
Contributor

@weltenwort Testing thread pool rejections is a bit cumbersome. The only viable way is to set your max queue size to 0 via: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html. Though I was only able to achieve this locally on a bare-bone/throw-away cluster. For some reason this did not work on cloud, I imagine it's because the rejections are collected from the container and not the machine itself.

The other way is to create a pipeline that accumulates those counts (since it's a derivative) in incoming node_stats documents. The path looks like this: _source.node_stats.thread_pool.search.rejected. Though, this is a more complicated process, so let me know if the first method does not work we'll take this route

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Monitoring Stack Monitoring team test-plan
Projects
None yet
Development

No branches or pull requests

6 participants