Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Monitoring] Kibana Alerting #49219

Closed

Conversation

chrisronline
Copy link
Contributor

@chrisronline chrisronline commented Oct 24, 2019

Relates to #42960

This PR lays the ground work for migrating cluster alerts to Kibana alerting. Part of this involves migrating a single alert for now - xpack_license_expiration.

In addition to migrating the xpack_license_expiration alert, there is also a UI exposed in this PR that helps the user do their work for the migration. We discussed this and want to require the user to click a button to initiate the migration, which is in this PR.

Also, we need a way to disable watcher-based cluster alerts as well. This PR does not do anything for that effort, as the plan is to disable them all when all alerts are migrated. This will involve a PR to Elasticsearch at some point. (See elastic/elasticsearch#50032)

Now you may be wondering why would we merge this PR with these new UIs right now? Well, we want to progressively merge into master the work around this migration, so we added a new constant in the code that indicates if kibana alerting is enabled or not for monitoring. For devs and testers, you will need to change this to true for any of this code to work. Once we are done migrating all alerts, we can just remove that constant.

Screenshots

Setup

out

Testing

Screen Shot 2019-10-24 at 12 44 32 PM

Screen Shot 2019-10-24 at 12 44 43 PM

Testing

  • You'll need to use a PR from Elasticsearch which blacklists the necessary cluster alerts. Ensure you have ES cloned (do so, if not) in a sibling directory to kibana (like /dev/repos/kibana and /dev/repos/elasticsearch), pull down this PR, then run yarn es source inside of kibana to use that custom ES build.
  • Add these to your kibana.dev.yml:
xpack.alerting.enabled: true
xpack.actions.enabled: true
  • Add this to your elasticsearch.yml:
xpack.security.authc.api_key.enabled: true
  • Flip the constant to true
  • Uncomment this line out and comment this line out
  • Ensure you have a trial, gold, or platinum license active
  • Ensure internal monitoring is enabled (Fairly sure that MB monitoring does not have a way to initiate the cluster alerts, cc @cachedout to verify)
  • Go to dev tools and ensure you only see 5 watches: GET .watches/_search?filter_path=hits.total.value
  • Use the UI to create an email action and migrate the xpack license expiration alert
  • For creating the action, if you are using gmail, I'd recommend reading this and this
  • Simulate a license expiration to receive an email. One way is to:
  1. Add an ingest pipeline:
PUT _ingest/pipeline/force_license_expiration
{
  "processors": [
    {
      "script": {
        "lang": "painless",
        "source": "ctx.license.expiry_date_in_millis = Instant.ofEpochMilli(new Date().getTime()).plusSeconds(60 * 60 * 24 * 3).getEpochSecond() * 1000;"
      }
    }
  ]
}
  1. Disable the default pipeline:
PUT _cluster/settings
{
  "persistent": {
    "xpack.monitoring.exporters": { 
      "local": {
        "type": "local",
        "use_ingest": false
      }
    }
  }
}
  1. Set this as the default pipeline for .monitoring-es-* indices:
PUT .monitoring-es-7-*/_settings
{
  "index.required_pipeline": "force_license_expiration"
}

This should cause the alert to fire. You can get it to the "resolved" state by simply removing the ingest pipeline from the index:

PUT .monitoring-es-7-*/_settings
{
  "index.required_pipeline": null
}

TODO

  • Add unit tests for the UI
  • Once migrated, all stored alerts in .monitoring-alerts-* should be removed
  • Only users with the right permissions should be able to interact with this UI, see [Monitoring] Only elevated permission users should see Setup Mode #50327
  • Should we perform CCS queries when searching for monitoring data in the alert?
  • Create follow up issues to remove the default_admin_email logic since we aren't using that in the Kibana alerting world

@chrisronline chrisronline added review Team:Monitoring Stack Monitoring team v8.0.0 release_note:skip Skip the PR/issue when compiling release notes Feature:Stack Monitoring v7.6.0 labels Oct 24, 2019
@chrisronline chrisronline self-assigned this Oct 24, 2019
@chrisronline
Copy link
Contributor Author

When a required field isn't filled in, all other fields display warnings:

Nice find! Fixed in 3c4924b

@chrisronline
Copy link
Contributor Author

The icon here suggests that clicking on it will launch a new window or a different tab but instead, clicking on it displays the content in the same page.

Fixed in The icon here suggests that clicking on it will launch a new window or a different tab but instead, clicking on it displays the content in the same page.

@cachedout
Copy link
Contributor

She wanted to make it more user friendly by parsing out the actual error - Maybe we can have both by showing what we have now, but offering a button/toggle for the user to see the raw message?

That would work for me! :)

@andreadelrio
Copy link
Contributor

andreadelrio commented Dec 2, 2019

I think we could improve our error handling story here. In this case, I intentionally wrote out a config which won't work and I'm presented with a generic error but I don't know what's wrong. It would be good if we could return more details, like what the response from the mail server was when trying to send an alert:

The main issue is the error message comes back in a non user-friendly format.

For example:

"error in action \"57914bf9-dabd-4e7f-99b4-e10798ca80d6\" sending email: Invalid login: 535-5.7.8 Username and Password not accepted. Learn more at\n535 5.7.8  https://support.google.com/mail/?p=BadCredentials j9sm371252qtr.57 - gsmtp"

We could perhaps try and parse this but it feels brittle. Here is the structure of the error message but that might change over time. @mikecote any thoughts/plans to make these error messages more user-friendly? or someway to handle this well in a UI?

@chrisronline What if we just show them {errorMessage}? Is showing the actionId really necessary here?

@chrisronline
Copy link
Contributor Author

Screen Shot 2019-12-02 at 4 31 26 PM

Okay what if we show this for now? And then it will only get better once the alerting team makes this error more user friendly?

@andreadelrio
Copy link
Contributor

andreadelrio commented Dec 2, 2019

Screen Shot 2019-12-02 at 4 31 26 PM

Okay what if we show this for now? And then it will only get better once the alerting team makes this error more user friendly?

@chrisronline I would only show from Invalid login onwards unless they to need see the actionId for something

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@chrisronline
Copy link
Contributor Author

I would only show from Invalid login onwards unless they need see the actionId for something

My fear with trying to do this is the string that comes before it is a localized string.

I think we should leave it as, knowing that it will get addressed before this feature is GA. Thoughts?

@pmuellr
Copy link
Member

pmuellr commented Dec 3, 2019

The problem with the errors from actions like this, is that most of the time, the action isn't going to know exactly what the problem is, it's encoded in the message coming back from the service, in this case gmail. Yahoo's email service likely returns something similar, yet completely different, and so parsing the interesting bit from the service (line split here for readability) is going to be impossible:

Invalid login: 535-5.7.8 Username and Password not accepted.
Learn more at\n535 5.7.8 https://support.google.com/mail/?p=BadCredentials j9sm371252qtr.57 - gsmtp

Somehow, we need to make that message available to users, but we could potentially do a better job in the action result to accomodate this.

Here's what the action currently returns:

{
      status: 'error',
      message: "error in action \"{action-id}\" sending email: {service-message}"
}

We could send the following instead:

{
      status: 'error',
      message: "error sending email"
      actionId: {action-id},
      actionName: {just a suggestion, could be useful to have},
      detail: {service-message}
}

@chrisronline
Copy link
Contributor Author

@pmuellr Makes sense. I think that little change (including the raw error message on the error object) will help us here, and exactly what we're looking to get.

@chrisronline
Copy link
Contributor Author

Question for @elastic/stack-monitoring

What was our intention with handling multi-cluster monitoring and cluster alerts? Do we want to require the user to go to each cluster's overview page and create Kibana alerts? Or do we want to be able to create these alerts and have them function for every monitored cluster (this would be dynamic in that our alerting code will just do a terms agg across cluster_uuid)?

@cachedout
Copy link
Contributor

I prefer the second option. I don't think a user is going to want to go cluster-by-cluster and I think it simplifies things.

@elasticmachine
Copy link
Contributor

💔 Build Failed

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@cachedout cachedout removed their request for review December 27, 2019 10:09
@chrisronline
Copy link
Contributor Author

The comments are a bit much and stale at this point. I'm going to close and re-open a new one for this work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Stack Monitoring release_note:skip Skip the PR/issue when compiling release notes review Team:Monitoring Stack Monitoring team v7.6.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants