Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Stack Monitoring] Alerting Phase -1 #42960

Open
8 of 10 tasks
cachedout opened this issue Aug 8, 2019 · 12 comments
Open
8 of 10 tasks

[Stack Monitoring] Alerting Phase -1 #42960

cachedout opened this issue Aug 8, 2019 · 12 comments
Assignees
Labels
enhancement New value added to drive a business result Meta Team:Monitoring Stack Monitoring team

Comments

@cachedout
Copy link
Contributor

cachedout commented Aug 8, 2019

This ticket tracks the work which needs to be completed to achieve Phase -1 which is outlined in the proposal document.

To complete this phase, we need to build out the plumbing to connect to the Stack Monitoring application to the Kibana Alerting Framework.

All watches need to be present and functional using the new framework:

@cachedout cachedout added Meta enhancement New value added to drive a business result Team:Monitoring Stack Monitoring team labels Aug 8, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/stack-monitoring

@chrisronline
Copy link
Contributor

Update here.

I found a couple of blockers while taking a first stab at this and raised them here: #45571

@chrisronline
Copy link
Contributor

chrisronline commented Oct 16, 2019

The effort is going well here. I don't have a PR ready yet, but I hope to have it this week. (Update: Draft PR available)

Some updated notes on this effort:

  • We need to figure out how we handle the state of alerts firing - with Watcher, we write to the .monitoring-alerts-* index, but I think we can avoid an additional index by leveraging the persisted state for actions. We are blocked on this because we need a way to access this state, see Ability to fetch alert state / alert instance state #48442
  • We need to figure out the right way to disabling cluster alerts (watches). I've outlined some thoughts on this issue
  • I'm thinking we'll want to progressively add these into master (instead of one big merge) and if so, we should think about if we want to disable these until they are all in, or do we want to enable at least one from the start and have it co-exist with the other watches?
  • With watcher, we require users to specify an email address to receive alerts in their kibana.yml - we can continue this trend, or we can allow them to specify it in the UI when they enable Kibana alerts, and then we store it in a saved object or something.

@igoristic
Copy link
Contributor

Nice work @chrisronline 💪 Can't wait to see it!

We need to figure out how we handle the state of alerts firing - with Watcher, we write to the .monitoring-alerts-* index

Once "Kibana Alerting" is live are we completely deprecating/removing the current/old Alerting?

I think we might still want a new index, just in case some setups still have the old .monitoring-alerts-* with legacy documents (or for some reason we need to support both ES and Kibana alerting). We can abbreviate it with something like -kb like we do -mb for Metricbeat.

I'm thinking we'll want to progressively add these into master (instead of one big merge)

💯

With watcher, we require users to specify an email address to receive alerts in their kibana.yml

I prefer in the Kibana UI, just because it's more UI friendly, and they can modify the info without restarting, but I don't mind continuing the yml trend.

@chrisronline
Copy link
Contributor

Thanks for the thoughts @igoristic!

Once "Kibana Alerting" is live are we completely deprecating/removing the current/old Alerting?

I guess it depends on if we want a slow rollout of these migrations. If so, we will be living in a world where both are running and exist at the same time (not for the same alert check, but we'll have some watcher based cluster alerts and some kibana alerts)

I think we might still want a new index, just in case some setups still have the old .monitoring-alerts-* with legacy documents (or for some reason we need to support both ES and Kibana alerting). We can abbreviate it with something like -kb like we do -mb for Metricbeat.

You don't think we can accomplish the same UI from just using the state provided by the alerting framework? I think that's really all we need since we'll store data in there that tells us when the alert fired and if it's been resolved yet.

I prefer in the Kibana UI, just because it's more UI friendly, and they can modify the info without restarting, but I don't mind continuing the yml trend.

Yea I agree the UI route is better, but if we do a slow rollout, it might be confusing for folks who already have the kibana.yml config set - I think we need to make a call on the slow rollout and that will help inform us of how to handle these other issues.

@igoristic
Copy link
Contributor

You don't think we can accomplish the same UI from just using the state provided by the alerting framework? I think that's really all we need since we'll store data in there that tells us when the alert fired and if it's been resolved yet.

I guess I don't really know how the current implantation well enough to validate my concern. My worry is that if an ES Alert is triggered it'll be added to the index which will then be picked up by both ES Alerts and KB Alerts which might duplicate some actions like sending two emails etc...

I just think a new index can help avoid any of this issues we might not yet foresee (maybe for the same reason Metricbeat has its own -mb indices?)

This is all based on speculation though

@chrisronline
Copy link
Contributor

I guess I don't really know how the current implantation well enough to validate my concern. My worry is that if an ES Alert is triggered it'll be added to the index which will then be picked up by both ES Alerts and KB Alerts which might duplicate some actions like sending two emails etc...

Ah, I see the confusion here.

Part of this work involves disabling (or blacklisting per @cachedout's idea) the cluster alert when we enable the Kibana alert. We'd never have a situation (intentionally) where both the cluster alert for xpack license expiration, and the Kibana alert for xpack license expiration are running at the same time.

@cachedout
Copy link
Contributor Author

I'm thinking we'll want to progressively add these into master (instead of one big merge) and if so, we should think about if we want to disable these until they are all in, or do we want to enable at least one from the start and have it co-exist with the other watches?

I think that gradually merging these and leaving them disabled until we are ready to switch the new alerting on in the application is the right thing to do. It gives us time to develop and test the alerts while minimizing the disruption for the user.

@ypid-geberit
Copy link

I was forwarded to this issue from elastic/elasticsearch#34814 (comment). The "Phase -1 which is outlined in the proposal document." is not linked so I don’t have knowledge of that so excuse me if this is beyond the scope of "Phase 1".

As a Elastic Stack admin, I feel the "Stack Monitoring" falls short compared to other Monitoring systems. For example, there is no concept of Hard and Soft States. And I am not convinced that it would be a good idea to replicate this using Elastic watcher (I tried for my own use and failed). See elastic/elasticsearch#34814 (comment) for more details.

@igoristic
Copy link
Contributor

igoristic commented Feb 16, 2021

Thank you @ypid-geberit for your feedback

As a Elastic Stack admin, I feel the "Stack Monitoring" falls short compared to other Monitoring systems. For example, there is no concept of Hard and Soft States

I think this is a good request feature, but perhaps out of scope within the context of this ticket.

@ravikesarwani Maybe this is something we can add a ticket for in SM feature requests roadmap

@igoristic igoristic reopened this Feb 16, 2021
@ravikesarwani
Copy link
Contributor

Many of the out of the box stack monitoring alerts provide users the full flexibility to control the notifications (including what method to get notified with based on license level) and when they are generated. For example "CPU Usage" has the default to alert when CPU is over 85% looking at average over last 5 minutes. Both 85% and 5 minutes duration can easily be adjusted by the users.

Also with #91145 we will allow users to create multiple alerts and be able to handle feature similar to soft and hard states. For example "Say user wants to alert when CPU is 75% for last 5 minutes and send an email. When its 85% for last 10 minutes they want to send a pagerduty alert."

@ypid-geberit
Copy link

Sounds like what @ravikesarwani wrote addresses it. I am looking forward to it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Meta Team:Monitoring Stack Monitoring team
Projects
None yet
Development

No branches or pull requests

6 participants