Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the alerting framework the solution for ETL? #92197

Closed
mikecote opened this issue Feb 22, 2021 · 5 comments
Closed

Is the alerting framework the solution for ETL? #92197

mikecote opened this issue Feb 22, 2021 · 5 comments
Labels
discuss Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting Meta Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

I've noticed a few places using the alerting framework as a solution to do ETL (Maps, Security Detections, etc.). I created this issue to discuss where we draw the line with the Kibana Alerting framework.

Some alert types leverage the alerting framework to do some data processing for other alerts to use as inputs (see below).
Screen Shot 2021-02-22 at 9 49 25 AM

The question is, is our framework the place to do ETL? If so, should we architect something to do ETL (example below)?
Screen Shot 2021-02-22 at 9 51 00 AM

Some of the issues we have today:

  • Alerts can only run so frequently without causing contention
  • They're currently "at least once", so duplicate documents are possible
  • They are inherently delayed
  • Alerting is meant to alert users of something happening. In some situations, we're not just alerting someone on a specific cadence. We also want to have a visualization that reflects when this occurs outside of the particulars of alerting. It's not just alerting. It's alerting + data transformation.
  • Alerts would have to look at all the data in a window (interval) to capture all the events while only alerting about one of them if they relate to the same instance (Ex: vehicle being contained in two boundaries).
  • Alerts will waste calculations if multiple alerts monitor the same thing (compared to using a calculated data source)
  • Alert simulations won’t have the data upfront to show the user what would happen
  • Alerting framework is currently not designed to be a monitoring framework
  • It would still be nice to have alert executors idempotent and not create, update, delete operations when they run. If that was the case, we could do something cool with alert simulation / explain plans by doing dry runs
  • A lot of data would be travelling between Kibana and Elasticsearch (I’m sure there are costs on Cloud for this?)
  • Alert instances don’t know when exactly something happened, but it knows when it got detected
  • Alerting framework isn’t designed to handle a large number of instances at this time
  • Alerting framework isn’t designed for alert instances to always be active
  • Alternatives lack alerting functionality (ex: task manager w/ API keys and UI forms)
@mikecote mikecote added discuss Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Feb 22, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@mikecote
Copy link
Contributor Author

I also wondered why we allow alerts not to have any actions. 🤔

@gmmorris
Copy link
Contributor

gmmorris commented Feb 23, 2021

Thanks for posting this issue Mike, it something I've been a little worried about for a while.
If TM is being used for ETL, then I think we need to prioritise improving on it's scalability ability, how it handles scheduling/timing, the priority of different task types etc.

It doesn't make sense for long running ETL tasks to cause notification of detection in the ETLed data from firing, for example, but that could easily happen today.

I also wondered why we allow alerts not to have any actions. 🤔

I think that's a separate question tbh.
Once we have a richer experience in Alert Details, I can totally see an analyst wanting to evaluate whether a certain detection makes sense before attaching behaviour (actions) to it.

I don't think preview is quite enough for that if you want to leave a few detections running in parallels for a while and compare them later.
Thinking back to my last job, that would have been super useful and something we lacked in Splunk.

@gmmorris
Copy link
Contributor

gmmorris commented Feb 23, 2021

I just thought I'd add some context that might be missing for people less familiar with Task Manager. 😄

Some of the issues we have today:

  • Alerts can only run so frequently without causing contention

I recently documented how these things work, and what the default scale of TM is.

From the docs:

By default Kibana polls for tasks at a rate of 10 tasks every 3 seconds.
This means that if many tasks have been scheduled to run at the same time, pending tasks will queue in Elasticsearch. Each Kibana instance then polls for pending tasks at a rate of up to 10 tasks at a time, at 3 second intervals. It is possible for pending tasks in the queue to exceed this capacity and run late as a result.

This means a long running ETL tasks might take up a slot for an extended period of time.
If we end up with many such ETL tasks, this could easily clog up the queue.
It's also very easy for a naive user to configure a bunch of alerts with a 1s interval, essentially clogging up the entire system.

We can always push these default numbers higher, but that comes at a cost.

Scaling a system is always hard... doing so in a general purpose manner that is adaptive is even harder.
Addressing this is of course possible, we just need to agree that it's the priority. 🤔

  • They are inherently delayed

Tasks are scheduled for a certain time and it's only after they exceed that time that they are picked up by one of the Kibana instances.
As Task Manager is designed to ensure no two Kibana instances will run the same task in parallel we are forced into using Elasticsearch as a queue (and as a lockign mechanism) to coordinate which Kibana will claim a task and execute it.
This process repeats at each polling interval, and is only applied to task have already expired (as in, their scheduled time to run is in the past).

We could in theory pick tasks up preemptively, but there's a lot of complexity to that:

  1. We don't know how long it'll take to get through tasks that have already been picked up. Do we pick up new tasks preemptively? What if mine are run over and another Kibana is free? We'll be unintentionally late because we preempted.
  2. What if a Kibana instance claims a task early and then crashes before it had t he chance to run that task?
  3. etc.

At a first glimpse these are simple problems, but in fact they are complex problems due to the distributed nature of Kibana and the requirements at play here.

Another thing worth understanding is that Task Manager prioritises system stability over schedule accuracy.
We only pick up work that we have capacity to execute, and if that means picking work up late, then that's what it does.

We have discussed ideas such as reactively scaling vertically when there are resources available, but we haven't moved ahead with that kind of work yet.

As I said before: Addressing these complexities is of course possible, we just need to agree that it's the priority over other things.
I'm definitely not trying to be a naysayer (I'd love to prioritise this work!), but rather just trying to provide the context as to why these limitations and concerns exist. 😬

  • Alerting is meant to alert users of something happening. In some situations, we're not just alerting someone on a specific cadence. We also want to have a visualization that reflects when this occurs outside of the particulars of alerting. It's not just alerting. It's alerting + data transformation.
  • Alerts would have to look at all the data in a window (interval) to capture all the events while only alerting about one of them if they relate to the same instance (Ex: vehicle being contained in two boundaries).
  • Alerts will waste calculations if multiple alerts monitor the same thing (compared to using a calculated data source)
  • Alerting framework is currently not designed to be a monitoring framework
  • Alert simulations won’t have the data upfront to show the user what would happen

These are all related, I believe.
Generally speaking, and this is just how I have been taught to think about alerting- it should be run as a separate system from the monitoring. You want to ensure the detection is quick, otherwise you might be notified of the problem too late as it collects the data.
Keeping alerting separate from the monitoring makes sense in that regard, and the situation we have now where a delay in monitoring can mean a delay in alerting (due to both using the same task queue) isn't ideal and could cause down stream problems.
Keep in mind I'm not necessarily referring to a delay in alerting on the monitored data. For example, a long running ETL task in Observability could cause an alert in Security to run late. (cc @spong as you were interested in long running tasks).

  • It would still be nice to have alert executors idempotent and not create, update, delete operations when they run. If that was the case, we could do something cool with alert simulation / explain plans by doing dry runs

It does occur to me that if solutions need ETL, they can use Task Manager for it directly (assuming we do some more work to improve scalability), rather than by going via the alerting framework.
This would allow them to make Alert Executors more idempotent...

  • Alerting framework isn’t designed to handle a large number of instances at this time

As we documented here:

Kibana Task Manager, like the rest of the Elastic Stack, has been designed to scale horizontally, and we recommend taking advantage of this ability to ensure mission ciritcal services such as Alerting and Reporting always have the capacity they need.

but...

Scaling horizontally requires a higher degree of coordination between Kibana instances.

I think we can improve this by working with the ES team to address some of the limitations and rethinking some of our task ownership strategies.

I've been playing around with coordination methods between Kibana instances (leader elected via a SO or long running ownership of tasks, for example) that would allow us to reduce this kind of coordination. This is large complex problem that would require us to think about a wide range of concerns around scheduling, work load balancing etc.

  • Alerting framework isn’t designed for alert instances to always be active

There are a few aspects to this beyond the technical (such as the UX around that), but from a scaling stand point, the main concern here is the constant firing of actions and how this might clog up Task Manager.
We have an issue that could reduce that concern, but it won't reduce the overall load of work on Kibana or the UX concerns.

@gmmorris gmmorris added Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework RFC non-issue Indicates to automation that a pull request should not appear in the release notes Meta and removed RFC non-issue Indicates to automation that a pull request should not appear in the release notes labels Jul 1, 2021
@mikecote
Copy link
Contributor Author

Closing due to lack of activity.

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting Meta Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

No branches or pull requests

4 participants