Is the alerting framework the solution for ETL? #92197

mikecote · 2021-02-22T15:08:31Z

I've noticed a few places using the alerting framework as a solution to do ETL (Maps, Security Detections, etc.). I created this issue to discuss where we draw the line with the Kibana Alerting framework.

Some alert types leverage the alerting framework to do some data processing for other alerts to use as inputs (see below).

The question is, is our framework the place to do ETL? If so, should we architect something to do ETL (example below)?

Some of the issues we have today:

Alerts can only run so frequently without causing contention
They're currently "at least once", so duplicate documents are possible
They are inherently delayed
Alerting is meant to alert users of something happening. In some situations, we're not just alerting someone on a specific cadence. We also want to have a visualization that reflects when this occurs outside of the particulars of alerting. It's not just alerting. It's alerting + data transformation.
Alerts would have to look at all the data in a window (interval) to capture all the events while only alerting about one of them if they relate to the same instance (Ex: vehicle being contained in two boundaries).
Alerts will waste calculations if multiple alerts monitor the same thing (compared to using a calculated data source)
Alert simulations won’t have the data upfront to show the user what would happen
Alerting framework is currently not designed to be a monitoring framework
It would still be nice to have alert executors idempotent and not create, update, delete operations when they run. If that was the case, we could do something cool with alert simulation / explain plans by doing dry runs
A lot of data would be travelling between Kibana and Elasticsearch (I’m sure there are costs on Cloud for this?)
Alert instances don’t know when exactly something happened, but it knows when it got detected
Alerting framework isn’t designed to handle a large number of instances at this time
Alerting framework isn’t designed for alert instances to always be active
Alternatives lack alerting functionality (ex: task manager w/ API keys and UI forms)

elasticmachine · 2021-02-22T15:08:33Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

mikecote · 2021-02-22T15:10:02Z

I also wondered why we allow alerts not to have any actions. 🤔

gmmorris · 2021-02-23T11:47:45Z

Thanks for posting this issue Mike, it something I've been a little worried about for a while.
If TM is being used for ETL, then I think we need to prioritise improving on it's scalability ability, how it handles scheduling/timing, the priority of different task types etc.

It doesn't make sense for long running ETL tasks to cause notification of detection in the ETLed data from firing, for example, but that could easily happen today.

I also wondered why we allow alerts not to have any actions. 🤔

I think that's a separate question tbh.
Once we have a richer experience in Alert Details, I can totally see an analyst wanting to evaluate whether a certain detection makes sense before attaching behaviour (actions) to it.

I don't think preview is quite enough for that if you want to leave a few detections running in parallels for a while and compare them later.
Thinking back to my last job, that would have been super useful and something we lacked in Splunk.

gmmorris · 2021-02-23T12:21:42Z

I just thought I'd add some context that might be missing for people less familiar with Task Manager. 😄

Some of the issues we have today:

Alerts can only run so frequently without causing contention

I recently documented how these things work, and what the default scale of TM is.

From the docs:

By default Kibana polls for tasks at a rate of 10 tasks every 3 seconds.
This means that if many tasks have been scheduled to run at the same time, pending tasks will queue in Elasticsearch. Each Kibana instance then polls for pending tasks at a rate of up to 10 tasks at a time, at 3 second intervals. It is possible for pending tasks in the queue to exceed this capacity and run late as a result.

This means a long running ETL tasks might take up a slot for an extended period of time.
If we end up with many such ETL tasks, this could easily clog up the queue.
It's also very easy for a naive user to configure a bunch of alerts with a 1s interval, essentially clogging up the entire system.

We can always push these default numbers higher, but that comes at a cost.

Scaling a system is always hard... doing so in a general purpose manner that is adaptive is even harder.
Addressing this is of course possible, we just need to agree that it's the priority. 🤔

They are inherently delayed

Tasks are scheduled for a certain time and it's only after they exceed that time that they are picked up by one of the Kibana instances.
As Task Manager is designed to ensure no two Kibana instances will run the same task in parallel we are forced into using Elasticsearch as a queue (and as a lockign mechanism) to coordinate which Kibana will claim a task and execute it.
This process repeats at each polling interval, and is only applied to task have already expired (as in, their scheduled time to run is in the past).

We could in theory pick tasks up preemptively, but there's a lot of complexity to that:

We don't know how long it'll take to get through tasks that have already been picked up. Do we pick up new tasks preemptively? What if mine are run over and another Kibana is free? We'll be unintentionally late because we preempted.
What if a Kibana instance claims a task early and then crashes before it had t he chance to run that task?
etc.

At a first glimpse these are simple problems, but in fact they are complex problems due to the distributed nature of Kibana and the requirements at play here.

Another thing worth understanding is that Task Manager prioritises system stability over schedule accuracy.
We only pick up work that we have capacity to execute, and if that means picking work up late, then that's what it does.

We have discussed ideas such as reactively scaling vertically when there are resources available, but we haven't moved ahead with that kind of work yet.

As I said before: Addressing these complexities is of course possible, we just need to agree that it's the priority over other things.
I'm definitely not trying to be a naysayer (I'd love to prioritise this work!), but rather just trying to provide the context as to why these limitations and concerns exist. 😬

Alerting is meant to alert users of something happening. In some situations, we're not just alerting someone on a specific cadence. We also want to have a visualization that reflects when this occurs outside of the particulars of alerting. It's not just alerting. It's alerting + data transformation.

Alerts would have to look at all the data in a window (interval) to capture all the events while only alerting about one of them if they relate to the same instance (Ex: vehicle being contained in two boundaries).

Alerts will waste calculations if multiple alerts monitor the same thing (compared to using a calculated data source)

Alerting framework is currently not designed to be a monitoring framework

Alert simulations won’t have the data upfront to show the user what would happen

These are all related, I believe.
Generally speaking, and this is just how I have been taught to think about alerting- it should be run as a separate system from the monitoring. You want to ensure the detection is quick, otherwise you might be notified of the problem too late as it collects the data.
Keeping alerting separate from the monitoring makes sense in that regard, and the situation we have now where a delay in monitoring can mean a delay in alerting (due to both using the same task queue) isn't ideal and could cause down stream problems.
Keep in mind I'm not necessarily referring to a delay in alerting on the monitored data. For example, a long running ETL task in Observability could cause an alert in Security to run late. (cc @spong as you were interested in long running tasks).

It would still be nice to have alert executors idempotent and not create, update, delete operations when they run. If that was the case, we could do something cool with alert simulation / explain plans by doing dry runs

It does occur to me that if solutions need ETL, they can use Task Manager for it directly (assuming we do some more work to improve scalability), rather than by going via the alerting framework.
This would allow them to make Alert Executors more idempotent...

Alerting framework isn’t designed to handle a large number of instances at this time

As we documented here:

Kibana Task Manager, like the rest of the Elastic Stack, has been designed to scale horizontally, and we recommend taking advantage of this ability to ensure mission ciritcal services such as Alerting and Reporting always have the capacity they need.

but...

Scaling horizontally requires a higher degree of coordination between Kibana instances.

I think we can improve this by working with the ES team to address some of the limitations and rethinking some of our task ownership strategies.

I've been playing around with coordination methods between Kibana instances (leader elected via a SO or long running ownership of tasks, for example) that would allow us to reduce this kind of coordination. This is large complex problem that would require us to think about a wide range of concerns around scheduling, work load balancing etc.

Alerting framework isn’t designed for alert instances to always be active

There are a few aspects to this beyond the technical (such as the UX around that), but from a scaling stand point, the main concern here is the constant firing of actions and how this might clog up Task Manager.
We have an issue that could reduce that concern, but it won't reduce the overall load of work on Kibana or the UX concerns.

mikecote · 2022-01-26T19:50:43Z

Closing due to lack of activity.

mikecote added discuss Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Feb 22, 2021

banderror mentioned this issue Mar 22, 2021

[Discuss] [Security Solution] [Alerting] HTTP route RFC for unified rule management #95060

Open

mikecote closed this as completed Jan 26, 2022

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is the alerting framework the solution for ETL? #92197

Is the alerting framework the solution for ETL? #92197

mikecote commented Feb 22, 2021

elasticmachine commented Feb 22, 2021

mikecote commented Feb 22, 2021

gmmorris commented Feb 23, 2021 •

edited

Loading

gmmorris commented Feb 23, 2021 •

edited

Loading

mikecote commented Jan 26, 2022

Is the alerting framework the solution for ETL? #92197

Is the alerting framework the solution for ETL? #92197

Comments

mikecote commented Feb 22, 2021

elasticmachine commented Feb 22, 2021

mikecote commented Feb 22, 2021

gmmorris commented Feb 23, 2021 • edited Loading

gmmorris commented Feb 23, 2021 • edited Loading

mikecote commented Jan 26, 2022

gmmorris commented Feb 23, 2021 •

edited

Loading

gmmorris commented Feb 23, 2021 •

edited

Loading