Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Explanation and approach to removing the "siem.notification" saved object and rule type #112209

Closed
FrankHassanabad opened this issue Sep 15, 2021 · 14 comments
Assignees
Labels
discuss Team:Detections and Resp Security Detection Response Team

Comments

@FrankHassanabad
Copy link
Contributor

FrankHassanabad commented Sep 15, 2021

We removed the legacy notification system during part 1 here:
#109722

Which included the code for the rule type of:

siem.notifications

You can see this alert/rule type through the query:

GET .kibana/_search
{
  "query": {
    "term": {
      "alert.alertTypeId": "siem.notifications"
    }
  }
}

Starting in 7.16.0 we now use the proper Kibana alerting notification system for all newly created alerts which does not require us to carry around this special rule type anymore 🎉 .

However, for the siem.notifications alert type we would like to ensure its removal/retirement to:

  • Remove the document altogether from the .kibana index if this is possible.
  • Ensure that there is not a case of ghost firing and errors since the code that backs the siem.notifications is now gone. I have not tested that ghost fires would happen on an upgrade but I want to ask is it possible? At worst if so, I would expect just errors of the form, has resulted in Error: Rule type "siem.notifications" is not registered. but those errors might repeat endlessly. I have not seen this on my system since I removed the code fwiw locally.
  • We do not want future code/migrations to do something unexpected by keeping this old type and document around by doing broad manipulations of alerting during migrations.

Hence we feel it's important enough to try to do a removal of this saved object during an upgrade to 7.16.0+

I do not know if there is a suggested/recommended way for removing alert types that are no longer used or if there is a kibana core way that is recommended to remove saved objects in general. We have seen this slice of code:
https://github.com/elastic/kibana/blob/master/src/core/server/saved_objects/types.ts#L285

Called excludeOnUpgrade as one viable option and read through the reasons why it was added for actions and task manager.

However it appears to run on each upgrade and that you cannot target a specific upgrade meaning it would run on each and every upgrade. Also it looks like you can only have one instance of it for your entire saved object which means if you have multiple rule types in the future you have to keep appending/changing the query or create a type of "append" query pipe specific to alert types you want to retire and then provide us a way to plug into that.

But taking the query above of:

{
  "query": {
    "term": {
      "alert.alertTypeId": "siem.notifications"
    }
  }
}

And making a pull request to add that to the excludeOnUpgrade for the alerting saved object would on upgrades remove our siem.notifications

For task manager it does spawn the task as seen with this query:

{
  "query": {
    "term": {
      "task.taskType" : "alerting:siem.notifications"
    }
  }
}

And we would like to add this query to the task manager's existing query to remove tasks/clean up tasks involving the siem notifications to avoid issues unless there is a better suggested way to let task manager clean things up.

Approaches:

🌟 1. We could do nothing for 7.16.0. This seems the safest and if we do not see repeating error messages our risks are just on other upgrades something goes wrong with the dead SO objects such as getting in the way of some other migration. This seems a low risk though. From conversations below and through other communications it seems like these would be just left over dead SO objects. Later when the excludeOnUpgrade is either solidified or another one is added we would then complete this work. If kibana-alerting makes a more generic routine way for us to plug into then we don't have to do any work on our side.
2. We could use the excludeOnUpgrade but with the understanding that it will go away in 8.0.0 and be replaced with something else we migrate to. Risks are that excludeOnUpgrade runs on each upgrade cycle. Also we have to make a PR against alerting and push it together with any others they already have for their task SO and their alert SO's.
3. We could use a task manager and during our plugin "start up" run a one-off task that will clean these up for us. Risk is we are maintaining code that could get in the way later or malfunction and remove all tasks or something silly. We would have to write more code for this effort or utilize our small startup framework migration we have in draft mode for this work.

So far it seems like most people are leaning towards approach 🌟 1 unless during an upgrade or testing we do see error messages repeating.

@FrankHassanabad FrankHassanabad added the Team:Detections and Resp Security Detection Response Team label Sep 15, 2021
@FrankHassanabad FrankHassanabad self-assigned this Sep 15, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@ymao1
Copy link
Contributor

ymao1 commented Sep 15, 2021

Ensure that there is not a case of ghost firing and errors since the code that backs the siem.notifications is now gone.

Task manager will mark unregistered task types as 'unrecognized' and stop claiming them. Here is the PR that introduced that behavior: #84273

So if you removed the task definition for siem.notifications, it should no longer fire.

@ymao1
Copy link
Contributor

ymao1 commented Sep 15, 2021

We have some precedent for using a task manager task to clean up saved objects. Here is the PR for that: #96971

The reason we did this was because previously, if an action execution failed, we kept the action task and its associated action_task_param around in case of retries. This would lead to a large buildup of these types of saved objects. @mikecote introduced this PR to run a background task to clean up these saved objects (This was before the excludeOnUpgrade work was done).

Perhaps a similar approach could be taken to clean up these siem.notifications tasks?

@FrankHassanabad
Copy link
Contributor Author

From discussions, if we delay this work past 7.16.0 (or not ever) the expected behaviors would be that we would have at most 3 dead Saved Objects and task manager would stop doing claims against siem.notifications and we would not get forever error messages. Just at most a few at startup.

@FrankHassanabad FrankHassanabad changed the title [Alerting] Explanation and approach to removing the siem.notification saved object and rule type [Alerting] Explanation and approach to removing the "siem.notification" saved object and rule type Sep 15, 2021
@rudolf
Copy link
Contributor

rudolf commented Sep 15, 2021

Reading through past discussions it seems like we introduced excludeOnUpgrade primarily to avoid upgrade downtime due to migrations taking very long because of unneeded data. Because of the urgency this API wasn't implemented quite the way we would want to design it. Improving it is tracked in #106991 but given the alerting team's progress, there's no longer a need for this API so our plan seems to be to remove this in 8.0 instead of fixing/improving this API.

Given that we expect there to be only 3 saved objects, conceptually this fits in better with a task that does cleanup after Kibana has started, there's no strong reason to do this during the migration.

Pragmatically, even though we're removing the API in 8.0 we could use this in 7.16 to cleanup siem.notification which because of the upgrade path to 8.0 would mean we can continue to remove this API in 8.0.

Would be interested to hear what the rest of @elastic/kibana-core thinks.

@pmuellr
Copy link
Member

pmuellr commented Sep 15, 2021

Perhaps a similar approach (using a task manager task to clean up saved objects.) could be taken to clean up these siem.notifications tasks?

Ya, seems like that's the safest thing to do, and maybe it's time to have some sort of generalized "cleanup" task for alerting, to handle multiple types of things, and hopefully only run once per migration every Kibana version (or on a very long interval).

@kobelb
Copy link
Contributor

kobelb commented Sep 16, 2021

Reading through past discussions it seems like we introduced excludeOnUpgrade primarily to avoid upgrade downtime due to migrations taking very long because of unneeded data. Because of the urgency this API wasn't implemented quite the way we would want to design it. Improving it is tracked in #106991 but given the alerting team's progress, there's no longer a need for this API so our plan seems to be to remove this in 8.0 instead of fixing/improving this API.

@rudolf why would we remove this API in 8.0? Users are able to upgrade from 7.x to 8.x without going through 7.last. Wouldn't we put these upgrades at risk?

@mikecote
Copy link
Contributor

mikecote commented Sep 16, 2021

Reading through past discussions it seems like we introduced excludeOnUpgrade primarily to avoid upgrade downtime due to migrations taking very long because of unneeded data. Because of the urgency this API wasn't implemented quite the way we would want to design it. Improving it is tracked in #106991 but given the alerting team's progress, there's no longer a need for this API so our plan seems to be to remove this in 8.0 instead of fixing/improving this API.

@rudolf why would we remove this API in 8.0? Users are able to upgrade from 7.x to 8.x without going through 7.last. Wouldn't we put these upgrades at risk?

@kobelb Further context can be found here: #106991 (comment).

@FrankHassanabad
Copy link
Contributor Author

FrankHassanabad commented Sep 16, 2021

Thanks everyone so far, unless there is any strong objections to the current 🌟 approach of 1. which is delay the work and not do it for 7.16.0 I think from our end we are completely fine with the approach and that is our current choice or at the very least this is moved to the bottom of our 7.16.0 backlog.

But please! Keep talking, and we will keep reading.

@pgayvallet
Copy link
Contributor

Reading through past discussions it seems like we introduced excludeOnUpgrade primarily to avoid upgrade downtime due to migrations taking very long because of unneeded data.

That's true, OTOH the second main usage of unusedTypesQuery is also to filter out unknown doc types, so in a way it's already used for cleaning purposes.

This is definitely not a great way to handle such cleanup but AFAIK atm thats the only one we got.

Pragmatically, even though we're removing the API in 8.0 we could use this in 7.16 to cleanup siem.notification which because of the upgrade path to 8.0 would mean we can continue to remove this API in 8.0.

If we were to use this API to let type owners register 'cleanup' tasks, we could just make this more official and just keep this API for this specific purpose on 8+?

@rudolf
Copy link
Contributor

rudolf commented Sep 21, 2021

If we were to use this API to let type owners register 'cleanup' tasks, we could just make this more official and just keep this API for this specific purpose on 8+?

Yeah, it's impossible to remove the risk that plugins at some point create huge amount of saved objects and we've seen this happen at least 3 times during 7.x. We could hack a fix into core the next time there's a bug, but having an API that makes it easy to prevent upgrade downtime seems worth keeping.

So in the short term it's maybe more a question of priorities, is it worth spending time on this now given that there's only a few teams that could/would use this.

@pgayvallet
Copy link
Contributor

So in the short term it's maybe more a question of priorities, is it worth spending time on this now given that there's only a few teams that could/would use this.

The API is already in place. What needs to be done apart from deciding to keep it in 8.0+?

@mikecote
Copy link
Contributor

@rylnd is this issue still necessary given your recent update here: #112327 (comment)?

@rylnd
Copy link
Contributor

rylnd commented Sep 22, 2021

@rylnd is this issue still necessary given your recent update here: #112327 (comment)?

@mikecote I think not. We've committed to maintaining these legacy saved objects for now, and migrating them to proper actions as users touch the rules. I believe this can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Team:Detections and Resp Security Detection Response Team
Projects
None yet
Development

No branches or pull requests

10 participants