Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution][Detections] Migrate from ruleStatusSavedObjectType to Alerting Event Log for Rule Monitoring #91265

Closed
spong opened this issue Feb 12, 2021 · 12 comments · Fixed by #121644
Assignees
Labels
enhancement New value added to drive a business result Feature:Detection Rules Anything related to Security Solution's Detection Rules Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. technical debt Improvement of the software architecture and operational architecture v8.1.0

Comments

@spong
Copy link
Member

spong commented Feb 12, 2021

Same as: #83235

Currently the Detection Engine uses a few separate SO's for managing additional state, sometimes as a stopgap while other solutions were under development. Details on all SO's managed by Detections are detailed in this comment here: #60053 (comment).

One of these SO's, the ruleStatusSavedObjectType, was created to store the last 5 rule failures while support for additional monitoring was added to the alerting framework. This SO is used to display a Rule's most recent failures on the Rule Details page, and is also joined with the Rule SO for display within the Rules and Monitoring tables. The latter of which quickly becomes a performance issue with the more and more rules we add, and with the limitations of the SO client.

Since the original implementation, the alerting framework now has a dedicated index called the Event Log for writing Rule (alert) state. In recent discussions with the @elastic/kibana-alerting-services team it sounds like we should be able to migrate from our sidecar SO model of managing status, to leveraging the Event Log. In doing so we should be able to remove the maintenance burden of managing another SO type, increase performance of the Rules/Monitoring tables by optimizing the join, and better prepare the Detection Engine for the unified alerting architecture.

Useful notes from our meeting with the alerting team:

  • The Event Log's security model requires you provide the SO you're trying to write/retrieve the event for. As of 7.11, bulk retrieval is possible by sending an array of SO's.
  • Potential for us to leverage the existing status field on the Alert SO so we don't have to do a join when fetching records for the Rules/Monitoring tables.
  • The Event Log can be written to from within the execution context, or outside. Logs are scheduled and written shortly after.
  • Schema covers base fields, but may be expanded if needed (will need to determine all necessary monitoring fields to adapt from the ruleStatusSavedObjectType)
ruleStatusSavedObjectType

export const type: SavedObjectsType = {
name: ruleStatusSavedObjectType,
hidden: false,
namespaceType: 'single',
mappings: ruleStatusSavedObjectMappings,
};

Event Log PR's

https://github.com/elastic/kibana/pulls?q=is%3Apr+label%3AFeature%3AAlerting+Event+Log+is%3Aclosed

How to remove existing SO's (mapping and existing objects on upgrade)

I checked with the platform team and this was our guidance for removing the existing SO:

Your plugin should stop registering the saved object type which will remove its mappings.
Then you need to add the type to this block list:

'apm-services-telemetry',
'background-session',
'cases-sub-case',
'file-upload-telemetry',
// https://github.com/elastic/kibana/issues/91869
'fleet-agent-events',
// Was removed in 7.12
'ml-telemetry',
'server',
// https://github.com/elastic/kibana/issues/95617
'tsvb-validation-telemetry',
// replaced by osquery-manager-usage-metric
'osquery-usage-metric',
// Was removed in 7.16
'timelion-sheet',

On the next migration, all saved objects of this type won't be migrated over to the new version index

@spong spong added enhancement New value added to drive a business result technical debt Improvement of the software architecture and operational architecture Feature:Detection Rules Anything related to Security Solution's Detection Rules Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v7.13.0 labels Feb 12, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@banderror
Copy link
Contributor

banderror commented Feb 15, 2021

Some notes after formalizing our requirements and checking the event_log plugin.

Our data model

Currently, we don't use any built-in features in Alerting to represent our detection rule execution statuses. We have custom statuses stored as Saved Objects in a separate SO of type siem-detection-engine-rule-status.

These are attributes of this SO (field types are simplified for readability):

interface IRuleStatusSOAttributes {
  alertId: string;
  status: 'going to run' | 'succeeded' | 'failed' | 'partial failure';
  statusDate: string;
  lastFailureAt: string;
  lastFailureMessage: string;
  lastSuccessAt: string;
  lastSuccessMessage: string;
  lastLookBackDate: string;
  gap: string;
  bulkCreateTimeDurations: string[];
  searchAfterTimeDurations: string[];
}

Our needs

We need to be able to store and fetch 2 types of entities: current rule execution status (or state if you will), and rule execution log (history of statuses or even better, a more generic log).

Current rule execution status. We need to be able to:

  • Store a custom status object (with arbitrary fields known to Security Solution) bound to a rule (alert).
  • Retrieve it when fetching a single rule.
    • with the rule itself (alertsClient.getRule) (btw there could be separate alertsClient.getRuleParams)
    • separately (alertsClient.getRuleState)
  • Retrieve N current statuses when fetching N rules at once - in a single query, 1 current status per each rule.
    • rules + execution states all at once (alertsClient.find, alertsClient.findRules etc)
    • separately (alertsClient.bulkGetRuleStates)
  • Update it from the executor function.
    • return state from the function?
  • Update it not only as a result of the rule execution, but also in the middle of the execution, maybe even multiple times.
    • Smth like services.currentState.update()
    • I need to sync with my team on how important this is for us.
  • Update it outside of the executor function.
    • I need to sync with my team on how important this is for us.
  • Filter and sort by fields of the current status object. Maybe even search.
  • Migrate it when the schema changes.

Rule execution log. We need to be able to:

  • Store a collection of custom objects (with arbitrary fields known to Security Solution) bound to a rule (alert).
  • Retrieve the collection when fetching a single rule (fetch by rule id).
  • Paginate this log. Sort by date/timestamp. Sort and filter by custom fields.
  • Set the number of items in the collection. Optional, nice to have.
  • Migrate it when the schema changes.

Some things to clarify

Alerting-related questions:

  • Can we leverage "alert type" and/or "alert instance" states for storing the current rule status?
  • How do they work? Where are they stored in Elasticsearch?
  • Can we fetch 1 current state when fetching 1 rule?
  • Can we bulk fetch N current states when fetching N rules?

Event log-related questions:

  • How do we get an event log scoped to a single rule?
  • Why events in Event Log are limited to ECS fields?
  • How do we store and retrieve custom arbitrary fields?

Data model TO BE

Roughly, as I would imagine at this point:

interface IRuleExecutionState {
  status: 'going to run' | 'succeeded' | 'failed' | 'partial failure';
  statusDate: string;
  statusMessage: string;
  lastFailureAt: string;
  lastFailureMessage: string;
  lastSuccessAt: string;
  lastSuccessMessage: string;
  lastLookBackDate: string;
  gap: string;
  bulkCreateTimeDurations: string[];
  searchAfterTimeDurations: string[];
}

// This represents "status updated" event.
// Probably we'd like to log other events as well, with other fields.
interface IRuleExecutionLogEntry {
  status: 'going to run' | 'succeeded' | 'failed' | 'partial failure' | null | undefined;
  statusDate: string;
  statusMessage: string | null | undefined;
}

@banderror
Copy link
Contributor

So we talked about all the points mentioned above with the Alerting team and then synced within our Detection and Response team. I tried to reconcile all info about available options, our needs, decisions and action items. Here's a summary. Trying to be concise, still there's a lot to read, sorry about that.

If you have any comments, please post here in the ticket.

@dhurley14 @FrankHassanabad @gmmorris @mikecote @peluja1012 @pmuellr @spong

Option 1: alert state

Rule (alert) state can be used for storing the current rule execution status (see above comment). And a list of 5 latest failures, and other data, if needed. We don't use alert state in detection rules. You can read about it in the docs, see "alert type level state".

Filtering, sorting, searching over state:

  • This state is stored in the task manager index, separately from the alert object which is in the .kibana index.

    # Detection rule execution states (native alert states)
    # See "_source.task.state".
    GET .kibana_task_manager/_search
    {
      "query": {
        "bool": {
          "filter": [
            {
              "match": {
                "task.taskType": "alerting:siem.signals"
              }
            }
          ]
        }
      }
    }
    
  • This state is just a string. Proper filtering, sorting, and searching is not possible by fields of this state.

  • Alerting team is not working on making it filterable. E.g. Make alert params searchable #50213 is only about params, but not state.

  • We can ask Alerting to implement this feature. Effort: high.

Fetching execution states:

  • There's an existing API for fetching state of a single rule by its id: AlertsClient.getAlertState({ id })
  • We'd need at least these new APIs:
    • API for bulk fetching N states of N rules by rule ids. Something like AlertsClient.bulkGetRuleStates.
    • API for bulk fetching N rules themselves by rule ids. Because they have their own native execution statuses, and we already merge them with our custom statuses to generate a final API response.
  • Optionally, more APIs could be useful: find rules + their states in a single method, etc.
  • Alerting team is OK to add these new APIs for us. Effort: low.

Updating execution states:

  • Currently, update is only supported via a return value from the executor function. So the state can be updated only once at the end of the rule execution.
  • Multiple updates during the rule execution are not supported atm. E.g., when a rule is running for a long time, and we want to be able to update it's state in the middle of the execution, so the user could see the updated state in the UI even if the rule is still running. This is not supported via state.
  • Multiple updates from the executor API standpoint could look like using the injected services.currentState.update().
  • We can ask Alerting to implement this feature. Effort: ?
  • Updates from the outside of the executor function are not supported. Nor really needed. When we enable a rule via our API, we also update its custom status to going to run. Instead of doing this, we could just read the native alert.executionStatus. Supported native statuses: ['ok', 'active', 'error', 'pending', 'unknown']. After enabling a rule it transitions to a pending status which we could map to going to run on our side.

Option 2: event log

Event Log is the new API created by the Alerting team:

  • A separate plugin x-pack/plugins/event_log. See README.
  • alerts and alerting framework agnostic (almost).

It can be used to store the rule execution log (status history etc, see above comment). Moreover, in theory, with proper support for aggregations, it could be used to retrieving the current rule execution status and other "current"-like information.

  • Events logged via the log must conform to ECS. Currently - a subset of ECS. Alerting team can expand the schema to other ESC fields if needed.
  • Events logged can carry custom objects. Their schema is currently hardcoded and rigid.
  • Currently, it's not possible to store and retrieve custom arbitrary fields. We can ask Alerting to implement this feature. Effort: ?
  • Currently, there's only findEventsBySavedObjectIds API. It has some support for filtering and sorting.
  • Currently, no support for aggregations, searching, filtering can be limited (filter is a string - KQL?), sorting is limited.
  • It's possible to scope a log to a single rule by logging events and specifying rule id via kibana.saved_objects[].id.
  • Alerting team is willing to develop this plugin further and they are good with accepting feature requests. Because event log uses a separate index, everything is possible in theory.

Our needs

As a minimum, now we need:

  • Fast and efficient retrieval of 1 current execution state of 1 rule (rule details page).
  • Fast and efficient retrieval of N current execution states of N rules (rule management table).
  • Fast and efficient retrieval of N rules by their ids (rule management table).
  • Fast and efficient retrieval of execution log of 1 single rule (rule details page).
  • Support for long-running rules (10s of seconds is already our reality, minutes are probably possible). Which means that at least:
    • long-running rules should be supported by the infrastructure (alerting, task manager) without issues
    • long-running rules should be able to post multiple updates during the execution (update current state, log to event log, whatever), and we should be able to retrieve it fast and efficiently both for 1 (rule details page) and N rules (rule management & monitoring tables)

Ultimately, we need:

  • Filtering, sorting, searching and aggregations over both rule execution states and execution log. Which technically can be implemented via the same technology / stored in the same place (e.g. event log). Or can be 2 different things. I think, this might be of comparable importance as Make alert params searchable #50213
  • Advanced support for long-running rules:
    • cancellation
    • run time estimation (e.g. based on history of executions)
    • handling race conditions (e.g. when a rule is updated during its execution)
  • Migrations for rule params and state.

We don't need:

  • Ability to update alert state from outside of the executor function (e.g. from our endpoint handlers).
  • Migrations for event log (for custom fields schema).

Decisions and next steps

At this point we think that Event Log might be promising for our needs.

  • We might use it to cover both rule execution log and current execution state use-cases, given aggregations and everything else become supported.
  • Logging in ECS looks compelling, because in theory that means we could incorporate this kind of logs into threat investigative workflows, integrate into existing Security Solution logic/UI, create rules running on top of execution logs, etc.

We'll put the Alert State-based implementation on hold for now. But we accept to perhaps get back to it later. If not everything will be possible with Event Log, or there will be performance or other considerations.

Next steps for me:

  1. I'll start playing with ECS and event log. I'll try to map our current custom status SO to a series of different events; the SO fields to ECS fields.
  2. I'll come up with queries to Elasticsearch we'd need to be able to filter, sort, search, aggregate and retrieve all the needed data on our side.
  3. I'll come up with examples of event log APIs that are currently missing for these needs.
  4. Hopefully then we'll be able to create some tickets and sync up on this again.

@gmmorris
Copy link
Contributor

Regarding:

Updating execution states:
...

  • We can ask Alerting to implement this feature. Effort: ?
    ...

I'd say effort is high, as we'd have to change how tasks throughout the system are executed.

How long do these tasks tend to run in production environments? Is it so long that incrementally updating throughout the task execution is unusually important?

@pmuellr
Copy link
Member

pmuellr commented Feb 18, 2021

For the event log:

Currently, no support for aggregations, searching, filtering can be limited (filter is a string - KQL?), sorting is limited.

In progress PR for that here: #91731

@banderror
Copy link
Contributor

@pmuellr Oh, that's awesome, thank you for pointing out to this PR.

@banderror
Copy link
Contributor

How long do these tasks tend to run in production environments? Is it so long that incrementally updating throughout the task execution is unusually important?

@gmmorris So regarding "10s of seconds". On one hand, it's my bad, apparently I looked at a wrong piece of response from https://kibana.siem.estc.dev/api/task_manager/_health 🤦

This is the correct data (today):

        "execution": {
          "duration": {
            "alerting:siem.signals": {
              "p50": 2385,
              "p90": 2775.5,
              "p95": 3028,
              "p99": 4523
            },

This is probably what I was looking at yesterday:

        "drift_by_type": {
          "alerting:siem.signals": {
            "p50": 18014,
            "p90": 24058,
            "p95": 24180,
            "p99": 24183
          },

So it's likely just seconds in the dev environment, not tens of seconds or minutes.

On the other hand, I have no data from production environments. Some of our rules are much more expensive than the others, and the execution time might depend on the rule type, parameters, and source event indices. We allow creating custom rules, so...

@FrankHassanabad is it possible to find some numbers from production envs? Or maybe you can give any general comments on that.

@gmmorris
Copy link
Contributor

gmmorris commented Feb 18, 2021

Support for long-running rules

Just to be clear - Task Manager does support long running tasks.
You can configure your task's timeout to be a far higher number than the default, if you so wish, at which point tasks can run for as long as you'd like.
What I would say though is that we try to discourage that if possible, as the impact of that is that the long running tasks take up workers for long periods of time (at the expense of other tasks). Until we improve the scaling experience (better observability, auto scaling etc.) I'd definitely not want us to do this if it's avoidable.

@pmuellr
Copy link
Member

pmuellr commented Feb 19, 2021

What I would say though is that we try to discourage that if possible

Perhaps we could have a constraint like - if you set a task's timeout higher than default, you must provide max concurrency as well.

@banderror
Copy link
Contributor

@gmmorris @pmuellr thank you, I see.

As far as I can see from the task_manager's README, the default timeout is 5 minutes. Is this correct, is this what alerting framework uses for alerts under the hood? Is there a way to specify the timeout values as well as max concurrency via the alerting API?

I hope 5 minutes should not be an issue in most (if not all) environments. I'll try to figure out if this is possible and how likely it is to happen.

@banderror banderror added v8.1.0 and removed v7.13.0 Theme: rac label obsolete labels Jan 12, 2022
banderror added a commit that referenced this issue Jan 20, 2022
)

**Epic:** #118324
**Tickets:** #119603, #119597, #91265, #118511

## Summary

The legacy rule execution logging implementation is replaced by a new one that introduces a new model for execution-related data, a new saved object and a new, cleaner interface and implementation.

- [x] The legacy data model is deleted (`IRuleStatusResponseAttributes`, `IRuleStatusSOAttributes`)
- [x] The legacy `siem-detection-engine-rule-status` saved object type is deleted and marked as deleted in `src/core`
- [x] A new data model is introduced (`x-pack/plugins/security_solution/common/detection_engine/schemas/common/rule_monitoring.ts`). This data model doesn't contain a mixture of successful and failed statuses, which should simplify client-side code (e.g. the code of Rule Management and Monitoring tables, as well as Rule Details page).
- [x] A new `siem-detection-engine-rule-execution-info` saved object is introduced (`x-pack/plugins/security_solution/server/lib/detection_engine/rule_execution_log/rule_execution_info/saved_object.ts`).
  - [x] This SO has 1:1 association with the rule SO, so every rule can have 0 or 1 execution info associated with it. This SO is used in order to 1) update the last execution status and metrics and 2) fetch execution data for N rules more efficiently comparing to the legacy SO.
  - [x] The logic of creating or updating this SOs is based on the "upsert" approach (planned in #118511). It does not fetch the SO by rule id before updating it anymore.
- [x] Rule execution logging logic is rewritten (see `x-pack/plugins/security_solution/server/lib/detection_engine/rule_execution_log`). The previous rule execution log client is split into two objects: `IRuleExecutionLogClient` for using it from route handlers, and `IRuleExecutionLogger` for writing logs from rule executors.
  - [x] `IRuleExecutionLogger` instance is scoped to the currently executing rule and space id. There's no need to pass rule id, name, type etc to `.logStatusChange()` every time.
- [x] Rule executors and related functions are updated.
- [x] API routes are updated, including the rule preview route which uses a special "spy" implementation of `IRuleExecutionLogger`. A rule returned from an API endpoint now has optional `execution_summary` field of type `RuleExecutionSummary`.
- [x] UI is updated to use the new data model of `RuleExecutionSummary`:
  - [x] Rule Management and Monitoring tables
  - [x] Rule Details page
- [x] A new API route is introduced for fetching rule execution events: `/internal/detection_engine/rules/{ruleId}/execution/events`. It is used for rendering the Failure History tab (last 5 failures) and is intended to be used in the coming UI of Rule Execution Log on the Details page.
- [x] Rule Details page and Failure History tab are updated to use the new data models and API routes.
- [x] I used `react-query` for fetching execution events
  - [x] See `x-pack/plugins/security_solution/public/detections/containers/detection_engine/rules/use_rule_execution_events.tsx`
  - [x] The lib is updated to the latest version
- [x] Tests and fixed and updated according to all the changes
- [x] Components related to rule execution statuses are all moved to `x-pack/plugins/security_solution/public/detections/components/rules/rule_execution_status`.
- [x] I left a lot of `// TODO: #121644 comments in the code which I'm planning to address and remove in a follow-up PR. Lots of clean up work is needed, but I'd like to unblock the work on Rule Execution Log UI.

## In the next episodes

- Address and remove `// TODO: #121644 comments in the code
- Make sure that SO id generation for `siem-detection-engine-rule-execution-info` is safe and future-proof. Sync with the Core team. If there are risks, we will need to choose between risks and performance (reading the SO before updating it). It would be easy to submit a fix if needed.
- Add APM integration. Use `withSecuritySpan` in methods of `rule_execution_log` citizens.
- Add comments to the code and README.
- Add test coverage.
- Etc...

### Checklist

Delete any items that are not applicable to this PR.

- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
- [ ] Any UI touched in this PR is usable by keyboard only (learn more about [keyboard accessibility](https://webaim.org/techniques/keyboard/))
- [ ] Any UI touched in this PR does not create any new axe failures (run axe in browser: [FF](https://addons.mozilla.org/en-US/firefox/addon/axe-devtools/), [Chrome](https://chrome.google.com/webstore/detail/axe-web-accessibility-tes/lhdoppojpmngadmnindnejefpokejbdd?hl=en-US))
- [x] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)
- [ ] This renders correctly on smaller devices using a responsive layout. (You can test this [in your browser](https://www.browserstack.com/guide/responsive-testing-on-local-server))
- [ ] This was checked for [cross-browser compatibility](https://www.elastic.co/support/matrix#matrix_browsers)

### For maintainers

- [x] This was checked for breaking API changes and was [labeled appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
@banderror banderror linked a pull request Jan 20, 2022 that will close this issue
28 tasks
@banderror
Copy link
Contributor

The last piece has been implemented in #121644. With this implementation:

  • we write rule execution logs to .kibana-event-log
  • in addition, we store last execution info (last rule status and metrics) in the new siem-detection-engine-rule-execution-info saved object
  • the legacy siem-detection-engine-rule-status saved object is removed from the codebase

The next major step would be to address #112193 which will give us a chance to get rid of siem-detection-engine-rule-execution-info saved object as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Feature:Detection Rules Anything related to Security Solution's Detection Rules Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. technical debt Improvement of the software architecture and operational architecture v8.1.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants