Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discuss] alerting event log: stand-alone ES index or SavedObjects? #51223

Closed
pmuellr opened this issue Nov 20, 2019 · 10 comments
Closed

[Discuss] alerting event log: stand-alone ES index or SavedObjects? #51223

pmuellr opened this issue Nov 20, 2019 · 10 comments

Comments

@pmuellr
Copy link
Member

pmuellr commented Nov 20, 2019

For the alerting event log, we're going to need to have a persistent log of alerting and action events, for both UI and general introspection. We've already settled on using ES to maintain this log, where every document will be a log entry, with an ECS-compatible shape.

Now the question is, do we roll our own index in ES, or do we use SavedObjects (SO's) to manage the index.

Some additional context:

  • currently the needs for the event log are from alerting and actions, which are both space-specific entities, so we will need (at least) space-specific event logs
  • there are other security aspects to this, as we would want a way to configure alerts in a space such that one user A could create alerts with app X, another user B could creates alerts with app Y, all in the same space, but A may not be able to see B's alerts if they don't have "access" to app Y, as an example
  • the plan is to have ILM set up for this index, out of the box, with some reasonable default for rollovers
  • the access pattern for the log is append-only (and there is an append-only mode coming to ES indices soon!)
  • the general access pattern used for SO's today is as a key/value store
  • the log will potentially have 1,000,000's of documents

The first two notes (space-specific and security) are either already handled by SO's, or are known issues (sub-features to handle the user A/app X, user B/app Y case), so we get that support for free when we use SO's. Big win here, because if we roll our own index, we'll have to replicate all this.

The remaining notes are the problem/risk areas for using SO's.

The current implementation of the event log rolls it's own index with ILM support. But let's go with the assumption we want to use SO's, and see what we need to "fix" to make that happen.

current event log ES resource creation

At start up time, the event log code goes through the following process:

  • create an ILM policy for the event log indices, if it doesn't already exist
  • create an index template for the event log indices, if it doesn't already exist
  • create the initial index with an ILM-friendly name, if no event log indices exist (based on nothing aliased to it)

The index template is pretty key here, as ILM will be creating new indices per roll-over settings, and the new index settings/mappings/etc will all be coming from there. The alias is also set in the template, and ILM does some management of that as well (dealing with the write_index, etc).

After the start up time processing has completed, operations against the event log will run against the alias - basically appending new log entries (via index) and searching.

current support for self-named saved object indices

Today you can create your own saved object index, with your own name, via the existing saved object APIs. The event log as a saved object would end up using this mechanism to create a new saved-object index specifically for the event log.

I wrote some code, stolen from the Task Manager code that also creates it's own SO index, to see what ends up getting created for such indexes.

Say I've specified the indexPattern to use in the savedObjectSchema to be "event_log" (not the real name!). After startup, there will be an index created named event_log_1 and an alias set up of event_log. So we've already got a problem there - SO suffixing the alias name with a _1 for migration reasons will be a problem since ILM wants a differently shaped name (event_log_0000001 or such). But there's a semantic problem as well, since:

  • we probably don't ever want to migrate the event logs (too much data, more on this below)
  • SO's point the alias at a single index, but we want an alias that spans indices - perhaps solveable with a separate alias

Let's also remember that we want an index template, so when ILM rolls over an index, the new index gets created the way we want. Presumably that index template would be created before the saved object index was created, and so it's not clear what effect that might have on the initial index creation.

suggested approach

It seems like the ideal situation in my mind would be to have the event log plugin continue to create the ES resources it needs, and then created a saved object store that referenced the resulting alias for document CRUD operations. But somehow have the saved object library NOT do any of the actual index management / CRUD operations, but still do support all the document CRUD operations (currently just index and search).

At a high level, this seems like another option we could add to savedObjectSchemas, via a new property like selfManaged below:

savedObjectSchemas: {
  logEntry: {
    selfManaged: true,
    hidden: true,
    isNamespaceAgnostic: false,
    indexPattern: 'event_log',
  }
}

Imagine that nullifies all the work saved objects does regarding creating indices, aliases, migration - index level operations, compared to document level operations. Then we could keep the existing event log ES resource creation code as is, and then create the saved object store which would give us all the document level access. More on migration below.

As a further point to make this work, the index template contains the mappings, so we'd need a way to get the existing "envelope" mappings that saved objects already adds. Here's what it currently looks like:

saved objects "envelope" mappings
{
  "mapping": {
    "dynamic": "strict",
    "_meta": {
      "migrationMappingPropertyHashes": {
        "migrationVersion": "4a1746014a75ade3a714e1db5763276f",
        "updated_at": "00da57df13e94e9d98437d13ace4bfe0",
        "references": "7997cf5a56cc02bdc9c93361bde732b0",
        "namespace": "2f4316de49999235636386fe51dc06c1",
        "logEntry": "b45b7ebb43da5ca5200beb42158b38c5",
        "type": "2f4316de49999235636386fe51dc06c1",
        "config": "87aca8fdb053154f11383fce3dbf3edf"
      }
    },
    "properties": {
      [saved-object-type]: {mappings for saved-object-type}
      "config": {
        "dynamic": "true",
        "properties": { "buildNum": { "type": "keyword" } }
      },
      "migrationVersion": { "type": "object", "dynamic": "true" },
      "namespace": { "type": "keyword" },
      "references": {
        "type": "nested",
        "properties": {
          "id": { "type": "keyword" },
          "name": { "type": "keyword" },
          "type": { "type": "keyword" }
        }
      },
      "type": { "type": "keyword" },
      "updated_at": { "type": "date" }
    }
  }
}

Note that currently the plan is to only have one saved object "type" - to facilitate searching across all the log data based on ECS properties.

Assuming we also want to opt-out of migration, some of these properties may not be needed. But presumably, we'd need a lot of the other fields for saved objects to operate correctly, and so would need a way to get that mapping from the saved object framework, so we could write it into the mappings in our index template.

what about migration

There are two aspects to migration

  • typical SO types of migrations, where the shape of a SO "application data" changes for a new release
  • migrations involving the SO "envelope" mappings and perhaps index settings

The first I think we won't have to deal with, short term. We're currently looking at using a subset of ECS as the mappings for the event log data, with some extensions specific to the event log itself. The subset is pretty small (~10 properties), but I expect we will be adding more over time, as clients want to add more. Most properties will be optional anyway, at an API level. Adding more, as "possibly null" values, isn't a real migration concern.

In addition, we wouldn't want to do a real migration of the event log data anyway, since in theory there could be years worth of it, and there would have to be a pretty hard requirement (and associated hard work) to make that happen.

The more concerning one is when the mappings for the SO "envelope" change. Current thinking is that this could change with SO's that can be shared across spaces (a possible eature in the future), but we don't know what that shape change would be. What happens when the SO envelope mappings change across releases.

We'll need to figure out this story, but for the initial releases of event log, we may have to live with a story (worse case!) that for new releases, old event logs may no longer be searchable. Somehow we'd need to identify that the SO envelope mappings have changed, update the index template with those changes, and do a rollover, before writing new entries.
Perhaps removing the alias from the older indices. Or something. Depending on the change, older indices may be searchable, or maybe not.

In the same vein of issues, imagine SO changes to use document- or field-level security for SO's. If any of that would depend on special index-specific settings, this could be trouble-some to deal with.

I'll admit I'm still a n00b on a lot of the Kibana (Saved Object) and ES (ILM, security) aspects of this, so maybe some of these "issues" are either non-issues, or show-stoppers.

Please chime in ...

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-stack-services (Team:Stack Services)

@mikecote mikecote added this to Backlog in Make it Action Nov 20, 2019
@kobelb
Copy link
Contributor

kobelb commented Nov 20, 2019

There is another option which we haven't discussed yet, which is essentially what ML is planning to do: treat the event-log entries as "linked" to the alert and action saved-objects. This conceptually enforces the security and spaces constrains of the alert and action saved-objects upon the related event-log entries. If each event-log entry stores a reference to the associated alert or action, then we can create a dedicated API endpoint to return the associated event-log entries. The following is the pseudo-code for what this API endpoint would look like:

const savedObjectsClient = server.savedObjects.getScopedSavedObjectsClient(request);
const savedObject = savedObjectsClient.get('alert', id); // this will enforce the spaces and security constraints
return await clusterClient.callAsInternalUser('search', {
    index: '.kibana-alerting-event-log',
    body: {
      query: {
        term: {
          alertId: savedObject.id
        }
      },
    },
  });
});

This approach becomes much more complicated if we intend for the user to be able to access event-log entries for all alerts that the user has access to, but as long as we're primarily concerned with the event-log entries for a single alert, it has its benefits.

@pmuellr
Copy link
Member Author

pmuellr commented Nov 20, 2019

We will need to support search - in fact that's the primary use case - "show me what alerts and actions were doing in this time frame" - the other use case being add a log entry. I don't think this story would work out well for search, but maybe I'm missing something.

Also not clear to me how this would interact w/ILM. Presumably w/ILM, customers can customize the policy to delete old rolled-over indices at some point, and that happens pretty much without our knowledge. I think we'd need to recognize an index was deleted, then delete the SO objects that linked to them, which seems hard. - woops mis-understood something in Brandon's explanation, my stricken-out bits don't make sense to me any more :-)

@mikecote mikecote moved this from Backlog to In progress in Make it Action Nov 22, 2019
@mikecote
Copy link
Contributor

mikecote commented Nov 22, 2019

After looking into this, I think the approach @kobelb proposes makes sense from my perspective as well. The approach includes that we also don't use saved objects and we keep it as a stand alone index / indices and manage it ourselves.

I don't think we'll be able to support displaying the entire activity log of all the alerts at all unless the user has access to everything. This problem applies regardless if we're using saved objects or not. The reason behind that is we would need to apply inherited access at query time which is not something we have features today to help us do so in a performant manner and at basic+ license tiers (after talking with @kobelb). Though we will still be able to filter down the activity log to a specific alert and display everything logged for it.

For the migrations, I agree that since we're using ECS, we won't have to change field mappings, only add new ones as we log more information. Using templates would basically just apply the new mappings to new indices only which is fine as they would be the ones getting the updated data. For reference how task manager used templates: https://github.com/elastic/kibana/blob/7.2/x-pack/plugins/task_manager/task_store.ts#L119-L216.

I want to cc @epixa and @peterschretlen to make sure we're making the right decision here.

@epixa
Copy link
Contributor

epixa commented Nov 22, 2019

I support that approach. To be frank, I don't see how we could ever possibly scale saved objects to support the amount of data you're talking about here. Even if it were desirable to use them for this, I think we'd need to make a more practical decision anyway. This will mean we have certain limitations when working with the data, it just is what it is.

If people are going to use this data in SIEM, Logging, dashboard, canvas, etc., then they're going to need to query directly on the raw data anyway. We don't really get any of the benefits of ECS if we're not treating this stuff like data for these more advanced used cases anyway.

@peterschretlen
Copy link
Contributor

I think the linking via API covers a lot of the common use cases. Restricting to a single alert or action actually works well with some of @mdefazio recent work that has the activity showing in a flyout.

@pmuellr makes a valid point about search, ideally the activity log view will allow search and filtering. Could document level security be used in that case to enforce space and app access controls? That would restrict a full event log UI to platinum, which I think we can accept if that's the only way to do it ( @alexfrancoeur? ). Users would still be able to fall back on explicitly granting index privileges to the activity log, with the caveat that there are no access controls.

@kobelb
Copy link
Contributor

kobelb commented Nov 22, 2019

To be frank, I don't see how we could ever possibly scale saved objects to support the amount of data you're talking about here.

This is likely a tangential topic, but are there specific aspects of saved-objects which you anticipate not scaling?

Could document level security be used in that case to enforce space and app access controls?

I think it's worth exploring. My greatest concern is the level of effort to set this up...

If people are going to use this data in SIEM, Logging, dashboard, canvas, etc., then they're going to need to query directly on the raw data anyway. We don't really get any of the benefits of ECS if we're not treating this stuff like data for these more advanced used cases anyway.

This is a rather new requirement which I've just started hearing discussed more and more... I feel like we're verging on a new feature/requirement here to really give the users what they're asking for. Ideally, we'd allow the users to view this data in Kibana without end-users needing Elasticsearch privileges to the entire data indices, and it'd respect their Kibana privileges. For the time being, we have to choose between using an internal/system index which end-users can't access directly and we apply our own authorization rules; and a data index where end-users can access all of the data. Neither of which is ideal.

@pmuellr
Copy link
Member Author

pmuellr commented Nov 22, 2019

Perhaps we can just treat the "need to search across all events" as a new type of permission required, which wouldn't be needed by most mortal users, but useful for admin-types of people. Or could be a permission that could be applied in cases where the customers don't care about data leaking through space/feature control aspects (probably the case for a lot of users). But still have it "locked down" by default, for those customers that care.

@pmuellr
Copy link
Member Author

pmuellr commented Nov 25, 2019

Also worth mentioning that in the latest version of the event log code, I had added a kibana.saved_objects field (keyword) with the thought that it would be used as a mechanism to "link" event log entries back to a "source" that they are related to.

This seems to jive perfectly with the current thinking here (except in Brandon's comment, the query term would be kibana.saved_objects instead of alertId).

Since in theory the event log may need to reference saved objects not in .kibana (eg, task manager records), somehow those saved_objects should also reference the relevant saved object store, which should be interesting ...

@pmuellr
Copy link
Member Author

pmuellr commented Dec 9, 2019

Consensus seems to be to go with the "linked SO" approach Brandon mentioned in #51223 (comment) . The current event log PR #45081 contains some initial support for storing saved object ids in the event log documents, to allow search for specific alerts / actions to get their history.

We'll plan on not exposing a general search facility via HTTP, and forcing the use of a saved object id when searching via the plugin-provided event log service. We can expand the search capabilities later, as needed, when we can do it in a secure manner.

@pmuellr pmuellr closed this as completed Dec 9, 2019
@pmuellr pmuellr moved this from In Progress to Done (Ordered by most recent) in Make it Action Dec 9, 2019
@mikecote mikecote added the v7.6.0 label Dec 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Make it Action
  
Done (Ordered by most recent)
Development

No branches or pull requests

6 participants