Skip to content

Commit

Permalink
event log index and basic structures somewhat working
Browse files Browse the repository at this point in the history
  • Loading branch information
pmuellr committed Nov 14, 2019
1 parent 2955ed6 commit de05b5d
Show file tree
Hide file tree
Showing 29 changed files with 4,964 additions and 1 deletion.
398 changes: 398 additions & 0 deletions rfcs/text/0006_event_log.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,398 @@
- Start Date: 2019-09-12
- RFC PR: currently https://github.com/elastic/kibana/pull/45081
- Kibana Issue: (leave this empty)

See also issue https://github.com/elastic/kibana/issues/45083, which is an
umbrella issue tracking ongoing initial work.


# Summary

For the Make It Action alerting / action plugins, we will need a way to
persist data regarding alerts and actions, for UI and investigative purposes.
We're referring to this persisted data as "events", and will be persisted to
a new elasticsearch index referred to as the "event log".

Example events are actions firing, alerts running their scheduled functions,
alerts scheduling actions to run, etc.

This functionality will be provided in a new NP plugin `event_log`, and will
provide server-side plugin APIs to write to the event log, and run queries
against it. We'll also want some query capability exposed as an HTTP endpoint.

The current clients for the event log are the actions and alerting plugins,
however the event log currently has nothing specific to them, and is general
purpose, so can be used by any plugin to "log events".

We currently assume that there may be many events logged, and that (some) customers
may not be interested in "old" events, and so to keep the event log from
consuming too much disk space, we'll set it up with ILM and some kind of
reasonable default policy that can be customized by the user. This implies
also the use of rollver, setting a write index alias upon rollover, and
that searches for events will be done via an ES index pattern / alias to search
across event log indices with a wildcard.

# Basic example

When an action is executed, an event should be written to the event log.

Here's a [`kbn-action` command](https://github.com/pmuellr/kbn-action) to
execute a "server log" action (writes a message to the Kibana log):

```console
$ kbn-action execute d90f5563-ab7c-4292-b56d-c68166c09a59 '{message: hallo}'
{
"status": "ok"
}
```

Here's the event written to the event log index:

```json
{
"_index": ".kibana-event-log-000001",
"_type": "_doc",
"_id": "d2CXT20BPOpswQ8vgXp5",
"_score": 1,
"_source": {
"timestamp": "2019-09-20T16:53:12.177Z",
"pluginId": "actions",
"spaceId": "default",
"username": "elastic",
"type": "action",
"subType": "executed",
"data": {
"actionId": "d90f5563-ab7c-4292-b56d-c68166c09a59",
"actionTypeId": ".server-log",
"config": {},
"params": {
"message": "hallo"
},
"description": "server-log",
"result": {
"status": "ok"
}
}
}
}
```


# Motivation

The existing designs for Make It Action describe UIs which show some amount
of alert "history". Up till now, we had no where to write this history,
and so a new elasticsearch index is required for that. Because we already
have two known clients of this plugin, which are related but different,
the design of the event documents is very generic, which means it could easily
be used by other plugins that would like to record "events", for the same
general purposes.


# Detailed design

## API

```typescript
export interface IEventLog {
registerEventType(eventType: string, subTypes: string[]): void;
getEventTypes(): Map<string, Set<string>>;

registerTags(tags: string[]): void;
getTags(): Set<string>;

getLogger(properties: Partial<IEvent>): IEventLogger;

searchEvents(search: any): Promise<any>;
}

export interface IEventLogger {
logEvent(properties: Partial<IEvent>): void;
}

export interface IEvent {
timestamp: Date;
spaceId: string;
username: string;
pluginId: string;
type: string;
subType: string;
tags: string[];
data: any;
}
```

The plugin exposes an `IEventLog` object to plugins that pre-req it. Those
plugins need to call `registerEventType()` and `registerTags()` to indicate
the values of the `type`, `subType`, and `tags` values they will be using
when logging events.

The pre-registration helps in two ways:

- dealing with misspelled values
- constraining the values that will end up being indexed; for example, we
probably don't want people creating tags that contain a lot of uuids, or
dates, etc, to prevent high cardinality issues with the index.

Once the values are registered, the plugin will get an `IEventLogger` instance
by passing in a set of default properties to be used for all it's logging,
to the `getLogger()` method. For instance, the `actions` plugin creates a
logger with it's plugin id and the only `type` value that it uses.

The `IEventLogger` object can be cached at the plugin level and accessed by
any code in the plugin. It has a single method to write an event log entry,
`logEvent()`, which is passed specific properties for the event.

The final data written is a combination of the data passed to `getLogger()` when
creating the logger, and the data passed on the `logEvent()` call, and then
that result is validated to ensure it's complete and valid. Errors will be
logged to the server log.

The `logEvent()` method returns no values, and is itself not asynchronous.
It's a "call and forget" kind of thing. The method itself will arrange
to have the ultimate document written to the index asynchrnously. It's designed
this way because it's not clear what a client would do with a result from this
method, nor what it would do if the method threw an error. All the error
processing involved with getting the data into the index is handled internally,
and logged to the server log as appropriate.


## Stored data

The elasticsearch index for the event log will have ILM and rollover support,
as customers may decide to only keep recent event documents, wanting indices
with older event documents deleted, turned cold, frozen, etc. We'll supply
some default values, but customers will be able to tweak these.

The index template for these indices looks like this:

```js
export function getIndexTemplate(esNames: EsNames, ilmExists: boolean) {
const mapDate = { type: 'date' };
const mapObject = { type: 'object', enabled: false };
const mapKeyword = { type: 'keyword', ignore_above: 256 };

return {
index_patterns: [esNames.indexPattern],
aliases: {
[esNames.alias]: {
is_write_index: true
},
},
settings: {
number_of_shards: 1,
number_of_replicas: 1,
'index.lifecycle.name': esNames.ilmPolicy,
'index.lifecycle.rollover_alias': esNames.alias,
},
mappings: {
dynamic: 'strict',
properties: {
timestamp: mapDate,
type: mapKeyword,
subType: mapKeyword,
pluginId: mapKeyword,
spaceId: mapKeyword,
username: mapKeyword,
tags: mapKeyword,
data: mapObject,
},
},
};
}
```

See [ilm rollover action docs][] for more info on the `is_write_index`, and `index.lifecycle.*` properties.

[ilm rollover action docs]: https://www.elastic.co/guide/en/elasticsearch/reference/current/_actions.html#ilm-rollover-action

Of particular note in the `mappings`:

- `dynamic: 'strict'` implies users can't add new fields
- all the `properties` are indexed, except `data` which contains event-specific
data, and so event-specific data is not directly searchable

Long-term we may want a story where event log clients can add their own
mappings in some new field `object` field that we do enable for indexing. Not
clear if that's required yet.

From the "Basic Example" above, you can see that the action execution records
everything about the action itself and it's execution - action id, description,
config, **not the secrets** _of course_ :-), the action execution parameters,
and the result of the execution. Complete denormalization. Whether or not
this is appropriate or not is TBD, but seems like it might be fine.


## ILM setup

We'll want to provide default ILM policy, this seems like a reasonable first
attempt:

```
PUT _ilm/policy/event_log_policy
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "5GB",
"max_age": "30d"
}
}
}
}
}
}
```

This means that ILM would "rollover" the current index, say
`.kibana-event-log-000001` by creating a new index `.kibana-event-log-000002`,
which would "inherit" everything from the index template, and then ILM will
set the write index of the the alias to the new index. This would happen
when the original index grew past 5 GB, or was created more than 30 days ago.

For more relevant information on ILM, see:
[getting started with ILM doc][] and [write index alias behavior][]:

[getting started with ILM doc]: https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-index-lifecycle-management.html
[write index alias behavior]: https://www.elastic.co/guide/en/elasticsearch/reference/master/indices-rollover-index.html#indices-rollover-is-write-index


# Drawbacks

TBD


# Alternatives

## Using Saved Objects instead of a new index

There are some potential advantages to using Saved Objects instead of a new
index as the persistence store:

- setup would be easier, compared to the setup required using a new index
with ILM, etc
- support for per-event indexed properties
- support for migrations
- potential use of references to link events together (eg, an alert will
write a log entry for scheduling an action to run, and that action execution
also writes a log entry, so would be nice to "link" them together)
- any other good things Saved Objects gives us

On the other hand, there are some known unknowns:

- how would ILM rollover and Kibana migrations work, when both want to "own"
the index name suffix of rolled-over / migrated indices
- not clear if we will need more elaborate querying than Saved Objects
provides
- unknown unknowns, since the attempt at impplementing a saved object persistence
solution was never completed (and it was getting pretty complicated)

## Concern with overlap with eventual audit log capability

In an early review, concern was raised that there is a lot of overlap with
where we eventually be with audit logging.

The issues with that are:

- audit logs are currently just an `audit-log` tag on messages written to the
standard Kibana log; there is no infrastructure to allow them to be queried

- security differences; the users able to read audit logs are probably much
more constrained than users that need to be able to read these event logs

So, we don't really have an _alternative_ here, but ... a lot of overlap.
Long-term, it may make sense to make the eventual audit logging code
parameterizable, so that the core code could be used for other purposes, like
this event log. We're just not there yet. Or implement the audit log with
the event log :-)

## Write to a events to a file and ingest via FileBeat

This came up as _"why don't we use our own stuff to do this since that's what
it's designed for"_. Yup! However, the only thing that really changes over
writing our index is:

- we'd write to some other log (could be console.log, or a file) instead of
an index, directly
- we'd have to arrange to run FileBeat alongside Kibana, ingesting these
logged events, which would send them into elasticsearch

We still need to arrange to initialize all the elasticsearch resources (ilm,
index template, etc).

The big kicker though is: where would we right these that FileBeat could
find them? Writing to a file might be problematic for users that run
Kibana with no writable file system, including folks using docker-based
deployments.

Lots of unknown unknowns, it seems.

Interestingly enough, this kind of story might work out extremely well for
Elastic Cloud, where we have a lot more control over the deployments!

# Adoption strategy

If we implement this proposal, how will existing Kibana developers adopt it? Is
this a breaking change? Can we write a codemod? Should we coordinate with
other projects or libraries?

The API surface has been kept to a minimum to make it easy for other plugins
make use of it. But that's just a nice-to-have right now anyway, since
the only known clients will be actions and alerting.


# How we teach this

**What names and terminology work best for these concepts and why? How is this
idea best presented? As a continuation of existing Kibana patterns?**

"event" and "log" are probably the most controversial names in use here.
Previously, we've referred to this as an "audit log" or "history log".
Audit log is not a great name due to the existance of the audit logging
facility already available in Kibana. History or History log seems like
a closer fit. Event log works for me, as it provides just a touch more
information than history - it's a history of events - but of course "events"
as suitable vague enough that it doesn'tt add thatt much more.

**Would the acceptance of this proposal mean the Kibana documentation must be
re-organized or altered? Does it change how Kibana is taught to new developers
at any level?**

no

**How should this feature be taught to existing Kibana developers?**

Follow examples of other plugins using it.

The API surface is very small, it's designed to be easy for other developers
to reuse.


# Unresolved questions

**Optional, but suggested for first drafts. What parts of the design are still
TBD?**

- Will we need more indexable fields in the event documents?

- Will we need a way to link events together; for example, an alert schedules
an action to run, so at least two related events will be written, with no
direct (eg, indexable) linkage between them currently possible.

- What do do when writing a log event document to an index encounters an
error relating to elasticsearch being down. Long-term we want to buffer
these in a file, but we should have some short-term solution until we get
there. Will probably mean providing a bounded queue of event documents
to be written when elasticsearch is available again.

Not a great story, but it's also not clear, in practice, how many events
will need to be logged when elasticsearch is down. For the most part,
anything getting logged for actions and alerting will happen AFTER an
elasticsearch-related query anyway, meaning ... there just may not be
that many cases where this would happen in real life. We may in fact
need to do something special about "elasticsearch is down" kind of
errors - like specifically tell plugin users to not create such events -
as they could cause a thundering herd issue when in fact elasticsearch
goes down. Will be some fun experiments to run!
Loading

0 comments on commit de05b5d

Please sign in to comment.