Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[EEM][POC] The POC for creating entity-centric indices using entity d…
…efinitions (#183205) ## Summary This is a "proof of concept" for generating entity-centric indices for the OAM. This exposes an API (`/api/entities`) for creating "asset definitions" (`EntityDefinition`) that manages a transform and ingest pipeline to produce documents into an index which could be used to create a search experience or lookups for different services. ### Features - Data schema agnostic, works with known schemas OR custom logs - Supports defining multiple `identityFields` along with an `identityTemplate` for formatting the `asset.id` - Supports optional `identityFields` using `{ "field": "path-to-field", "optional": true }` definition instead of a `string`. - Supports defining key `metrics` with equations which are compatible with the SLO product - Supports adding `metadata` fields which will include multiple values. - Supports `metadata` fields can be re-mapped to a new destination path using `{ "source": "path-to-source-field", "limit": 1000, "destination": "path-to-destination-in-output" }` definition instead of a `string` - Supports adding `staticFields` which can also use template variables - Support fine grain control over the frequency and sync settings for the underlying transform - Installs the index template components and index template settings for the destination index - Allow the user to configure the index patterns and timestamp field along with the lookback - The documents for each definition will be stored in their own index (`.entities-observability.summary-v1.{defintion.id}`) ### Notes - We are currently considering adding a historical index which will track changes to the assets over time. If we choose to do this, the summary index would remain the same but we'd add a second transform with a group_by on the `definition.timestampField` and break the indices into monthly indexes (configurable in the settings). - We are looking into ways to add `firstSeenTimestamp`, this is a difficult due to scaling issue. Essentially, we would need to find the `minimum` timestamp for each entity which could be extremely costly on a large datasets. - There is nothing stopping you from creating an asset definition that uses the `.entities-observability.summary-v1.*` index pattern to create summaries of summaries... it can be very "meta". ### API - `POST /api/entities/definition` - Creates a new asset definition and starts the indexing. See examples below. - `DELETE /api/entities/definition/{id}` - Deletes the asset definition along with cleaning up the transform, ingest pipeline, and deletes the destination index. - `POST /api/entities/definition/{id}/_reset` - Resets the transform, ingest pipeline, and destination index. This is useful for upgrading asset definitions to new features. ## Example Definitions and Output Here is a definition for creating services for each of the custom log sources in the `fake_stack` dataset from `x-pack/packages/data-forge`. ```JSON POST kbn:/api/entities/definition { "id": "admin-console-logs-service", "name": "Services for Admin Console", "type": "service", "indexPatterns": ["kbn-data-forge-fake_stack.*"], "timestampField": "@timestamp", "lookback": "5m", "identityFields": ["log.logger"], "identityTemplate": "{{log.logger}}", "metadata": [ "tags", "host.name" ], "metrics": [ { "name": "logRate", "equation": "A / 5", "metrics": [ { "name": "A", "aggregation": "doc_count", "filter": "log.level: *" } ] }, { "name": "errorRate", "equation": "A / 5", "metrics": [ { "name": "A", "aggregation": "doc_count", "filter": "log.level: \"ERROR\"" } ] } ] } ``` Which produces: ```JSON { "host": { "name": [ "admin-console.prod.020", "admin-console.prod.010", "admin-console.prod.011", "admin-console.prod.001", "admin-console.prod.012", "admin-console.prod.002", "admin-console.prod.013", "admin-console.prod.003", "admin-console.prod.014", "admin-console.prod.004", "admin-console.prod.015", "admin-console.prod.016", "admin-console.prod.005", "admin-console.prod.017", "admin-console.prod.006", "admin-console.prod.018", "admin-console.prod.007", "admin-console.prod.019", "admin-console.prod.008", "admin-console.prod.009" ] }, "entity": { "latestTimestamp": "2024-05-10T22:04:51.481Z", "metric": { "logRate": 37.4, "errorRate": 1 }, "identity": { "log": { "logger": "admin-console" } }, "id": "admin-console", "indexPatterns": [ "kbn-data-forge-fake_stack.*" ], "definitionId": "admin-console-logs-service" }, "event": { "ingested": "2024-05-10T22:05:51.955691Z" }, "tags": [ "infra:admin-console" ] } ``` Here is an example of a definition for APM Services: ```JSON POST kbn:/api/entities/definition { "id": "apm-services", "name": "Services for APM", "type": "service", "indexPatterns": ["logs-*", "metrics-*"], "timestampField": "@timestamp", "lookback": "5m", "identityFields": ["service.name", "service.environment"], "identityTemplate": "{{service.name}}:{{service.environment}}", "metadata": [ "tags", "host.name" ], "metrics": [ { "name": "latency", "equation": "A", "metrics": [ { "name": "A", "aggregation": "avg", "field": "transaction.duration.histogram" } ] }, { "name": "throughput", "equation": "A / 5", "metrics": [ { "name": "A", "aggregation": "doc_count" } ] }, { "name": "failedTransRate", "equation": "A / B", "metrics": [ { "name": "A", "aggregation": "doc_count", "filter": "event.outcome: \"failure\"" }, { "name": "B", "aggregation": "doc_count", "filter": "event.outcome: *" } ] } ] } ``` Which produces: ```JSON { "host": { "name": [ "simianhacker's-macbook-pro" ] }, "entity": { "latestTimestamp": "2024-05-10T21:38:22.513Z", "metric": { "latency": 615276.8812785388, "throughput": 50.6, "failedTransRate": 0.0091324200913242 }, "identity": { "service": { "environment": "development", "name": "admin-console" } }, "id": "admin-console:development", "indexPatterns": [ "logs-*", "metrics-*" ], "definitionId": "apm-services" }, "event": { "ingested": "2024-05-10T21:39:33.636225Z" }, "tags": [ "_geoip_database_unavailable_GeoLite2-City.mmdb" ] } ``` ### Getting Started The easiest way to get started is to use the`kbn-data-forge` config below. Save this YAML to `~/Desktop/fake_stack.yaml` then run `node x-pack/scripts/data_forge.js --config ~/Desktop/fake_stack.yaml`. Then create a definition using the first example above. ```YAML --- elasticsearch: installKibanaUser: false kibana: installAssets: true host: "http://localhost:5601/kibana" indexing: dataset: "fake_stack" eventsPerCycle: 50 reduceWeekendTrafficBy: 0.5 schedule: # Start with good events - template: "good" start: "now-1d" end: "now-20m" eventsPerCycle: 50 randomness: 0.8 - template: "bad" start: "now-20m" end: "now-10m" eventsPerCycle: 50 randomness: 0.8 - template: "good" start: "now-10m" end: false eventsPerCycle: 50 randomness: 0.8 ``` --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
- Loading branch information