[ingest] Enrich documents prior to indexing #32789

jakelandis · 2018-08-10T16:20:48Z

Enrichment at ingest

This issue describes a project that will leverage the ingest node to allow for enrichment of documents before they are indexed.

Below is a diagram that highlights the workflow. The red parts are new components.

.enrich-* - index(es) managed managed by Elasticsearch that contains a highly optimized subset of the source data used for enrichment.
source index - a normal index managed externally (e.g. not by Elasticsearch) that contains the data used for enrichment
enrich policy - a policy that describes how to synchronize the source index with the .enrich-* index. The policy will describe which fields to copy and how often to copy the fields.
decorate processor - an ingest node processor that reads from a .enrich-* index to mutate the raw data before it is indexed. The .enrich-* will be data local to the decorate processor.

There are many moving parts so this issue will serve as a central place to track them.

Tasks

Enrich policy definition

Define the enrich policy (@martijnvg) Added enrich policy definition. #41003
Rename enrich_key to match_field and enrich_values to enrich_fields.
Remove type field and make the type a top level json object that contains all the configuration of an enrich policy. Change how type is stored in an enrich policy. #45789

{
  "exact_match": {
	"match_field": "prsnl.id",
	"enrich_fields": [
  	  "prsnl.name.first",
  	  "prsnl.name.last"
	],
	"indices": [
  	  "bar*",
  	  "foo"
	],
	"query": {}
  }
}

instead of:

{
    "type": "exact_match",
    "indices": [
   	 "bar*",
   	 "foo"
    ],
    "match_field": "prsnl.id",
    "enrich_fields": [
   	 "prsnl.name.first",
   	 "prsnl.name.last"
    ],
    "query": {
    }
}

Enrich processor

Policy management

APIs

Misc

Restart qa test
Documentation
Enable / Disable settings
HLRC
~~update Kibana roles for new role, to be done after the feature branch is merged to master~~ obsoleted by Role Management - use ES Builtin Privilege API to drive list of privileges kibana#40270
update stack docs for the new role, to be done after the feature branch is merged to master
Transport client support. (@hub-cap) Add enrich transport client support #46002
Integration with xpack usage api.

EDITS:

2019-4-8: Changed the original description of this issue to reflect the current direction*
20190507: Updated after planning meeting.

The text was updated successfully, but these errors were encountered:

jakelandis · 2018-08-23T19:31:18Z

Closing as better alternatives for these use cases have been discussed.

jakelandis · 2019-01-28T20:08:30Z

Re-opening per further discussion.

elasticmachine · 2019-01-28T20:13:40Z

Pinging @elastic/es-core-features

dadoonet · 2019-03-06T11:24:16Z

@jakelandis When I closed #20340 I started to work on a JDBC ingest plugin which was basically doing lookups to a 3rd party database. The way I designed it was by heavily using cache to make lookups running as fast as possible with local data.

2 strategies at this period:

cache hit by hit. The more you call ingest-jdbc the more you are caching data, the faster it runs
cache on ingest startup. It starts an embedded in memory database, create a schema identical to the source one, import the table data in memory.

Of course with cache eviction, memory usage protection (ie. don't load more than x kb/mb of data...).

Is that one of the thing you have in mind?

gmoskovicz · 2019-03-06T19:53:40Z

This would be beneficial to do real-time lookups within Elasticsearch.

Relates to elastic#32789

Relates to #32789

The enrich processor uses a field value from the document being enriched and uses that to do a lookup in the locally allocated enrich index shard. If there is a match then retrieves the source of the enrich document from the enrich source field. This is a special binary doc values field. The document being enriched then gets values from the enrich document based on the configured decorate fields. The policy contains the information what field in the enrich index to query and what fields are available to decorate a document being enriched with. The enrich processor has the following configuration options: * `policy_name` - the name of the policy this processor should use * `enrich_key_field` - the field in the document being enriched that holds to lookup value * `enrich_key_field_ignore_missing` - Whether to allow the key field to be missing * `enrich_values` - a list of fields to decorate the document being enriched with. Each entry holds a source field and a target field. The source field indicates what decorate field to use that is available in the policy. The target field controls the field name to use in the document being enriched. The source and target fields can be the same. Example pipeline config: ``` { "processors": [ { "policy_name": "my_policy", "key": "host_name", "values": [ { "source": "globalRank", "target": "global_rank" } ] } ] } ``` In the above example documents are being enriched with a global rank value. For each document that has match in the enrich index based on its host_name field, the document gets an global rank field value, which is fetched from the `globalRank` field in the enrich index and saved as `global_rank` in the document being enriched. The enrich source field mapper is an internal field mapper meant to be used by enrich exclusively. Relates to elastic#32789

Changed the signature of AbstractResponseTestCase#createServerTestInstance(...) to include the randomly selected xcontent type. This is needed for the creating a server response instance with a query which is represented as BytesReference. Maybe this should go into a different change? This PR also includes HLRC docs for the get policy api. Relates to #32789

In the case that an ingest processor factory relies on other configuration in the cluster state in order to construct a processor instance then it is currently undetermined if a processor facotry can be notified about a change if multiple cluster state updates are bundled together and if a processor implement `ClusterStateApplier` interface. (IngestService implements this interface too) The idea with ingest cluster state listener is that it is guaranteed to update the processor factory first before the ingest service creates a pipeline with their respective processor instances. Currently this concept is used in the enrich branch: https://github.com/elastic/elasticsearch/blob/enrich/x-pack/plugin/enrich/src/main/java/org/elasticsearch/xpack/enrich/EnrichProcessorFactory.java#L21 In this case it a processor factory is interested in enrich indices' _meta mapping fields. This is the third PR that merges changes made to server module from the enrich branch (see elastic#32789) into the master branch. Changes to the server module are merged separately from the pr that will merge enrich into master, so that these changes can be reviewed in isolation.

This change also slightly modifies the stats response, so that is can easier consumer by monitoring and other users. (coordinators stats are now in a list instead of a map and has an additional field for the node id) Relates to elastic#32789

In the case that an ingest processor factory relies on other configuration in the cluster state in order to construct a processor instance then it is currently undetermined if a processor facotry can be notified about a change if multiple cluster state updates are bundled together and if a processor implement `ClusterStateApplier` interface. (IngestService implements this interface too) The idea with ingest cluster state listener is that it is guaranteed to update the processor factory first before the ingest service creates a pipeline with their respective processor instances. Currently this concept is used in the enrich branch: https://github.com/elastic/elasticsearch/blob/enrich/x-pack/plugin/enrich/src/main/java/org/elasticsearch/xpack/enrich/EnrichProcessorFactory.java#L21 In this case it a processor factory is interested in enrich indices' _meta mapping fields. This is the third PR that merges changes made to server module from the enrich branch (see #32789) into the master branch. Changes to the server module are merged separately from the pr that will merge enrich into master, so that these changes can be reviewed in isolation.

In the case that an ingest processor factory relies on other configuration in the cluster state in order to construct a processor instance then it is currently undetermined if a processor facotry can be notified about a change if multiple cluster state updates are bundled together and if a processor implement `ClusterStateApplier` interface. (IngestService implements this interface too) The idea with ingest cluster state listener is that it is guaranteed to update the processor factory first before the ingest service creates a pipeline with their respective processor instances. Currently this concept is used in the enrich branch: https://github.com/elastic/elasticsearch/blob/enrich/x-pack/plugin/enrich/src/main/java/org/elasticsearch/xpack/enrich/EnrichProcessorFactory.java#L21 In this case it a processor factory is interested in enrich indices' _meta mapping fields. This is the third PR that merges changes made to server module from the enrich branch (see elastic#32789) into the master branch. Changes to the server module are merged separately from the pr that will merge enrich into master, so that these changes can be reviewed in isolation.

This PR changes the ingest executing to be non blocking by adding an additional method to the Processor interface that accepts a BiConsumer as handler and changing IngestService#executeBulkRequest(...) to ingest document in a non blocking fashion iff a processor executes in a non blocking fashion. This is the second PR that merges changes made to server module from the enrich branch (see #32789) into the master branch. The plan is to merge changes made to the server module separately from the pr that will merge enrich into master, so that these changes can be reviewed in isolation. This change originates from the enrich branch and was introduced there in #43361.

…#46241) This PR changes the ingest executing to be non blocking by adding an additional method to the Processor interface that accepts a BiConsumer as handler and changing IngestService#executeBulkRequest(...) to ingest document in a non blocking fashion iff a processor executes in a non blocking fashion. This is the second PR that merges changes made to server module from the enrich branch (see elastic#32789) into the master branch. The plan is to merge changes made to the server module separately from the pr that will merge enrich into master, so that these changes can be reviewed in isolation. This change originates from the enrich branch and was introduced there in elastic#43361.

Backport of #46241 This PR changes the ingest executing to be non blocking by adding an additional method to the Processor interface that accepts a BiConsumer as handler and changing IngestService#executeBulkRequest(...) to ingest document in a non blocking fashion iff a processor executes in a non blocking fashion. This is the second PR that merges changes made to server module from the enrich branch (see #32789) into the master branch. The plan is to merge changes made to the server module separately from the pr that will merge enrich into master, so that these changes can be reviewed in isolation. This change originates from the enrich branch and was introduced there in #43361.

This change also slightly modifies the stats response, so that is can easier consumer by monitoring and other users. (coordinators stats are now in a list instead of a map and has an additional field for the node id) Relates to #32789

This PR also includes HLRC docs for the enrich stats api. Relates to elastic#32789

This PR also includes HLRC docs for the enrich stats api. Relates to #32789

This PR also includes HLRC docs for the enrich stats api. Relates to elastic#32789

This PR also includes HLRC docs for the enrich stats api. Relates to #32789

which adds a new ingest processor, named enrich processor, that allows document being ingested to be enriched with data from other indices. Besides a new enrich processor, this PR adds several APIs to manage an enrich policy. An enrich policy is in charge of making the data from other indices available to the enrich processor in an efficient manner. Closes #32789

which is backport merge and adds a new ingest processor, named enrich processor, that allows document being ingested to be enriched with data from other indices. Besides a new enrich processor, this PR adds several APIs to manage an enrich policy. An enrich policy is in charge of making the data from other indices available to the enrich processor in an efficient manner. Related to #32789

jakelandis added >enhancement :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Aug 10, 2018

jakelandis closed this as completed Aug 23, 2018

jakelandis reopened this Jan 28, 2019

jakelandis changed the title ~~[ingest] reference data backed enriching processor(s)~~ [ingest] lookup processor / data backed enriching processor(s) Jan 28, 2019

jakelandis added :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP and removed :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Jan 28, 2019

jakelandis mentioned this issue Jan 28, 2019

Elasticsearch filter/query processor #37635

Closed

talevy mentioned this issue Feb 12, 2019

Add Logstash-filter-translate as an Ingest Processor #29096

Closed

martijnvg self-assigned this Mar 1, 2019

jakelandis assigned hub-cap and jbaiera Apr 8, 2019

jakelandis changed the title ~~[ingest] lookup processor / data backed enriching processor(s)~~ [ingest] Enrich documents prior to indexing Apr 8, 2019

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Apr 9, 2019

Added enrich policy definition.

492936d

Relates to elastic#32789

martijnvg mentioned this issue Apr 9, 2019

Added enrich policy definition. #41003

Merged

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Apr 9, 2019

Expose Engine.Searcher provider to ingest plugins.

d5f8e1d

Relates to elastic#32789

martijnvg mentioned this issue Apr 9, 2019

Expose Engine.Searcher provider to ingest plugins. #41010

Merged

martijnvg added a commit that referenced this issue Apr 12, 2019

Added enrich policy definition. (#41003)

7ea14fd

Relates to #32789

martijnvg added a commit that referenced this issue Apr 12, 2019

Added enrich policy definition. (#41003)

d01c1f3

Relates to #32789

jakelandis added the 7x label Apr 15, 2019

martijnvg added a commit that referenced this issue Apr 24, 2019

Expose Engine.Searcher provider to ingest plugins. (#41010)

284c508

Relates to #32789

martijnvg added a commit that referenced this issue Apr 24, 2019

Expose Engine.Searcher provider to ingest plugins. (#41010)

a61ec11

Relates to #32789

martijnvg mentioned this issue Apr 25, 2019

Add enrich processor and enrich source meta field mapper. #41521

Closed

jakelandis mentioned this issue Apr 26, 2019

Add enrich processor #41532

Merged

martijnvg mentioned this issue Sep 12, 2019

Add ingest cluster state listeners #46650

Merged

martijnvg mentioned this issue Sep 13, 2019

Expose enrich stats api to monitoring. #46708

Merged

martijnvg mentioned this issue Sep 25, 2019

Backport: allow ingest processors to execute in a non blocking manner #47122

Merged

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Sep 30, 2019

Add HLRC support for enrich stats API

db9ca43

This PR also includes HLRC docs for the enrich stats api. Relates to elastic#32789

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Sep 30, 2019

Add HLRC support for enrich stats API

79bfbdf

This PR also includes HLRC docs for the enrich stats api. Relates to elastic#32789

martijnvg mentioned this issue Sep 30, 2019

Add HLRC support for enrich stats API #47306

Merged

martijnvg added a commit that referenced this issue Oct 10, 2019

Add HLRC support for enrich stats API (#47306)

0caca2f

This PR also includes HLRC docs for the enrich stats api. Relates to #32789

martijnvg added a commit that referenced this issue Oct 10, 2019

Add HLRC support for enrich stats API (#47306)

aace42d

This PR also includes HLRC docs for the enrich stats api. Relates to #32789

tlrx mentioned this issue Oct 10, 2019

[Feature] extract bots in useragent module #47568

Closed

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Oct 14, 2019

Add HLRC support for enrich execute policy API

750c9ce

This PR also includes HLRC docs for the enrich stats api. Relates to elastic#32789

martijnvg mentioned this issue Oct 14, 2019

Add HLRC support for enrich execute policy API #47991

Merged

martijnvg added a commit that referenced this issue Oct 14, 2019

Add HLRC support for enrich execute policy API (#47991)

6ed7d69

This PR also includes HLRC docs for the enrich stats api. Relates to #32789

martijnvg added a commit that referenced this issue Oct 14, 2019

Add HLRC support for enrich execute policy API (#47991)

7cc73f6

This PR also includes HLRC docs for the enrich stats api. Relates to #32789

martijnvg mentioned this issue Oct 15, 2019

Add enrich processor #48039

Merged

martijnvg closed this as completed in #48039 Oct 15, 2019

cjcenizal mentioned this issue Oct 15, 2019

Enrich Policies app elastic/kibana#46987

Closed

Mpdreamz mentioned this issue Nov 19, 2019

[meta] 7.5 release elastic/elasticsearch-net#4232

Closed

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ingest] Enrich documents prior to indexing #32789

[ingest] Enrich documents prior to indexing #32789

jakelandis commented Aug 10, 2018 •

edited by hub-cap

Loading

jakelandis commented Aug 23, 2018

jakelandis commented Jan 28, 2019

elasticmachine commented Jan 28, 2019

dadoonet commented Mar 6, 2019

gmoskovicz commented Mar 6, 2019

[ingest] Enrich documents prior to indexing #32789

[ingest] Enrich documents prior to indexing #32789

Comments

jakelandis commented Aug 10, 2018 • edited by hub-cap Loading

Enrichment at ingest

Tasks

Enrich policy definition

Enrich processor

Policy management

APIs

Misc

jakelandis commented Aug 23, 2018

jakelandis commented Jan 28, 2019

elasticmachine commented Jan 28, 2019

dadoonet commented Mar 6, 2019

gmoskovicz commented Mar 6, 2019

jakelandis commented Aug 10, 2018 •

edited by hub-cap

Loading