Add enrich processor #48039

martijnvg · 2019-10-15T08:09:54Z

This PR adds a new ingest processor, named enrich processor, that allows document being ingested to be enriched with data from other indices.

Besides a new enrich processor, this PR adds several APIs to manage an enrich policy. An enrich policy is in charge of making the data from other indices available to the enrich processor in an efficient manner.

Closes #32789

Relates to #32789

This allows the transport client use this class in enrich APIs. Relates to #40997

There is no need to create a enrich store component for the transport layer since the inner components of the store are either present in the master node calls or via an already injected ClusterService. This commit cleans up the class, adds the forthcoming delete call and tests the new code.

Relates to #32789

This commit wires up the Rest calls and Transport calls for PUT enrich policy, as well as tests and rest spec additions.

move put policy api yaml test to this rest module. The main benefit is that all tests will then be run when running: `./gradlew -p x-pack/plugin/enrich check` The rest qa module starts a node with default distribution and basic license. This qa module will also be used for adding different rest tests (not yaml), for example rest tests needed for #41532 Also when we are going to work on security integration then we can add a security qa module under the qa folder. Also at some point we should add a multi node qa module.

The enrich processor performs a lookup in a locally allocated enrich index shard using a field value from the document being enriched. If there is a match then the _source of the enrich document is fetched. The document being enriched then gets the decorate values from the enrich document based on the configured decorate fields in the pipeline. Note that the usage of the _source field is temporary until the enrich source field that is part of #41521 is merged into the enrich branch. Using the _source field involves significant decompression which not desired for enrich use cases. The policy contains the information what field in the enrich index to query and what fields are available to decorate a document being enriched with. The enrich processor has the following configuration options: * `policy_name` - the name of the policy this processor should use * `enrich_key` - the field in the document being enriched that holds to lookup value * `ignore_missing` - Whether to allow the key field to be missing * `enrich_values` - a list of fields to decorate the document being enriched with. Each entry holds a source field and a target field. The source field indicates what decorate field to use that is available in the policy. The target field controls the field name to use in the document being enriched. The source and target fields can be the same. Example pipeline config: ``` { "processors": [ { "policy_name": "my_policy", "enrich_key": "host_name", "enrich_values": [ { "source": "globalRank", "target": "global_rank" } ] } ] } ``` In the above example documents are being enriched with a global rank value. For each document that has match in the enrich index based on its host_name field, the document gets an global rank field value, which is fetched from the `globalRank` field in the enrich index and saved as `global_rank` in the document being enriched. This is PR is part one of #41521

This commit wires up the Rest calls and Transport calls for listing all enrich policies, as well as tests and rest spec additions.

This commit wires up the Rest calls and Transport calls for DELETE enrich policy, as well as tests and rest spec additions.

Adds the foundation of the execution logic to execute an enrich policy. Validates the source index existence as well as mappings, creates a new enrich index for the policy, reindexes the source index into the new enrich index, and swaps the enrich alias for the policy to the new index.

…41839) its own helper method to determine alias / policy base name. This way both the enrich processor and policy runner use the same logic to determine the alias to use. Relates to #32789

Relates to #32789

) Reindex uses scroll searches to read the source data. It is more efficient to read more data in one search scroll round then several. I think 10000 is a good sweet spot. Relates to #32789

)

The enrich key field is being kept track in _meta field by the policy runner. The ingest processor uses the field name defined in enrich index _meta field and not in the policy. This will avoid problems if policy is changed without a new enrich index being created. This also complete decouples EnrichPolicy from ExactMatchProcessor. The following scenario results in failure without this change: 1) Create policy 2) Execute policy 3) Create pipeline with enrich processor 4) Use pipeline 5) Update enrich key in policy 6) Use pipeline, which then fails.

This PR also includes HLRC docs for the enrich stats api. Relates to #32789

Prior to this change the `target_field` would always be a json array field in the document being ingested. This to take into account that multiple enrich documents could be inserted into the `target_field`. However the default `max_matches` is `1`. Meaning that by default only a single enrich document would be added to `target_field` json array field. This commit changes this; if `max_matches` is set to `1` then the single document would be added as a json object to the `target_field` and if it is configured to a higher value then the enrich documents will be added as a json array (even if a single enrich document happens to be enriched).

This PR adds the ability to run the enrich policy execution task in the background, returning a task id instead of waiting for the completed operation.

…es is 1.

elasticmachine · 2019-10-15T08:09:56Z

Pinging @elastic/es-core-features (:Core/Features/Ingest)

which adds a new ingest processor, named enrich processor, that allows document being ingested to be enriched with data from other indices. Besides a new enrich processor, this PR adds several APIs to manage an enrich policy. An enrich policy is in charge of making the data from other indices available to the enrich processor in an efficient manner. Closes #32789

martijnvg and others added 30 commits April 5, 2019 09:31

first commit

35af474

Merge remote-tracking branch 'es/master' into enrich

0ae4546

Fixed build issue after merge

c905559

Merge remote-tracking branch 'es/master' into enrich

018b1e2

Added enrich policy definition. (#41003)

7ea14fd

Relates to #32789

Move the policy class to xpack core module. (#41311)

def1024

This allows the transport client use this class in enrich APIs. Relates to #40997

Merge remote-tracking branch 'upstream/master' into enrich

2e9e480

Merge remote-tracking branch 'upstream/master' into enrich

ba32255

Expose Engine.Searcher provider to ingest plugins. (#41010)

284c508

Relates to #32789

Add enrich policy PUT API (#41383)

1c28f30

This commit wires up the Rest calls and Transport calls for PUT enrich policy, as well as tests and rest spec additions.

Merge remote-tracking branch 'es/master' into enrich

8c8e3e0

Add enrich policy list API (#41553)

c999c09

This commit wires up the Rest calls and Transport calls for listing all enrich policies, as well as tests and rest spec additions.

Merge remote-tracking branch 'es/master' into enrich

50f3177

comment out part of test until delete policy api had been added

593a1c1

Add enrich policy DELETE API (#41495)

83617e8

This commit wires up the Rest calls and Transport calls for DELETE enrich policy, as well as tests and rest spec additions.

Change policy runner to use helper method on EnrichPolicy instead of (#…

33fddef

…41839) its own helper method to determine alias / policy base name. This way both the enrich processor and policy runner use the same logic to determine the alias to use. Relates to #32789

Rename enrich policy index_pattern field to indices. (#41836)

ecffd73

Relates to #32789

Change the reindex fetch in policy runner from 1000 to 10000 and (#41838

0bf0f52

) Reindex uses scroll searches to read the source data. It is more efficient to read more data in one search scroll round then several. I think 10000 is a good sweet spot. Relates to #32789

Merge remote-tracking branch 'upstream/master' into enrich

5a02999

Merge remote-tracking branch 'es/master' into enrich

97d658e

Enrich store should only update the policies via an update task. (#41944

28c529f

)

Tidy up EnrichPolicy class (#41877)

4ebee27

Merge remote-tracking branch 'es/master' into enrich

df3a3f3

Merge branch 'master' into enrich

4dde9e0

Remove schedule field from EnrichPolicy (#42143)

ceab8ee

martijnvg and others added 8 commits October 14, 2019 19:44

Add HLRC support for enrich execute policy API (#47991)

6ed7d69

This PR also includes HLRC docs for the enrich stats api. Relates to #32789

Add wait for completion for Enrich policy execution (#47886)

b0ccce2

This PR adds the ability to run the enrich policy execution task in the background, returning a task id instead of waiting for the completed operation.

Fix broken test

382f264

Fixed test, take into account that Map can be the result if max_match…

7d68935

…es is 1.

remove eclipse conditional

6b0cfb5

Merge remote-tracking branch 'es/master' into enrich

85ad27e

adjusted minimal supported version

1fcadbb

martijnvg added >feature release highlight :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v8.0.0 v7.5.0 labels Oct 15, 2019

martijnvg removed the v7.5.0 label Oct 15, 2019

martijnvg mentioned this pull request Oct 15, 2019

Backport: Add enrich processor #48040

Merged

fixed invalid reference

d941e1b

martijnvg merged commit d941e1b into master Oct 15, 2019

martijnvg added backport pending v7.5.0 and removed backport pending labels Oct 15, 2019

martijnvg mentioned this pull request Oct 16, 2019

Geo-Match Enrich Processor #47243

Merged

astefan mentioned this pull request Oct 17, 2019

[CI] CliSecurityIT testDescribeDocumentExcluded failed with missing audit log #48117

Closed

Mpdreamz mentioned this pull request Nov 19, 2019

[meta] 7.5 release elastic/elasticsearch-net#4232

Closed

24 tasks

tahaderouiche mentioned this pull request Dec 11, 2019

Enrich processor: allow scheduling of policy executions #50071

Open

colings86 deleted the enrich branch May 27, 2020 07:43

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

joegallo mentioned this pull request Jan 13, 2022

Filter enrich policy index deletes to just the policy's associated indices #82568

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add enrich processor #48039

Add enrich processor #48039

martijnvg commented Oct 15, 2019

elasticmachine commented Oct 15, 2019

Add enrich processor #48039

Add enrich processor #48039

Conversation

martijnvg commented Oct 15, 2019

elasticmachine commented Oct 15, 2019