Backport: Add enrich processor #48040

martijnvg · 2019-10-15T08:13:06Z

This is a backport of #48039

~~Added a discuss label, because we still need to think about whether the enrich processor should be part of 7.5 or 7.6.~~

This PR adds a new ingest processor, named enrich processor, that allows document being ingested to be enriched with data from other indices.

Besides a new enrich processor, this PR adds several APIs to manage an enrich policy. An enrich policy is in charge of making the data from other indices available to the enrich processor in an efficient manner.

Relates to #32789

This allows the transport client use this class in enrich APIs. Relates to #40997

There is no need to create a enrich store component for the transport layer since the inner components of the store are either present in the master node calls or via an already injected ClusterService. This commit cleans up the class, adds the forthcoming delete call and tests the new code.

Relates to #32789

This commit wires up the Rest calls and Transport calls for PUT enrich policy, as well as tests and rest spec additions.

move put policy api yaml test to this rest module. The main benefit is that all tests will then be run when running: `./gradlew -p x-pack/plugin/enrich check` The rest qa module starts a node with default distribution and basic license. This qa module will also be used for adding different rest tests (not yaml), for example rest tests needed for #41532 Also when we are going to work on security integration then we can add a security qa module under the qa folder. Also at some point we should add a multi node qa module.

The enrich processor performs a lookup in a locally allocated enrich index shard using a field value from the document being enriched. If there is a match then the _source of the enrich document is fetched. The document being enriched then gets the decorate values from the enrich document based on the configured decorate fields in the pipeline. Note that the usage of the _source field is temporary until the enrich source field that is part of #41521 is merged into the enrich branch. Using the _source field involves significant decompression which not desired for enrich use cases. The policy contains the information what field in the enrich index to query and what fields are available to decorate a document being enriched with. The enrich processor has the following configuration options: * `policy_name` - the name of the policy this processor should use * `enrich_key` - the field in the document being enriched that holds to lookup value * `ignore_missing` - Whether to allow the key field to be missing * `enrich_values` - a list of fields to decorate the document being enriched with. Each entry holds a source field and a target field. The source field indicates what decorate field to use that is available in the policy. The target field controls the field name to use in the document being enriched. The source and target fields can be the same. Example pipeline config: ``` { "processors": [ { "policy_name": "my_policy", "enrich_key": "host_name", "enrich_values": [ { "source": "globalRank", "target": "global_rank" } ] } ] } ``` In the above example documents are being enriched with a global rank value. For each document that has match in the enrich index based on its host_name field, the document gets an global rank field value, which is fetched from the `globalRank` field in the enrich index and saved as `global_rank` in the document being enriched. This is PR is part one of #41521

This commit wires up the Rest calls and Transport calls for listing all enrich policies, as well as tests and rest spec additions.

This commit wires up the Rest calls and Transport calls for DELETE enrich policy, as well as tests and rest spec additions.

Backports #41088 Adds the foundation of the execution logic to execute an enrich policy. Validates the source index existence as well as mappings, creates a new enrich index for the policy, reindexes the source index into the new enrich index, and swaps the enrich alias for the policy to the new index.

…41839) its own helper method to determine alias / policy base name. This way both the enrich processor and policy runner use the same logic to determine the alias to use. Relates to #32789

Relates to #32789

) Reindex uses scroll searches to read the source data. It is more efficient to read more data in one search scroll round then several. I think 10000 is a good sweet spot. Relates to #32789

)

The enrich key field is being kept track in _meta field by the policy runner. The ingest processor uses the field name defined in enrich index _meta field and not in the policy. This will avoid problems if policy is changed without a new enrich index being created. This also complete decouples EnrichPolicy from ExactMatchProcessor. The following scenario results in failure without this change: 1) Create policy 2) Execute policy 3) Create pipeline with enrich processor 4) Use pipeline 5) Update enrich key in policy 6) Use pipeline, which then fails.

this commit introduces a geo-match enrich processor that looks up a specific `geo_point` field in the enrich-index for all entries that have a geo_shape match field that meets some specific relation criteria with the input field. For example, the enrich index may contain documents with zipcodes and their respective geo_shape. Ingesting documents with a geo_point field can be enriched with which zipcode they associate according to which shape they are contained within. this commit also refactors some of the MatchProcessor by moving a lot of the shared code to AbstractEnrichProcessor. Closes #42639.

to index monitoring docs.

Adds a check when running an Enrich policy to make sure that an Enrich index is force merged down to one segment, and if it was not fully merged, attempts the merge again, up to a configurable number of times.

…#47359) The currently logic shard selecting logic selects a random shard copy instead of selecting the local shard copy and if local copy is not available then selecting a random shard copy. The latter is desired behaviour for enrich. By reusing `OperationRouting#searchShards(...)` we get the desired behaviour and reuse the same logic that the search api is using.

) Currently if the document being ingested contains another field value than a string then the processor fails with an error. This commit changes the match processor to handle number values and array values correctly. If a json array is detected then the `terms` query is used instead of the `term` query.

This PR also includes HLRC docs for the enrich stats api. Relates to #32789

Changes the execution logic to create a new task using the execute request, and attaches the new task to the policy runner to be updated. Also, a new response is now returned from the execute api, which contains either the task id of the execution, or the completed status of the run. The fields are mutually exclusive to make it easier to discern what type of response it is.

This PR also includes HLRC docs for the enrich stats api. Relates to #32789

Prior to this change the `target_field` would always be a json array field in the document being ingested. This to take into account that multiple enrich documents could be inserted into the `target_field`. However the default `max_matches` is `1`. Meaning that by default only a single enrich document would be added to `target_field` json array field. This commit changes this; if `max_matches` is set to `1` then the single document would be added as a json object to the `target_field` and if it is configured to a higher value then the enrich documents will be added as a json array (even if a single enrich document happens to be enriched).

This PR adds the ability to run the enrich policy execution task in the background, returning a task id instead of waiting for the completed operation.

…es is 1.

elasticmachine · 2019-10-15T08:13:08Z

Pinging @elastic/es-core-features (:Core/Features/Ingest)

which is backport merge and adds a new ingest processor, named enrich processor, that allows document being ingested to be enriched with data from other indices. Besides a new enrich processor, this PR adds several APIs to manage an enrich policy. An enrich policy is in charge of making the data from other indices available to the enrich processor in an efficient manner. Related to #32789

martijnvg and others added 30 commits April 5, 2019 09:32

first commit

e6bdfea

Merge remote-tracking branch 'es/7.x' into enrich-7.x

5a1d5cc

Fixed build issue after merge

5aecd9d

Merge remote-tracking branch 'es/7.x' into enrich-7.x

b66ad34

Added enrich policy definition. (#41003)

d01c1f3

Relates to #32789

Move the policy class to xpack core module. (#41311)

d99249c

This allows the transport client use this class in enrich APIs. Relates to #40997

Merge remote-tracking branch 'upstream/7.x' into enrich-7.x

860e783

Merge remote-tracking branch 'upstream/7.x' into enrich-7.x

38e6dcd

Expose Engine.Searcher provider to ingest plugins. (#41010)

a61ec11

Relates to #32789

Add enrich policy PUT API (#41383)

fad45ea

This commit wires up the Rest calls and Transport calls for PUT enrich policy, as well as tests and rest spec additions.

Merge remote-tracking branch 'es/7.x' into enrich-7.x

eb9618f

fixed compile error after merging in the 7.x branch

57adee0

Merge remote-tracking branch 'es/7.x' into enrich-7.x

e429cd7

Add enrich policy list API (#41553)

2978ac3

This commit wires up the Rest calls and Transport calls for listing all enrich policies, as well as tests and rest spec additions.

Add enrich policy DELETE API (#41495)

5d53706

This commit wires up the Rest calls and Transport calls for DELETE enrich policy, as well as tests and rest spec additions.

Change policy runner to use helper method on EnrichPolicy instead of (#…

f366f56

…41839) its own helper method to determine alias / policy base name. This way both the enrich processor and policy runner use the same logic to determine the alias to use. Relates to #32789

Rename enrich policy index_pattern field to indices. (#41836)

d709b8b

Relates to #32789

Change the reindex fetch in policy runner from 1000 to 10000 and (#41838

1b00e7f

) Reindex uses scroll searches to read the source data. It is more efficient to read more data in one search scroll round then several. I think 10000 is a good sweet spot. Relates to #32789

Merge remote-tracking branch 'upstream/7.x' into enrich-7.x

202a840

Merge remote-tracking branch 'es/7.x' into enrich-7.x

44f09a9

Enrich store should only update the policies via an update task. (#41944

299ff70

)

Tidy up EnrichPolicy class (#41877)

2daf568

Merge remote-tracking branch 'es/7.x' into enrich-7.x

855f5cc

Merge branch '7.x' into enrich-7.x

323251c

Remove schedule field from EnrichPolicy (#42143)

9e514cb

martijnvg and others added 20 commits October 7, 2019 10:07

Merge remote-tracking branch 'es/7.x' into enrich-7.x

f2f2304

Don't remove indices to avoid monitoring from intermittently failing

8b7100e

to index monitoring docs.

Add retry to force merge operation in EnrichPolicyRunner (#47178)

b9fb354

Adds a check when running an Enrich policy to make sure that an Enrich index is force merged down to one segment, and if it was not fully merged, attempts the merge again, up to a configurable number of times.

Merge remote-tracking branch 'es/7.x' into enrich-7.x

da1e2ea

required change after merging in 7 dot x branch

be0e177

[DOCS] Add docs for geo_match enrich policy type (#47745)

65f8294

Add HLRC support for enrich stats API (#47306)

aace42d

This PR also includes HLRC docs for the enrich stats api. Relates to #32789

Merge remote-tracking branch 'es/7.x' into enrich-7.x

102016d

Merge remote-tracking branch 'es/7.x' into enrich-7.x

d4901a7

Add HLRC support for enrich execute policy API (#47991)

7cc73f6

This PR also includes HLRC docs for the enrich stats api. Relates to #32789

Add wait for completion for Enrich policy execution (#47886)

18d7e32

This PR adds the ability to run the enrich policy execution task in the background, returning a task id instead of waiting for the completed operation.

Fixed test, take into account that Map can be the result if max_match…

c4b1a30

…es is 1.

remove eclipse conditional

51c33f3

Merge remote-tracking branch 'es/7.x' into enrich-7.x

cc4b6c4

adjusted minimal supported version

77164e9

martijnvg added discuss :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP backport v7.5.0 labels Oct 15, 2019

fixed invalid reference

31e41d4

martijnvg removed the discuss label Oct 15, 2019

martijnvg merged commit 31e41d4 into 7.x Oct 15, 2019

martijnvg deleted the enrich-7.x branch October 16, 2019 07:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backport: Add enrich processor #48040

Backport: Add enrich processor #48040

martijnvg commented Oct 15, 2019 •

edited

elasticmachine commented Oct 15, 2019

Backport: Add enrich processor #48040

Backport: Add enrich processor #48040

Conversation

martijnvg commented Oct 15, 2019 • edited

elasticmachine commented Oct 15, 2019

martijnvg commented Oct 15, 2019 •

edited