Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport: Add enrich processor #48040

Merged
merged 162 commits into from Oct 15, 2019
Merged

Backport: Add enrich processor #48040

merged 162 commits into from Oct 15, 2019

Conversation

martijnvg
Copy link
Member

@martijnvg martijnvg commented Oct 15, 2019

This is a backport of #48039

Added a discuss label, because we still need to think about whether the enrich processor should be part of 7.5 or 7.6.

This PR adds a new ingest processor, named enrich processor, that allows document being ingested to be enriched with data from other indices.

Besides a new enrich processor, this PR adds several APIs to manage an enrich policy. An enrich policy is in charge of making the data from other indices available to the enrich processor in an efficient manner.

martijnvg and others added 30 commits April 5, 2019 09:32
This allows the transport client use this class in enrich APIs.

Relates to #40997
There is no need to create a enrich store component for the transport
layer since the inner components of the store are either present in the
master node calls or via an already injected ClusterService. This commit
cleans up the class, adds the forthcoming delete call and tests the new
code.
This commit wires up the Rest calls and Transport calls for PUT enrich
policy, as well as tests and rest spec additions.
move put policy api yaml test to this rest module.

The main benefit is that all tests will then be run when running:
`./gradlew -p x-pack/plugin/enrich check`

The rest qa module starts a node with default distribution and basic
license.

This qa module will also be used for adding different rest tests (not yaml),
for example rest tests needed for #41532

Also when we are going to work on security integration then we can
add a security qa module under the qa folder. Also at some point
we should add a multi node qa module.
The enrich processor performs a lookup in a locally allocated
enrich index shard using a field value from the document being enriched.
If there is a match then the _source of the enrich document is fetched.
The document being enriched then gets the decorate values from the
enrich document based on the configured decorate fields in the pipeline.

Note that the usage of the _source field is temporary until the enrich
source field that is part of #41521 is merged into the enrich branch.
Using the _source field involves significant decompression which not
desired for enrich use cases.

The policy contains the information what field in the enrich index
to query and what fields are available to decorate a document being
enriched with.

The enrich processor has the following configuration options:
* `policy_name` - the name of the policy this processor should use
* `enrich_key` - the field in the document being enriched that holds to lookup value
* `ignore_missing` - Whether to allow the key field to be missing
* `enrich_values` - a list of fields to decorate the document being enriched with.
                    Each entry holds a source field and a target field.
                    The source field indicates what decorate field to use that is available in the policy.
                    The target field controls the field name to use in the document being enriched.
                    The source and target fields can be the same.

Example pipeline config:

```
{
   "processors": [
      {
         "policy_name": "my_policy",
         "enrich_key": "host_name",
         "enrich_values": [
            {
              "source": "globalRank",
              "target": "global_rank"
            }
         ]
      }
   ]
}
```

In the above example documents are being enriched with a global rank value.
For each document that has match in the enrich index based on its host_name field,
the document gets an global rank field value, which is fetched from the `globalRank`
field in the enrich index and saved as `global_rank` in the document being enriched.

This is PR is part one of #41521
This commit wires up the Rest calls and Transport calls for listing all
enrich policies, as well  as tests and rest spec additions.
This commit wires up the Rest calls and Transport calls for DELETE enrich
policy, as well as tests and rest spec additions.
Backports #41088

Adds the foundation of the execution logic to execute an enrich policy. Validates
the source index existence as well as mappings, creates a new enrich index for
the policy, reindexes the source index into the new enrich index, and swaps the 
enrich alias for the policy to the new index.
…41839)

its own helper method to determine alias / policy base name.

This way both the enrich processor and policy runner use the same logic
to determine the alias to use.

Relates to #32789
)

Reindex uses scroll searches to read the source data. It is more efficient
to read more data in one search scroll round then several. I think 10000
is a good sweet spot.

Relates to #32789
The enrich key field is being kept track in _meta field by the policy runner.
The ingest processor uses the field name defined in enrich index _meta field and
not in the policy. This will avoid problems if policy is changed without
a new enrich index being created.

This also complete decouples EnrichPolicy from ExactMatchProcessor.

The following scenario results in failure without this change:
1) Create policy
2) Execute policy
3) Create pipeline with enrich processor
4) Use pipeline
5) Update enrich key in policy
6) Use pipeline, which then fails.
martijnvg and others added 20 commits October 7, 2019 10:07
this commit introduces a geo-match enrich processor that looks up a specific
`geo_point` field in the enrich-index for all entries that have a geo_shape match field
that meets some specific relation criteria with the input field.

For example, the enrich index may contain documents with zipcodes and their respective
geo_shape. Ingesting documents with a geo_point field can be enriched with which zipcode
they associate according to which shape they are contained within.

this commit also refactors some of the MatchProcessor by moving a lot of the shared code to
AbstractEnrichProcessor.

Closes #42639.
Adds a check when running an Enrich policy to make sure that an Enrich index
is force merged down to one segment, and if it was not fully merged, attempts
the merge again, up to a configurable number of times.
…#47359)

The currently logic shard selecting logic selects a random shard copy
instead of selecting the local shard copy and if local copy is not
available then selecting a random shard copy. The latter is desired
behaviour for enrich.

By reusing `OperationRouting#searchShards(...)` we get the desired
behaviour and reuse the same logic that the search api is using.
)

Currently if the document being ingested contains another field value
than a string then the processor fails with an error.

This commit changes the match processor to handle number values
and array values correctly.

If a json array is detected then the `terms` query is used instead
of the `term` query.
This PR also includes HLRC docs for the enrich stats api.

Relates to #32789
Changes the execution logic to create a new task using the execute request,
and attaches the new task to the policy runner to be updated. Also, a new
response is now returned from the execute api, which contains either the task
id of the execution, or the completed status of the run. The fields are mutually
exclusive to make it easier to discern what type of response it is.
This PR also includes HLRC docs for the enrich stats api.

Relates to #32789
Prior to this change the `target_field` would always be a json array
field in the document being ingested. This to take into account that
multiple enrich documents could be inserted into the `target_field`.

However the default `max_matches` is `1`. Meaning that by default
only a single enrich document would be added to `target_field` json
array field.

This commit changes this; if `max_matches` is set to `1` then the single
document would be added as a json object to the `target_field` and
if it is configured to a higher value then the enrich documents will be
added as a json array (even if a single enrich document happens to be
enriched).
This PR adds the ability to run the enrich policy execution task in the background,
returning a task id instead of waiting for the completed operation.
@martijnvg martijnvg added discuss :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP backport v7.5.0 labels Oct 15, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Ingest)

@martijnvg martijnvg removed the discuss label Oct 15, 2019
@martijnvg martijnvg merged commit 31e41d4 into 7.x Oct 15, 2019
martijnvg added a commit that referenced this pull request Oct 15, 2019
which is backport merge and adds a new ingest processor, named enrich processor,
that allows document being ingested to be enriched with data from other indices.

Besides a new enrich processor, this PR adds several APIs to manage an enrich policy.
An enrich policy is in charge of making the data from other indices available to the enrich processor in an efficient manner.

Related to #32789
@martijnvg martijnvg deleted the enrich-7.x branch October 16, 2019 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v7.5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants