-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ingest] Enrich documents prior to indexing #32789
Comments
Closing as better alternatives for these use cases have been discussed. |
Re-opening per further discussion. |
Pinging @elastic/es-core-features |
@jakelandis When I closed #20340 I started to work on a JDBC ingest plugin which was basically doing lookups to a 3rd party database. The way I designed it was by heavily using cache to make lookups running as fast as possible with local data. 2 strategies at this period:
Of course with cache eviction, memory usage protection (ie. don't load more than x kb/mb of data...). Is that one of the thing you have in mind? |
This would be beneficial to do real-time lookups within Elasticsearch. |
Relates to elastic#32789
The enrich processor uses a field value from the document being enriched and uses that to do a lookup in the locally allocated enrich index shard. If there is a match then retrieves the source of the enrich document from the enrich source field. This is a special binary doc values field. The document being enriched then gets values from the enrich document based on the configured decorate fields. The policy contains the information what field in the enrich index to query and what fields are available to decorate a document being enriched with. The enrich processor has the following configuration options: * `policy_name` - the name of the policy this processor should use * `enrich_key_field` - the field in the document being enriched that holds to lookup value * `enrich_key_field_ignore_missing` - Whether to allow the key field to be missing * `enrich_values` - a list of fields to decorate the document being enriched with. Each entry holds a source field and a target field. The source field indicates what decorate field to use that is available in the policy. The target field controls the field name to use in the document being enriched. The source and target fields can be the same. Example pipeline config: ``` { "processors": [ { "policy_name": "my_policy", "key": "host_name", "values": [ { "source": "globalRank", "target": "global_rank" } ] } ] } ``` In the above example documents are being enriched with a global rank value. For each document that has match in the enrich index based on its host_name field, the document gets an global rank field value, which is fetched from the `globalRank` field in the enrich index and saved as `global_rank` in the document being enriched. The enrich source field mapper is an internal field mapper meant to be used by enrich exclusively. Relates to elastic#32789
Changed the signature of AbstractResponseTestCase#createServerTestInstance(...) to include the randomly selected xcontent type. This is needed for the creating a server response instance with a query which is represented as BytesReference. Maybe this should go into a different change? This PR also includes HLRC docs for the get policy api. Relates to #32789
In the case that an ingest processor factory relies on other configuration in the cluster state in order to construct a processor instance then it is currently undetermined if a processor facotry can be notified about a change if multiple cluster state updates are bundled together and if a processor implement `ClusterStateApplier` interface. (IngestService implements this interface too) The idea with ingest cluster state listener is that it is guaranteed to update the processor factory first before the ingest service creates a pipeline with their respective processor instances. Currently this concept is used in the enrich branch: https://github.com/elastic/elasticsearch/blob/enrich/x-pack/plugin/enrich/src/main/java/org/elasticsearch/xpack/enrich/EnrichProcessorFactory.java#L21 In this case it a processor factory is interested in enrich indices' _meta mapping fields. This is the third PR that merges changes made to server module from the enrich branch (see elastic#32789) into the master branch. Changes to the server module are merged separately from the pr that will merge enrich into master, so that these changes can be reviewed in isolation.
This change also slightly modifies the stats response, so that is can easier consumer by monitoring and other users. (coordinators stats are now in a list instead of a map and has an additional field for the node id) Relates to elastic#32789
In the case that an ingest processor factory relies on other configuration in the cluster state in order to construct a processor instance then it is currently undetermined if a processor facotry can be notified about a change if multiple cluster state updates are bundled together and if a processor implement `ClusterStateApplier` interface. (IngestService implements this interface too) The idea with ingest cluster state listener is that it is guaranteed to update the processor factory first before the ingest service creates a pipeline with their respective processor instances. Currently this concept is used in the enrich branch: https://github.com/elastic/elasticsearch/blob/enrich/x-pack/plugin/enrich/src/main/java/org/elasticsearch/xpack/enrich/EnrichProcessorFactory.java#L21 In this case it a processor factory is interested in enrich indices' _meta mapping fields. This is the third PR that merges changes made to server module from the enrich branch (see #32789) into the master branch. Changes to the server module are merged separately from the pr that will merge enrich into master, so that these changes can be reviewed in isolation.
In the case that an ingest processor factory relies on other configuration in the cluster state in order to construct a processor instance then it is currently undetermined if a processor facotry can be notified about a change if multiple cluster state updates are bundled together and if a processor implement `ClusterStateApplier` interface. (IngestService implements this interface too) The idea with ingest cluster state listener is that it is guaranteed to update the processor factory first before the ingest service creates a pipeline with their respective processor instances. Currently this concept is used in the enrich branch: https://github.com/elastic/elasticsearch/blob/enrich/x-pack/plugin/enrich/src/main/java/org/elasticsearch/xpack/enrich/EnrichProcessorFactory.java#L21 In this case it a processor factory is interested in enrich indices' _meta mapping fields. This is the third PR that merges changes made to server module from the enrich branch (see #32789) into the master branch. Changes to the server module are merged separately from the pr that will merge enrich into master, so that these changes can be reviewed in isolation.
In the case that an ingest processor factory relies on other configuration in the cluster state in order to construct a processor instance then it is currently undetermined if a processor facotry can be notified about a change if multiple cluster state updates are bundled together and if a processor implement `ClusterStateApplier` interface. (IngestService implements this interface too) The idea with ingest cluster state listener is that it is guaranteed to update the processor factory first before the ingest service creates a pipeline with their respective processor instances. Currently this concept is used in the enrich branch: https://github.com/elastic/elasticsearch/blob/enrich/x-pack/plugin/enrich/src/main/java/org/elasticsearch/xpack/enrich/EnrichProcessorFactory.java#L21 In this case it a processor factory is interested in enrich indices' _meta mapping fields. This is the third PR that merges changes made to server module from the enrich branch (see elastic#32789) into the master branch. Changes to the server module are merged separately from the pr that will merge enrich into master, so that these changes can be reviewed in isolation.
This PR changes the ingest executing to be non blocking by adding an additional method to the Processor interface that accepts a BiConsumer as handler and changing IngestService#executeBulkRequest(...) to ingest document in a non blocking fashion iff a processor executes in a non blocking fashion. This is the second PR that merges changes made to server module from the enrich branch (see #32789) into the master branch. The plan is to merge changes made to the server module separately from the pr that will merge enrich into master, so that these changes can be reviewed in isolation. This change originates from the enrich branch and was introduced there in #43361.
…#46241) This PR changes the ingest executing to be non blocking by adding an additional method to the Processor interface that accepts a BiConsumer as handler and changing IngestService#executeBulkRequest(...) to ingest document in a non blocking fashion iff a processor executes in a non blocking fashion. This is the second PR that merges changes made to server module from the enrich branch (see elastic#32789) into the master branch. The plan is to merge changes made to the server module separately from the pr that will merge enrich into master, so that these changes can be reviewed in isolation. This change originates from the enrich branch and was introduced there in elastic#43361.
Backport of #46241 This PR changes the ingest executing to be non blocking by adding an additional method to the Processor interface that accepts a BiConsumer as handler and changing IngestService#executeBulkRequest(...) to ingest document in a non blocking fashion iff a processor executes in a non blocking fashion. This is the second PR that merges changes made to server module from the enrich branch (see #32789) into the master branch. The plan is to merge changes made to the server module separately from the pr that will merge enrich into master, so that these changes can be reviewed in isolation. This change originates from the enrich branch and was introduced there in #43361.
This change also slightly modifies the stats response, so that is can easier consumer by monitoring and other users. (coordinators stats are now in a list instead of a map and has an additional field for the node id) Relates to #32789
This change also slightly modifies the stats response, so that is can easier consumer by monitoring and other users. (coordinators stats are now in a list instead of a map and has an additional field for the node id) Relates to #32789
This PR also includes HLRC docs for the enrich stats api. Relates to elastic#32789
This PR also includes HLRC docs for the enrich stats api. Relates to elastic#32789
This PR also includes HLRC docs for the enrich stats api. Relates to #32789
This PR also includes HLRC docs for the enrich stats api. Relates to #32789
This PR also includes HLRC docs for the enrich stats api. Relates to elastic#32789
This PR also includes HLRC docs for the enrich stats api. Relates to #32789
This PR also includes HLRC docs for the enrich stats api. Relates to #32789
which adds a new ingest processor, named enrich processor, that allows document being ingested to be enriched with data from other indices. Besides a new enrich processor, this PR adds several APIs to manage an enrich policy. An enrich policy is in charge of making the data from other indices available to the enrich processor in an efficient manner. Closes #32789
which is backport merge and adds a new ingest processor, named enrich processor, that allows document being ingested to be enriched with data from other indices. Besides a new enrich processor, this PR adds several APIs to manage an enrich policy. An enrich policy is in charge of making the data from other indices available to the enrich processor in an efficient manner. Related to #32789
Enrichment at ingest
This issue describes a project that will leverage the ingest node to allow for enrichment of documents before they are indexed.
Below is a diagram that highlights the workflow. The red parts are new components.
.enrich-*
- index(es) managed managed by Elasticsearch that contains a highly optimized subset of the source data used for enrichment.enrich policy
- a policy that describes how to synchronize the source index with the.enrich-*
index. The policy will describe which fields to copy and how often to copy the fields.decorate processor
- an ingest node processor that reads from a.enrich-*
index to mutate the raw data before it is indexed. The.enrich-*
will be data local to thedecorate processor
.There are many moving parts so this issue will serve as a central place to track them.
Tasks
Enrich policy definition
enrich policy
(@martijnvg) Added enrich policy definition. #41003enrich_key
tomatch_field
andenrich_values
toenrich_fields
.type
field and make the type a top level json object that contains all the configuration of an enrich policy. Change how type is stored in an enrich policy. #45789instead of:
Enrich processor
(Currently if multiple CS updates are combined then enrich policy changes may not be visible)
IngestService
to register components that are updated before the processor factories.EnrichProcessorFactory
as component that keeps track of the policies.enrich_key
option tofield
in enrich processor configuration. Enrich processor configuration changes #45466set_from
andtargets
options and introducetarget_field
option that is inline with whatgeoip
processor is doing. The entire looked up document is placed as json object under thetarget_field
. Enrich processor configuration changes #45466EnrichPolicy
instance. Just on the policy name. From the policy name, the enrich index alias can be resolved and from the the currently active enrich index. The enrich index should have thematch_field
of policy in the meta mapping stored, this is the only piece of information required to do the enrichment at ingest time. Decouple enrich processor factory from enrich policy #45826Policy management
(add created version to EnrichPolicy?) (@jbaiera) Add the cluster version to enrich policies #45021
The background process should mark indices for deletion first, and remove them in the next execution (To avoid deleting indices that have been freshly retired from the enrich alias and still potentially in use). Also the background process should not delete any indices that are tied to policies currently being executed - We don't want to throw out new indices that are currently being populated by a policy execution. (@jbaiera) Add Enrich index background task to cleanup old indices #43746field are not inside an array of objects (nested). (@jbaiera) Enrich validate nested mappings #42452
APIs
MetaDataCreateIndexService#validateIndexOrAliasName
) (@martijnvg)GET _enrich/policy/users-policy
(specific policy) andGET _enrich/policy
(all policies). Both variants should always return a list of objects. And later also support:GET _enrich/policy/users-*
andGET _enrich/policy/users-policy,users2-policy
. (@hub-cap) Consolidate enrich list all and get by name APIs #45705enrich policy
(@hub-cap) _enrich/policy/name.enrich-policies
?) instead of in the cluster state. (@hub-cap) Use an index to store enrich policies #47475Misc
update Kibana roles for new role, to be done after the feature branch is merged to masterobsoleted by Role Management - use ES Builtin Privilege API to drive list of privileges kibana#40270EDITS:
The text was updated successfully, but these errors were encountered: