Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add only enrich_fields to documents when using Enrich Processors #74217

Open
ajosh0504 opened this issue Jun 16, 2021 · 3 comments
Open

Add only enrich_fields to documents when using Enrich Processors #74217

ajosh0504 opened this issue Jun 16, 2021 · 3 comments
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement Team:Data Management Meta label for data/management team

Comments

@ajosh0504
Copy link

The way Enrich processors work right now, the match_field is added to input documents, along with the enrich_fields. Per the docs, only enrich_fields should be added as enrichments to input documents.

Here's presenting two cases for an example enrich policy and the effects of adding match_field to the enriched output:

  • The field corresponding to match_field in the enrich policy appears in both, the input documents and the enrich documents
  • The field corresponding to match_field in the enrich policy, is not present in input document

Eg: For an enrich policy setup as follows:

{
  "match": {
    "indices": "test-*",
    "match_field": "full_url",
    "enrich_fields": ["label"]
  }
}
  • Case 1: The field corresponding to match_field in the enrich policy appears in both, the input documents and the enrich documents
    Sample input document:
{
  "full_url": "www.google.com"
}

Corresponding enrich processor:

{
  "processors" : [
    {
      "enrich" : {
        "policy_name": "test-enrich-policy",
        "field" : "full_url",
        "target_field": "enrichments",
        "max_matches": "1"
      }
    }
  ]
}

Resulting enriched document:

{
  "_index": "test",
  "_type": "_doc",
  "_id": "80yIFnoBIIJ28p1MmMF-",
  "_score": 1,
  "_source": {
    "enrichments": {
      "label": 0,
      "full_url": "www.google.com"
      }
    }
  }

Observation: The match_field is merged with the input field and appears in addition to the enrich_fields under the target_field in the enriched documents. This is not ideal if users want the original fields in their input documents to be untouched.

  • Case 2: The field corresponding to match_field in the enrich policy, is not present in input document
    Sample input document:
{
  "original_url": "www.google.com"
}

Corresponding enrich processor:

{
  "processors" : [
    {
      "enrich" : {
        "policy_name": "test-enrich-policy",
        "field" : "original_url",
        "target_field": "enrichments",
        "max_matches": "1"
      }
    }
  ]
}

Resulting enriched document:

{
  "_index": "test",
  "_type": "_doc",
  "_id": "K0x5FXoBIIJ28p1MscE6",
  "_score": 1,
  "_source": {
    "original_url": "www.google.com",
    "enrichments": {
      "label": 0,
      "full_url": "www.google.com"
      }
    }
  }

Observation: The match_field appears in addition to the enrich_fields under the target_field in the enriched documents. Adding match_field to the enrich documents is unnecessary and by definition, enrich_fields are the only fields that the input documents should be enriched with.

cc: @martijnvg

@ajosh0504 ajosh0504 added >enhancement :Core/Infra/REST API REST infrastructure and utilities Team:Core/Infra Meta label for core/infra team labels Jun 16, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@ajosh0504 ajosh0504 changed the title Include only the enrich_fields in documents when using Enrich Processors Add only enrich_fields to documents when using Enrich Processors Jun 16, 2021
@martijnvg
Copy link
Member

Thanks for opening this issue @ajosh0504.

Part of the issue is that the match field is currently included from the source index into the enrich index, so that it can be queried. At enrich time, the enrich processor should exclude the match field or when an enrich policy is creating the enrich index it should exclude the match field from the source only (by using something like _source exclude).

@martijnvg martijnvg added :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP and removed :Core/Infra/REST API REST infrastructure and utilities labels Jun 17, 2021
@elasticmachine elasticmachine added Team:Data Management Meta label for data/management team and removed Team:Core/Infra Meta label for core/infra team labels Jun 17, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (Team:Core/Features)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests

3 participants