Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inclusion of match_field in enriched documents #49251

Open
bcamper opened this issue Nov 18, 2019 · 4 comments
Open

Inclusion of match_field in enriched documents #49251

bcamper opened this issue Nov 18, 2019 · 4 comments
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team

Comments

@bcamper
Copy link

bcamper commented Nov 18, 2019

Elasticsearch version: 7.5.0

An enrich policy includes both a match_field (to match documents against for potential enrichment) and a list of enrich_fields to be appended to any matching documents. Currently, the match_field is ALSO appended to enriched docs, even if it doesn't appear in enrich_fields.

This may not be desirable in all cases, e.g. for the geo_match processor, you may be matching points against very detailed shapes like administrative area polygons, and probably don't need to copy the entire boundary of a country for every street address inside of it.

Is it worth considering an option for this behavior or rethinking the default (I was surprised to find the match_field included even though I didn't ask for it, but I assume it is intentional behavior)? As it stands today, @talevy points out that you can always add a remove processor to get rid of the copied shape field.

(Also see #49208 to clarify existing behavior.)

@bcamper bcamper added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Nov 18, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Ingest)

@martijnvg
Copy link
Member

I think it worth considering to exclude the match_field. Reason that it is included now, is that it is easy to remove using remove processor and it is a field the enrich processor queries. But we can exclude it at policy execution time, by configuring an exclude on the _source field mapping in the .enrich index and at the same time still be able to query on it.

@martijnvg
Copy link
Member

We have discussed this and agreed that it is unexpected that the match field is included in the document being ingest (since it is already there, otherwise the enrich doc couldn't be fetched).

We should introduce a flag that controls whether the match field is included and default to true, in order to not break bwc. In 8.0 we can change the default to false.

We should ensure that the match_field doesn't endup in the _source for documents in the .enrich index to begin with. We can do that with the include / exclude _source feature. The flag should then be an option in the enrich policy.

@alexfrancoeur
Copy link

alexfrancoeur commented Feb 9, 2020

Just ran into this trying to ingest 20 years worth of docs and caught me by surprise, it significantly increases the index size as you might imagine 😀 Great to hear that remove is a good workaround, but the example docs should at least be updated to reflect this. https://www.elastic.co/guide/en/elasticsearch/reference/7.5/geo-match-enrich-policy-type.html. If match is not listed in enrich it doesn't feel intuitive to include it.

@rjernst rjernst added the Team:Data Management Meta label for data/management team label May 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests

5 participants