Fix missing support for setting document id in decoder_json pr… #15859

urso · 2020-01-27T13:35:11Z

Breaking change
Enhancement

What does this PR do?

Update processors, output, and json parser to store the document ID in
@metadata._id.
Also add missing document_id to decode_json_fields processor, given
users the chance to set the document id if the JSON document was
embedded in another JSON document.

Why is it important?

This ensures better compatibility with Logstash existing inputs/filters already using @metadata._id.
Fix missing support for extract document IDs via decode_json_fields

About the breaking change: The document_id setting on the JSON decoder has been introduced in 7.5, but overall effort on supporting event duplication was only finalized in 7.6. This means that the to @metadata._id is a breaking change. But the feature wasn't much documented, while actual documentation on how to configure beats + ES for data duplication is planned for 7.6.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works

Author's Checklist

[ ]

How to test this PR locally

create file with custom JSON lines and field that is supposed to act as document ID (like 4 or 5 events). For example: {"myid": "id1", "log": "..."}
configure filebeat to collect from said file, but configure decode_json_fields processor:

processors:
 - decode_json_fields:
     document_id: "myid"
     fields: ["message"]
     max_depth: 1
     target: ""

run filebeat (publish to Elasticsearch) with -d 'publish' and check that @metadata._id is set when inspecting events to be published in the logs. The myid should be removed from the event.
Query events from elasticsearch and check that _id matches the original contents of myid.

Related issues

Relates Update documentation with document_id #13739
Relates [DOCS] Add new topic about data deduplication #15171
Relates Support for setting the document ID #14363

Update processors, output, and json parser to store the document ID in `@metadata._id`. This ensures better compatibility with Logstash inputs/filters setting `@metadata._id`. Also add missing `document_id` to decode_json_fields processor, given users the chance to set the document id if the JSON document was embedded in another JSON document.

ycombinator · 2020-01-27T15:10:46Z

libbeat/beat/event.go

@@ -51,7 +51,7 @@ func (e *Event) SetID(id string) {
 	if e.Meta == nil {
 		e.Meta = common.MapStr{}
 	}
-	e.Meta["id"] = id


Given the special nature of this field name and the desire to keep it consistent in multiple places, do you think we should make it an exported const?

We have more than one field that is special to meta. Let's clean these up (the other fields as well) in a follow up PR.

ycombinator · 2020-01-27T15:15:10Z

Should we add a CHANGELOG entry, maybe especially since it's technically a breaking change?

urso · 2020-01-27T16:43:36Z

Should we add a CHANGELOG entry, maybe especially since it's technically a breaking change?

Oops, Addded changelog.

ycombinator

LGTM.

urso · 2020-01-28T20:04:49Z

beats-ci failure due to timeouts downloading dependencies. All related tests passed on Travis. Merging.

…tic#15859) * Change to metadata._id Update processors, output, and json parser to store the document ID in `@metadata._id`. This ensures better compatibility with Logstash inputs/filters setting `@metadata._id`. Also add missing `document_id` to decode_json_fields processor, given users the chance to set the document id if the JSON document was embedded in another JSON document. (cherry picked from commit d60b04a)

* Change to metadata._id Update processors, output, and json parser to store the document ID in `@metadata._id`. This ensures better compatibility with Logstash inputs/filters setting `@metadata._id`. Also add missing `document_id` to decode_json_fields processor, given users the chance to set the document id if the JSON document was embedded in another JSON document. (cherry picked from commit d60b04a)

faec · 2020-02-06T21:24:46Z

Testing turned up an oversight in this PR: document_id wasn't added to the list of allowed fields for the processor, so while the other logic (and the id / _id fix) looks correct, libbeat rejected configurations that include document_id in decode_json_fields. We consider this a non-blocking bug. The fix is in review at #16156 and will be backported to 7.6 and 7.x shortly.

In elastic#15859 the Elasticsearch output was changed to read from the @metadata._id field when it had been using @metadata.id. The s3 and googlepubsub inputs had both been setting @metadata.id, but were not updated with that change. This updates the s3 and googlepubsub inputs to use `beat.Event#SetID()` rather than creating the metadata object themselves.

In #15859 the Elasticsearch output was changed to read from the @metadata._id field when it had been using @metadata.id. The s3 and googlepubsub inputs had both been setting @metadata.id, but were not updated with that change. This updates the s3 and googlepubsub inputs to use `beat.Event#SetID()` rather than creating the metadata object themselves.

In elastic#15859 the Elasticsearch output was changed to read from the @metadata._id field when it had been using @metadata.id. The s3 and googlepubsub inputs had both been setting @metadata.id, but were not updated with that change. This updates the s3 and googlepubsub inputs to use `beat.Event#SetID()` rather than creating the metadata object themselves. (cherry picked from commit 304eca4)

In #15859 the Elasticsearch output was changed to read from the @metadata._id field when it had been using @metadata.id. The s3 and googlepubsub inputs had both been setting @metadata.id, but were not updated with that change. This updates the s3 and googlepubsub inputs to use `beat.Event#SetID()` rather than creating the metadata object themselves. (cherry picked from commit 304eca4)

urso requested a review from ycombinator January 27, 2020 13:35

urso changed the title ~~Change to metadata._id~~ Fix missing support for setting document id in decoder_json processor and switch to metadata._id Jan 27, 2020

urso added review Team:Beats v7.6.0 blocker labels Jan 27, 2020

This was referenced Jan 27, 2020

Support for setting the document ID #14363

Closed

[DOCS] Add new topic about data deduplication #15171

Merged

urso added the needs_backport PR is waiting to be backported to other branches. label Jan 27, 2020

andresrc added [zube]: Inbox Team:Services (Deprecated) Label for the former Integrations-Services team [zube]: In Review and removed [zube]: Inbox labels Jan 27, 2020

ycombinator reviewed Jan 27, 2020

View reviewed changes

add changelog

fdc4c41

ycombinator approved these changes Jan 27, 2020

View reviewed changes

fix metricbeat unit test

f4e0b80

urso changed the title ~~Fix missing support for setting document id in decoder_json processor and switch to metadata._id~~ Fix missing support for setting document id in decoder_json pr… Jan 28, 2020

urso merged commit d60b04a into elastic:master Jan 28, 2020

urso deleted the metadata-id-compat branch January 28, 2020 20:05

zube bot added [zube]: Done and removed [zube]: In Review labels Jan 28, 2020

urso mentioned this pull request Jan 28, 2020

Cherry-pick #15859 to 7.x: Fix missing support for setting doc… #15914

Merged

5 tasks

urso added v7.7.0 and removed needs_backport PR is waiting to be backported to other branches. labels Jan 28, 2020

urso mentioned this pull request Jan 28, 2020

Cherry-pick #15859 to 7.6: Fix missing support for setting doc… #15915

Merged

5 tasks

urso added the test-plan Add this PR to be manual test plan label Jan 28, 2020

andresrc assigned faec Feb 3, 2020

urso removed the blocker label Feb 3, 2020

andresrc removed the [zube]: Done label Feb 4, 2020

faec mentioned this pull request Feb 6, 2020

Include document_id in decode_json_fields allowed fields #16156

Merged

5 tasks

This was referenced Feb 7, 2020

Cherry-pick #16156 to 7.6: Include document_id in decode_json_fields allowed fields #16186

Merged

Cherry-pick #16156 to 7.x: Include document_id in decode_json_fields allowed fields #16187

Merged

andrewkroh mentioned this pull request Feb 25, 2020

New input for Office 365 audit logs #16244

Merged

10 tasks

andresrc added the Team:Integrations Label for the Integrations team label Mar 6, 2020

andrewkroh mentioned this pull request Mar 16, 2020

Fix _id field in s3 and googlepubsub inputs #17026

Merged

andrewkroh mentioned this pull request Mar 19, 2020

Cherry-pick #17026 to 7.x: Fix _id field in s3 and googlepubsub inputs #17117

Merged

andresrc added the test-plan-added This PR has been added to the test plan label Mar 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix missing support for setting document id in decoder_json pr… #15859

Fix missing support for setting document id in decoder_json pr… #15859

urso commented Jan 27, 2020 •

edited by andresrc

Loading

ycombinator Jan 27, 2020

urso Jan 27, 2020

ycombinator commented Jan 27, 2020

urso commented Jan 27, 2020

ycombinator left a comment

urso commented Jan 28, 2020

faec commented Feb 6, 2020

Fix missing support for setting document id in decoder_json pr… #15859

Fix missing support for setting document id in decoder_json pr… #15859

Conversation

urso commented Jan 27, 2020 • edited by andresrc Loading

What does this PR do?

Why is it important?

Checklist

Author's Checklist

How to test this PR locally

Related issues

ycombinator Jan 27, 2020

Choose a reason for hiding this comment

urso Jan 27, 2020

Choose a reason for hiding this comment

ycombinator commented Jan 27, 2020

urso commented Jan 27, 2020

ycombinator left a comment

Choose a reason for hiding this comment

urso commented Jan 28, 2020

faec commented Feb 6, 2020

urso commented Jan 27, 2020 •

edited by andresrc

Loading