Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix missing support for setting document id in decoder_json pr… #15859

Merged
merged 3 commits into from
Jan 28, 2020

Conversation

urso
Copy link

@urso urso commented Jan 27, 2020

  • Breaking change
  • Enhancement

What does this PR do?

Update processors, output, and json parser to store the document ID in
@metadata._id.
Also add missing document_id to decode_json_fields processor, given
users the chance to set the document id if the JSON document was
embedded in another JSON document.

Why is it important?

  • This ensures better compatibility with Logstash existing inputs/filters already using @metadata._id.
  • Fix missing support for extract document IDs via decode_json_fields

About the breaking change: The document_id setting on the JSON decoder has been introduced in 7.5, but overall effort on supporting event duplication was only finalized in 7.6. This means that the to @metadata._id is a breaking change. But the feature wasn't much documented, while actual documentation on how to configure beats + ES for data duplication is planned for 7.6.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works

Author's Checklist

  • [ ]

How to test this PR locally

  1. create file with custom JSON lines and field that is supposed to act as document ID (like 4 or 5 events). For example: {"myid": "id1", "log": "..."}
  2. configure filebeat to collect from said file, but configure decode_json_fields processor:
processors:
 - decode_json_fields:
     document_id: "myid"
     fields: ["message"]
     max_depth: 1
     target: ""
  1. run filebeat (publish to Elasticsearch) with -d 'publish' and check that @metadata._id is set when inspecting events to be published in the logs. The myid should be removed from the event.
  2. Query events from elasticsearch and check that _id matches the original contents of myid.

Related issues

Update processors, output, and json parser to store the document ID in
`@metadata._id`. This ensures better compatibility with Logstash
inputs/filters setting `@metadata._id`.

Also add missing `document_id` to decode_json_fields processor, given
users the chance to set the document id if the JSON document was
embedded in another JSON document.
@urso urso requested a review from ycombinator January 27, 2020 13:35
@urso urso changed the title Change to metadata._id Fix missing support for setting document id in decoder_json processor and switch to metadata._id Jan 27, 2020
@urso urso added the needs_backport PR is waiting to be backported to other branches. label Jan 27, 2020
@andresrc andresrc added [zube]: Inbox Team:Services (Deprecated) Label for the former Integrations-Services team [zube]: In Review and removed [zube]: Inbox labels Jan 27, 2020
@@ -51,7 +51,7 @@ func (e *Event) SetID(id string) {
if e.Meta == nil {
e.Meta = common.MapStr{}
}
e.Meta["id"] = id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the special nature of this field name and the desire to keep it consistent in multiple places, do you think we should make it an exported const?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have more than one field that is special to meta. Let's clean these up (the other fields as well) in a follow up PR.

@ycombinator
Copy link
Contributor

Should we add a CHANGELOG entry, maybe especially since it's technically a breaking change?

@urso
Copy link
Author

urso commented Jan 27, 2020

Should we add a CHANGELOG entry, maybe especially since it's technically a breaking change?

Oops, Addded changelog.

Copy link
Contributor

@ycombinator ycombinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@urso
Copy link
Author

urso commented Jan 28, 2020

beats-ci failure due to timeouts downloading dependencies. All related tests passed on Travis. Merging.

@urso urso changed the title Fix missing support for setting document id in decoder_json processor and switch to metadata._id Fix missing support for setting document id in decoder_json pr… Jan 28, 2020
@urso urso merged commit d60b04a into elastic:master Jan 28, 2020
@urso urso deleted the metadata-id-compat branch January 28, 2020 20:05
urso pushed a commit to urso/beats that referenced this pull request Jan 28, 2020
…tic#15859)

* Change to metadata._id

Update processors, output, and json parser to store the document ID in
`@metadata._id`. This ensures better compatibility with Logstash
inputs/filters setting `@metadata._id`.

Also add missing `document_id` to decode_json_fields processor, given
users the chance to set the document id if the JSON document was
embedded in another JSON document.

(cherry picked from commit d60b04a)
@urso urso added v7.7.0 and removed needs_backport PR is waiting to be backported to other branches. labels Jan 28, 2020
urso pushed a commit to urso/beats that referenced this pull request Jan 28, 2020
…tic#15859)

* Change to metadata._id

Update processors, output, and json parser to store the document ID in
`@metadata._id`. This ensures better compatibility with Logstash
inputs/filters setting `@metadata._id`.

Also add missing `document_id` to decode_json_fields processor, given
users the chance to set the document id if the JSON document was
embedded in another JSON document.

(cherry picked from commit d60b04a)
@urso urso added the test-plan Add this PR to be manual test plan label Jan 28, 2020
urso pushed a commit that referenced this pull request Jan 28, 2020
* Change to metadata._id

Update processors, output, and json parser to store the document ID in
`@metadata._id`. This ensures better compatibility with Logstash
inputs/filters setting `@metadata._id`.

Also add missing `document_id` to decode_json_fields processor, given
users the chance to set the document id if the JSON document was
embedded in another JSON document.

(cherry picked from commit d60b04a)
urso pushed a commit that referenced this pull request Jan 28, 2020
* Change to metadata._id

Update processors, output, and json parser to store the document ID in
`@metadata._id`. This ensures better compatibility with Logstash
inputs/filters setting `@metadata._id`.

Also add missing `document_id` to decode_json_fields processor, given
users the chance to set the document id if the JSON document was
embedded in another JSON document.

(cherry picked from commit d60b04a)
@urso urso removed the blocker label Feb 3, 2020
@faec
Copy link
Contributor

faec commented Feb 6, 2020

Testing turned up an oversight in this PR: document_id wasn't added to the list of allowed fields for the processor, so while the other logic (and the id / _id fix) looks correct, libbeat rejected configurations that include document_id in decode_json_fields. We consider this a non-blocking bug. The fix is in review at #16156 and will be backported to 7.6 and 7.x shortly.

@andresrc andresrc added the Team:Integrations Label for the Integrations team label Mar 6, 2020
andrewkroh added a commit to andrewkroh/beats that referenced this pull request Mar 16, 2020
In elastic#15859 the Elasticsearch output was changed to read from the @metadata._id field when it had been using @metadata.id.
The s3 and googlepubsub inputs had both been setting @metadata.id, but were not updated with that change.

This updates the s3 and googlepubsub inputs to use `beat.Event#SetID()` rather than creating the metadata object themselves.
andrewkroh added a commit that referenced this pull request Mar 18, 2020
In #15859 the Elasticsearch output was changed to read from the @metadata._id field when it had been using @metadata.id.
The s3 and googlepubsub inputs had both been setting @metadata.id, but were not updated with that change.

This updates the s3 and googlepubsub inputs to use `beat.Event#SetID()` rather than creating the metadata object themselves.
andrewkroh added a commit to andrewkroh/beats that referenced this pull request Mar 19, 2020
In elastic#15859 the Elasticsearch output was changed to read from the @metadata._id field when it had been using @metadata.id.
The s3 and googlepubsub inputs had both been setting @metadata.id, but were not updated with that change.

This updates the s3 and googlepubsub inputs to use `beat.Event#SetID()` rather than creating the metadata object themselves.

(cherry picked from commit 304eca4)
andrewkroh added a commit that referenced this pull request Mar 19, 2020
In #15859 the Elasticsearch output was changed to read from the @metadata._id field when it had been using @metadata.id.
The s3 and googlepubsub inputs had both been setting @metadata.id, but were not updated with that change.

This updates the s3 and googlepubsub inputs to use `beat.Event#SetID()` rather than creating the metadata object themselves.

(cherry picked from commit 304eca4)
@andresrc andresrc added the test-plan-added This PR has been added to the test plan label Mar 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
review Team:Integrations Label for the Integrations team Team:Services (Deprecated) Label for the former Integrations-Services team test-plan Add this PR to be manual test plan test-plan-added This PR has been added to the test plan v7.6.0 v7.7.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants