From fea94e5731946a74ff782bfb9a9c19189985572f Mon Sep 17 00:00:00 2001 From: DeDe Morton Date: Tue, 17 Dec 2019 16:39:18 -0800 Subject: [PATCH 1/7] [DOCS] Add new topic about data deduplication --- filebeat/docs/configuring-howto.asciidoc | 3 + libbeat/docs/shared-deduplication.asciidoc | 81 ++++++++++++++++++++++ 2 files changed, 84 insertions(+) create mode 100644 libbeat/docs/shared-deduplication.asciidoc diff --git a/filebeat/docs/configuring-howto.asciidoc b/filebeat/docs/configuring-howto.asciidoc index 08f61a4f060..a5d8f381428 100644 --- a/filebeat/docs/configuring-howto.asciidoc +++ b/filebeat/docs/configuring-howto.asciidoc @@ -30,6 +30,7 @@ The following topics describe how to configure Filebeat: * <> * <> * <> +* <<{beatname_lc}-deduplication>> * <> * <<{beatname_lc}-geoip>> * <> @@ -68,6 +69,8 @@ include::{libbeat-dir}/shared-ssl-config.asciidoc[] include::./filebeat-filtering.asciidoc[] +include::{libbeat-dir}/shared-deduplication.asciidoc[] + include::{libbeat-dir}/shared-config-ingest.asciidoc[] include::{libbeat-dir}/shared-geoip.asciidoc[] diff --git a/libbeat/docs/shared-deduplication.asciidoc b/libbeat/docs/shared-deduplication.asciidoc new file mode 100644 index 00000000000..da0be787a51 --- /dev/null +++ b/libbeat/docs/shared-deduplication.asciidoc @@ -0,0 +1,81 @@ +[id="{beatname_lc}-deduplication"] +== Data deduplication + +The {beats} framework guarantees at-least-once delivery to ensure that no data +is lost when events are sent to {es}. This is great if everything goes as +planned. But if {beatname_uc} shuts down during processing, or the connection is +lost before events are acknowledged, you can end up with duplicate events in +{es}. + +[float] +=== What causes duplicates? + +The {beats} retry mechanism may result in duplicate data in {es}. + +When an output is blocked, {beatname_uc} will attempt to resend events until +they are acknowledged by the output. If the output receives the data, but is +unable to send an acknowledgement, the data may be sent to {es} multiple times. +When {es} processes the data, it looks for a document ID. If the ID exists, +{es} overwrites the existing document. If not, {es} creates a new document. +Because document IDs are typically set by {es} (by default), this problem is +common for data sent by {beats} or {ls}. + +[float] +=== How can I avoid duplicates? + +Rather than allowing {es} to set the document ID, set the ID in {beats}. The ID +is stored in the {beats} `@metadata.id` field where it can be used to set the +document ID during indexing. That way, if {beats} sends the same event to {es} +more than once, {es} overwrites the existing document rather than creating a new +one. + +The `@metadata.id` field is passed along with the event so that you can use +it to set the document ID later in your processing pipeline, for example, in +{ls}. + +There are several methods available for setting the document ID in {beats}. The +one you use depends on your specific use case: + +TODO: Need some realistic examples to flesh out the following sections. Also need to test these...haha. + +* *`add_id` processor* ++ +Use the <> processor when your logs have no natural key field, +and you can’t derive a unique key from existing fields. ++ +This example generates a unique ID for each event and adds it to the +`@metadata.id` field: ++ +[source,yaml] +---- +processors: + - add_id: ~ +---- + +* *`fingerprint` processor* ++ +Use the <> processor when you want to derive a unique +key from multiple existing fields. ++ +This example combines the values of `field1` and `field2` to create a unique key +that it adds to the `_id` field: ++ +[source,yaml] +---- +processors: + - fingerprint: + fields: ["field1", "field2", ...] + target_field: "_id" +---- ++ +TODO: Test the syntax. I’m guessing here. + +* *JSON input settings* ++ +Use the `json.document_id` input setting if you’re ingesting JSON-formatted +data, and the data has a natural key field. ++ +This example sets the document ID to the value of field1 from the JSON document. ++ +TODO: Add an example here. Should show the input config in addition to the JSON +settings to provide context. From 6748be5a3d3d7a075656923818033b84e7680dbe Mon Sep 17 00:00:00 2001 From: DeDe Morton Date: Thu, 9 Jan 2020 17:08:21 -0800 Subject: [PATCH 2/7] More updates --- libbeat/docs/shared-deduplication.asciidoc | 19 +++++++------------ 1 file changed, 7 insertions(+), 12 deletions(-) diff --git a/libbeat/docs/shared-deduplication.asciidoc b/libbeat/docs/shared-deduplication.asciidoc index da0be787a51..3191782042c 100644 --- a/libbeat/docs/shared-deduplication.asciidoc +++ b/libbeat/docs/shared-deduplication.asciidoc @@ -4,21 +4,17 @@ The {beats} framework guarantees at-least-once delivery to ensure that no data is lost when events are sent to {es}. This is great if everything goes as planned. But if {beatname_uc} shuts down during processing, or the connection is -lost before events are acknowledged, you can end up with duplicate events in +lost before events are acknowledged, you can end up with duplicate data in {es}. [float] === What causes duplicates? -The {beats} retry mechanism may result in duplicate data in {es}. - -When an output is blocked, {beatname_uc} will attempt to resend events until -they are acknowledged by the output. If the output receives the data, but is -unable to send an acknowledgement, the data may be sent to {es} multiple times. -When {es} processes the data, it looks for a document ID. If the ID exists, -{es} overwrites the existing document. If not, {es} creates a new document. -Because document IDs are typically set by {es} (by default), this problem is -common for data sent by {beats} or {ls}. +When an output is blocked, the retry mechanism in {beatname_uc} attempts to +resend events until they are acknowledged by the output. If the output receives +the events, but is unable to acknowledge them, the data might be sent to {es} +multiple times. Because document IDs are typically set by {es} _after_ it +receives the data from {beats}, the duplicate events are indexed as new data. [float] === How can I avoid duplicates? @@ -33,8 +29,7 @@ The `@metadata.id` field is passed along with the event so that you can use it to set the document ID later in your processing pipeline, for example, in {ls}. -There are several methods available for setting the document ID in {beats}. The -one you use depends on your specific use case: +There are several ways to set the document ID in {beats}: TODO: Need some realistic examples to flesh out the following sections. Also need to test these...haha. From 1c6f94d291b2a522f553dfc2d00bd0fff5854971 Mon Sep 17 00:00:00 2001 From: DeDe Morton Date: Fri, 24 Jan 2020 17:51:22 -0800 Subject: [PATCH 3/7] Corrections from testing --- libbeat/docs/shared-deduplication.asciidoc | 73 ++++++++++++++++------ 1 file changed, 53 insertions(+), 20 deletions(-) diff --git a/libbeat/docs/shared-deduplication.asciidoc b/libbeat/docs/shared-deduplication.asciidoc index 3191782042c..e708738785e 100644 --- a/libbeat/docs/shared-deduplication.asciidoc +++ b/libbeat/docs/shared-deduplication.asciidoc @@ -20,19 +20,16 @@ receives the data from {beats}, the duplicate events are indexed as new data. === How can I avoid duplicates? Rather than allowing {es} to set the document ID, set the ID in {beats}. The ID -is stored in the {beats} `@metadata.id` field where it can be used to set the -document ID during indexing. That way, if {beats} sends the same event to {es} -more than once, {es} overwrites the existing document rather than creating a new -one. +is stored in the {beats} `@metadata.id` field and used to set the document ID +during indexing. That way, if {beats} sends the same event to {es} more than +once, {es} overwrites the existing document rather than creating a new one. The `@metadata.id` field is passed along with the event so that you can use -it to set the document ID later in your processing pipeline, for example, in -{ls}. +it to set the document ID later in your processing pipeline, for example, +in {ls}. See <>. There are several ways to set the document ID in {beats}: -TODO: Need some realistic examples to flesh out the following sections. Also need to test these...haha. - * *`add_id` processor* + Use the <> processor when your logs have no natural key field, @@ -44,33 +41,69 @@ This example generates a unique ID for each event and adds it to the [source,yaml] ---- processors: - - add_id: ~ + - add_id: ~ ---- * *`fingerprint` processor* + -Use the <> processor when you want to derive a unique -key from multiple existing fields. +Use the <> processor to derive a unique key from +multiple existing fields. + -This example combines the values of `field1` and `field2` to create a unique key -that it adds to the `_id` field: +This example uses the values of `field1` and `field2` to derive a unique key +that it adds to the `@metadata.id` field: + [source,yaml] ---- processors: - - fingerprint: - fields: ["field1", "field2", ...] - target_field: "_id" + - fingerprint: + fields: ["field1", "field2"] + target_field: "@metadata.id" ---- + -TODO: Test the syntax. I’m guessing here. * *JSON input settings* + Use the `json.document_id` input setting if you’re ingesting JSON-formatted data, and the data has a natural key field. + -This example sets the document ID to the value of field1 from the JSON document. +This example takes the value of `key1` from the JSON document and stores it in +the `@metadata.id` field: + -TODO: Add an example here. Should show the input config in addition to the JSON -settings to provide context. +[source,yaml] +---- +filebeat.inputs: +- type: log + paths: + - /path/to/json.log + json.document_id: "key1" +---- + +[float] +[[ls-doc-id]] +=== {ls} pipeline example + +For this example, assume that you've used one of the approaches described +earlier to store the document ID in the {beats} `@metadata.id` field. In +the {logstash-ref}/plugins-outputs-elasticsearch.html[{es} output], set +the `document_id` field based on the `@metadata.id` field: + +[source,json] +---- +input { + beats { + port => 5044 + } +} + +output { + elasticsearch { + hosts => ["http://localhost:9200"] + document_id => "%{[@metadata][id]}" <1> + index => "%{[@metadata][beat]}-%{[@metadata][version]}" + } +} +---- +<1> Sets the `document_id` field to the value stored in `@metadata.id`. + +When {es} indexes the document, it sets the document ID to the specified value, +preserving the ID passed from {beats}. From 9e2f14ea97b6ba0a9bc28ad41a407abbdb0ae2c4 Mon Sep 17 00:00:00 2001 From: DeDe Morton Date: Mon, 3 Feb 2020 18:43:05 -0800 Subject: [PATCH 4/7] Add more changes from the review --- libbeat/docs/shared-deduplication.asciidoc | 59 ++++++++++++---------- 1 file changed, 33 insertions(+), 26 deletions(-) diff --git a/libbeat/docs/shared-deduplication.asciidoc b/libbeat/docs/shared-deduplication.asciidoc index e708738785e..4bf977078de 100644 --- a/libbeat/docs/shared-deduplication.asciidoc +++ b/libbeat/docs/shared-deduplication.asciidoc @@ -2,41 +2,42 @@ == Data deduplication The {beats} framework guarantees at-least-once delivery to ensure that no data -is lost when events are sent to {es}. This is great if everything goes as -planned. But if {beatname_uc} shuts down during processing, or the connection is -lost before events are acknowledged, you can end up with duplicate data in -{es}. +is lost when events are sent to outputs that support acknowledgement, such as +{es}, {ls}, Kafka, and Redis. This is great if everything goes as planned. But +if {beatname_uc} shuts down during processing, or the connection is lost before +events are acknowledged, you can end up with duplicate data. [float] -=== What causes duplicates? +=== What causes duplicates in {es}? When an output is blocked, the retry mechanism in {beatname_uc} attempts to resend events until they are acknowledged by the output. If the output receives -the events, but is unable to acknowledge them, the data might be sent to {es} -multiple times. Because document IDs are typically set by {es} _after_ it -receives the data from {beats}, the duplicate events are indexed as new data. +the events, but is unable to acknowledge them, the data might be sent to the +output multiple times. Because document IDs are typically set by {es} _after_ it +receives the data from {beats}, the duplicate events are indexed as new +documents. [float] === How can I avoid duplicates? Rather than allowing {es} to set the document ID, set the ID in {beats}. The ID -is stored in the {beats} `@metadata.id` field and used to set the document ID +is stored in the {beats} `@metadata._id` field and used to set the document ID during indexing. That way, if {beats} sends the same event to {es} more than once, {es} overwrites the existing document rather than creating a new one. -The `@metadata.id` field is passed along with the event so that you can use -it to set the document ID later in your processing pipeline, for example, -in {ls}. See <>. +The `@metadata._id` field is passed along with the event so that you can use +it to set the document ID after the event has been published by {beatname_uc} +but before it's received by {es}. For example, see <>. There are several ways to set the document ID in {beats}: * *`add_id` processor* + -Use the <> processor when your logs have no natural key field, +Use the <> processor when your data has no natural key field, and you can’t derive a unique key from existing fields. + This example generates a unique ID for each event and adds it to the -`@metadata.id` field: +`@metadata._id` field: + [source,yaml] ---- @@ -47,19 +48,18 @@ processors: * *`fingerprint` processor* + Use the <> processor to derive a unique key from -multiple existing fields. +one or more existing fields. + This example uses the values of `field1` and `field2` to derive a unique key -that it adds to the `@metadata.id` field: +that it adds to the `@metadata._id` field: + [source,yaml] ---- processors: - fingerprint: fields: ["field1", "field2"] - target_field: "@metadata.id" + target_field: "@metadata._id" ---- -+ * *JSON input settings* + @@ -67,7 +67,7 @@ Use the `json.document_id` input setting if you’re ingesting JSON-formatted data, and the data has a natural key field. + This example takes the value of `key1` from the JSON document and stores it in -the `@metadata.id` field: +the `@metadata._id` field: + [source,yaml] ---- @@ -83,9 +83,9 @@ filebeat.inputs: === {ls} pipeline example For this example, assume that you've used one of the approaches described -earlier to store the document ID in the {beats} `@metadata.id` field. In +earlier to store the document ID in the {beats} `@metadata._id` field. In the {logstash-ref}/plugins-outputs-elasticsearch.html[{es} output], set -the `document_id` field based on the `@metadata.id` field: +the `document_id` field based on the `@metadata._id` field: [source,json] ---- @@ -96,14 +96,21 @@ input { } output { - elasticsearch { - hosts => ["http://localhost:9200"] - document_id => "%{[@metadata][id]}" <1> - index => "%{[@metadata][beat]}-%{[@metadata][version]}" + if [@metadata][_id] { + elasticsearch { + hosts => ["http://localhost:9200"] + document_id => "%{[@metadata][_id]}" <1> + index => "%{[@metadata][beat]}-%{[@metadata][version]}" + } + } else { + elasticsearch { + hosts => ["http://localhost:9200"] + index => "%{[@metadata][beat]}-%{[@metadata][version]}" + } } } ---- -<1> Sets the `document_id` field to the value stored in `@metadata.id`. +<1> Sets the `document_id` field to the value stored in `@metadata._id`. When {es} indexes the document, it sets the document ID to the specified value, preserving the ID passed from {beats}. From 04af9dce3dd36735bb56998bfa763cf23d5cd41f Mon Sep 17 00:00:00 2001 From: DeDe Morton Date: Tue, 4 Feb 2020 16:35:59 -0800 Subject: [PATCH 5/7] Add more fixes from the review --- libbeat/docs/shared-deduplication.asciidoc | 33 ++++++++++++++++++---- 1 file changed, 28 insertions(+), 5 deletions(-) diff --git a/libbeat/docs/shared-deduplication.asciidoc b/libbeat/docs/shared-deduplication.asciidoc index 4bf977078de..46fab7f8714 100644 --- a/libbeat/docs/shared-deduplication.asciidoc +++ b/libbeat/docs/shared-deduplication.asciidoc @@ -61,6 +61,27 @@ processors: target_field: "@metadata._id" ---- +* *`decode_json_field` processor* ++ +Use the `document_id` setting in the <> +processor when you're decoding a JSON string that contains a natural key field. ++ +For this example, assume that the `message` field contains the JSON string +`{"myid": "100", "text": "Some text"}`. This example takes the value of `myid` +from the JSON string and stores it in the `@metadata._id` field: ++ +[source,yaml] +---- +processors: + - decode_json_fields: + document_id: "myid" + fields: ["message"] + max_depth: 1 + target: "" +---- ++ +The resulting document ID is `100`. + * *JSON input settings* + Use the `json.document_id` input setting if you’re ingesting JSON-formatted @@ -83,9 +104,9 @@ filebeat.inputs: === {ls} pipeline example For this example, assume that you've used one of the approaches described -earlier to store the document ID in the {beats} `@metadata._id` field. In -the {logstash-ref}/plugins-outputs-elasticsearch.html[{es} output], set -the `document_id` field based on the `@metadata._id` field: +earlier to store the document ID in the {beats} `@metadata._id` field. To +preserve the ID when you send {beats} data through {ls} en route to {es}, +set the `document_id` field in the {ls} pipeline: [source,json] ---- @@ -93,7 +114,7 @@ input { beats { port => 5044 } -} +}} output { if [@metadata][_id] { @@ -110,7 +131,9 @@ output { } } ---- -<1> Sets the `document_id` field to the value stored in `@metadata._id`. +<1> Sets the `document_id` field in the +{logstash-ref}/plugins-outputs-elasticsearch.html[{es} output] to the value +stored in `@metadata._id`. When {es} indexes the document, it sets the document ID to the specified value, preserving the ID passed from {beats}. From 407a6918f9a0dc47bb94b34bdc770829fbbe288e Mon Sep 17 00:00:00 2001 From: DeDe Morton Date: Tue, 4 Feb 2020 17:02:54 -0800 Subject: [PATCH 6/7] Fix typo in processor name --- libbeat/docs/shared-deduplication.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/libbeat/docs/shared-deduplication.asciidoc b/libbeat/docs/shared-deduplication.asciidoc index 46fab7f8714..1f8ab85385c 100644 --- a/libbeat/docs/shared-deduplication.asciidoc +++ b/libbeat/docs/shared-deduplication.asciidoc @@ -61,7 +61,7 @@ processors: target_field: "@metadata._id" ---- -* *`decode_json_field` processor* +* *`decode_json_fields` processor* + Use the `document_id` setting in the <> processor when you're decoding a JSON string that contains a natural key field. From cf4fe6dd81e5a939c11ab30bee62a1fba10699b8 Mon Sep 17 00:00:00 2001 From: DeDe Morton Date: Tue, 4 Feb 2020 17:57:04 -0800 Subject: [PATCH 7/7] Update notice --- NOTICE.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/NOTICE.txt b/NOTICE.txt index e8412d03955..14bcd6878c4 100644 --- a/NOTICE.txt +++ b/NOTICE.txt @@ -1,5 +1,5 @@ Elastic Beats -Copyright 2014-2019 Elasticsearch BV +Copyright 2014-2020 Elasticsearch BV This product includes software developed by The Apache Software Foundation (http://www.apache.org/).