From fea94e5731946a74ff782bfb9a9c19189985572f Mon Sep 17 00:00:00 2001
From: DeDe Morton <dede.morton@elastic.co>
Date: Tue, 17 Dec 2019 16:39:18 -0800
Subject: [PATCH 1/7] [DOCS] Add new topic about data deduplication

---
 filebeat/docs/configuring-howto.asciidoc   |  3 +
 libbeat/docs/shared-deduplication.asciidoc | 81 ++++++++++++++++++++++
 2 files changed, 84 insertions(+)
 create mode 100644 libbeat/docs/shared-deduplication.asciidoc
diff --git a/filebeat/docs/configuring-howto.asciidoc b/filebeat/docs/configuring-howto.asciidoc
index 08f61a4f060..a5d8f381428 100644
--- a/filebeat/docs/configuring-howto.asciidoc
+++ b/filebeat/docs/configuring-howto.asciidoc
@@ -30,6 +30,7 @@ The following topics describe how to configure Filebeat:
 * <<load-balancing>>
 * <<configuration-ssl>>
 * <<filtering-and-enhancing-data>>
+* <<{beatname_lc}-deduplication>>
 * <<configuring-ingest-node>>
 * <<{beatname_lc}-geoip>>
 * <<configuration-path>>
@@ -68,6 +69,8 @@ include::{libbeat-dir}/shared-ssl-config.asciidoc[]
 
 include::./filebeat-filtering.asciidoc[]
 
+include::{libbeat-dir}/shared-deduplication.asciidoc[]
+
 include::{libbeat-dir}/shared-config-ingest.asciidoc[]
 
 include::{libbeat-dir}/shared-geoip.asciidoc[]
diff --git a/libbeat/docs/shared-deduplication.asciidoc b/libbeat/docs/shared-deduplication.asciidoc
new file mode 100644
index 00000000000..da0be787a51
--- /dev/null
+++ b/libbeat/docs/shared-deduplication.asciidoc
@@ -0,0 +1,81 @@
+[id="{beatname_lc}-deduplication"]
+== Data deduplication
+
+The {beats} framework guarantees at-least-once delivery to ensure that no data
+is lost when events are sent to {es}. This is great if everything goes as
+planned. But if {beatname_uc} shuts down during processing, or the connection is
+lost before events are acknowledged, you can end up with duplicate events in
+{es}.
+
+[float]
+=== What causes duplicates?
+
+The {beats} retry mechanism may result in duplicate data in {es}.
+
+When an output is blocked, {beatname_uc} will attempt to resend events until
+they are acknowledged by the output. If the output receives the data, but is
+unable to send an acknowledgement, the data may be sent to {es} multiple times.
+When {es} processes the data, it looks for a document ID. If the ID exists,
+{es} overwrites the existing document. If not, {es} creates a new document.
+Because document IDs are typically set by {es} (by default), this problem is
+common for data sent by {beats} or {ls}.
+
+[float]
+=== How can I avoid duplicates?
+
+Rather than allowing {es} to set the document ID, set the ID in {beats}. The ID
+is stored in the {beats} `@metadata.id` field where it can be used to set the
+document ID during indexing. That way, if {beats} sends the same event to {es}
+more than once, {es} overwrites the existing document rather than creating a new
+one.
+
+The `@metadata.id` field is passed along with the event so that you can use
+it to set the document ID later in your processing pipeline, for example, in
+{ls}.
+
+There are several methods available for setting the document ID in {beats}. The
+one you use depends on your specific use case:
+
+TODO: Need some realistic examples to flesh out the following sections. Also need to test these...haha.
+
+* *`add_id` processor*
++
+Use the <<add-id,`add_id`>> processor when your logs have no natural key field,
+and you can’t derive a unique key from existing fields. 
++
+This example generates a unique ID for each event and adds it to the
+`@metadata.id` field:
++
+[source,yaml]
+----
+processors:
+ - add_id: ~
+----
+ 
+* *`fingerprint` processor*
++
+Use the <<fingerprint,`fingerprint`>> processor when you want to derive a unique
+key from multiple existing fields.
++
+This example combines the values of `field1` and `field2` to create a unique key
+that it adds to the `_id` field:
++
+[source,yaml]
+----
+processors:
+ - fingerprint:
+     fields: ["field1", "field2", ...]
+     target_field: "_id"
+----
++
+TODO: Test the syntax. I’m guessing here. 
+
+* *JSON input settings*
++
+Use the `json.document_id` input setting if you’re ingesting JSON-formatted
+data, and the data has a natural key field.
++
+This example sets the document ID to the value of field1 from the JSON document.
++
+TODO: Add an example here. Should show the input config in addition to the JSON
+settings to provide context.

From 6748be5a3d3d7a075656923818033b84e7680dbe Mon Sep 17 00:00:00 2001
From: DeDe Morton <dede.morton@elastic.co>
Date: Thu, 9 Jan 2020 17:08:21 -0800
Subject: [PATCH 2/7] More updates

---
 libbeat/docs/shared-deduplication.asciidoc | 19 +++++++------------
 1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/libbeat/docs/shared-deduplication.asciidoc b/libbeat/docs/shared-deduplication.asciidoc
index da0be787a51..3191782042c 100644
--- a/libbeat/docs/shared-deduplication.asciidoc
+++ b/libbeat/docs/shared-deduplication.asciidoc
@@ -4,21 +4,17 @@
 The {beats} framework guarantees at-least-once delivery to ensure that no data
 is lost when events are sent to {es}. This is great if everything goes as
 planned. But if {beatname_uc} shuts down during processing, or the connection is
-lost before events are acknowledged, you can end up with duplicate events in
+lost before events are acknowledged, you can end up with duplicate data in
 {es}.
 
 [float]
 === What causes duplicates?
 
-The {beats} retry mechanism may result in duplicate data in {es}.
-
-When an output is blocked, {beatname_uc} will attempt to resend events until
-they are acknowledged by the output. If the output receives the data, but is
-unable to send an acknowledgement, the data may be sent to {es} multiple times.
-When {es} processes the data, it looks for a document ID. If the ID exists,
-{es} overwrites the existing document. If not, {es} creates a new document.
-Because document IDs are typically set by {es} (by default), this problem is
-common for data sent by {beats} or {ls}.
+When an output is blocked, the retry mechanism in {beatname_uc} attempts to
+resend events until they are acknowledged by the output. If the output receives
+the events, but is unable to acknowledge them, the data might be sent to {es}
+multiple times. Because document IDs are typically set by {es} _after_ it
+receives the data from {beats}, the duplicate events are indexed as new data.
 
 [float]
 === How can I avoid duplicates?
@@ -33,8 +29,7 @@ The `@metadata.id` field is passed along with the event so that you can use
 it to set the document ID later in your processing pipeline, for example, in
 {ls}.
 
-There are several methods available for setting the document ID in {beats}. The
-one you use depends on your specific use case:
+There are several ways to set the document ID in {beats}:
 
 TODO: Need some realistic examples to flesh out the following sections. Also need to test these...haha.
 

From 1c6f94d291b2a522f553dfc2d00bd0fff5854971 Mon Sep 17 00:00:00 2001
From: DeDe Morton <dede.morton@elastic.co>
Date: Fri, 24 Jan 2020 17:51:22 -0800
Subject: [PATCH 3/7] Corrections from testing

---
 libbeat/docs/shared-deduplication.asciidoc | 73 ++++++++++++++++------
 1 file changed, 53 insertions(+), 20 deletions(-)

diff --git a/libbeat/docs/shared-deduplication.asciidoc b/libbeat/docs/shared-deduplication.asciidoc
index 3191782042c..e708738785e 100644
--- a/libbeat/docs/shared-deduplication.asciidoc
+++ b/libbeat/docs/shared-deduplication.asciidoc
@@ -20,19 +20,16 @@ receives the data from {beats}, the duplicate events are indexed as new data.
 === How can I avoid duplicates?
 
 Rather than allowing {es} to set the document ID, set the ID in {beats}. The ID
-is stored in the {beats} `@metadata.id` field where it can be used to set the
-document ID during indexing. That way, if {beats} sends the same event to {es}
-more than once, {es} overwrites the existing document rather than creating a new
-one.
+is stored in the {beats} `@metadata.id` field and used to set the document ID
+during indexing. That way, if {beats} sends the same event to {es} more than
+once, {es} overwrites the existing document rather than creating a new one.
 
 The `@metadata.id` field is passed along with the event so that you can use
-it to set the document ID later in your processing pipeline, for example, in
-{ls}.
+it to set the document ID later in your processing pipeline, for example,
+in {ls}. See <<ls-doc-id>>. 
 
 There are several ways to set the document ID in {beats}:
 
-TODO: Need some realistic examples to flesh out the following sections. Also need to test these...haha.
-
 * *`add_id` processor*
 +
 Use the <<add-id,`add_id`>> processor when your logs have no natural key field,
@@ -44,33 +41,69 @@ This example generates a unique ID for each event and adds it to the
 [source,yaml]
 ----
 processors:
- - add_id: ~
+  - add_id: ~
 ----
  
 * *`fingerprint` processor*
 +
-Use the <<fingerprint,`fingerprint`>> processor when you want to derive a unique
-key from multiple existing fields.
+Use the <<fingerprint,`fingerprint`>> processor to derive a unique key from
+multiple existing fields.
 +
-This example combines the values of `field1` and `field2` to create a unique key
-that it adds to the `_id` field:
+This example uses the values of `field1` and `field2` to derive a unique key
+that it adds to the `@metadata.id` field:
 +
 [source,yaml]
 ----
 processors:
- - fingerprint:
-     fields: ["field1", "field2", ...]
-     target_field: "_id"
+  - fingerprint:
+      fields: ["field1", "field2"]
+      target_field: "@metadata.id"
 ----
 +
-TODO: Test the syntax. I’m guessing here. 
 
 * *JSON input settings*
 +
 Use the `json.document_id` input setting if you’re ingesting JSON-formatted
 data, and the data has a natural key field.
 +
-This example sets the document ID to the value of field1 from the JSON document.
+This example takes the value of `key1` from the JSON document and stores it in
+the `@metadata.id` field:
 +
-TODO: Add an example here. Should show the input config in addition to the JSON
-settings to provide context.
+[source,yaml]
+----
+filebeat.inputs:
+- type: log 
+  paths:
+    - /path/to/json.log
+  json.document_id: "key1"
+----
+
+[float]
+[[ls-doc-id]]
+=== {ls} pipeline example
+
+For this example, assume that you've used one of the approaches described
+earlier to store the document ID in the {beats} `@metadata.id` field. In
+the {logstash-ref}/plugins-outputs-elasticsearch.html[{es} output], set
+the `document_id` field based on the `@metadata.id` field:
+
+[source,json]
+----
+input {
+  beats {
+    port => 5044
+  }
+}
+
+output {
+  elasticsearch {
+    hosts => ["http://localhost:9200"]
+    document_id => "%{[@metadata][id]}" <1>
+    index => "%{[@metadata][beat]}-%{[@metadata][version]}" 
+  }
+}
+----
+<1> Sets the `document_id` field to the value stored in `@metadata.id`.
+
+When {es} indexes the document, it sets the document ID to the specified value,
+preserving the ID passed from {beats}.

From 9e2f14ea97b6ba0a9bc28ad41a407abbdb0ae2c4 Mon Sep 17 00:00:00 2001
From: DeDe Morton <dede.morton@elastic.co>
Date: Mon, 3 Feb 2020 18:43:05 -0800
Subject: [PATCH 4/7] Add more changes from the review

---
 libbeat/docs/shared-deduplication.asciidoc | 59 ++++++++++++----------
 1 file changed, 33 insertions(+), 26 deletions(-)

diff --git a/libbeat/docs/shared-deduplication.asciidoc b/libbeat/docs/shared-deduplication.asciidoc
index e708738785e..4bf977078de 100644
--- a/libbeat/docs/shared-deduplication.asciidoc
+++ b/libbeat/docs/shared-deduplication.asciidoc
@@ -2,41 +2,42 @@
 == Data deduplication
 
 The {beats} framework guarantees at-least-once delivery to ensure that no data
-is lost when events are sent to {es}. This is great if everything goes as
-planned. But if {beatname_uc} shuts down during processing, or the connection is
-lost before events are acknowledged, you can end up with duplicate data in
-{es}.
+is lost when events are sent to outputs that support acknowledgement, such as
+{es}, {ls}, Kafka, and Redis. This is great if everything goes as planned. But
+if {beatname_uc} shuts down during processing, or the connection is lost before
+events are acknowledged, you can end up with duplicate data.
 
 [float]
-=== What causes duplicates?
+=== What causes duplicates in {es}?
 
 When an output is blocked, the retry mechanism in {beatname_uc} attempts to
 resend events until they are acknowledged by the output. If the output receives
-the events, but is unable to acknowledge them, the data might be sent to {es}
-multiple times. Because document IDs are typically set by {es} _after_ it
-receives the data from {beats}, the duplicate events are indexed as new data.
+the events, but is unable to acknowledge them, the data might be sent to the
+output multiple times. Because document IDs are typically set by {es} _after_ it
+receives the data from {beats}, the duplicate events are indexed as new
+documents.
 
 [float]
 === How can I avoid duplicates?
 
 Rather than allowing {es} to set the document ID, set the ID in {beats}. The ID
-is stored in the {beats} `@metadata.id` field and used to set the document ID
+is stored in the {beats} `@metadata._id` field and used to set the document ID
 during indexing. That way, if {beats} sends the same event to {es} more than
 once, {es} overwrites the existing document rather than creating a new one.
 
-The `@metadata.id` field is passed along with the event so that you can use
-it to set the document ID later in your processing pipeline, for example,
-in {ls}. See <<ls-doc-id>>. 
+The `@metadata._id` field is passed along with the event so that you can use
+it to set the document ID after the event has been published by {beatname_uc}
+but before it's received by {es}. For example, see <<ls-doc-id>>. 
 
 There are several ways to set the document ID in {beats}:
 
 * *`add_id` processor*
 +
-Use the <<add-id,`add_id`>> processor when your logs have no natural key field,
+Use the <<add-id,`add_id`>> processor when your data has no natural key field,
 and you can’t derive a unique key from existing fields. 
 +
 This example generates a unique ID for each event and adds it to the
-`@metadata.id` field:
+`@metadata._id` field:
 +
 [source,yaml]
 ----
@@ -47,19 +48,18 @@ processors:
 * *`fingerprint` processor*
 +
 Use the <<fingerprint,`fingerprint`>> processor to derive a unique key from
-multiple existing fields.
+one or more existing fields.
 +
 This example uses the values of `field1` and `field2` to derive a unique key
-that it adds to the `@metadata.id` field:
+that it adds to the `@metadata._id` field:
 +
 [source,yaml]
 ----
 processors:
   - fingerprint:
       fields: ["field1", "field2"]
-      target_field: "@metadata.id"
+      target_field: "@metadata._id"
 ----
-+
 
 * *JSON input settings*
 +
@@ -67,7 +67,7 @@ Use the `json.document_id` input setting if you’re ingesting JSON-formatted
 data, and the data has a natural key field.
 +
 This example takes the value of `key1` from the JSON document and stores it in
-the `@metadata.id` field:
+the `@metadata._id` field:
 +
 [source,yaml]
 ----
@@ -83,9 +83,9 @@ filebeat.inputs:
 === {ls} pipeline example
 
 For this example, assume that you've used one of the approaches described
-earlier to store the document ID in the {beats} `@metadata.id` field. In
+earlier to store the document ID in the {beats} `@metadata._id` field. In
 the {logstash-ref}/plugins-outputs-elasticsearch.html[{es} output], set
-the `document_id` field based on the `@metadata.id` field:
+the `document_id` field based on the `@metadata._id` field:
 
 [source,json]
 ----
@@ -96,14 +96,21 @@ input {
 }
 
 output {
-  elasticsearch {
-    hosts => ["http://localhost:9200"]
-    document_id => "%{[@metadata][id]}" <1>
-    index => "%{[@metadata][beat]}-%{[@metadata][version]}" 
+  if [@metadata][_id] {
+    elasticsearch {
+      hosts => ["http://localhost:9200"]
+      document_id => "%{[@metadata][_id]}" <1>
+      index => "%{[@metadata][beat]}-%{[@metadata][version]}"
+    }
+  } else {
+    elasticsearch {
+      hosts => ["http://localhost:9200"]
+      index => "%{[@metadata][beat]}-%{[@metadata][version]}" 
+    }
   }
 }
 ----
-<1> Sets the `document_id` field to the value stored in `@metadata.id`.
+<1> Sets the `document_id` field to the value stored in `@metadata._id`.
 
 When {es} indexes the document, it sets the document ID to the specified value,
 preserving the ID passed from {beats}.

From 04af9dce3dd36735bb56998bfa763cf23d5cd41f Mon Sep 17 00:00:00 2001
From: DeDe Morton <dede.morton@elastic.co>
Date: Tue, 4 Feb 2020 16:35:59 -0800
Subject: [PATCH 5/7] Add more fixes from the review

---
 libbeat/docs/shared-deduplication.asciidoc | 33 ++++++++++++++++++----
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/libbeat/docs/shared-deduplication.asciidoc b/libbeat/docs/shared-deduplication.asciidoc
index 4bf977078de..46fab7f8714 100644
--- a/libbeat/docs/shared-deduplication.asciidoc
+++ b/libbeat/docs/shared-deduplication.asciidoc
@@ -61,6 +61,27 @@ processors:
       target_field: "@metadata._id"
 ----
 
+* *`decode_json_field` processor*
++
+Use the `document_id` setting in the <<decode-json-fields,`decode_json_fields`>>
+processor when you're decoding a JSON string that contains a natural key field.
++
+For this example, assume that the `message` field contains the JSON string
+`{"myid": "100", "text": "Some text"}`. This example takes the value of `myid`
+from the JSON string and stores it in the `@metadata._id` field:
++
+[source,yaml]
+----
+processors:
+ - decode_json_fields:
+     document_id: "myid"
+     fields: ["message"]
+     max_depth: 1
+     target: ""
+----
++
+The resulting document ID is `100`.
+
 * *JSON input settings*
 +
 Use the `json.document_id` input setting if you’re ingesting JSON-formatted
@@ -83,9 +104,9 @@ filebeat.inputs:
 === {ls} pipeline example
 
 For this example, assume that you've used one of the approaches described
-earlier to store the document ID in the {beats} `@metadata._id` field. In
-the {logstash-ref}/plugins-outputs-elasticsearch.html[{es} output], set
-the `document_id` field based on the `@metadata._id` field:
+earlier to store the document ID in the {beats} `@metadata._id` field. To
+preserve the ID when you send {beats} data through {ls} en route to {es},
+set the `document_id` field in the {ls} pipeline:
 
 [source,json]
 ----
@@ -93,7 +114,7 @@ input {
   beats {
     port => 5044
   }
-}
+}}
 
 output {
   if [@metadata][_id] {
@@ -110,7 +131,9 @@ output {
   }
 }
 ----
-<1> Sets the `document_id` field to the value stored in `@metadata._id`.
+<1> Sets the `document_id` field in the
+{logstash-ref}/plugins-outputs-elasticsearch.html[{es} output] to the value
+stored in `@metadata._id`.
 
 When {es} indexes the document, it sets the document ID to the specified value,
 preserving the ID passed from {beats}.

From 407a6918f9a0dc47bb94b34bdc770829fbbe288e Mon Sep 17 00:00:00 2001
From: DeDe Morton <dede.morton@elastic.co>
Date: Tue, 4 Feb 2020 17:02:54 -0800
Subject: [PATCH 6/7] Fix typo in processor name

---
 libbeat/docs/shared-deduplication.asciidoc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/libbeat/docs/shared-deduplication.asciidoc b/libbeat/docs/shared-deduplication.asciidoc
index 46fab7f8714..1f8ab85385c 100644
--- a/libbeat/docs/shared-deduplication.asciidoc
+++ b/libbeat/docs/shared-deduplication.asciidoc
@@ -61,7 +61,7 @@ processors:
       target_field: "@metadata._id"
 ----
 
-* *`decode_json_field` processor*
+* *`decode_json_fields` processor*
 +
 Use the `document_id` setting in the <<decode-json-fields,`decode_json_fields`>>
 processor when you're decoding a JSON string that contains a natural key field.

From cf4fe6dd81e5a939c11ab30bee62a1fba10699b8 Mon Sep 17 00:00:00 2001
From: DeDe Morton <dede.morton@elastic.co>
Date: Tue, 4 Feb 2020 17:57:04 -0800
Subject: [PATCH 7/7] Update notice

---
 NOTICE.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/NOTICE.txt b/NOTICE.txt
index e8412d03955..14bcd6878c4 100644
--- a/NOTICE.txt
+++ b/NOTICE.txt
@@ -1,5 +1,5 @@
 Elastic Beats
-Copyright 2014-2019 Elasticsearch BV
+Copyright 2014-2020 Elasticsearch BV
 
 This product includes software developed by The Apache Software 
 Foundation (http://www.apache.org/).