elastic · dedemorton · Feb 5, 2020 · Dec 18, 2019 · Jan 10, 2020 · Jan 25, 2020
diff --git a/filebeat/docs/configuring-howto.asciidoc b/filebeat/docs/configuring-howto.asciidoc
@@ -30,6 +30,7 @@ The following topics describe how to configure Filebeat:
 * <<load-balancing>>
 * <<configuration-ssl>>
 * <<filtering-and-enhancing-data>>
+* <<{beatname_lc}-deduplication>>
 * <<configuring-ingest-node>>
 * <<{beatname_lc}-geoip>>
 * <<configuration-path>>
@@ -68,6 +69,8 @@ include::{libbeat-dir}/shared-ssl-config.asciidoc[]
 
 include::./filebeat-filtering.asciidoc[]
 
+include::{libbeat-dir}/shared-deduplication.asciidoc[]
+
 include::{libbeat-dir}/shared-config-ingest.asciidoc[]
 
 include::{libbeat-dir}/shared-geoip.asciidoc[]

@@ -0,0 +1,109 @@
+[id="{beatname_lc}-deduplication"]
+== Data deduplication
+
+The {beats} framework guarantees at-least-once delivery to ensure that no data
+is lost when events are sent to {es}. This is great if everything goes as
+planned. But if {beatname_uc} shuts down during processing, or the connection is
+lost before events are acknowledged, you can end up with duplicate data in
+{es}.
+
+[float]
+=== What causes duplicates?
+
+When an output is blocked, the retry mechanism in {beatname_uc} attempts to
+resend events until they are acknowledged by the output. If the output receives
+the events, but is unable to acknowledge them, the data might be sent to {es}
+multiple times. Because document IDs are typically set by {es} _after_ it
+receives the data from {beats}, the duplicate events are indexed as new data.
+
+[float]
+=== How can I avoid duplicates?
+
+Rather than allowing {es} to set the document ID, set the ID in {beats}. The ID
+is stored in the {beats} `@metadata.id` field and used to set the document ID
+during indexing. That way, if {beats} sends the same event to {es} more than
+once, {es} overwrites the existing document rather than creating a new one.
+
+The `@metadata.id` field is passed along with the event so that you can use
+it to set the document ID later in your processing pipeline, for example,
+in {ls}. See <<ls-doc-id>>. 
+
+There are several ways to set the document ID in {beats}:
+
+* *`add_id` processor*
++
+Use the <<add-id,`add_id`>> processor when your logs have no natural key field,
+and you can’t derive a unique key from existing fields. 
++
+This example generates a unique ID for each event and adds it to the
+`@metadata.id` field:
++
+[source,yaml]
+----
+processors:
+  - add_id: ~
+----
+
+* *`fingerprint` processor*
++
+Use the <<fingerprint,`fingerprint`>> processor to derive a unique key from
+multiple existing fields.
++
+This example uses the values of `field1` and `field2` to derive a unique key
+that it adds to the `@metadata.id` field:
++
+[source,yaml]
+----
+processors:
+  - fingerprint:
+      fields: ["field1", "field2"]
+      target_field: "@metadata.id"
+----
++
+
+* *JSON input settings*
++
+Use the `json.document_id` input setting if you’re ingesting JSON-formatted
+data, and the data has a natural key field.
++
+This example takes the value of `key1` from the JSON document and stores it in
+the `@metadata.id` field:
++
+[source,yaml]
+----
+filebeat.inputs:
+- type: log 
+  paths:
+    - /path/to/json.log
+  json.document_id: "key1"
+----
+
+[float]
+[[ls-doc-id]]
+=== {ls} pipeline example
+
+For this example, assume that you've used one of the approaches described
+earlier to store the document ID in the {beats} `@metadata.id` field. In
+the {logstash-ref}/plugins-outputs-elasticsearch.html[{es} output], set
+the `document_id` field based on the `@metadata.id` field:
+
+[source,json]
+----
+input {
+  beats {
+    port => 5044
+  }
+}
+
+output {
+  elasticsearch {
+    hosts => ["http://localhost:9200"]
+    document_id => "%{[@metadata][id]}" <1>
+    index => "%{[@metadata][beat]}-%{[@metadata][version]}" 
+  }
+}
+----
+<1> Sets the `document_id` field to the value stored in `@metadata.id`.
+
+When {es} indexes the document, it sets the document ID to the specified value,
+preserving the ID passed from {beats}.