New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

elasticsearch output plugin: rejected docs in bulk indexing partial failure are silently lost #1631

Closed
colinsurprenant opened this Issue Aug 14, 2014 · 13 comments

Comments

Projects
None yet
5 participants
@colinsurprenant
Contributor

colinsurprenant commented Aug 14, 2014

if the bulk indexing request returns a 200 OK but contains partial failures, these rejected doc are silently lost.

some documents in a bulk request can fail indexing with "EsRejectedExecutionException[rejected execution (queue capacity XX)” but the bulk request response still returns with a 200 success. Here’s
such a response example:

{"took":5324,"errors":true,"items”:[
  {"create":{"_index":"logstash 2014.08.11.16","_type":"logs","_id":"HwBm8IAZT1ycsdJidIPlOQ","_version":1,"status":201}},
  {"create":{"_index":"logstash-2014.08.11.16","_type":"logs","_id":"i6Xy10NUTyyd4V-ns Dt0w","_version":1,"status":201}},
  …
  {"create":{"_index":"logstash 2014.08.11.16","_type":"logs","_id":"4RjTBGGhSaCVNSiMFyf9PA","status":503,"error":"RemoteTransportException[[elasticsearch-d6.pardot.com][inet[/10.60.57.163:9300]][bulk/shard]]; nested: EsRejectedExecutionException[rejected execution (queue
capacity 200) on org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1@3da71c15]; "}},
  …
]}

This could be handled with a retry queue + an exponential backoff plus a time & size limit where documents could be routed to a “rejections” file output for example. doing this would handle temporary hickups and would avoid loosing trace of rejected documents. User could then easily re-feed the rejected documents once the situation is resolved.

@colinsurprenant

This comment has been minimized.

Show comment
Hide comment
@colinsurprenant

colinsurprenant Aug 14, 2014

Contributor

Actually, handling of the failed documents could be much simpler: simply add the failed documents into the next bulk request (and stil honouring the max bulk size) or have the next bulk request contain only the failed documents. this will avoid having to handle a parallel process and it will be compatible with our implicit back pressure mechanism with the sized queues.

Contributor

colinsurprenant commented Aug 14, 2014

Actually, handling of the failed documents could be much simpler: simply add the failed documents into the next bulk request (and stil honouring the max bulk size) or have the next bulk request contain only the failed documents. this will avoid having to handle a parallel process and it will be compatible with our implicit back pressure mechanism with the sized queues.

@jordansissel jordansissel added this to the v1.5.0 milestone Aug 14, 2014

@jordansissel jordansissel added bug and removed enhancement labels Aug 14, 2014

@jordansissel

This comment has been minimized.

Show comment
Hide comment
@jordansissel

jordansissel Aug 14, 2014

Contributor

+1. This was my mistake assuming '200 OK' from Elasticsearch meant the request was successful.

This probably also impacts the transport and node protocols as well, as I'm guessing there's no exception thrown in this condition.

Contributor

jordansissel commented Aug 14, 2014

+1. This was my mistake assuming '200 OK' from Elasticsearch meant the request was successful.

This probably also impacts the transport and node protocols as well, as I'm guessing there's no exception thrown in this condition.

@colinsurprenant

This comment has been minimized.

Show comment
Hide comment
@colinsurprenant

colinsurprenant Aug 15, 2014

Contributor

True, we need to verify what the node/transport client returns upon partial failure, I assume it will be a success result, so we'll need to introspect the response structure, same with http, to identify retryable documents.

For the record: with HTTP, as was discussed/confirmed internally, the bulk partial failures result codes that should be retried are 429 and 503.

Contributor

colinsurprenant commented Aug 15, 2014

True, we need to verify what the node/transport client returns upon partial failure, I assume it will be a success result, so we'll need to introspect the response structure, same with http, to identify retryable documents.

For the record: with HTTP, as was discussed/confirmed internally, the bulk partial failures result codes that should be retried are 429 and 503.

@jordansissel

This comment has been minimized.

Show comment
Hide comment
@jordansissel

jordansissel Aug 15, 2014

Contributor

This perhaps is getting into what really constitutes a "permanent failure" but, imo, mapping failures can be repaired in Elasticsearch and a retry may succeed in the future. Mapping errors return code 400 in ES 1.3.1:

% curl -D- -XPOST localhost:9200/foo/bar/_bulk?pretty -d '
{ "index": { "_index": "foo", "_type": "bar" } }
{ "foo": "foo" }
'
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 315

{
  "took" : 8,
  "errors" : true,
  "items" : [ {
    "create" : {
      "_index" : "foo",
      "_type" : "bar",
      "_id" : "uPLP2AJ4Ti2xpRxDMXU_Tg",
      "status" : 400,
      "error" : "MapperParsingException[failed to parse [foo]]; nested: NumberFormatException[For input string: \"foo\"]; "
    }
  } ]
}
Contributor

jordansissel commented Aug 15, 2014

This perhaps is getting into what really constitutes a "permanent failure" but, imo, mapping failures can be repaired in Elasticsearch and a retry may succeed in the future. Mapping errors return code 400 in ES 1.3.1:

% curl -D- -XPOST localhost:9200/foo/bar/_bulk?pretty -d '
{ "index": { "_index": "foo", "_type": "bar" } }
{ "foo": "foo" }
'
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 315

{
  "took" : 8,
  "errors" : true,
  "items" : [ {
    "create" : {
      "_index" : "foo",
      "_type" : "bar",
      "_id" : "uPLP2AJ4Ti2xpRxDMXU_Tg",
      "status" : 400,
      "error" : "MapperParsingException[failed to parse [foo]]; nested: NumberFormatException[For input string: \"foo\"]; "
    }
  } ]
}
@colinsurprenant

This comment has been minimized.

Show comment
Hide comment
@colinsurprenant

colinsurprenant Aug 15, 2014

Contributor

Well, this is obviously a gray zone, but for me, a mapping error is clearly a case of non-transient error, if your automatically retry it will fail systematically, until the mapping is repaired, which is typically when a real human, who usually drinks lots of coffee, fixes the template. But for the 429s and 503s "queue capacity" errors, retrying will likely succeed. Granted it may point at a "permanent" problematic condition on your cluster that will also require the action of a coffee drinker, in which case, logging and alerting should help bootstrap the correct sequence of drinking coffee + solving problem.

Contributor

colinsurprenant commented Aug 15, 2014

Well, this is obviously a gray zone, but for me, a mapping error is clearly a case of non-transient error, if your automatically retry it will fail systematically, until the mapping is repaired, which is typically when a real human, who usually drinks lots of coffee, fixes the template. But for the 429s and 503s "queue capacity" errors, retrying will likely succeed. Granted it may point at a "permanent" problematic condition on your cluster that will also require the action of a coffee drinker, in which case, logging and alerting should help bootstrap the correct sequence of drinking coffee + solving problem.

@colinsurprenant

This comment has been minimized.

Show comment
Hide comment
@colinsurprenant

colinsurprenant Aug 15, 2014

Contributor

will bots ever drink coffee?

Contributor

colinsurprenant commented Aug 15, 2014

will bots ever drink coffee?

@suyograo suyograo self-assigned this Sep 3, 2014

@jordansissel

This comment has been minimized.

Show comment
Hide comment
@jordansissel

jordansissel Oct 4, 2014

Contributor

Based on some discussion off-ticket, I think the consensus is:

  • The current behavior is terrible and a bug
  • We should retry only failed documents
  • We should never retry Mapping failures (because they practically cannot be resolved externally)
  • We should add a dead letter queue to store these 'never retry' documents
  • We should add a 'dead letter' input that allows processing of these dead/failed documents

Other thoughts: A dead letter queue has significant behavior change from the past where logstash generally stalls (maybe forever) for delivery failures. @colinsurprenant's recent work on a disk-persistent queue will be useful here.

Contributor

jordansissel commented Oct 4, 2014

Based on some discussion off-ticket, I think the consensus is:

  • The current behavior is terrible and a bug
  • We should retry only failed documents
  • We should never retry Mapping failures (because they practically cannot be resolved externally)
  • We should add a dead letter queue to store these 'never retry' documents
  • We should add a 'dead letter' input that allows processing of these dead/failed documents

Other thoughts: A dead letter queue has significant behavior change from the past where logstash generally stalls (maybe forever) for delivery failures. @colinsurprenant's recent work on a disk-persistent queue will be useful here.

@jordansissel

This comment has been minimized.

Show comment
Hide comment
@jordansissel

jordansissel Oct 6, 2014

Contributor

I may have been over-zealous in my judgement of consensus based on some offline discussion with @colinsurprenant.

We'll discuss this more this week and come up with a solution.

Contributor

jordansissel commented Oct 6, 2014

I may have been over-zealous in my judgement of consensus based on some offline discussion with @colinsurprenant.

We'll discuss this more this week and come up with a solution.

@jordansissel jordansissel assigned talevy and unassigned suyograo Oct 20, 2014

@talevy

This comment has been minimized.

Show comment
Hide comment
@talevy

talevy Oct 21, 2014

Contributor

Do we know if we want to set a specific max-retries so that we do not block on
consistent failures when bulk-sending the events?

Contributor

talevy commented Oct 21, 2014

Do we know if we want to set a specific max-retries so that we do not block on
consistent failures when bulk-sending the events?

@talevy

This comment has been minimized.

Show comment
Hide comment
@talevy

talevy Oct 21, 2014

Contributor

here is an initial stub of how this possibly could be tackled. Still not entirely sure how to
test these scenarios.

talevy@6d86e8e

To handle the errors (e.g. 429) that were referenced above. we would have to bump the version of the ES jar packaged with Logstash. This Error code was introduced after v1.1

Contributor

talevy commented Oct 21, 2014

here is an initial stub of how this possibly could be tackled. Still not entirely sure how to
test these scenarios.

talevy@6d86e8e

To handle the errors (e.g. 429) that were referenced above. we would have to bump the version of the ES jar packaged with Logstash. This Error code was introduced after v1.1

@colinsurprenant

This comment has been minimized.

Show comment
Hide comment
@colinsurprenant

colinsurprenant Oct 24, 2014

Contributor

@talevy I commented in your commit

Contributor

colinsurprenant commented Oct 24, 2014

@talevy I commented in your commit

@talevy

This comment has been minimized.

Show comment
Hide comment
@talevy

talevy Oct 27, 2014

Contributor

Pending #1777, FTW will be no more. So, a normalization of bulk response for both Node and Http (es-ruby) clients will be done so that the es-output can extract which actions did not succeed.

Contributor

talevy commented Oct 27, 2014

Pending #1777, FTW will be no more. So, a normalization of bulk response for both Node and Http (es-ruby) clients will be done so that the es-output can extract which actions did not succeed.

@talevy

This comment has been minimized.

Show comment
Hide comment
@talevy
Contributor

talevy commented Jan 14, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment