Should count as success if any records shipped (?) #42

wryun · 2015-11-05T02:21:50Z

Consider this situation: shard is close to saturated. One fluentd node gets unlucky, and has to retry a few times (can happen just because of poor distribution). Because it's retrying, the next chunk to ship is much bigger (i.e. up to the capacity of put_records).

Now it has a bigger chunk, it's much more likely to fail to ship (more records, and more likely to hit capacity). Therefore it will need even more retries to succeed, and because of the exponential back-off will take a long time to do so. i.e. the throughput of this particular fluentd node will be much lower than any others, because the more records you try to ship in a particular put_records attempt the longer you will spend backing off.

Suggested fix: set retry_count to 0 if any records were pushed, which means the sleep will be ~0.5 secs (the smallest possible sleep).

Also, would be great to log how many records were pushed at this point (and how many were attempted).

wryun · 2015-11-07T03:24:56Z

(this is more of a problem in our use case because a single node always ships to the same shard - i.e. a single putrecords request, which can have 5mb, is guaranteed to run up against the 1mb/sec shard limit!)

wryun · 2015-11-10T20:28:28Z

Actually, in some ways this is worse if you use random partition key. Because every PutRecords request now has all the possible shards covered, if you have any hot shard it will fail.

wryun · 2015-11-10T20:28:40Z

(and then backoff)

riywo · 2016-01-30T01:49:40Z

Basically, this plugin is designed not losing any input data as much as possible using fluentd buffer system.

I think we can add an option to choose the strategy "at least once" (current) or "at most once" (your suggestion). So, I'll keep this issue open, but I don't start coding so far. For more throughput, KPL support by v1.0.0 would be a better way.

wryun · 2016-01-30T02:47:23Z

I'm not suggesting that it should drop the records; I'm just suggesting it should be more aggressive with the retries. Say my backoff was 2 minutes, and I have 400 records to ship. With 0.4.0 (haven't checked latest), if 399 succeed, the plugin backs off for 4 minutes to ship the last record! Instead, it should consider any shipped records a success in terms of retries, and stop backing off. So instead of waiting 4 minutes to ship that single record, you do it immediately.

Example of how it all goes wrong:

shipping 10 records every 10 secs happily
shard becomes busy, starts backing off; takes a few retries to ship 10
now the next chunk is much bigger, say 100 records, so there is less chance that all of them fit, and it will retry for longer
this continues until chunks are the maximum size (5000 records), at which point even on a fully recovered stream you're likely to fail, since you're trying to ship so many records at once

wryun · 2016-01-30T02:48:03Z

(and there is no way to cap the maximum backoff like there is with the standard fluentd retry mechanism)

riywo · 2016-03-15T17:17:35Z

For v1.0.0, I added three more parameters to handle this scenario. Also, I reduce the minimum sleep to about 0.3 s because AWS SDK for Ruby use 0.3 for scaling factor.

reset_backoff_if_success

Boolean, default true. If enabled, when after retrying, the next retrying checks the number of succeeded records on the former batch request and reset exponential backoff if there is any success. Because batch request could be composed by requests across shards, simple exponential backoff for the batch request wouldn't work some cases.

batch_request_max_count

Integer, default 500. The number of max count of making batch request from record chunk. It can't exceed the default value because it's API limit.

batch_request_max_size

Integer, default 5 * 1024*1024. The number of max size of making batch request from record chunk. It can't exceed the default value because it's API limit.

wryun · 2016-06-05T23:56:53Z

Thanks!

riywo added the enhancement label Nov 9, 2015

riywo added the help wanted label Jan 30, 2016

riywo added this to the v1.0.0 milestone Mar 14, 2016

riywo removed the help wanted label Mar 14, 2016

riywo closed this as completed Mar 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should count as success if any records shipped (?) #42

Should count as success if any records shipped (?) #42

wryun commented Nov 5, 2015

wryun commented Nov 7, 2015

wryun commented Nov 10, 2015

wryun commented Nov 10, 2015

riywo commented Jan 30, 2016

wryun commented Jan 30, 2016

wryun commented Jan 30, 2016

riywo commented Mar 15, 2016

wryun commented Jun 5, 2016

Should count as success if any records shipped (?) #42

Should count as success if any records shipped (?) #42

Comments

wryun commented Nov 5, 2015

wryun commented Nov 7, 2015

wryun commented Nov 10, 2015

wryun commented Nov 10, 2015

riywo commented Jan 30, 2016

wryun commented Jan 30, 2016

wryun commented Jan 30, 2016

riywo commented Mar 15, 2016

reset_backoff_if_success

batch_request_max_count

batch_request_max_size

wryun commented Jun 5, 2016