Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should count as success if any records shipped (?) #42

Closed
wryun opened this issue Nov 5, 2015 · 8 comments
Closed

Should count as success if any records shipped (?) #42

wryun opened this issue Nov 5, 2015 · 8 comments
Milestone

Comments

@wryun
Copy link

wryun commented Nov 5, 2015

Consider this situation: shard is close to saturated. One fluentd node gets unlucky, and has to retry a few times (can happen just because of poor distribution). Because it's retrying, the next chunk to ship is much bigger (i.e. up to the capacity of put_records).

Now it has a bigger chunk, it's much more likely to fail to ship (more records, and more likely to hit capacity). Therefore it will need even more retries to succeed, and because of the exponential back-off will take a long time to do so. i.e. the throughput of this particular fluentd node will be much lower than any others, because the more records you try to ship in a particular put_records attempt the longer you will spend backing off.

Suggested fix: set retry_count to 0 if any records were pushed, which means the sleep will be ~0.5 secs (the smallest possible sleep).

Also, would be great to log how many records were pushed at this point (and how many were attempted).

@wryun
Copy link
Author

wryun commented Nov 7, 2015

(this is more of a problem in our use case because a single node always ships to the same shard - i.e. a single putrecords request, which can have 5mb, is guaranteed to run up against the 1mb/sec shard limit!)

@wryun
Copy link
Author

wryun commented Nov 10, 2015

Actually, in some ways this is worse if you use random partition key. Because every PutRecords request now has all the possible shards covered, if you have any hot shard it will fail.

@wryun
Copy link
Author

wryun commented Nov 10, 2015

(and then backoff)

@riywo
Copy link
Contributor

riywo commented Jan 30, 2016

Basically, this plugin is designed not losing any input data as much as possible using fluentd buffer system.

I think we can add an option to choose the strategy "at least once" (current) or "at most once" (your suggestion). So, I'll keep this issue open, but I don't start coding so far. For more throughput, KPL support by v1.0.0 would be a better way.

@wryun
Copy link
Author

wryun commented Jan 30, 2016

I'm not suggesting that it should drop the records; I'm just suggesting it should be more aggressive with the retries. Say my backoff was 2 minutes, and I have 400 records to ship. With 0.4.0 (haven't checked latest), if 399 succeed, the plugin backs off for 4 minutes to ship the last record! Instead, it should consider any shipped records a success in terms of retries, and stop backing off. So instead of waiting 4 minutes to ship that single record, you do it immediately.

Example of how it all goes wrong:

  • shipping 10 records every 10 secs happily
  • shard becomes busy, starts backing off; takes a few retries to ship 10
  • now the next chunk is much bigger, say 100 records, so there is less chance that all of them fit, and it will retry for longer
  • this continues until chunks are the maximum size (5000 records), at which point even on a fully recovered stream you're likely to fail, since you're trying to ship so many records at once

@wryun
Copy link
Author

wryun commented Jan 30, 2016

(and there is no way to cap the maximum backoff like there is with the standard fluentd retry mechanism)

@riywo riywo added this to the v1.0.0 milestone Mar 14, 2016
@riywo riywo removed the help wanted label Mar 14, 2016
@riywo
Copy link
Contributor

riywo commented Mar 15, 2016

For v1.0.0, I added three more parameters to handle this scenario. Also, I reduce the minimum sleep to about 0.3 s because AWS SDK for Ruby use 0.3 for scaling factor.

reset_backoff_if_success

Boolean, default true. If enabled, when after retrying, the next retrying checks the number of succeeded records on the former batch request and reset exponential backoff if there is any success. Because batch request could be composed by requests across shards, simple exponential backoff for the batch request wouldn't work some cases.

batch_request_max_count

Integer, default 500. The number of max count of making batch request from record chunk. It can't exceed the default value because it's API limit.

batch_request_max_size

Integer, default 5 * 1024*1024. The number of max size of making batch request from record chunk. It can't exceed the default value because it's API limit.

@riywo riywo closed this as completed Mar 15, 2016
@wryun
Copy link
Author

wryun commented Jun 5, 2016

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants