Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async producer retries for failed messages #331

Merged
merged 22 commits into from Jun 4, 2015

Conversation

vshlapakov
Copy link
Contributor

Another approach to add retries for failed messages (for asynchronous mode only).

  1. I've extended ProduceRequest a little bit by adding complete retries count to the fields.
    It's backward compatible, you can miss this param when creating initial ProduceRequest, it's 0 initially.
    (It's not very clean solution in terms of code, but storing retries count per request sounds reasonable.)
  2. I've added several retry params for producers also:
    • batch_retry_backoff_ms - backoff timeout to wait before next retry (300 by default);
    • batch_retries_limit - upper limit of retries per request
      (5 by default, None means infinite retries, 0 means no retries).
  3. I've used FailedPayloadsException from Improve fault tolerance by handling leadership election and other metadata changes #55 to get failed requests and increment retries for it, or drop it depending on settings.

What do you think about the approach? I'm going to add some additional tests for it asap.

@dpkp dpkp added the producer label Mar 3, 2015
@dpkp
Copy link
Owner

dpkp commented Mar 9, 2015

There are several exceptions that could be raised within the send_produce_request() call chain besides FailedPayloadsError:

NotLeaderForPartitionError -- metadata is likely out of date and should be refreshed before retry
LeaderNotAvailableError -- leader election is underway, backoff + refresh metadata before retry
UnknownTopicOrPartitionError -- broker is neither a leader or a replica, possible that a metadata refresh and retry would fix, but also possible that topic just doesnt exist and auto-topic-creation is not enabled.
ConnectionError -- socket / connection error, backoff + refresh metadata + retry
KafkaUnavailableError -- all broker nodes are offline... backoff
RequestTimedOutError -- broker was unable to get all ACKs specified by the Produce request w/i the required timeout. If the user is ok w/ possible duplicate messages, a retry is fine.
Then also a bunch of message related errors: InvalidMessage, InvalidMessageSize, MessageSizeTooLarge -- which generally should not be retried

@vshlapakov
Copy link
Contributor Author

Thanks a lot for the clarifications, it seems that we need more smart retrying logic, I'll think about it. So far I've updated the code a little bit to retry all failed requests without exclusions.

@dpkp
Copy link
Owner

dpkp commented Mar 24, 2015

I dont think we can do blanket retries. Many of the errors require resyncing cluster metadata before retry. The error handling logic should handle that.

@dpkp dpkp added this to the 0.9.4 Release milestone Mar 24, 2015
@vshlapakov vshlapakov closed this Apr 21, 2015
@vshlapakov vshlapakov deleted the feature-producer-retries branch April 21, 2015 07:21
@vshlapakov vshlapakov restored the feature-producer-retries branch April 21, 2015 07:21
@vshlapakov vshlapakov reopened this Apr 21, 2015
@vshlapakov
Copy link
Contributor Author

Sorry, removed/restored the branch by accident, I'm on this issue.

@vshlapakov vshlapakov force-pushed the feature-producer-retries branch 2 times, most recently from 400f10e to 59faeca Compare April 21, 2015 08:17
@vshlapakov vshlapakov changed the title Async producer retries for failed messages Async producer retries for failed messages [WIP] Apr 21, 2015
@vshlapakov vshlapakov changed the title Async producer retries for failed messages [WIP] [WIP] Async producer retries for failed messages Apr 21, 2015
@vshlapakov vshlapakov changed the title [WIP] Async producer retries for failed messages Async producer retries for failed messages Apr 21, 2015
@vshlapakov
Copy link
Contributor Author

As this PR is directly related with async producer and retries, I also added an ability to limit async queue size (used some commits from #283 & #304 - look like abandoned).

@vshlapakov
Copy link
Contributor Author

@dpkp I'm still having some odd errors from Travis, but could you check the approach please? I'm not sure if we're fine with having a named tuple as an argument for a producer.

except Exception:
log.exception("Unable to send message")

except FailedPayloadsError as ex:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that #366 is merged, we can pass fail_on_error=False to client.send_produce_request and get a list of responses/errors. This would allow us to loop through the responses and only retry the reqs that failed. Currently this exception would cause all reqs in the batch to retry, even if only a single req caused the error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for pointing me, I missed that PR!

@vshlapakov
Copy link
Contributor Author

@dpkp Thanks for all your comments, they're extremely helpful, I'll try to fix everything asap.

@vshlapakov
Copy link
Contributor Author

Rebased and prepared a draft for response.error check.

@dpkp
Copy link
Owner

dpkp commented Jun 3, 2015

awesome! will take a look shortly!

@dpkp
Copy link
Owner

dpkp commented Jun 4, 2015

There are a handful of issues remaining -- need to reenable the test_switch_leader_async test in test/test_failover_integration.py and then fix it along w/ the lint issues failing travis build.

But you've put in a ton of work and I think it is basically there, so I am going to merge and push a few fixes directly. Thanks so much for working on this one!

dpkp added a commit that referenced this pull request Jun 4, 2015
Async producer retries for failed messages
@dpkp dpkp merged commit 474aeaa into dpkp:master Jun 4, 2015
@dpkp dpkp mentioned this pull request Jun 7, 2015
@vshlapakov
Copy link
Contributor Author

Cool, thank you @dpkp! If I have some additional time, I'll fix other issues (like additional tests and docstrings) as well in a separate PR.

@dpkp
Copy link
Owner

dpkp commented Jun 7, 2015

Thanks again for your help on this. I put together some edits in #388 (merged yesterday).

@vshlapakov vshlapakov deleted the feature-producer-retries branch June 10, 2015 06:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants