Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retry committing for transient exceptions #1584

Merged
merged 2 commits into from
Nov 28, 2022

Conversation

omeraha
Copy link
Contributor

@omeraha omeraha commented Nov 18, 2022

Background

Offset commits might fail due to timeout on the Kafka broker, in which case the broker returns an error stating the operation timed out. Timeouts are transient in nature, so it might be worth to retry the commit.

Changes

  • Attempt to retry a commit that failed due to TimeoutException exception.
  • Add timeout exception scenario to the commit retry test.

References #1111

@lightbend-cla-validator

Hi @omeraha,

Thank you for your contribution! We really value the time you've taken to put this together.

Before we proceed with reviewing this pull request, please sign the Lightbend Contributors License Agreement:

https://www.lightbend.com/contribute/cla

@@ -588,7 +588,7 @@ import scala.util.control.NonFatal
progressTracker.committed(offsets)
replyTo.foreach(_ ! Done)

case e: RebalanceInProgressException => retryCommits(duration, e)
case e @ (_: RebalanceInProgressException | _: TimeoutException) => retryCommits(duration, e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your comment in #266 (comment) that this could handle all kinds of RetriableException here instead. That would even catch RetriableCommitFailedException below.
I assume most of the subclasses to RetriableException can not happen just here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ennru I'll fix it and commit. I'll also take a look at the classes that (currently) extend RetriableException. But it should be safe to retry all of them inside a callback on the commit result (as this means they must have been thrown during the commit itself)

Copy link
Contributor Author

@omeraha omeraha Nov 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ennru The following (specific) exceptions will be handled if we decide to go for RetriableException:

  1. CoordinatorLoadInProgressException
  2. UnknownTopicOrPartitionException
  3. CoordinatorNotAvailableException
  4. NotCoordinatorException
  5. TimeoutException

While TimeoutException and the CoordinatorLoadInProgressException are clearly safe to retry (we've seen CoordinatorLoadInProgressException in our services a couple of times), the other three are a bit more tricky, for instance: the UnknownTopicOrPartitionException comment states that

This exception is used in contexts where a topic doesn't seem to exist based on possibly stale metadata.
This exception is retriable because the topic or partition might subsequently be created.

Which in our case might lead to endless retires (as currently retryCommits() doesn't have a limit on the number of retries, if I understand the code correctly).
I've added CoordinatorLoadInProgressException to the code, and think that we should consider adding additional exceptions upon request from the community

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for looking closely at those sub-exceptions. It was the endless retrying I was afraid of, stale metadata sounds like a reason to try, but if the topic or partition is never created we it will loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ennru Thanks. Anything else on my side?

Copy link
Member

@ennru ennru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ennru ennru changed the title retry committing an offset if previous commit attmept timed out retry committing for transient exceptions Nov 28, 2022
@ennru ennru merged commit 62373b6 into akka:master Nov 28, 2022
@ennru
Copy link
Member

ennru commented Nov 28, 2022

Thank you for improving Alpakka Kafka!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants