Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-7941: Catch TimeoutException in KafkaBasedLog worker thread #6283

Merged
merged 5 commits into from Aug 13, 2019

Conversation

@pgwhalen
Copy link
Contributor

commented Feb 18, 2019

  • When calling readLogToEnd(), the KafkaBasedLog worker thread should
    catch TimeoutException and log a warning, which can occur if brokers
    are unavailable, otherwise the worker thread terminates.
  • Includes an enhancement to MockConsumer that allows simulating
    exceptions not just when polling but also when querying for offsets,
    which is necessary for testing the fix.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)
KAFKA-7941: Catch TimeoutException in KafkaBasedLog worker thread
 - When calling readLogToEnd(), the KafkaBasedLog worker thread should
 catch TimeoutException and log a warning, which can occur if brokers
 are unavailable, otherwise the worker thread terminates.
 - Includes an enhancement to MockConsumer that allows simulating
 exceptions not just when polling but also when querying for offsets,
 which is necessary for testing the fix.
@pgwhalen

This comment has been minimized.

Copy link
Contributor Author

commented Feb 18, 2019

This is my first contribution and I haven't exactly poured over the entirety of the Connect runtime, so I'll be curious to see if I'm missing something entirely.

A few things I'd like reviewers to note that I thought about myself:

  • Is WARN an appropriate log level, or should there even be a log at all?
  • Given that the problem is essentially "uncaught exception causes important thread to terminate," should we "catch and continue" for yet more exceptions to prevent similar issues?
  • More broadly - Is there a way to alert on the healthiness of Connect overall, perhaps through the REST interface? Maybe the biggest issue with the bug was that it was hard to detect - tasks still reported themselves as "running," meaning the only way to discover the worker thread had died was by monitoring the expected output of our connectors (and noticing that there was none).
@pgwhalen

This comment has been minimized.

Copy link
Contributor Author

commented Mar 23, 2019

retest this please

@pgwhalen

This comment has been minimized.

Copy link
Contributor Author

commented Apr 25, 2019

Ping @mjsax @kkonstantine @wicknicks @rhauch (not totally sure, but I've seen these folks reviewing other connect PRs)

@rhauch
Copy link
Contributor

left a comment

Thanks, @pgwhalen -- nice catch! I think this was not considered an issue is because the internal poll(Integer.MAX_VALUE) call in readToEnd() should never time out. But as you pointed out, there are other method invocations, and timeouts on those should be handled as well.

I have a minor suggestion to help make the log message more readable, and (if you're in the code) another to move the two setter methods on MockConsumer together. Otherwise, this looks great.

@@ -312,6 +313,9 @@ public void run() {
try {
readToLogEnd();
log.trace("Finished read to end log for topic {}", topic);
} catch (TimeoutException e) {
log.warn("Timeout while reading log to end {}", e.getMessage());

This comment has been minimized.

Copy link
@rhauch

rhauch May 16, 2019

Contributor

How about putting the topic name in the message (like on line 315), denoting that this will be retried automatically, and maybe why this might happen.

Suggested change
log.warn("Timeout while reading log to end {}", e.getMessage());
log.warn("Timeout while reading log to end for topic '{}'. Retrying automatically. This may occur when brokers are unavailable or unreachable.", topic, e.getMessage());

This comment has been minimized.

Copy link
@pgwhalen

pgwhalen May 17, 2019

Author Contributor

Good idea, though looks like this suggested change will exclude the message (doesn't have another {}); I'll fix that.

public synchronized void setOffsetsException(KafkaException exception) {
this.offsetsException = exception;
}

This comment has been minimized.

Copy link
@rhauch

rhauch May 16, 2019

Contributor

Nit: It's probably a bit better to have this be near the setPollException(...) method.

This comment has been minimized.

Copy link
@pgwhalen

pgwhalen May 17, 2019

Author Contributor

Agreed, not sure why I put it there.

@rhauch

This comment has been minimized.

Copy link
Contributor

commented May 16, 2019

It is a bit surprising it hasn't come up more since AK 2.0 was released, but it would only when brokers become unavailable. It looks like KAFKA-6608 changed the consumer.endOffsets(...) methods in AK 2.0 to add an optional timeout parameter, and for the existing method KafkaBasedLog was using to default to the value set by the request.timeout.ms consumer (client) property, which itself defaults to 30 seconds. This added the possibility of a TimeoutException on these methods, which didn't have it before AK 2.0.

I documented this on KAFKA-7941, but one workaround is to set the request.timeout.ms property for Connect worker's configuration, which affects the worker's consumer used for offsets and other internal topics. Note that doing this will affect the producer of internal components, too, and unless it's overridden for the worker's producer.request.timeout.ms or consumer.request.timeout.ms will also apply to producers and consumers used for connectors.

This fix should be ported back to AK 2.0.

Code review tweaks
 - Enhance logging on TimeoutException
 - Move setOffsetsException method to sensible place in MockConsumer
@pgwhalen

This comment has been minimized.

Copy link
Contributor Author

commented May 17, 2019

Thanks for picking up the review @rhauch !

I'm surprised it hasn't come up more as well actually, my team experiences it enough that we went through the trouble to add some additional out of process monitoring specifically to catch this error, and now restart our instance of connect nightly in order to get it back in a good state if it hits this. Regardless, glad we're fixing it now!

For back porting to 2.0, is that something I should help with by making other PRs or will you/others handle it while merging?

@rhauch

This comment has been minimized.

Copy link
Contributor

commented May 23, 2019

@pgwhalen, my comment about backporting was for posterity. PRs should be based on trunk (like this one), and committers will backport to older branches when merging.

@rhauch
Copy link
Contributor

left a comment

Thanks, @pgwhalen. One more suggested change, and after that I think we're ready to merge.

@@ -213,8 +214,12 @@ public synchronized void addRecord(ConsumerRecord<K, V> record) {
recs.add(record);
}

public synchronized void setException(KafkaException exception) {
this.exception = exception;

This comment has been minimized.

Copy link
@rhauch

rhauch May 23, 2019

Contributor

@pgwhalen, after thinking about this some more, I think we should keep this older signature, deprecate it, and have it call setPollException(exception). Although MockConsumer is strictly speaking a non-public API, doing this will help avoid problems should people use the MockConsumer class in their tests. And it'd probably help to add JavaDoc that says to use setPollException(...) instead.

This comment has been minimized.

Copy link
@pgwhalen

pgwhalen May 24, 2019

Author Contributor

Good point, I forgot it was a mildly public API. I added it back, should be good to merge now unless anything else stands out.

@wicknicks
Copy link
Contributor

left a comment

thanks for the fix, @pgwhalen. one question.

@@ -312,6 +313,10 @@ public void run() {
try {
readToLogEnd();
log.trace("Finished read to end log for topic {}", topic);
} catch (TimeoutException e) {

This comment has been minimized.

Copy link
@wicknicks

wicknicks May 24, 2019

Contributor

there are a couple of other places where we call readToEnd() and poll() (links below). should we catch this exception in those places also?

This comment has been minimized.

Copy link
@pgwhalen

pgwhalen May 25, 2019

Author Contributor

Thanks for the additional eyes @wicknicks !

The readToLogEnd() call is inside start()so my thinking was that failure would happen on startup which seems okay (if not desirable). If there were broker availability issues it seems reasonable to me for Connect not to start. If we want to try to be more robust we could come up with some retrying strategy, but I bet that would require changes in a number of other places.

The poll() method contains its own catch (KafkaException e) which is a super class of TimeoutException so we ought to be okay there:

@wicknicks
Copy link
Contributor

left a comment

thanks, @pgwhalen! lgtm.

@jgoldrich

This comment has been minimized.

Copy link

commented Jun 19, 2019

👍 this affects us, please merge!

@adithyant

This comment has been minimized.

Copy link

commented Jun 19, 2019

can we please merge this ?

@rhauch
rhauch approved these changes Jun 20, 2019
Copy link
Contributor

left a comment

Thanks, @pgwhalen. This looks good to me, but since we want to backport this all the way back to the 2.0 branch, I've been waiting to merge until we get out of the freeze on the 2.3 branch.

At least approving this. I'll merge this and backport it as soon as we have a vote to release AK 2.3.0.

@reshetu

This comment has been minimized.

Copy link

commented Jul 18, 2019

Wondering what is the timeline for this to be integrated/backported into the production release. It's affecting our current project. Sorry if this is not the right place to ask this question.

@pgwhalen

This comment has been minimized.

Copy link
Contributor Author

commented Jul 31, 2019

Giving this one another bump as 2.3 has been out for a bit now.

} catch (TimeoutException e) {
log.warn("Timeout while reading log to end for topic '{}'. Retrying automatically. " +
"This may occur when brokers are unavailable or unreachable. Reason: {}", topic, e.getMessage());
continue;

This comment has been minimized.

Copy link
@qinghui-xu

qinghui-xu Aug 12, 2019

Contributor

I would suggest to backoff the retry, as the timeout probably means the unavailability of the resource (it might be the kafka broker itself, or the network, or something else), an immediate retry likely will not be success.

This comment has been minimized.

Copy link
@pgwhalen

pgwhalen Aug 12, 2019

Author Contributor

In practice, actual retries to the network will still be beholden to the request.timeout.ms and retry.backoff.ms inside the consumer, so that isn't an issue here.

This comment has been minimized.

Copy link
@qinghui-xu

qinghui-xu Aug 13, 2019

Contributor

That's true, thanks for refreshing my mind. Then no need to backoff here.

@rhauch rhauch merged commit e2c8612 into apache:trunk Aug 13, 2019

2 of 3 checks passed

JDK 8 and Scala 2.11 FAILURE 9951 tests run, 65 skipped, 0 failed.
Details
JDK 11 and Scala 2.12 SUCCESS 11269 tests run, 67 skipped, 0 failed.
Details
JDK 11 and Scala 2.13 SUCCESS 11680 tests run, 67 skipped, 0 failed.
Details
rhauch added a commit that referenced this pull request Aug 13, 2019
KAFKA-7941: Catch TimeoutException in KafkaBasedLog worker thread (#6283
)

When calling readLogToEnd(), the KafkaBasedLog worker thread should catch TimeoutException and log a warning, which can occur if brokers are unavailable, otherwise the worker thread terminates.

Includes an enhancement to MockConsumer that allows simulating exceptions not just when polling but also when querying for offsets, which is necessary for testing the fix.

Author: Paul Whalen <pgwhalen@gmail.com>
Reviewers: Randall Hauch <rhauch@gmail.com>, Arjun Satish <arjun@confluent.io>, Ryanne Dolan <ryannedolan@gmail.com>
rhauch added a commit that referenced this pull request Aug 13, 2019
KAFKA-7941: Catch TimeoutException in KafkaBasedLog worker thread (#6283
)

When calling readLogToEnd(), the KafkaBasedLog worker thread should catch TimeoutException and log a warning, which can occur if brokers are unavailable, otherwise the worker thread terminates.

Includes an enhancement to MockConsumer that allows simulating exceptions not just when polling but also when querying for offsets, which is necessary for testing the fix.

Author: Paul Whalen <pgwhalen@gmail.com>
Reviewers: Randall Hauch <rhauch@gmail.com>, Arjun Satish <arjun@confluent.io>, Ryanne Dolan <ryannedolan@gmail.com>
rhauch added a commit that referenced this pull request Aug 13, 2019
KAFKA-7941: Catch TimeoutException in KafkaBasedLog worker thread (#6283
)

When calling readLogToEnd(), the KafkaBasedLog worker thread should catch TimeoutException and log a warning, which can occur if brokers are unavailable, otherwise the worker thread terminates.

Includes an enhancement to MockConsumer that allows simulating exceptions not just when polling but also when querying for offsets, which is necessary for testing the fix.

Author: Paul Whalen <pgwhalen@gmail.com>
Reviewers: Randall Hauch <rhauch@gmail.com>, Arjun Satish <arjun@confluent.io>, Ryanne Dolan <ryannedolan@gmail.com>
@qinghui-xu

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2019

Thanks for the contribution and the effort of merging it! @pgwhalen @rhauch
I still wonder if my suggestion here could be applied (https://issues.apache.org/jira/browse/KAFKA-8790)? Basically, it's a patch based on this PR, with some worker shutdown mechanism in case the error is unrecoverable (or retry is continuously failing).

What's your thoughts? I think it will be especially helpful when workers' lifecycle is managed by some resource orchestration (such as mesos, k8s, or else), which is our case.
We will be glad to contribute if you guys think it's also valuable for you.

rhauch added a commit that referenced this pull request Aug 13, 2019
KAFKA-7941: Catch TimeoutException in KafkaBasedLog worker thread (#6283
)

When calling readLogToEnd(), the KafkaBasedLog worker thread should catch TimeoutException and log a warning, which can occur if brokers are unavailable, otherwise the worker thread terminates.

Includes an enhancement to MockConsumer that allows simulating exceptions not just when polling but also when querying for offsets, which is necessary for testing the fix.

Author: Paul Whalen <pgwhalen@gmail.com>
Reviewers: Randall Hauch <rhauch@gmail.com>, Arjun Satish <arjun@confluent.io>, Ryanne Dolan <ryannedolan@gmail.com>
xiowu0 added a commit to linkedin/kafka that referenced this pull request Aug 22, 2019
[LI-CHERRY-PICK] [2426926] KAFKA-7941: Catch TimeoutException in Kafk…
…aBasedLog worker thread (apache#6283)

TICKET = KAFKA-7941
LI_DESCRIPTION =
EXIT_CRITERIA = HASH [2426926]
ORIGINAL_DESCRIPTION =

When calling readLogToEnd(), the KafkaBasedLog worker thread should catch TimeoutException and log a warning, which can occur if brokers are unavailable, otherwise the worker thread terminates.

Includes an enhancement to MockConsumer that allows simulating exceptions not just when polling but also when querying for offsets, which is necessary for testing the fix.

Author: Paul Whalen <pgwhalen@gmail.com>
Reviewers: Randall Hauch <rhauch@gmail.com>, Arjun Satish <arjun@confluent.io>, Ryanne Dolan <ryannedolan@gmail.com>
(cherry picked from commit 2426926)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.