KAFKA-4473: RecordCollector should handle retriable exceptions more strictly #2249

dguy · 2016-12-13T10:10:35Z

The RecordCollectorImpl currently drops messages on the floor if an exception is non-null in the producer callback. This will result in message loss and violates at-least-once processing.
Rather than just log an error in the callback, save the exception in a field. On subsequent calls to send, flush, close, first check for the existence of an exception and throw a StreamsException if it is non-null. Also, in the callback, if an exception has already occurred, the offsets map should not be updated.

… on the floor

dguy · 2016-12-13T10:10:52Z

@enothereska @mjsax @guozhangwang

asfbot · 2016-12-13T10:57:54Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/97/
Test PASSed (JDK 7 and Scala 2.10).

asfbot · 2016-12-13T11:59:23Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/98/
Test PASSed (JDK 8 and Scala 2.12).

asfbot · 2016-12-13T12:03:26Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/99/
Test PASSed (JDK 8 and Scala 2.11).

mjsax · 2016-12-14T21:04:38Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/RecordCollectorImpl.java

                            TopicPartition tp = new TopicPartition(metadata.topic(), metadata.partition());
                            offsets.put(tp, metadata.offset());
                        } else {
+                            sendException = exception;


I don't see why we cannot just throw a StreamsException directly?

The callback happens on the IO thread of the producer

mjsax · 2016-12-14T21:05:03Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/RecordCollectorImpl.java

    @Override
    public void flush() {
        log.debug("{} Flushing producer", logPrefix);
        this.producer.flush();
+        checkForException();


Should we check first and than flush?

I did have it checking before and after flush, but decided to flush and then check. The reason being that we probably want to close the producer, that will also causes the messages to be flushed.

Just for clarification. If we call flush, we now that there are no "dangling callbacks" for onCompletion() ?

Yep - the have all completed by the time flush returns

mjsax

LGTM

guozhangwang

One minor comment otherwise LGTM. I have a couple of other general comments which could be tackled in a separate PR if you agree with my comments.

guozhangwang · 2016-12-19T23:52:38Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/RecordCollectorImpl.java

@@ -79,9 +81,17 @@ public RecordCollectorImpl(Producer<byte[], byte[]> producer, String streamTaskI
                    @Override
                    public void onCompletion(RecordMetadata metadata, Exception exception) {
                        if (exception == null) {
+                            if (sendException != null) {
+                                log.warn("{} not updating offset for topic {} partition {} due to previous exception",


In a batch of records, if the first record failed, it will cause all the rest record's callback to add a warn entry and hence swamp the log file. I feel it is better to just modify line 95 to sth. like "error sending to topic and partition, will not updating offset of this partition anymore and this exception should be eventually thrown to the user".

guozhangwang · 2016-12-19T23:54:28Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/RecordCollectorImpl.java

@@ -110,6 +127,7 @@ public void flush() {
    @Override
    public void close() {
        producer.close();
+        checkForException();


This is not introduced in this PR, but I realized this function is never called actually. Is it a bug?

Hmm, yes it is only called from the test i wrote. The producer is closed during shutdown in StreamThread. Which i don't think would matter on its own, but it is semi-related to your comment above. During shutdown and suspend we should only commit offsets after we've flushed and/or closed the producer - otherwise we run the risk of violating at-least-once.

guozhangwang · 2016-12-20T00:11:25Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/RecordCollectorImpl.java

    @Override
    public void flush() {
        log.debug("{} Flushing producer", logPrefix);
        this.producer.flush();
+        checkForException();


This is not introduced in this PR and we can tackle it in a separate PR if my reasoning is correct: in #1970 we did this ordering:

1. Commit tasks BUT only commit their consumers, do not flush states or flush producers. 2. Close tasks. 3. Flush stores: stateMgr.flush(processorContext). 4. Flush producers. 5. Close state manager: closeAllStateManagers(). 6. Close producer / consumer / restore consumer.

I do not remember is there is any reasons to do step 1 first, but it definitely void the at-least-once guarantees if we have a failure after step 1 before any of other steps right?

That is correct. I've raised: https://issues.apache.org/jira/browse/KAFKA-4561

dguy · 2016-12-20T10:03:23Z

@guozhangwang updated comment based on your feedback - thanks. I also raised another JIRA to track the other issue

asfbot · 2016-12-20T10:24:07Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/261/
Test PASSed (JDK 8 and Scala 2.11).

asfbot · 2016-12-20T10:29:37Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/259/
Test FAILed (JDK 7 and Scala 2.10).

asfbot · 2016-12-20T11:09:10Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/260/
Test PASSed (JDK 8 and Scala 2.12).

guozhangwang · 2016-12-20T18:36:47Z

Merged to trunk.

…trictly The `RecordCollectorImpl` currently drops messages on the floor if an exception is non-null in the producer callback. This will result in message loss and violates at-least-once processing. Rather than just log an error in the callback, save the exception in a field. On subsequent calls to `send`, `flush`, `close`, first check for the existence of an exception and throw a `StreamsException` if it is non-null. Also, in the callback, if an exception has already occurred, the `offsets` map should not be updated. Author: Damian Guy <damian.guy@gmail.com> Reviewers: Guozhang Wang <wangguoz@gmail.com> Closes apache#2249 from dguy/kafka-4473

throw exception in record collector instead of just dropping messages…

9ceaa67

… on the floor

mjsax reviewed Dec 14, 2016

View reviewed changes

mjsax approved these changes Dec 15, 2016

View reviewed changes

guozhangwang reviewed Dec 20, 2016

View reviewed changes

update comment

abf542f

asfgit closed this in 0321bf5 Dec 20, 2016

dguy deleted the kafka-4473 branch January 13, 2017 08:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-4473: RecordCollector should handle retriable exceptions more strictly #2249

KAFKA-4473: RecordCollector should handle retriable exceptions more strictly #2249

dguy commented Dec 13, 2016

dguy commented Dec 13, 2016

asfbot commented Dec 13, 2016

asfbot commented Dec 13, 2016

asfbot commented Dec 13, 2016

mjsax Dec 14, 2016

dguy Dec 14, 2016

mjsax Dec 14, 2016

mjsax Dec 14, 2016 •

edited

Loading

dguy Dec 14, 2016

mjsax Dec 14, 2016

dguy Dec 15, 2016

mjsax left a comment

guozhangwang left a comment

guozhangwang Dec 19, 2016

guozhangwang Dec 19, 2016

dguy Dec 20, 2016

guozhangwang Dec 20, 2016

dguy Dec 20, 2016

dguy commented Dec 20, 2016

asfbot commented Dec 20, 2016

asfbot commented Dec 20, 2016

asfbot commented Dec 20, 2016

guozhangwang commented Dec 20, 2016

KAFKA-4473: RecordCollector should handle retriable exceptions more strictly #2249

KAFKA-4473: RecordCollector should handle retriable exceptions more strictly #2249

Conversation

dguy commented Dec 13, 2016

dguy commented Dec 13, 2016

asfbot commented Dec 13, 2016

asfbot commented Dec 13, 2016

asfbot commented Dec 13, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjsax Dec 14, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjsax left a comment

Choose a reason for hiding this comment

guozhangwang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dguy commented Dec 20, 2016

asfbot commented Dec 20, 2016

asfbot commented Dec 20, 2016

asfbot commented Dec 20, 2016

guozhangwang commented Dec 20, 2016

mjsax Dec 14, 2016 •

edited

Loading