KAFKA-6120: RecordCollector should not retry sending #4148

mjsax · 2017-10-27T19:25:45Z

No description provided.

mjsax · 2017-10-27T19:25:59Z

Call for review @bbejeck @dguy @guozhangwang

dguy · 2017-10-30T10:59:59Z

retest this please

guozhangwang · 2017-10-31T22:25:42Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/RecordCollectorImpl.java

-                Utils.sleep(SEND_RETRY_BACKOFF);
-            }
+            });
+        } catch (final TimeoutException e) {


This is not a comment: I agree that we should get rid of the retry logic but only rely on producer internals for retries; one caveat of this change though is that if we did hit the issue that KIP-91 tries to solve, Streams will be less resilient as compared to today (previously we can tolerate up to 3X downtime while now it is X downtime) since increasing max.block will not help here.

Just FYI to @xvrl

Would it not help if max.block.ms was set to a reasonably large value?

@guozhangwang can you explain how this affects our ability to withstand downtime? You mentioned 3X vs. X downtime, what currently determines the value of X?

Actually it's not exactly 3X v.s. X. And here is the difference:

Assuming the broker is down, then without this PR the producer would first use request.timeout to throw the exception for records in its accumulated queue, and then gets caught here and retry sending, and upon retries it will wait up to max.block.ms since queue is full and then throw the TimeoutException again, up to three times. So the total time it can endure broker to be down is

request.timeout + 3 * max.block.ms

And without this PR it would be request.timeout.

Note that the issue itself will only happen if we do not yet know the destination leader of the partition when broker is down, so its likelihood-to-hit is not like 100%.

I though the call to send() will never throw TimeoutException after request.timeout passed. The TimeoutException will be provided in the callback handler and streams will throw it later on. And this behavior does not change after this PR. One will always need to increase producer config retries to be more resilient to this scenario.

Please correct me if I am wrong.

The TimeoutException that we catch here, should only originate from the scenario for a full producer queue and max.block.ms passed.

@mjsax That is right, the TimeoutException from Sender#failBatch() is returned in the callback's exception, which will only be thrown in the next call. And retries will not help here. So it is really max.block.ms v.s. 3 * max.block.ms.

Currently this config value's default is 60 secs and Streams does not override it. So the effect is that if we do hit the issue that KIP-91's solving, it is a resilience of 60 seconds v.s. 180 seconds.

guozhangwang · 2017-10-31T22:29:06Z

streams/src/test/java/org/apache/kafka/streams/processor/internals/RecordCollectorTest.java

        final RecordCollector collector = new RecordCollectorImpl(
                new MockProducer(cluster, true, new DefaultPartitioner(), byteArraySerializer, byteArraySerializer) {
                    @Override
                    public synchronized Future<RecordMetadata> send(final ProducerRecord record, final Callback callback) {
-                        throw new TimeoutException();
+                        throw new KafkaException();


Hmm.. I'm wondering how did we succeed in this test case, since in the above code send() call is only captured with TimeoutException? Note that we only set the KafkaException in the callback while here we throw exception directly. And in fact, you changed the expected exception from StreamsException to KafkaException in line 128 above.

As Damian pointed out, it should still be StreamsException as expected exception and we need to wrap this exception with an additional catch-block.

dguy · 2017-11-01T12:12:39Z

streams/src/test/java/org/apache/kafka/streams/processor/internals/RecordCollectorTest.java

-    @SuppressWarnings("unchecked")
-    @Test(expected = StreamsException.class)
-    public void shouldThrowStreamsExceptionAfterMaxAttempts() {
+    @Test(expected = KafkaException.class)


Should this be StreamsException ?

mjsax · 2017-11-06T14:48:30Z

Updated this.

asfgit · 2017-11-06T14:48:59Z

FAILURE
No test results found.
--none--

asfgit · 2017-11-06T15:51:37Z

SUCCESS
8049 tests run, 5 skipped, 0 failed.
--none--

asfgit · 2017-11-06T16:26:58Z

SUCCESS
8049 tests run, 5 skipped, 0 failed.
--none--

guozhangwang · 2017-11-06T18:51:18Z

Merged to trunk.

KAFKA-6120: RecordCollector should not retry sending

f99a991

guozhangwang reviewed Oct 31, 2017

View reviewed changes

dguy reviewed Nov 1, 2017

View reviewed changes

Github comments

5bc7b22

asfgit closed this in 2b5a213 Nov 6, 2017

mjsax deleted the kafka-6120-recordCollector branch November 6, 2017 19:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-6120: RecordCollector should not retry sending #4148

KAFKA-6120: RecordCollector should not retry sending #4148

mjsax commented Oct 27, 2017

mjsax commented Oct 27, 2017

dguy commented Oct 30, 2017

guozhangwang Oct 31, 2017 •

edited

Loading

dguy Nov 1, 2017

xvrl Nov 1, 2017

guozhangwang Nov 1, 2017

mjsax Nov 6, 2017

mjsax Nov 6, 2017

guozhangwang Nov 6, 2017

guozhangwang Oct 31, 2017

mjsax Nov 6, 2017

dguy Nov 1, 2017

mjsax commented Nov 6, 2017

asfgit commented Nov 6, 2017

asfgit commented Nov 6, 2017

asfgit commented Nov 6, 2017

guozhangwang commented Nov 6, 2017

KAFKA-6120: RecordCollector should not retry sending #4148

KAFKA-6120: RecordCollector should not retry sending #4148

Conversation

mjsax commented Oct 27, 2017

mjsax commented Oct 27, 2017

dguy commented Oct 30, 2017

guozhangwang Oct 31, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjsax commented Nov 6, 2017

asfgit commented Nov 6, 2017

asfgit commented Nov 6, 2017

asfgit commented Nov 6, 2017

guozhangwang commented Nov 6, 2017

guozhangwang Oct 31, 2017 •

edited

Loading