KAFKA-2837: fix transient failure of kafka.api.ProducerBounceTest > testBrokerFailure #648

jinxing64 · 2015-12-09T14:04:05Z

I can reproduced this transient failure, it seldom happen;
code is like below:
// rolling bounce brokers
for (i <- 0 until numServers) {
for (server <- servers) {
server.shutdown()
server.awaitShutdown()
server.startup()
Thread.sleep(2000)
}

  // Make sure the producer do not see any exception
  // in returned metadata due to broker failures
  assertTrue(scheduler.failed == false)

  // Make sure the leader still exists after bouncing brokers
  (0 until numPartitions).foreach(partition => TestUtils.waitUntilLeaderIsElectedOrChanged(zkUtils, topic1, partition))

Brokers keep rolling restart, and producer keep sending messages;
In every loop, it will wait for election of partition leader;
But if the election is slow, more messages will be buffered in RecordAccumulator's BufferPool;
The limit for buffer is set to be 30000;
TimeoutException("Failed to allocate memory within the configured max blocking time") will show up when out of memory;
Since for every restart of the broker, it will sleep for 2000 ms, so this transient failure seldom happen;
But if I reduce the sleeping period, the bigger chance failure happens;
for example if the broker with role of controller suffered a restart, it will take time to select controller first, then select leader, which will lead to more messges blocked in KafkaProducer:RecordAccumulator:BufferPool;
In this fix, I just enlarge the producer's buffer size to be 1MB;
@guozhangwang , Could you give some comments?

merge

2015-12-1

2015-12-04#KAFKA-2893

2015-12-9

…BrokerFailure

guozhangwang · 2015-12-09T23:39:00Z

core/src/test/scala/integration/kafka/api/ProducerBounceTest.scala


  val numServers = 2

  val overridingProps = new Properties()
  overridingProps.put(KafkaConfig.AutoCreateTopicsEnableProp, false.toString)
-  overridingProps.put(KafkaConfig.MessageMaxBytesProp, serverMessageMaxBytes.toString)


Why do you need to remove this config?

guozhangwang · 2015-12-09T23:46:09Z

@ZoneMayor Thanks for looking into this. I think this reasoning makes sense.

In the current producer version we have already deprecated METADATA_FETCH_TIMEOUT_CONFIG and BLOCK_ON_BUFFER_FULL_CONFIG, so in order to eliminate this issue instead of just reducing its likelihood we could choose to not set any of these two, and instead set MAX_BLOCK_MS_CONFIG to Long.MAX_VALUE so that the producer will be blocked forever if there is not enough data.

jeanlyn · 2015-12-10T02:30:10Z

👍 The reasoning also makes sense to me.

2015-12-10

…into trunk-KAFKA-2837

guozhangwang · 2015-12-10T18:53:32Z

core/src/test/scala/unit/kafka/utils/TestUtils.scala

@@ -455,7 +455,7 @@ object TestUtils extends Logging {
   */
  def createNewProducer(brokerList: String,
                        acks: Int = -1,
-                        metadataFetchTimeout: Long = 3000L,
+                        maxBlockMs: Long = Long.MaxValue,
                        blockOnBufferFull: Boolean = true,


Seems this variable is not used any more, could we remove it from createNewProducer as well?

guozhangwang · 2015-12-13T19:52:22Z

LGTM. Merged to trunk.

ijuma · 2015-12-14T01:46:59Z

@guozhangwang I don't understand the reasoning for setting MAX_BLOCK_MS_CONFIG to Long.MaxValue by default. Doesn't this mean that the test suite could hang instead of failing after a time out in case of bugs? Wouldn't it be better to set a time out that is high enough for our tests, but much lower than Long.MaxValue? Maybe 1 minute or something along those lines?

guozhangwang · 2015-12-14T20:16:24Z

@ijuma Yeah you are right. Setting it to MaxValue is not a good solution actually. I will submit a follow-up patch shortly.

…che#648)

Changelog: https://github.com/netty/netty/issues?q=is%3Aclosed+milestone%3A4.1.73.Final Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>

ZoneMayor and others added 5 commits November 27, 2015 12:49

Merge pull request #1 from apache/trunk

9537414

merge

Merge pull request #2 from apache/trunk

cec5b48

2015-12-1

Merge pull request #3 from apache/trunk

a119d54

2015-12-04#KAFKA-2893

Merge pull request #5 from apache/trunk

b767a8d

2015-12-9

KAFKA-2837: fix transient failure kafka.api.ProducerBounceTest > test…

cd5e6f4

…BrokerFailure

guozhangwang reviewed Dec 9, 2015
View reviewed changes

ZoneMayor and others added 4 commits December 10, 2015 12:47

Merge pull request #6 from apache/trunk

8ded910

2015-12-10

KAFKA-2837: WIP

2bcf010

Merge branch 'trunk-KAFKA-2837' of https://github.com/ZoneMayor/kafka …

dae4a3c

…into trunk-KAFKA-2837

KAFKA-2837: WIP

7118e11

jinxing64 closed this Dec 10, 2015

jinxing64 reopened this Dec 10, 2015

jinxing64 closed this Dec 10, 2015

jinxing64 reopened this Dec 10, 2015

guozhangwang reviewed Dec 10, 2015
View reviewed changes

KAFKA-2837: WIP

310dd6b

jinxing64 closed this Dec 11, 2015

jinxing64 reopened this Dec 11, 2015

asfgit closed this in 3fed579 Dec 13, 2015

efeg pushed a commit to efeg/kafka that referenced this pull request Jan 29, 2020

Fix bugs in processing logic for review and review_board request (apa…

32d7407

…che#648)

AnGg98 pushed a commit to AnGg98/kafka that referenced this pull request Jul 25, 2022

MINOR: Upgrade netty to 4.1.73.Final (apache#648)

f696218

Changelog: https://github.com/netty/netty/issues?q=is%3Aclosed+milestone%3A4.1.73.Final Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-2837: fix transient failure of kafka.api.ProducerBounceTest > testBrokerFailure #648

KAFKA-2837: fix transient failure of kafka.api.ProducerBounceTest > testBrokerFailure #648

jinxing64 commented Dec 9, 2015

guozhangwang Dec 9, 2015

guozhangwang commented Dec 9, 2015

jeanlyn commented Dec 10, 2015

guozhangwang Dec 10, 2015

guozhangwang commented Dec 13, 2015

ijuma commented Dec 14, 2015

guozhangwang commented Dec 14, 2015

KAFKA-2837: fix transient failure of kafka.api.ProducerBounceTest > testBrokerFailure #648

KAFKA-2837: fix transient failure of kafka.api.ProducerBounceTest > testBrokerFailure #648

Conversation

jinxing64 commented Dec 9, 2015

guozhangwang Dec 9, 2015

Choose a reason for hiding this comment

guozhangwang commented Dec 9, 2015

jeanlyn commented Dec 10, 2015

guozhangwang Dec 10, 2015

Choose a reason for hiding this comment

guozhangwang commented Dec 13, 2015

ijuma commented Dec 14, 2015

guozhangwang commented Dec 14, 2015