Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-2837: fix transient failure of kafka.api.ProducerBounceTest > testBrokerFailure #648

Closed
wants to merge 10 commits into from

Conversation

jinxing64
Copy link

I can reproduced this transient failure, it seldom happen;
code is like below:
// rolling bounce brokers
for (i <- 0 until numServers) {
for (server <- servers) {
server.shutdown()
server.awaitShutdown()
server.startup()
Thread.sleep(2000)
}

  // Make sure the producer do not see any exception
  // in returned metadata due to broker failures
  assertTrue(scheduler.failed == false)

  // Make sure the leader still exists after bouncing brokers
  (0 until numPartitions).foreach(partition => TestUtils.waitUntilLeaderIsElectedOrChanged(zkUtils, topic1, partition))

Brokers keep rolling restart, and producer keep sending messages;
In every loop, it will wait for election of partition leader;
But if the election is slow, more messages will be buffered in RecordAccumulator's BufferPool;
The limit for buffer is set to be 30000;
TimeoutException("Failed to allocate memory within the configured max blocking time") will show up when out of memory;
Since for every restart of the broker, it will sleep for 2000 ms, so this transient failure seldom happen;
But if I reduce the sleeping period, the bigger chance failure happens;
for example if the broker with role of controller suffered a restart, it will take time to select controller first, then select leader, which will lead to more messges blocked in KafkaProducer:RecordAccumulator:BufferPool;
In this fix, I just enlarge the producer's buffer size to be 1MB;
@guozhangwang , Could you give some comments?


val numServers = 2

val overridingProps = new Properties()
overridingProps.put(KafkaConfig.AutoCreateTopicsEnableProp, false.toString)
overridingProps.put(KafkaConfig.MessageMaxBytesProp, serverMessageMaxBytes.toString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to remove this config?

@guozhangwang
Copy link
Contributor

@ZoneMayor Thanks for looking into this. I think this reasoning makes sense.

In the current producer version we have already deprecated METADATA_FETCH_TIMEOUT_CONFIG and BLOCK_ON_BUFFER_FULL_CONFIG, so in order to eliminate this issue instead of just reducing its likelihood we could choose to not set any of these two, and instead set MAX_BLOCK_MS_CONFIG to Long.MAX_VALUE so that the producer will be blocked forever if there is not enough data.

@jeanlyn
Copy link

jeanlyn commented Dec 10, 2015

👍 The reasoning also makes sense to me.

@jinxing64 jinxing64 closed this Dec 10, 2015
@jinxing64 jinxing64 reopened this Dec 10, 2015
@jinxing64 jinxing64 closed this Dec 10, 2015
@jinxing64 jinxing64 reopened this Dec 10, 2015
@@ -455,7 +455,7 @@ object TestUtils extends Logging {
*/
def createNewProducer(brokerList: String,
acks: Int = -1,
metadataFetchTimeout: Long = 3000L,
maxBlockMs: Long = Long.MaxValue,
blockOnBufferFull: Boolean = true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this variable is not used any more, could we remove it from createNewProducer as well?

@jinxing64 jinxing64 closed this Dec 11, 2015
@jinxing64 jinxing64 reopened this Dec 11, 2015
@asfgit asfgit closed this in 3fed579 Dec 13, 2015
@guozhangwang
Copy link
Contributor

LGTM. Merged to trunk.

@ijuma
Copy link
Contributor

ijuma commented Dec 14, 2015

@guozhangwang I don't understand the reasoning for setting MAX_BLOCK_MS_CONFIG to Long.MaxValue by default. Doesn't this mean that the test suite could hang instead of failing after a time out in case of bugs? Wouldn't it be better to set a time out that is high enough for our tests, but much lower than Long.MaxValue? Maybe 1 minute or something along those lines?

@guozhangwang
Copy link
Contributor

@ijuma Yeah you are right. Setting it to MaxValue is not a good solution actually. I will submit a follow-up patch shortly.

efeg pushed a commit to efeg/kafka that referenced this pull request Jan 29, 2020
AnGg98 pushed a commit to AnGg98/kafka that referenced this pull request Jul 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants