Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MINOR: Revert "KAFKA-13542: add rebalance reason in Kafka Streams (#11804)" #11873

Merged
merged 1 commit into from
Mar 10, 2022
Merged

MINOR: Revert "KAFKA-13542: add rebalance reason in Kafka Streams (#11804)" #11873

merged 1 commit into from
Mar 10, 2022

Conversation

wcarlson5
Copy link
Contributor

@wcarlson5 wcarlson5 commented Mar 9, 2022

This reverts commit 2ccc834. We were seeing serious regressions in our state heavy benchmarks. We saw that our state heavy benchmarks were experiencing a really bad regression. The State heavy benchmarks runs with rolling bounces with 10 nodes.

We regularly saw this exception

17165 java.lang.OutOfMemoryError: Java heap space                                                                                                                                                                                              
17166 Dumping heap to ./senrollingBounce10-suNONE-stMEM-cENABLED-lENABLED-snkNO_SINK-ks1000000-kdsequential-prt100-ec2i3.large-1,1-heap-dump-try-1 ...
17167 Heap dump file created [4129525648 bytes in 12.538 secs]
17168 217283 [kafka-coordinator-heartbeat-thread | benchmark] ERROR org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - [Consumer clientId=benchmark-727ac648-f5d5-43fb-95c3-549dc2da033d-StreamThread-1-consumer, groupId=benchm      ark] Heartbeat thread failed due to unexpected error
17169 java.lang.OutOfMemoryError: Java heap space
17169 java.lang.OutOfMemoryError: Java heap space
17170         at java.base/java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:61)
17171         at java.base/java.nio.ByteBuffer.allocate(ByteBuffer.java:348)
17172         at org.apache.kafka.common.memory.MemoryPool$1.tryAllocate(MemoryPool.java:30)
17173         at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:113)
17174         at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:452)
17175         at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:402)
17176         at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:674)
17177         at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:576)
17178         at org.apache.kafka.common.network.Selector.poll(Selector.java:481)
17179         at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:560)
17180         at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:265)
17181         at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.pollNoWakeup(ConsumerNetworkClient.java:306)
17182         at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$HeartbeatThread.run(AbstractCoordinator.java:1420)

I ran through a git bisect and found this commit. We verified that the commit right before did not have the same issues as this one did. I then reverted the problematic commit and ran the benchmarks again on this commit and did not see any more issues. We are still looking into the root cause, but for now since this isn't a critical improvement so we can remove it temporarily.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@wcarlson5
Copy link
Contributor Author

wcarlson5 commented Mar 9, 2022

@vvcephei @cadonna

Copy link
Contributor

@cadonna cadonna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wcarlson5 I approve the PR.

However, I -- as probably also you -- do not see the connection to the issue.

@cadonna
Copy link
Contributor

cadonna commented Mar 10, 2022

Restarted builds

@dajac
Copy link
Contributor

dajac commented Mar 10, 2022

@wcarlson5 It would be great if we could get to the bottom on this. I don't really understand how that could impact the performances.

@ijuma
Copy link
Contributor

ijuma commented Mar 10, 2022

Have you verified that this revert actually fixes the regression?

@ijuma
Copy link
Contributor

ijuma commented Mar 10, 2022

We should not revert this without providing more detailed information, the PR doesn't include enough information.

@ijuma
Copy link
Contributor

ijuma commented Mar 10, 2022

Is this method being called in a hot loop? One difference is that there is a string concatenation if the reason is not null.

@wcarlson5
Copy link
Contributor Author

wcarlson5 commented Mar 10, 2022

@dajac @ijuma Hey I expanded the PR description a bit to include everything we know. I also was not sure how this causes a performance impact. It seems innocuous but i has a clear impact on our benchmarks. @lihaosky Is looking to find out more. It looks like a buffer in the network client was overflowing maybe?

Commit 2ccc834 (the original)
image

Commit 59ca166 (the revert)
image

Let me know if there is anything else you would like to know

@ableegoldman
Copy link
Contributor

Test failures are unrelated. Going to merge

Thanks for digging into this @wcarlson5 and @lihaosky . I hope we can figure out the cause and fix it soon as I was looking forward to having the rebalance reason be available in the JoinGroup :/

@ijuma
Copy link
Contributor

ijuma commented Mar 10, 2022

Thanks for the details @wcarlson5

@ableegoldman ableegoldman merged commit 4d5a289 into apache:trunk Mar 10, 2022
@ableegoldman
Copy link
Contributor

Merged to trunk

@wcarlson5 wcarlson5 deleted the reverted_commit branch March 10, 2022 21:53
wcarlson5 added a commit to confluentinc/kafka that referenced this pull request Mar 22, 2022
)" (apache#11873)

This reverts commit 2ccc834.

This reverts commit 2ccc834. We were seeing serious regressions in our state heavy benchmarks. We saw that our state heavy benchmarks were experiencing a really bad regression. The State heavy benchmarks runs with rolling bounces with 10 nodes.

We regularly saw this exception:  java.lang.OutOfMemoryError: Java heap space                                                                                                                                                                                              

I ran through a git bisect and found this commit. We verified that the commit right before did not have the same issues as this one did. I then reverted the problematic commit and ran the benchmarks again on this commit and did not see any more issues. We are still looking into the root cause, but for now since this isn't a critical improvement so we can remove it temporarily.

Reviewers: Bruno Cadonna <cadonna@confluent.io>, Anna Sophie Blee-Goldman <ableegoldman@apache.org>, David Jacot <djacot@confluent.io>, Ismael Juma <ismael@confluent.io>
wcarlson5 added a commit to confluentinc/kafka that referenced this pull request Mar 28, 2022
)" (apache#11873)

This reverts commit 2ccc834.

This reverts commit 2ccc834. We were seeing serious regressions in our state heavy benchmarks. We saw that our state heavy benchmarks were experiencing a really bad regression. The State heavy benchmarks runs with rolling bounces with 10 nodes.

We regularly saw this exception:  java.lang.OutOfMemoryError: Java heap space                                                                                                                                                                                              

I ran through a git bisect and found this commit. We verified that the commit right before did not have the same issues as this one did. I then reverted the problematic commit and ran the benchmarks again on this commit and did not see any more issues. We are still looking into the root cause, but for now since this isn't a critical improvement so we can remove it temporarily.

Reviewers: Bruno Cadonna <cadonna@confluent.io>, Anna Sophie Blee-Goldman <ableegoldman@apache.org>, David Jacot <djacot@confluent.io>, Ismael Juma <ismael@confluent.io>
dajac added a commit that referenced this pull request Mar 31, 2022
…upRequest (#11971)

KIP-800 introduced a mechanism to pass a reason in the join group request and in the leave group request. A default reason is used unless one is provided by the user. In this case, the custom reason is prefixed by the default one.

When we tried to used this in Kafka Streams, we noted a significant degradation of the performances, see #11873. It is not clear wether the prefixing is the root cause of the issue or not. To be on the safe side, I think that we should remove the prefixing. It does not bring much anyway as we are still able to distinguish a custom reason from the default one on the broker side.

This patch removes prefixing the user provided reasons. So if a the user provides a reason, the reason is used directly. If the reason is empty or null, the default reason is used.

Reviewers: Luke Chen <showuon@gmail.com>, <jeff.kim@confluent.io>, Hao Li <hli@confluent.io>
dajac added a commit that referenced this pull request Mar 31, 2022
…upRequest (#11971)

KIP-800 introduced a mechanism to pass a reason in the join group request and in the leave group request. A default reason is used unless one is provided by the user. In this case, the custom reason is prefixed by the default one.

When we tried to used this in Kafka Streams, we noted a significant degradation of the performances, see #11873. It is not clear wether the prefixing is the root cause of the issue or not. To be on the safe side, I think that we should remove the prefixing. It does not bring much anyway as we are still able to distinguish a custom reason from the default one on the broker side.

This patch removes prefixing the user provided reasons. So if a the user provides a reason, the reason is used directly. If the reason is empty or null, the default reason is used.

Reviewers: Luke Chen <showuon@gmail.com>, <jeff.kim@confluent.io>, Hao Li <hli@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants