New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-7196: Remove heartbeat delayed operation for those removed consumers at the end of each rebalance #5556
Conversation
…umers at the end of each rebalance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch. The fix LGTM. Left only one minor comment for the test code.
|
||
// the joined member should get heart beat response with no error. Let the new member keep heartbeating for a while | ||
// to verify that no new rebalance is triggered unexpectedly | ||
for ( _ <- 1 to 20) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help me understand why we choose 20
iteration and 500
ms per clock advance? What would be the minimum possible iteration to make sure that we catch this bug in the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The minimum possible number of iteration to make sure we catch this bug is 3. Because the session timeout is 1000 ms. That means that the heartbeat of the previous consumer should timeout in 1000 ms and that would trigger another unnecessary rebalance if the heartbeat delayed operation is not removed at the end of the first rebalance.
1000 ms is 2 iterations. So for the code to verify the existence or absence of the error, we just need another extra iteration, which is (2 + 1) = 3.
I chose to make it 20 iteration is to verify the consumer group stays stable in a much longer period of time after the rebalance. Plus that running each iteration takes a negligible amount of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. Sounds good.
Thanks for the PR @Lincong. LGTM. Merged to 1.0, 1.1, 2.0 and trunk branches. |
…umers at the end of each rebalance During the consumer group rebalance, when the joining group phase finishes, the heartbeat delayed operation of the consumer that fails to rejoin the group should be removed from the purgatory. Otherwise, even though the member ID of the consumer has been removed from the group, its heartbeat delayed operation is still registered in the purgatory and the heartbeat delayed operation is going to timeout and then another unnecessary rebalance is triggered because of it. Author: Lincong Li <lcli@linkedin.com> Reviewers: Dong Lin <lindong28@gmail.com> Closes #5556 from Lincong/remove_heartbeat_delayedOperation (cherry picked from commit 260b07a) Signed-off-by: Dong Lin <lindong28@gmail.com>
…umers at the end of each rebalance During the consumer group rebalance, when the joining group phase finishes, the heartbeat delayed operation of the consumer that fails to rejoin the group should be removed from the purgatory. Otherwise, even though the member ID of the consumer has been removed from the group, its heartbeat delayed operation is still registered in the purgatory and the heartbeat delayed operation is going to timeout and then another unnecessary rebalance is triggered because of it. Author: Lincong Li <lcli@linkedin.com> Reviewers: Dong Lin <lindong28@gmail.com> Closes #5556 from Lincong/remove_heartbeat_delayedOperation (cherry picked from commit 260b07a) Signed-off-by: Dong Lin <lindong28@gmail.com>
…umers at the end of each rebalance During the consumer group rebalance, when the joining group phase finishes, the heartbeat delayed operation of the consumer that fails to rejoin the group should be removed from the purgatory. Otherwise, even though the member ID of the consumer has been removed from the group, its heartbeat delayed operation is still registered in the purgatory and the heartbeat delayed operation is going to timeout and then another unnecessary rebalance is triggered because of it. Author: Lincong Li <lcli@linkedin.com> Reviewers: Dong Lin <lindong28@gmail.com> Closes #5556 from Lincong/remove_heartbeat_delayedOperation (cherry picked from commit 260b07a) Signed-off-by: Dong Lin <lindong28@gmail.com>
…umers at the end of each rebalance During the consumer group rebalance, when the joining group phase finishes, the heartbeat delayed operation of the consumer that fails to rejoin the group should be removed from the purgatory. Otherwise, even though the member ID of the consumer has been removed from the group, its heartbeat delayed operation is still registered in the purgatory and the heartbeat delayed operation is going to timeout and then another unnecessary rebalance is triggered because of it. Author: Lincong Li <lcli@linkedin.com> Reviewers: Dong Lin <lindong28@gmail.com> Closes apache#5556 from Lincong/remove_heartbeat_delayedOperation
Description
During the consumer group rebalance, when the joining group phase finishes, the heartbeat delayed operation of the consumer that fails to rejoin the group should be removed from the purgatory. Otherwise, even though the member ID of the consumer has been removed from the group, its heartbeat delayed operation is still registered in the purgatory and the heartbeat delayed operation is going to timeout and then another unnecessary rebalance is triggered because of it.
Summary of testing strategy
Extend one of the existing unit test "testRebalanceCompletesBeforeMemberJoins" to verify that the delayed heart beat operation has actually been removed so that it would not trigger another unnecessary group rebalance.
Committer Checklist (excluded from commit message)