Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-7196: Remove heartbeat delayed operation for those removed consumers at the end of each rebalance #5556

Closed
wants to merge 1 commit into from
Closed

KAFKA-7196: Remove heartbeat delayed operation for those removed consumers at the end of each rebalance #5556

wants to merge 1 commit into from

Conversation

Lincong
Copy link
Contributor

@Lincong Lincong commented Aug 22, 2018

Description

During the consumer group rebalance, when the joining group phase finishes, the heartbeat delayed operation of the consumer that fails to rejoin the group should be removed from the purgatory. Otherwise, even though the member ID of the consumer has been removed from the group, its heartbeat delayed operation is still registered in the purgatory and the heartbeat delayed operation is going to timeout and then another unnecessary rebalance is triggered because of it.

Summary of testing strategy

Extend one of the existing unit test "testRebalanceCompletesBeforeMemberJoins" to verify that the delayed heart beat operation has actually been removed so that it would not trigger another unnecessary group rebalance.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@lindong28 lindong28 self-requested a review October 3, 2018 06:58
@lindong28 lindong28 changed the title KAFKA-7196: Remove heartbeat delayed operation for those removed cons… KAFKA-7196: Remove heartbeat delayed operation for those removed consumers at the end of each rebalance Oct 3, 2018
@lindong28 lindong28 self-assigned this Oct 3, 2018
Copy link
Member

@lindong28 lindong28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch. The fix LGTM. Left only one minor comment for the test code.


// the joined member should get heart beat response with no error. Let the new member keep heartbeating for a while
// to verify that no new rebalance is triggered unexpectedly
for ( _ <- 1 to 20) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help me understand why we choose 20 iteration and 500 ms per clock advance? What would be the minimum possible iteration to make sure that we catch this bug in the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minimum possible number of iteration to make sure we catch this bug is 3. Because the session timeout is 1000 ms. That means that the heartbeat of the previous consumer should timeout in 1000 ms and that would trigger another unnecessary rebalance if the heartbeat delayed operation is not removed at the end of the first rebalance.

1000 ms is 2 iterations. So for the code to verify the existence or absence of the error, we just need another extra iteration, which is (2 + 1) = 3.

I chose to make it 20 iteration is to verify the consumer group stays stable in a much longer period of time after the rebalance. Plus that running each iteration takes a negligible amount of time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. Sounds good.

@lindong28
Copy link
Member

lindong28 commented Oct 4, 2018

Thanks for the PR @Lincong. LGTM. Merged to 1.0, 1.1, 2.0 and trunk branches.

@lindong28 lindong28 closed this in 260b07a Oct 4, 2018
lindong28 pushed a commit that referenced this pull request Oct 4, 2018
…umers at the end of each rebalance

During the consumer group rebalance, when the joining group phase finishes, the heartbeat delayed operation of the consumer that fails to rejoin the group should be removed from the purgatory. Otherwise, even though the member ID of the consumer has been removed from the group, its heartbeat delayed operation is still registered in the purgatory and the heartbeat delayed operation is going to timeout and then another unnecessary rebalance is triggered because of it.

Author: Lincong Li <lcli@linkedin.com>

Reviewers: Dong Lin <lindong28@gmail.com>

Closes #5556 from Lincong/remove_heartbeat_delayedOperation

(cherry picked from commit 260b07a)
Signed-off-by: Dong Lin <lindong28@gmail.com>
lindong28 pushed a commit that referenced this pull request Oct 4, 2018
…umers at the end of each rebalance

During the consumer group rebalance, when the joining group phase finishes, the heartbeat delayed operation of the consumer that fails to rejoin the group should be removed from the purgatory. Otherwise, even though the member ID of the consumer has been removed from the group, its heartbeat delayed operation is still registered in the purgatory and the heartbeat delayed operation is going to timeout and then another unnecessary rebalance is triggered because of it.

Author: Lincong Li <lcli@linkedin.com>

Reviewers: Dong Lin <lindong28@gmail.com>

Closes #5556 from Lincong/remove_heartbeat_delayedOperation

(cherry picked from commit 260b07a)
Signed-off-by: Dong Lin <lindong28@gmail.com>
lindong28 pushed a commit that referenced this pull request Oct 4, 2018
…umers at the end of each rebalance

During the consumer group rebalance, when the joining group phase finishes, the heartbeat delayed operation of the consumer that fails to rejoin the group should be removed from the purgatory. Otherwise, even though the member ID of the consumer has been removed from the group, its heartbeat delayed operation is still registered in the purgatory and the heartbeat delayed operation is going to timeout and then another unnecessary rebalance is triggered because of it.

Author: Lincong Li <lcli@linkedin.com>

Reviewers: Dong Lin <lindong28@gmail.com>

Closes #5556 from Lincong/remove_heartbeat_delayedOperation

(cherry picked from commit 260b07a)
Signed-off-by: Dong Lin <lindong28@gmail.com>
Pengxiaolong pushed a commit to Pengxiaolong/kafka that referenced this pull request Jun 14, 2019
…umers at the end of each rebalance

During the consumer group rebalance, when the joining group phase finishes, the heartbeat delayed operation of the consumer that fails to rejoin the group should be removed from the purgatory. Otherwise, even though the member ID of the consumer has been removed from the group, its heartbeat delayed operation is still registered in the purgatory and the heartbeat delayed operation is going to timeout and then another unnecessary rebalance is triggered because of it.

Author: Lincong Li <lcli@linkedin.com>

Reviewers: Dong Lin <lindong28@gmail.com>

Closes apache#5556 from Lincong/remove_heartbeat_delayedOperation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants