New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-14016: Revoke more partitions than expected in Cooperative rebalance #12348
Conversation
…LANCE_IN_PROGRESS (apache#12140)" This reverts commit c23d60d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timing consumer got REBALANCE_IN_PROGRESS
is when current reblance round completed join_group, the group state changed to completingRebalance
. And during this time, some reason causes rebalancing triggered again (ex: new member joined), so the group state change to preparingRebalancing
. Thus, any sync_group sent this time will get REBALANCE_IN_PROGRESS
. And this error will not happen when this consumer send sync_group with old generation id, otherwise, ILLEGAL_GENERATION
error will receive.
So, we want to make sure the ownedPartitions
in consumers are up-to-date. And if the consumer leader already calculated the assignment in this round, we should distribute them to consumers, even though we already started next round of rebalance.
Is my understanding correct?
If so, I think I agree with the solution. But I'd like to hear @guozhangwang 's opinion.
Thanks.
@showuon Yes. |
@aiquestion Thanks for the patch. Before settling in on the best approach to fix this, could we start by adding a unit test which reproduces the issue? Could you explain why we need to revert https://issues.apache.org/jira/browse/KAFKA-13891? In the scenario you describe in the Jira:
I suppose that this scenario only works if A1 is the leader, right? Otherwise, A1 would not have received the sync group response. |
@dajac okay, will add ut for it first.
|
@aiquestion Yeah, this is what I thought. All the members rejoining after the leader are concerned. I think that the fundamental issue here is that we don't really enforce the synchronization barrier after each rebalance in the cooperative mode. We consider the rebalance completed as soon as the assignment provided by the leader are persisted and transition the group to Stable. The barrier is loose in a sense. So an alternative approach to your current proposal would be to really enforce that synchronization barrier. We could basically release the sync-group responses only when all the members are there instead of doing it when assignment is persisted. The down side is that it would also impact the eager mode which does not really require this. |
@aiquestion Any update on this one? |
@dajac sorry for the delay. So i think returning assignment along with 'REBALNCE_IN_PROGRESS' can be a fix for it. i think we can just update the assignment's generation if consumer get a 'REBALNCE_IN_PROGRESS' error in syncGroup.
WDYT? |
Fixed by c6ad151. Closing it. |
With latest trunk branch's code we found that in Cooperative rebalance consumer will revoke more partitions than expected. Details here https://issues.apache.org/jira/browse/KAFKA-14016
So i want to start a PR to discuss the fix code. test will be added later.
Summary of testing strategy (including rationale)
for the feature or bug fix. Unit and/or integration
tests are expected for any behaviour change and
system tests should be considered for larger changes.
Committer Checklist (excluded from commit message)