New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-5154: Consumer fetches from revoked partitions when SyncGroup fails with disconnection [WIP] #3181
Conversation
@guozhangwang @hachikuji this is not to be merged, but this test fails and is based on the logs i extracted from the corresponding JIRA. |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Regarding the general issue that it may get exposed: I think the root cause is in
I think we should check all four variables above in the while condition instead. |
Good catch. I wonder if we are setting |
@guozhangwang @hachikuji i think i've fixed it by simply calling |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Seems one fix missing:
|
coordinatorDead(); | ||
requestRejoin(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we should set requestRejoin
in the base class (CoordinatorResponseHandler
). For example, HeartbeatResponseHandler
also extends from it, but for that request if we get a disconnect, we should just mark the coordinator as dead in order to re-discover it; and then after new coordinator rediscovered retry sending heartbeat request and if that succeed just proceed as normal. Setting it here will force heartbeat request disconnection to also trigger a join group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @hachikuji 's suggestion may be better: do not call
AbstractCoordinator.this.rejoinNeeded = false;
in JoinGroupResponseHandler#handle()
, but in SyncGroupResponseHandler#handle()
.
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
@@ -537,6 +536,7 @@ public void handle(SyncGroupResponse syncResponse, | |||
if (error == Errors.NONE) { | |||
sensors.syncLatency.record(response.requestLatencyMs()); | |||
future.complete(syncResponse.memberAssignment()); | |||
AbstractCoordinator.this.rejoinNeeded = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a second thought... future.complete(syncResponse.memberAssignment())
above will trigger joinFuture.addListener
's onSuccess
, which will enable the heartbeat thread right away, and hence there is a (very small) race condition.
I think it is safer to just move the the above line inside onSuccess
(line 395) to set it before enabling heart beat thread, and we would not need AbstractCoordinator.this
prefix also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dguy @hachikuji if it sounds good to you I can go ahead and make this change while merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…Group response handler Scenario is as follows: 1. Consumer subscribes to topic t1 and begins consuming 2. heartbeat fails as the group is rebalancing 3. ConsumerCoordinator.onJoinGroupPrepare is called 3.1 onPartitionsRevoked is called 4. consumer becomes the group leader 5. sends sync group request 6. sync group is cancelled due to disconnection 7. fetch request is sent for partitions that have previously been revoked Author: Damian Guy <damian.guy@gmail.com> Reviewers: Jason Gustafson <jason@confluent.io>, Guozhang Wang <wangguoz@gmail.com> Closes #3181 from dguy/kafka-5154
Merged to trunk and cherry-picked to 0.11.0. |
Scenario is as follows:
3.1 onPartitionsRevoked is called