KAFKA-5154: Consumer fetches from revoked partitions when SyncGroup fails with disconnection [WIP] #3181

dguy · 2017-05-31T16:50:09Z

Scenario is as follows:

Consumer subscribes to topic t1 and begins consuming
heartbeat fails as the group is rebalancing
ConsumerCoordinator.onJoinGroupPrepare is called
3.1 onPartitionsRevoked is called
consumer becomes the group leader
sends sync group request
sync group is cancelled due to disconnection
fetch request is sent for partitions that have previously been revoked

dguy · 2017-05-31T16:55:56Z

@guozhangwang @hachikuji this is not to be merged, but this test fails and is based on the logs i extracted from the corresponding JIRA.
One simple fix for this specific problem is to clear the subscription in ConsumerCoordinator.onJoinPrepare. I think that is probably worth doing, but it is possibly masking a bigger problem as once the SyncGroup is disconnected the rebalance is never completed, i.e., the consumer doesn't get any partitions assigned

asfbot · 2017-05-31T17:18:51Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/4638/
Test FAILed (JDK 8 and Scala 2.12).

asfbot · 2017-05-31T17:23:46Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/4653/
Test FAILed (JDK 7 and Scala 2.11).

asfbot · 2017-05-31T18:03:58Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/4642/
Test FAILed (JDK 8 and Scala 2.12).

asfbot · 2017-05-31T18:11:20Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/4657/
Test FAILed (JDK 7 and Scala 2.11).

guozhangwang · 2017-06-01T02:24:28Z

Regarding the general issue that it may get exposed: I think the root cause is in CoordinatorResponseHandler#onFailure() we only mark the coordinator as dead but do not do anything else; for example for syncGroupRequest, and then in its caller joinGroupIfNeeded(), we do not check the DisconnectException again, and the while condition needRejoin() || rejoinIncomplete() will also pass although the actual state is:

state = MemberState.UNJOINED.
coordinator = null.
rejoinNeeded = false.
joinFuture = null.

I think we should check all four variables above in the while condition instead.

hachikuji · 2017-06-01T06:51:04Z

Good catch. I wonder if we are setting rejoinNeeded to false too early. Perhaps that should only come after the SyncGroup returns?

dguy · 2017-06-01T09:52:51Z

@guozhangwang @hachikuji i think i've fixed it by simply calling requestRejoin when the coordinator is marked as dead. It will now attempt to rejoin the group. The test passes and everything looks as if it is working as expected

asfbot · 2017-06-01T09:54:01Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/4702/
Test FAILed (JDK 8 and Scala 2.12).

asfbot · 2017-06-01T09:54:50Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/4718/
Test FAILed (JDK 7 and Scala 2.11).

guozhangwang · 2017-06-01T16:11:44Z

Seems one fix missing:

scala2.11/clients/src/test/java/org/apache/kafka/clients/consumer/KafkaConsumerTest.java:1271: error: no suitable constructor found for Metadata(int,long)
        Metadata metadata = new Metadata(0, Long.MAX_VALUE);

guozhangwang · 2017-06-01T16:18:15Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java

                coordinatorDead();
+                requestRejoin();


I'm not sure if we should set requestRejoin in the base class (CoordinatorResponseHandler ). For example, HeartbeatResponseHandler also extends from it, but for that request if we get a disconnect, we should just mark the coordinator as dead in order to re-discover it; and then after new coordinator rediscovered retry sending heartbeat request and if that succeed just proceed as normal. Setting it here will force heartbeat request disconnection to also trigger a join group.

I think @hachikuji 's suggestion may be better: do not call

AbstractCoordinator.this.rejoinNeeded = false;

in JoinGroupResponseHandler#handle(), but in SyncGroupResponseHandler#handle().

asfbot · 2017-06-01T18:04:59Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/4717/
Test PASSed (JDK 8 and Scala 2.12).

asfbot · 2017-06-01T18:16:59Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/4733/
Test PASSed (JDK 7 and Scala 2.11).

guozhangwang · 2017-06-01T18:57:52Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java

@@ -537,6 +536,7 @@ public void handle(SyncGroupResponse syncResponse,
            if (error == Errors.NONE) {
                sensors.syncLatency.record(response.requestLatencyMs());
                future.complete(syncResponse.memberAssignment());
+                AbstractCoordinator.this.rejoinNeeded = false;


On a second thought... future.complete(syncResponse.memberAssignment()) above will trigger joinFuture.addListener's onSuccess, which will enable the heartbeat thread right away, and hence there is a (very small) race condition.

I think it is safer to just move the the above line inside onSuccess (line 395) to set it before enabling heart beat thread, and we would not need AbstractCoordinator.this prefix also.

@dguy @hachikuji if it sounds good to you I can go ahead and make this change while merging.

…Group response handler Scenario is as follows: 1. Consumer subscribes to topic t1 and begins consuming 2. heartbeat fails as the group is rebalancing 3. ConsumerCoordinator.onJoinGroupPrepare is called 3.1 onPartitionsRevoked is called 4. consumer becomes the group leader 5. sends sync group request 6. sync group is cancelled due to disconnection 7. fetch request is sent for partitions that have previously been revoked Author: Damian Guy <damian.guy@gmail.com> Reviewers: Jason Gustafson <jason@confluent.io>, Guozhang Wang <wangguoz@gmail.com> Closes #3181 from dguy/kafka-5154

guozhangwang · 2017-06-01T20:33:29Z

Merged to trunk and cherry-picked to 0.11.0.

just a test for discussion

ddeb16f

dguy force-pushed the kafka-5154 branch from 8feaa7e to ddeb16f Compare May 31, 2017 17:40

requestRejoin when coordinator is marked as dead

5a06671

guozhangwang reviewed Jun 1, 2017

View reviewed changes

dguy added 2 commits June 1, 2017 18:00

Merge branch 'trunk' into kafka-5154

46938a8

set rejoinNeeded to false in SyncGroupResponsHandler#handle

9237cee

guozhangwang reviewed Jun 1, 2017

View reviewed changes

asfgit closed this in 1b16aca Jun 1, 2017

dguy deleted the kafka-5154 branch August 16, 2017 13:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-5154: Consumer fetches from revoked partitions when SyncGroup fails with disconnection [WIP] #3181

KAFKA-5154: Consumer fetches from revoked partitions when SyncGroup fails with disconnection [WIP] #3181

dguy commented May 31, 2017

dguy commented May 31, 2017

asfbot commented May 31, 2017

asfbot commented May 31, 2017

asfbot commented May 31, 2017

asfbot commented May 31, 2017

guozhangwang commented Jun 1, 2017

hachikuji commented Jun 1, 2017

dguy commented Jun 1, 2017

asfbot commented Jun 1, 2017

asfbot commented Jun 1, 2017

guozhangwang commented Jun 1, 2017

guozhangwang Jun 1, 2017

guozhangwang Jun 1, 2017

asfbot commented Jun 1, 2017

asfbot commented Jun 1, 2017

guozhangwang Jun 1, 2017

guozhangwang Jun 1, 2017

hachikuji Jun 1, 2017

dguy Jun 1, 2017

guozhangwang commented Jun 1, 2017

KAFKA-5154: Consumer fetches from revoked partitions when SyncGroup fails with disconnection [WIP] #3181

KAFKA-5154: Consumer fetches from revoked partitions when SyncGroup fails with disconnection [WIP] #3181

Conversation

dguy commented May 31, 2017

dguy commented May 31, 2017

asfbot commented May 31, 2017

asfbot commented May 31, 2017

asfbot commented May 31, 2017

asfbot commented May 31, 2017

guozhangwang commented Jun 1, 2017

hachikuji commented Jun 1, 2017

dguy commented Jun 1, 2017

asfbot commented Jun 1, 2017

asfbot commented Jun 1, 2017

guozhangwang commented Jun 1, 2017

guozhangwang Jun 1, 2017

Choose a reason for hiding this comment

guozhangwang Jun 1, 2017

Choose a reason for hiding this comment

asfbot commented Jun 1, 2017

asfbot commented Jun 1, 2017

guozhangwang Jun 1, 2017

Choose a reason for hiding this comment

guozhangwang Jun 1, 2017

Choose a reason for hiding this comment

hachikuji Jun 1, 2017

Choose a reason for hiding this comment

dguy Jun 1, 2017

Choose a reason for hiding this comment

guozhangwang commented Jun 1, 2017