KAFKA-2555: Infinite recursive function call when call commitSync in … #221

becketqin · 2015-09-18T17:53:03Z

@hachikuji @ewencp I found this problem when adding new consumer to mirror maker which commits offset in the rebalance callback. It is not clear to me why we are triggering rebalance for commitSync() and fetchCommittedOffset(). Can you help review to see if I miss something?

Regarding commitSync, After each poll() the partitions will be either assigned to a consumer or it will be already revoked. As long as user is using internal offset map, the offset map will always be valid. i.e. the offset map will always only contain the assigned partitions when commitSync is called. Hence there is no need to trigger a rebalance in commitSync().

The same guarantee also apply to fetchCommittedOffset(), isn't the only requirement is to ensure we know the coordinator?

Another related issue is that today the IllegalGenerationIdException is a bit confusing. When we receive an IllegalGenerationIdException from heartbeat, we need to use that same generation Id to commit offset and the coordinator will take it. So the generation ID was not really illegal. I will file a ticket for this issue.

asfbot · 2015-09-18T17:58:19Z

kafka-trunk-git-pr #443 FAILURE
Looks like there's a problem with this pull request

hachikuji · 2015-09-21T19:03:29Z

@becketqin I thought about this before, but the implications from removing the partition assignment check were not very clear to me. I think there are basically three cases to consider:

The current generation is still active.
There is a rebalance in progress.
A rebalance was previously completed (and the member was kicked out).

For (1) and (2), this change appears to work fine. The coordinator should accept the commits and we will have a chance to rebalance the next time poll() is invoked. However, for (3), I'm not sure we'll recover correctly since the commits will always return either UNKNOWN_CONSUMER or ILLEGAL_GENERATION, which are both designated retriable. The loop will spin forever since it no longer has a way to refresh the generation.

One of the problems is that the current "retriable" notion is pretty limited. In fact, those errors are only retriable in the context of a JoinGroup. For everything else, they are not retriable. My suggestion would probably be to make them non-retriable, and then to add special checks in the join group loop to check for them. This would mean that the illegal generation errors in commitSync would get propagated to the user, which is probably a lot better than trying to force through the old commit on a new generation. What do you think?

Also, I agree that use of illegal generation is a little confusing. I would rather reserve that error for when the generation is officially superseded. Instead of using illegal generation in the heartbeat response to indicate a need to rebalance, maybe we can have a REJOIN_NEEDED error or something like that.

becketqin · 2015-09-21T20:38:44Z

@hachikuji Thanks for the explanation. Case (3) is actually a failure case, the consumer has already been kicked out of the group. Letting a commit offset trigger a rebalance and proceed might cause some issue. For example:

Consumer 1 is consuming from 100 from partition 1. Consumer 2 is consuming from 100 from partition 2.
Consumer 2 somehow was kicked out of the group
Consumer 1 now consumes from offset 200 from partion1 and 200 from partion2.
Consumer 2 commits offset 100.
In this case if we let consumer 2 proceed the offset of partition 2 will rewind.

Personally I prefer to have the rebalance triggered by IllegalGenerationID only happen in heartbeat. In other cases, that is actually an error user needs to be aware of, perhaps throwing an exception is better than swallowing the error.

Completely agree with the ambiguity of IllegalGenerationIdException. I have created a ticket KAFKA-2557 and Onur has a PR for it.

hachikuji · 2015-09-21T20:44:43Z

@becketqin I think we agree that commitSync should not rebalance. However, in order to handle case (3), we need to make IllegalGenerationException and UnknownConsumerIdException not extend RetriableException. Otherwise, commitSync will be stuck in the loop (since it will keep retrying) and will never propagate the error to the user. Does that make sense?

becketqin · 2015-09-21T21:03:21Z

@hachikuji Yes, that makes sense. Good catch about the infinite while loop.
So we are going to only make REBALANCE_IN_PROGRESS retriable, and ILLEGAL_GENERATION to be non-retriable. I am going to wait for that change get checked in. Sounds good?

hachikuji · 2015-09-23T21:18:18Z

@becketqin Looks like you might need a rebase (sorry!), but the REBALANCE_IN_PROGRESS patch was merged, so you should be good to go. I think @guozhangwang maybe can help check in the fix when it's ready since he was also running into this problem.

…RebalanceListener.onPartitionRevoked()

becketqin · 2015-09-24T04:05:42Z

@hachikuji I will need to change both UnknownConsumerException and IllegalGenerationIdException to Non-retriable.

There are two scenarios:

A consumer that has been kicked out of the group wants to commit offsets. In this case, we should throw exception rather than retry.
A coordinator failover happened (in the current checked in code it is as if all the consumers are kicked out of the group), we need to retry.

Today a consumer is not able to distinguish between those two scenarios. But I think moving forward, we should persist the group metadata for coordinator failover. The coordinator failover should either throw a GroupMetadataLoadingInProgressException or hold the offset commit request until the failover finish.

If we follow that path then making UnknownConsumerIdException and IllegalGenerationIdException non-retriable is reasonable. Today, making both of them non retriable means that we expose the coordinator failover to the users in (2), so user needs rejoin the group and retry commit with correct offset. It is a little bit ugly but not too big a deal. Do you have any concern on this?

hachikuji · 2015-09-24T16:30:47Z

@becketqin That's a good point about the coordinator failure case. Unless we have persistence, the new coordinator will have no choice but to reject the commits, and the only thing the client can do is propagate the error to the user. This pretty much guarantees that broker failures (or even clean broker shutdowns) will generally lead to duplicate consumption, which is unfortunate. Do you guys think this problem is serious enough to push for persistence in the initial release? Either way, it seems like making IllegalGeneration and UnknownConsumer non-retriable is the right thing to do since rebalancing in commit is not really an option.

becketqin · 2015-09-24T17:43:48Z

@hachikuji I don't have a strong opinion on whether we should include it in the initial release or not. It is ugly but not a disaster. User needs to handle duplicate anyway. But we have to educate user what to do when they receive those two exceptions.

becketqin · 2015-09-24T17:45:00Z

@guozhangwang Could help take a look at this patch? It seems both of us are blocking on this problem now...

asfbot · 2015-09-24T17:49:07Z

kafka-trunk-git-pr #531 FAILURE
Looks like there's a problem with this pull request

hachikuji · 2015-09-24T17:55:22Z

@becketqin Since we made those exceptions non-retriable, I think we have to also fix Coordinator.reassignPartitions() so that we don't propagate UnknownConsumerException when doing the actual rebalance. It's ugly, but something like this would probably work:

RequestFuture<Void> future = sendJoinGroupRequest();
client.poll(future);

if (future.failed()) {
    if (future.exception() instanceof UnknownConsumerException)
        continue;
    else if (future.isRetriable())
        Utils.sleep(retryBackoffMs);
    else
        throw future.exception();
}

becketqin · 2015-09-24T23:24:50Z

@hachikuji Good catch. Yes, it is a bit ugly to call them out and retry. But I think this is the right thing to do. Retriable means always retry, but the correct handling for those exceptions depends. So we cannot put them into the retriable category.

asfbot · 2015-09-24T23:29:51Z

kafka-trunk-git-pr #540 SUCCESS
This pull request looks good

hachikuji · 2015-09-24T23:51:34Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/Coordinator.java

@@ -197,7 +198,10 @@ private void reassignPartitions() {
            client.poll(future);

            if (future.failed()) {
-                if (!future.isRetriable())
+                if (future.exception() instanceof UnknownConsumerIdException
+                    || future.exception() instanceof IllegalGenerationException)


Nit: I don't think illegal generation is possible in JoinGroup.

asfbot · 2015-09-25T00:16:18Z

kafka-trunk-git-pr #541 FAILURE
Looks like there's a problem with this pull request

hachikuji · 2015-09-25T00:25:59Z

LGTM

guozhangwang · 2015-09-25T17:09:40Z

Thanks for the discussion @hachikuji @becketqin, this is very helpful. A few thoughts:

We can probably solve the problem that the same exception having different semantics for different request / response by introducing new exception codes, but we can also add the request API code into the the RequestFuture and let wrap the logic inside RequestFuture for determining isRetriable() instead of just the exception type itself.
If there is a coordinator failover, the new coordinator will have the following phases:

a) it does not yet know that this group belongs to itself:

Hence returns the NOT_COORDINATOR_FOR_CONSUMER, which is retriable.

b) being notified by the controller and gone though becomeLeader but the group is not created yet:

Hence just blindly commit the offset since it thinks this consumer does not use Kafka for membership.

c) the group is finally created, but the consumer id is unknown or the generation id is out-dated:

Hence will return UNKNOWN_CONSUMER or ILLEGAL_GENERATION, which is now non-retriable.

For me I think the current case b) is really bad, as it will accept any commits and this phase will only end when someone sends a JoinGroup, phase c) will now possibly result in consumption duplicates. So I feel
persisting the metadata to ZK is necessary because it relates to correctness but not only duplicates: in b) it will not blindly commit the offset but will return GroupMetadataLoadingInProgress, and in c) it will possibly discovered the consumer already.

hachikuji · 2015-09-25T17:37:32Z

@guozhangwang Thanks for the comments. It seems like the problem in phase b) is that the coordinator blindly commits the offsets. Does it have to? Couldn't we have the coordinator check if the generation is -1 (the default generation) in order to tell if consumer is using group membership? If the generation is greater than 0, then we could just return ILLEGAL_GENERATION. This obviously wouldn't prevent duplicates, but at least it would prevent this weird state where commits from old generations are accepted.

That being said, I'm starting to think that persistence ought to be part of the first release. If not, then we should at least consider whether there's a way to prevent duplicates on clean broker shutdown, since that is probably the most common case where the coordinator would need to fail-over.

guozhangwang · 2015-09-25T18:40:32Z

@hachikuji Good point, I think we can do that.

guozhangwang · 2015-09-25T18:54:46Z

Had some offline discussion with @hachikuji and @junrao, I think we can check in this patch as-is and there are a few follow-up works:

Shutting down consumer should commit offset to avoid duplicates. @hachikuji will work on that.
I will work on KAFKA-2017 to make it in the first release, there are some thoughts about what minimum metadata needs to be persisted to ZK and in what path / format. Also I will try to incorporate the generation id check for non-kafka membership management in the same ticket.
@hachikuji will create another ticket for adding request code into request future.

guozhangwang · 2015-09-25T18:58:28Z

LGTM.

…pache#221) - Pass principal.tostring() as input param. Metadata service API expects plain string value - Similar to ResourceType, add standard serializer/deserializer for Operation class - Removed AuthorizeResponse class. Metadata service API returns List<String> as authorize response result.

becketqin added 2 commits September 23, 2015 20:47

KAFKA-2555: Infinite recursive function call when call commitSync in …

cfd8e8e

…RebalanceListener.onPartitionRevoked()

Addressed Jason's comments.

72fb2a8

becketqin force-pushed the KAFKA-2555 branch from 77b6a59 to 72fb2a8 Compare September 24, 2015 17:38

Addressed Jason's comments.

7d905c2

hachikuji reviewed Sep 24, 2015
View reviewed changes

removed IllegalGenerationIdException from reassignPartitions()

988adc2

asfgit closed this in a07fbd0 Sep 25, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-2555: Infinite recursive function call when call commitSync in … #221

KAFKA-2555: Infinite recursive function call when call commitSync in … #221

becketqin commented Sep 18, 2015

asfbot commented Sep 18, 2015

hachikuji commented Sep 21, 2015

becketqin commented Sep 21, 2015

hachikuji commented Sep 21, 2015

becketqin commented Sep 21, 2015

hachikuji commented Sep 23, 2015

becketqin commented Sep 24, 2015

hachikuji commented Sep 24, 2015

becketqin commented Sep 24, 2015

becketqin commented Sep 24, 2015

asfbot commented Sep 24, 2015

hachikuji commented Sep 24, 2015

becketqin commented Sep 24, 2015

asfbot commented Sep 24, 2015

hachikuji Sep 24, 2015

asfbot commented Sep 25, 2015

hachikuji commented Sep 25, 2015

guozhangwang commented Sep 25, 2015

hachikuji commented Sep 25, 2015

guozhangwang commented Sep 25, 2015

guozhangwang commented Sep 25, 2015

guozhangwang commented Sep 25, 2015

KAFKA-2555: Infinite recursive function call when call commitSync in … #221

KAFKA-2555: Infinite recursive function call when call commitSync in … #221

Conversation

becketqin commented Sep 18, 2015

asfbot commented Sep 18, 2015

hachikuji commented Sep 21, 2015

becketqin commented Sep 21, 2015

hachikuji commented Sep 21, 2015

becketqin commented Sep 21, 2015

hachikuji commented Sep 23, 2015

becketqin commented Sep 24, 2015

hachikuji commented Sep 24, 2015

becketqin commented Sep 24, 2015

becketqin commented Sep 24, 2015

asfbot commented Sep 24, 2015

hachikuji commented Sep 24, 2015

becketqin commented Sep 24, 2015

asfbot commented Sep 24, 2015

hachikuji Sep 24, 2015

Choose a reason for hiding this comment

asfbot commented Sep 25, 2015

hachikuji commented Sep 25, 2015

guozhangwang commented Sep 25, 2015

hachikuji commented Sep 25, 2015

guozhangwang commented Sep 25, 2015

guozhangwang commented Sep 25, 2015

guozhangwang commented Sep 25, 2015