New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-2555: Infinite recursive function call when call commitSync in … #221
Conversation
kafka-trunk-git-pr #443 FAILURE |
@becketqin I thought about this before, but the implications from removing the partition assignment check were not very clear to me. I think there are basically three cases to consider:
For (1) and (2), this change appears to work fine. The coordinator should accept the commits and we will have a chance to rebalance the next time poll() is invoked. However, for (3), I'm not sure we'll recover correctly since the commits will always return either UNKNOWN_CONSUMER or ILLEGAL_GENERATION, which are both designated retriable. The loop will spin forever since it no longer has a way to refresh the generation. One of the problems is that the current "retriable" notion is pretty limited. In fact, those errors are only retriable in the context of a JoinGroup. For everything else, they are not retriable. My suggestion would probably be to make them non-retriable, and then to add special checks in the join group loop to check for them. This would mean that the illegal generation errors in commitSync would get propagated to the user, which is probably a lot better than trying to force through the old commit on a new generation. What do you think? Also, I agree that use of illegal generation is a little confusing. I would rather reserve that error for when the generation is officially superseded. Instead of using illegal generation in the heartbeat response to indicate a need to rebalance, maybe we can have a REJOIN_NEEDED error or something like that. |
@hachikuji Thanks for the explanation. Case (3) is actually a failure case, the consumer has already been kicked out of the group. Letting a commit offset trigger a rebalance and proceed might cause some issue. For example:
Personally I prefer to have the rebalance triggered by IllegalGenerationID only happen in heartbeat. In other cases, that is actually an error user needs to be aware of, perhaps throwing an exception is better than swallowing the error. Completely agree with the ambiguity of IllegalGenerationIdException. I have created a ticket KAFKA-2557 and Onur has a PR for it. |
@becketqin I think we agree that commitSync should not rebalance. However, in order to handle case (3), we need to make IllegalGenerationException and UnknownConsumerIdException not extend RetriableException. Otherwise, commitSync will be stuck in the loop (since it will keep retrying) and will never propagate the error to the user. Does that make sense? |
@hachikuji Yes, that makes sense. Good catch about the infinite while loop. |
@becketqin Looks like you might need a rebase (sorry!), but the REBALANCE_IN_PROGRESS patch was merged, so you should be good to go. I think @guozhangwang maybe can help check in the fix when it's ready since he was also running into this problem. |
…RebalanceListener.onPartitionRevoked()
@hachikuji I will need to change both UnknownConsumerException and IllegalGenerationIdException to Non-retriable. There are two scenarios:
Today a consumer is not able to distinguish between those two scenarios. But I think moving forward, we should persist the group metadata for coordinator failover. The coordinator failover should either throw a GroupMetadataLoadingInProgressException or hold the offset commit request until the failover finish. If we follow that path then making UnknownConsumerIdException and IllegalGenerationIdException non-retriable is reasonable. Today, making both of them non retriable means that we expose the coordinator failover to the users in (2), so user needs rejoin the group and retry commit with correct offset. It is a little bit ugly but not too big a deal. Do you have any concern on this? |
@becketqin That's a good point about the coordinator failure case. Unless we have persistence, the new coordinator will have no choice but to reject the commits, and the only thing the client can do is propagate the error to the user. This pretty much guarantees that broker failures (or even clean broker shutdowns) will generally lead to duplicate consumption, which is unfortunate. Do you guys think this problem is serious enough to push for persistence in the initial release? Either way, it seems like making IllegalGeneration and UnknownConsumer non-retriable is the right thing to do since rebalancing in commit is not really an option. |
77b6a59
to
72fb2a8
Compare
@hachikuji I don't have a strong opinion on whether we should include it in the initial release or not. It is ugly but not a disaster. User needs to handle duplicate anyway. But we have to educate user what to do when they receive those two exceptions. |
@guozhangwang Could help take a look at this patch? It seems both of us are blocking on this problem now... |
kafka-trunk-git-pr #531 FAILURE |
@becketqin Since we made those exceptions non-retriable, I think we have to also fix Coordinator.reassignPartitions() so that we don't propagate UnknownConsumerException when doing the actual rebalance. It's ugly, but something like this would probably work: RequestFuture<Void> future = sendJoinGroupRequest();
client.poll(future);
if (future.failed()) {
if (future.exception() instanceof UnknownConsumerException)
continue;
else if (future.isRetriable())
Utils.sleep(retryBackoffMs);
else
throw future.exception();
} |
@hachikuji Good catch. Yes, it is a bit ugly to call them out and retry. But I think this is the right thing to do. Retriable means always retry, but the correct handling for those exceptions depends. So we cannot put them into the retriable category. |
kafka-trunk-git-pr #540 SUCCESS |
@@ -197,7 +198,10 @@ private void reassignPartitions() { | |||
client.poll(future); | |||
|
|||
if (future.failed()) { | |||
if (!future.isRetriable()) | |||
if (future.exception() instanceof UnknownConsumerIdException | |||
|| future.exception() instanceof IllegalGenerationException) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I don't think illegal generation is possible in JoinGroup.
kafka-trunk-git-pr #541 FAILURE |
LGTM |
Thanks for the discussion @hachikuji @becketqin, this is very helpful. A few thoughts:
a) it does not yet know that this group belongs to itself: Hence returns the NOT_COORDINATOR_FOR_CONSUMER, which is retriable. b) being notified by the controller and gone though becomeLeader but the group is not created yet: Hence just blindly commit the offset since it thinks this consumer does not use Kafka for membership. c) the group is finally created, but the consumer id is unknown or the generation id is out-dated: Hence will return UNKNOWN_CONSUMER or ILLEGAL_GENERATION, which is now non-retriable. For me I think the current case b) is really bad, as it will accept any commits and this phase will only end when someone sends a JoinGroup, phase c) will now possibly result in consumption duplicates. So I feel |
@guozhangwang Thanks for the comments. It seems like the problem in phase b) is that the coordinator blindly commits the offsets. Does it have to? Couldn't we have the coordinator check if the generation is -1 (the default generation) in order to tell if consumer is using group membership? If the generation is greater than 0, then we could just return ILLEGAL_GENERATION. This obviously wouldn't prevent duplicates, but at least it would prevent this weird state where commits from old generations are accepted. That being said, I'm starting to think that persistence ought to be part of the first release. If not, then we should at least consider whether there's a way to prevent duplicates on clean broker shutdown, since that is probably the most common case where the coordinator would need to fail-over. |
@hachikuji Good point, I think we can do that. |
Had some offline discussion with @hachikuji and @junrao, I think we can check in this patch as-is and there are a few follow-up works:
|
LGTM. |
…pache#221) - Pass principal.tostring() as input param. Metadata service API expects plain string value - Similar to ResourceType, add standard serializer/deserializer for Operation class - Removed AuthorizeResponse class. Metadata service API returns List<String> as authorize response result.
@hachikuji @ewencp I found this problem when adding new consumer to mirror maker which commits offset in the rebalance callback. It is not clear to me why we are triggering rebalance for commitSync() and fetchCommittedOffset(). Can you help review to see if I miss something?
Regarding commitSync, After each poll() the partitions will be either assigned to a consumer or it will be already revoked. As long as user is using internal offset map, the offset map will always be valid. i.e. the offset map will always only contain the assigned partitions when commitSync is called. Hence there is no need to trigger a rebalance in commitSync().
The same guarantee also apply to fetchCommittedOffset(), isn't the only requirement is to ensure we know the coordinator?
Another related issue is that today the IllegalGenerationIdException is a bit confusing. When we receive an IllegalGenerationIdException from heartbeat, we need to use that same generation Id to commit offset and the coordinator will take it. So the generation ID was not really illegal. I will file a ticket for this issue.