-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consumer continues to consume but cannot commit to wrong group coordinator #2630
Comments
Huhm, it seems like the coordinator is not re-queried when we get a NOT_COORD_FOR_GROUP error for OffsetCommitRequests, but it will be re-queried when we get the same error for HeartbeatRequests so the situation should fix itself within the heartbeat interval (by default within a couple of seconds). |
@edenhill: We might have ran into a similar issue (we're in the process of re-producing it). In our case, we're calling assign() directly and not using the subscribe() mechanism, so there wouldn't be any HeartbeatRequests. Should the coordinator be re-queried at commit time too to account for assign() use-cases? |
Duplicate of #2791 (or vice versa). |
@keith-chew Can you please tell me the details of your fallback? I'm wondering if it is a good fallback to have regardless of the fix. Do you trap the event and reconnect immediately or do you do something more sophisticated? I was thinking something like detecting two consecutive such events before reconnecting. |
Our workaround (within node-rdkafka) is as follows (simplified version):
You can adjust the 5 attempts to suit your application. Occasionally we will see some false positives, but we are happy with the re-connect when this happens. Periodically we will scan the logs and adjust the attempts accordingly. Note: Our app automatically reconnects on a disconnect, which is not shown in the snippet above. |
@keith-chew Thank you that looks great and what i was thinking to do! Will you keep this fallback even after the patched librdkafka is released? |
Description
This issue is similar to #2214, but the error message is "Not coordinator for group" instead of "Waiting for coordinator".
How to reproduce
This case is quite hard to reproduce, but after some attempts to shutdown the main group coordinator and bringing it back up (basically doing a group coordinator change on the server side), we can see this issue.
Checklist
Please provide the following information:
1.1.0
2.1.0
standard configuration
rhel7
debug=..
as necessary) from librdkafkaFrom client logs, we see:
followed by (forever, until a restart is done on the client):
From the server-side, we can confirm the offset is not being commited, so the consumer is consuming without any commits. The case above is with
enable.auto.commit = true
. This also happens withenable.auto.commit = false
. Our workaround at the moment is:(1) enable.auto.commit = true
Trap the COMMITFAIL event and disconnect and connect consumer
(2) enable.auto.commit = false
For clients using commitSync(), disconnect and connect on exception. For clients using commitAsync(), register the offset_commit_cb and on error, disconnect and connect the consumer.
The text was updated successfully, but these errors were encountered: