Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consumer continues to consume but cannot commit to wrong group coordinator #2630

Closed
6 of 7 tasks
keith-chew opened this issue Nov 19, 2019 · 6 comments
Closed
6 of 7 tasks
Labels
Milestone

Comments

@keith-chew
Copy link

Description

This issue is similar to #2214, but the error message is "Not coordinator for group" instead of "Waiting for coordinator".

How to reproduce

This case is quite hard to reproduce, but after some attempts to shutdown the main group coordinator and bringing it back up (basically doing a group coordinator change on the server side), we can see this issue.

Checklist

Please provide the following information:

  • librdkafka version (release number or git tag): 1.1.0
  • Apache Kafka version: 2.1.0
  • librdkafka client configuration: standard configuration
  • Operating system: rhel7
  • Provide logs (with debug=.. as necessary) from librdkafka
  • Provide broker log excerpts
  • Critical issue

From client logs, we see:

{"severity":4,"fac":"COMMITFAIL","message":"[thrd:main]: Offset commit (manual) failed for 1/1 partition(s): Broker: Request timed out: xxxxx[10]@97813025(Broker: Request timed out)"}

followed by (forever, until a restart is done on the client):

{"severity":4,"fac":"COMMITFAIL","message":"[thrd:main]: Offset commit (manual) failed for 1/1 partition(s): Broker: Not coordinator for group: xxxxx[10]@14469297(Broker: Not coordinator for group)"}

From the server-side, we can confirm the offset is not being commited, so the consumer is consuming without any commits. The case above is with enable.auto.commit = true. This also happens with enable.auto.commit = false. Our workaround at the moment is:

(1) enable.auto.commit = true
Trap the COMMITFAIL event and disconnect and connect consumer

(2) enable.auto.commit = false
For clients using commitSync(), disconnect and connect on exception. For clients using commitAsync(), register the offset_commit_cb and on error, disconnect and connect the consumer.

@edenhill
Copy link
Contributor

Huhm, it seems like the coordinator is not re-queried when we get a NOT_COORD_FOR_GROUP error for OffsetCommitRequests, but it will be re-queried when we get the same error for HeartbeatRequests so the situation should fix itself within the heartbeat interval (by default within a couple of seconds).
There are some fixes around Heartbeat error handling that are going into the upcoming v1.4.0 release.

@edenhill edenhill added the bug label Mar 12, 2020
@edenhill edenhill added this to the v1.5.0 milestone Mar 12, 2020
@mlongob
Copy link

mlongob commented Mar 31, 2020

@edenhill: We might have ran into a similar issue (we're in the process of re-producing it). In our case, we're calling assign() directly and not using the subscribe() mechanism, so there wouldn't be any HeartbeatRequests. Should the coordinator be re-queried at commit time too to account for assign() use-cases?

@edenhill
Copy link
Contributor

edenhill commented Apr 3, 2020

Duplicate of #2791 (or vice versa).

@edenhill edenhill closed this as completed Apr 3, 2020
@MaximGurschi
Copy link

@keith-chew Can you please tell me the details of your fallback? I'm wondering if it is a good fallback to have regardless of the fix.

Do you trap the event and reconnect immediately or do you do something more sophisticated? I was thinking something like detecting two consecutive such events before reconnecting.

@keith-chew
Copy link
Author

Hi @MaximGurschi

Our workaround (within node-rdkafka) is as follows (simplified version):

        this.optsConsumer.offset_commit_cb = (err: any, topicPartitions: any) => {
            if (err) {
                this.kafkaFailedToCommitCount += 1;
                if (this.kafkaFailedToCommitCount > 5) {
                    this.disconnect();
                }
            } else {
                this.kafkaFailedToCommitCount = 0;
            }
        };

You can adjust the 5 attempts to suit your application. Occasionally we will see some false positives, but we are happy with the re-connect when this happens. Periodically we will scan the logs and adjust the attempts accordingly.

Note: Our app automatically reconnects on a disconnect, which is not shown in the snippet above.

@MaximGurschi
Copy link

@keith-chew Thank you that looks great and what i was thinking to do!

Will you keep this fallback even after the patched librdkafka is released?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants