Consumer continues to consume but cannot commit to wrong group coordinator #2630

keith-chew · 2019-11-19T20:53:49Z

Description

This issue is similar to #2214, but the error message is "Not coordinator for group" instead of "Waiting for coordinator".

How to reproduce

This case is quite hard to reproduce, but after some attempts to shutdown the main group coordinator and bringing it back up (basically doing a group coordinator change on the server side), we can see this issue.

Checklist

Please provide the following information:

librdkafka version (release number or git tag): 1.1.0
Apache Kafka version: 2.1.0
librdkafka client configuration: standard configuration
Operating system: rhel7
Provide logs (with debug=.. as necessary) from librdkafka
Provide broker log excerpts
Critical issue

From client logs, we see:

{"severity":4,"fac":"COMMITFAIL","message":"[thrd:main]: Offset commit (manual) failed for 1/1 partition(s): Broker: Request timed out: xxxxx[10]@97813025(Broker: Request timed out)"}

followed by (forever, until a restart is done on the client):

{"severity":4,"fac":"COMMITFAIL","message":"[thrd:main]: Offset commit (manual) failed for 1/1 partition(s): Broker: Not coordinator for group: xxxxx[10]@14469297(Broker: Not coordinator for group)"}

From the server-side, we can confirm the offset is not being commited, so the consumer is consuming without any commits. The case above is with enable.auto.commit = true. This also happens with enable.auto.commit = false. Our workaround at the moment is:

(1) enable.auto.commit = true
Trap the COMMITFAIL event and disconnect and connect consumer

(2) enable.auto.commit = false
For clients using commitSync(), disconnect and connect on exception. For clients using commitAsync(), register the offset_commit_cb and on error, disconnect and connect the consumer.

The text was updated successfully, but these errors were encountered:

edenhill · 2020-03-12T13:43:54Z

Huhm, it seems like the coordinator is not re-queried when we get a NOT_COORD_FOR_GROUP error for OffsetCommitRequests, but it will be re-queried when we get the same error for HeartbeatRequests so the situation should fix itself within the heartbeat interval (by default within a couple of seconds).
There are some fixes around Heartbeat error handling that are going into the upcoming v1.4.0 release.

mlongob · 2020-03-31T20:02:32Z

@edenhill: We might have ran into a similar issue (we're in the process of re-producing it). In our case, we're calling assign() directly and not using the subscribe() mechanism, so there wouldn't be any HeartbeatRequests. Should the coordinator be re-queried at commit time too to account for assign() use-cases?

edenhill · 2020-04-03T11:05:32Z

Duplicate of #2791 (or vice versa).

MaximGurschi · 2020-04-28T16:43:17Z

@keith-chew Can you please tell me the details of your fallback? I'm wondering if it is a good fallback to have regardless of the fix.

Do you trap the event and reconnect immediately or do you do something more sophisticated? I was thinking something like detecting two consecutive such events before reconnecting.

keith-chew · 2020-04-28T22:02:20Z

Hi @MaximGurschi

Our workaround (within node-rdkafka) is as follows (simplified version):

        this.optsConsumer.offset_commit_cb = (err: any, topicPartitions: any) => {
            if (err) {
                this.kafkaFailedToCommitCount += 1;
                if (this.kafkaFailedToCommitCount > 5) {
                    this.disconnect();
                }
            } else {
                this.kafkaFailedToCommitCount = 0;
            }
        };

You can adjust the 5 attempts to suit your application. Occasionally we will see some false positives, but we are happy with the re-connect when this happens. Periodically we will scan the logs and adjust the attempts accordingly.

Note: Our app automatically reconnects on a disconnect, which is not shown in the snippet above.

MaximGurschi · 2020-04-29T11:54:36Z

@keith-chew Thank you that looks great and what i was thinking to do!

Will you keep this fallback even after the patched librdkafka is released?

keith-chew mentioned this issue Nov 23, 2019

Looking for more Collaborators! Blizzard/node-rdkafka#628

Open

edenhill added the bug label Mar 12, 2020

edenhill added this to the v1.5.0 milestone Mar 12, 2020

mlongob mentioned this issue Mar 31, 2020

Consumer using assign() does not re-query coordinator after NOT_COORD_FOR_GROUP #2791

Closed

7 tasks

edenhill closed this as completed Apr 3, 2020

MaximGurschi mentioned this issue Dec 7, 2020

Kafka consumer (v1.4.2) gets stuck (NOT_COORDINATOR) whilst rejoining a group after a broker rolling update #2944

Closed

7 tasks

mensfeld mentioned this issue Sep 22, 2022

Consumer reset not needed on a :not_coordinator according to librdkafka zendesk/racecar#301

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consumer continues to consume but cannot commit to wrong group coordinator #2630

Consumer continues to consume but cannot commit to wrong group coordinator #2630

keith-chew commented Nov 19, 2019

edenhill commented Mar 12, 2020

mlongob commented Mar 31, 2020

edenhill commented Apr 3, 2020

MaximGurschi commented Apr 28, 2020

keith-chew commented Apr 28, 2020

MaximGurschi commented Apr 29, 2020

Consumer continues to consume but cannot commit to wrong group coordinator #2630

Consumer continues to consume but cannot commit to wrong group coordinator #2630

Comments

keith-chew commented Nov 19, 2019

Description

How to reproduce

Checklist

edenhill commented Mar 12, 2020

mlongob commented Mar 31, 2020

edenhill commented Apr 3, 2020

MaximGurschi commented Apr 28, 2020

keith-chew commented Apr 28, 2020

MaximGurschi commented Apr 29, 2020