Heartbeat failed for group xxxWorker because it is rebalancing #1418

ealfatt · 2018-03-08T15:37:32Z

Hi all
I'm facing this problem that is driving to me crazy with 1.4.1 version of Kafka python
the instruction that i perform are:

create a consumer in this way
KafkaConsumer(bootstrap_servers=kafka_multi_hosts,
auto_offset_reset=earliest,
enable_auto_commit=False,
group_id=group_name,
reconnect_backoff_ms=1,
consumer_timeout_ms=5000)
subscribe on a topics
finally the consumer.poll(500)

no problem till now but then in the log i see:

[INFO] 03/08/2018 02:52:53 PM Subscribe executed.
[INFO] 03/08/2018 02:52:53 PM Initialization pool executed.
[INFO] 03/08/2018 02:52:53 PM Subscribed to topic: event
[INFO] 03/08/2018 02:52:53 PM eventHandle connected
[WARNING] 03/08/2018 02:53:23 PM Heartbeat failed for group emsWorker because it is rebalancing
[WARNING] 03/08/2018 02:53:26 PM Heartbeat failed for group emsWorker because it is rebalancing
[WARNING] 03/08/2018 02:53:29 PM Heartbeat failed for group emsWorker because it is rebalancing
[WARNING] 03/08/2018 02:53:32 PM Heartbeat failed for group emsWorker because it is rebalancing
......
......
......
[WARNING] 03/08/2018 02:57:48 PM Heartbeat failed for group emsWorker because it is rebalancing
[WARNING] 03/08/2018 02:57:51 PM Heartbeat failed for group emsWorker because it is rebalancing
[INFO] 03/08/2018 02:57:53 PM Leaving consumer group (group_name).

why?

I've also added the option max_poll_records=50 in the kafkaConsumer definition but nothing is changed

@dpkp can you help me?
do you know if the 1.4.1 version presents some problem about that? Because in the previous version I can not see this problem.

thanks in advance

andrewkowalik · 2018-03-10T00:02:19Z

@ealfatt Yeah this is definitely newer behavior since there was not a background heartbeat thread so it basically happened during polling.

I am guessing after the last message your worker is not rejoining?

dpkp · 2018-03-10T01:08:52Z

kafka-python 1.4 changes heartbeats to a background thread, and it enables additional timeout configurations in max_poll_interval_ms and session_timeout_ms . It is certainly possible there is a bug in the code, but all of our tests pass so I'm not aware of anything obvious. Can you enable debug logs and see if anything looks strange?

ealfatt · 2018-03-12T08:38:35Z

@infecto yes, after the last message my worker was not rejoining (thanks for the response)
Hi @dpkp, thanks for the response. I'll try to play with those two parameter and try to enable the debug logs (how i can do that? is it explained in the documentation?)

ealfatt · 2018-03-14T13:24:45Z

Hi @dpkp
below 2 logs of what i can see when the issue above happen.

log1.txt
log2.txt

could yo give a look and let me know your feedback?
thanks a lot in advance

dpkp · 2018-03-22T19:40:15Z

This log entry: Heartbeat poll expired, leaving group means that your consumer application is not processing messages fast enough. What is your max_poll_interval_ms configuration? Is there something that might block your poll loop for longer than that amount?

slash31 · 2018-10-29T07:32:34Z

After upgrading to 1.4.3 from 1.3.5 a couple of weeks ago, I started seeing my consumer groups constantly rebalancing at the session timeout setting, no matter what I set it to. I tried:

adjusting the poll/heartbeat/session timers, as well as max recs polled
setting the assignment algorithm deterministically to round robin
turning off auto-commit and manually committing
upgrading Kafka to the latest available parcel

Eventually, I ended up writing a handler into the application where it would kill the consumer if the commit failed, then reincarnate it in the job queue and start it back up. You can imagine that with the default settings it was effectively in a perpetual loop of rebalancing, then reading a high volume of messages - the broker traffic graph looked like a perfect sawtooth.

I see heartbeat requests and successful responses, but commit responses with an Error 25 just before the error's thrown that the consumer can't commit successfully:

2018-10-28 20:35:06,563 kafka.conn DEBUG <BrokerConnection node_id=772 host=xvda-c14.xv.dc.openx.org:9092 [IPv4 ('10.11.3.14', 9092)]> Response 103 (0.808000564575 ms): OffsetCommitResponse_v2(topics=[(topic=u'netflow.raw', partitions=[(partition=0, error_code=25), (partition=5, error_code=25), (partition=10, error_code=25)])])
2018-10-28 20:35:06,563 kafka.conn DEBUG <BrokerConnection node_id=772 host=xvda-c14.xv.dc.openx.org:9092 [IPv4 ('10.11.3.14', 9092)]> Response 103 (0.808000564575 ms): OffsetCommitResponse_v2(topics=[(topic=u'netflow.raw', partitions=[(partition=0, error_code=25), (partition=5, error_code=25), (partition=10, error_code=25)])])
2018-10-28 20:35:06,563 kafka.coordinator.consumer DEBUG OffsetCommit for group netflow_raw failed: [Error 25] UnknownMemberIdError: netflow_raw
2018-10-28 20:35:06,563 kafka.coordinator.consumer DEBUG OffsetCommit for group netflow_raw failed: [Error 25] UnknownMemberIdError: netflow_raw

I downgraded the version to 1.3.5 again and everything is happy, all consumers have been running smoothly for hours now with no issue and the broker traffic graphs are flat. One other thing I noticed is a lot of wow/flutter on graphs of this data (netflow) when using 1.4.3, which smoothed out again after downgrading. The setup for each of my clusters is 10 consumers running on one host via multiprocess, consuming single topic with 10 partitions on three brokers. The application is very simple/lightweight; it reads 2k-4k messages per second depending on time of day, formats/augments some of the data, and then dumps it to CrateDB. With the 1.3.5 client, I can easily get > 20MBps throughput with this application (average is ~2MBps in production when the consumers are caught up), with each db write of 5k rows taking < 1s.

If you're interested in any of the other logs/graphs from either kafka-python or my brokers, I'd be happy to share.

julienaubert · 2018-11-07T16:13:14Z

I see this as well.

I have a set of consumers, each polling 500 records at a time, being part of group generation N

Another consumer joins. Causing a rebalance. Generation N+1

I can see in the logs that a consumer, while processing the 500 records that it got at generation N, will log that the heart beat failed due to rebalancing. But the processing code does not learn this, so it continues processing.. (I should fix so that the processing checks for this case and aborts. Not sure how?)

I can see that the broker does not include this consumer in the rebalanced group (at N+1). I wonder why, as it was sending heart-beats? Is this considered a bug?

I only quickly glimpsed at the code. It seems that this could happen: HeartBeat ends up in a state "do not send more heartbeats" due to rebalancing and the broker taking sufficiently long time to perform the rebalancing. So the consumer has stopped sending heartbeats, and is subsequently not included. It seems some special-case handling would be needed (eg reset timeouts) when rebalancing occurs?

(The rebalancing keeps happening as when this consumer that did not become part of generation N+1 later tries to join again, the same thing will happen, this time some other consumer will be busy processing and not become part of generation N+2... and so on and so on.)

88manpreet · 2018-11-21T23:30:07Z

We have also started seeing this error quite frequently with the upgrade to 1.4.3. So far none of tuning has turned out to be helpful.

dpkp · 2018-11-21T23:46:47Z

Can you try 1.4.4 ? There are some fixes included that may help.

…

On Wed, Nov 21, 2018, 3:30 PM Manpreet Singh ***@***.*** wrote: We have also started seeing this error quite frequently with the upgrade to 1.4.3. So far none of tuning has turned out to be helpful. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1418 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAzetC4CfcGRGl2q5Ktv8GI10HysmoIxks5uxeIAgaJpZM4Si39s> .

rafenden · 2018-11-27T16:15:17Z

It happens to me as well, I'm running 1.4.4

goern · 2018-12-03T15:53:30Z

I can see it on 1.4.4 too.

rabindrakm · 2019-01-02T04:29:18Z

Has there been a solution for this ?

licy121 · 2019-01-08T03:24:39Z

It is similar with the bug #1691 , I fixed it. Do you think this is the same with it?

dpkp · 2019-01-13T02:42:19Z

One point to keep in mind: heartbeats "failing" during rebalance is completely normal. The main java client treats this as a debug log, not a warning. If this is simply a case of spammy logs, we should probably return the log entry to debug.

However, if you are seeing unexpected rebalances and/or rebalances are not resolving correctly, then there may be something else going on. It is hard to say what that is from the issue detail so far. Some possibilities: your consumer is taking too long to process each batch, causing a max poll interval timeout. Or, there could be some bug or deadlock in kafka-python that is causing the consumer to stop processing records altogether. Do you have any more information on specifically what you are seeing in the consumer behavior apart from the heartbeat?

vimal3271 · 2019-02-28T02:07:25Z

I am seeing consumer rebalances even if there is no messages to consume. Start three consumers in a group and send some messages to topic and after that stop the producer. The consumer will start seeing rebalances after 5-6mins.
Sample code here:
https://stackoverflow.com/questions/54908902/kafka-consumer-rebalance-occurs-even-if-there-is-no-message-to-consume

dpkp · 2019-02-28T03:58:37Z

vimal: thanks for posting. I believe you may be hitting lock contention between an idle client.poll -- which can block and hold the client lock for the entire request_timeout_ms -- and the attempt by the heartbeat thread to send a new request. It seems to me that we may need to use KafkaClient.wakeup() to make sure that the polling thread drops the lock if/when we need to send a request from a different thread.

vimal3271 · 2019-02-28T04:30:30Z

@dpkp Thanks for the reply. Is there any workaround to avoid this?

dpkp · 2019-02-28T05:01:14Z

This shouldn't be an issue when messages are flowing through your topics at a steady rate. If this is just a test environment, and you expect your production environment to have more steady live data, then you could just ignore the error in testing. But if you are managing a topic w/ very low traffic -- delays of minutes between consecutive messages, for example -- you might try to reduce the request_timeout_ms to something closer to the heartbeat_interval_ms, which should prevent the read side from blocking for much longer than the heartbeat timeout. But note that other timeouts may also need to change (max_poll_interval_ms and session_timeout_ms perhaps). Another workaround might be to reduce metadata_max_age_ms to something close / equal to your heartbeat_timeout_ms. This will cause more frequent metadata requests, but should unblock the send side when there is no socket data available for reads.

vimal3271 · 2019-03-05T15:00:01Z

@dpkp Thanks for the workaround. The rebalance is not happening anymore. I used the second approach of reducing metadata_max_age_ms closer to heartbeat_time_out

mhorowitz0 · 2019-03-05T18:02:29Z

vimal: thanks for posting. I believe you may be hitting lock contention between an idle client.poll -- which can block and hold the client lock for the entire request_timeout_ms -- and the attempt by the heartbeat thread to send a new request. It seems to me that we may need to use KafkaClient.wakeup() to make sure that the polling thread drops the lock if/when we need to send a request from a different thread.

@dpkp I think I am hitting this exact issue with the lock contention, and it happens quite frequently (using 1.4.4). I turned the log level to DEBUG, and I can see that the heartbeats were every 3 seconds, then it went into a _client.poll -> _selector.select -> _epoll.poll (confirmed in pdb stack trace) and just hung for 5 minutes until it timed out, despite the fact the log end offset was way ahead of the current offset. During that time there were no heartbeats sent. I suspect the session timedout after 10 seconds, and the poll was left in the lurch until it reached the request_timeout_ms, after which it recognized the heartbeat timeout and started refreshing the metadata and doing a rejoin of the consumer group. After that the heartbeats started again every 3 seconds. I saw another issue with the heartbeat not being started after JoinGroup, but from what I can see the heartbeats were regular since the last JoinGroup in the logs, and only stopped when it went into that poll.

The issue happens randomly, but it happens often. The result is that consumers periodically hang in polls and messages build up and the lag becomes large, then it has to rejoin the group and then catch up on a backlog of messages. This problem becomes even worse when the poll goes for more than the max_poll_interval_ms causing runaway constant rebalancing, especially in groups with many consumers. I've mitigated the runaway rebalancing by just upping the max_poll_interval_ms to 1 hour, but I still see those constant poll hangs on partitions with plenty of lag, from anywhere from 20 seconds to 5 minutes. I can only thus far attribute them to the stoppage of the heartbeats and a session timeout as a result.

I was thinking about reducing the request_timeout_ms from the default of 5 minutes to 30 seconds, but that will only work around the issue, it seems the underlying root cause of those heartbeats stopping would still be there.

jeffwidman · 2019-03-13T15:50:45Z

@mhorowitz0 / @vimal3271 / @slash31 / @88manpreet there were a number of fixes included in the 1.4.6 release that should alleviate these lock problems w/o requiring workarounds... do you mind trying them and see whether they fix these issues?

jeffwidman · 2019-04-03T18:26:30Z

I'm going to close, please try the new 1.4.6 release, if you still see issues please open a new ticket.

mhorowitz0 · 2019-04-05T01:13:30Z

@jeffwidman I tried 1.4.5, and I still see the same symptoms. I will try 1.4.6.

mhorowitz0 · 2019-04-05T20:06:41Z

@jeffwidman I am seeing the same issue with 1.4.6. The issue is intermittent, so it is hard to catch, but from what I can see even with a single consumer group, a poll() can hang inexplicably for several minutes even when the topic/partition has a large lag, and even in as little as 12 seconds (enough to trigger a session timeout), and during that time heartbeats stop being sent for longer than the session timeout, and so a rebalance is triggered. I don't know why it hangs so long in the poll() when most of the time polls() take less than a second, and I don't know why the heartbeats stop, but I can only guess it is the same root cause, and most likely the lock contention @dpkp refers to above.

artemistomaras · 2019-04-10T08:33:33Z

@jeffwidman I am having the same issue as @mhorowitz0 with 1.4.6 with the default provided settings for the consumer.

A consumer inside a ConsumerGroup fails with the following log when there are no messages in the topic partition he is currently assigned to.

2019 Apr 10 06:22:03  | 2019-04-10 06:22:02,491 - kafka.coordinator - INFO - (Re-)joining group NSFW_CONSUMER
2019 Apr 10 06:22:06  | 2019-04-10 06:22:05,582 - kafka.coordinator - INFO - Successfully joined group NSFW_CONSUMER with generation 220
2019 Apr 10 06:22:06  | 2019-04-10 06:22:05,583 - kafka.consumer.subscription_state - INFO - Updated partition assignment: [TopicPartition(topic='request_for_analyze', partition=3)]
2019 Apr 10 06:22:06  | 2019-04-10 06:22:05,583 - kafka.coordinator.consumer - INFO - Setting newly assigned partitions {TopicPartition(topic='request_for_analyze', partition=3)} for group NSFW_CONSUMER
2019 Apr 10 06:27:12  | 2019-04-10 06:25:37,880 - kafka.coordinator - WARNING - Heartbeat session expired, marking coordinator dead
2019 Apr 10 06:27:12  | 2019-04-10 06:25:37,880 - kafka.coordinator - WARNING - Marking the coordinator dead (node 0) for group NSFW_CONSUMER: Heartbeat session expired.
2019 Apr 10 06:27:12  | 2019-04-10 06:25:37,889 - kafka.cluster - INFO - Group coordinator for NSFW_CONSUMER is BrokerMetadata(nodeId=0, host='tot-kafka-0.tot-kafka-headless.tot', port=9092, rack=None)
2019 Apr 10 06:27:12  | 2019-04-10 06:25:42,893 - kafka.coordinator - WARNING - Marking the coordinator dead (node 0) for group NSFW_CONSUMER: [Error 7] RequestTimedOutError.
2019 Apr 10 06:27:12  | 2019-04-10 06:25:37,890 - kafka.coordinator - INFO - Discovered coordinator 0 for group NSFW_CONSUMER
2019 Apr 10 06:27:12  | 2019-04-10 06:25:42,897 - kafka.cluster - INFO - Group coordinator for NSFW_CONSUMER is BrokerMetadata(nodeId=0, host='tot-kafka-0.tot-kafka-headless.tot', port=9092, rack=None)
2019 Apr 10 06:27:12  | 2019-04-10 06:25:42,894 - kafka.coordinator - WARNING - Heartbeat failed for group NSFW_CONSUMER because it is rebalancing
2019 Apr 10 06:27:12  | 2019-04-10 06:25:42,893 - kafka.coordinator.consumer - WARNING - Auto offset commit failed for group NSFW_CONSUMER: [Error 7] RequestTimedOutError: NSFW_CONSUMER
2019 Apr 10 06:27:12  | 2019-04-10 06:25:42,898 - kafka.coordinator - INFO - Discovered coordinator 0 for group NSFW_CONSUMER
2019 Apr 10 06:27:12  | 2019-04-10 06:25:48,003 - kafka.cluster - INFO - Group coordinator for NSFW_CONSUMER is BrokerMetadata(nodeId=0, host='tot-kafka-0.tot-kafka-headless.tot', port=9092, rack=None)
2019 Apr 10 06:27:12  | 2019-04-10 06:25:47,901 - kafka.coordinator - WARNING - Marking the coordinator dead (node 0) for group NSFW_CONSUMER: [Error 7] RequestTimedOutError.
2019 Apr 10 06:27:12  | 2019-04-10 06:25:53,149 - kafka.coordinator - WARNING - Marking the coordinator dead (node 0) for group NSFW_CONSUMER: [Error 7] RequestTimedOutError.
2019 Apr 10 06:27:12  | 2019-04-10 06:25:48,003 - kafka.coordinator - INFO - Discovered coordinator 0 for group NSFW_CONSUMER
2019 Apr 10 06:27:12  | 2019-04-10 06:25:53,252 - kafka.cluster - INFO - Group coordinator for NSFW_CONSUMER is BrokerMetadata(nodeId=0, host='tot-kafka-0.tot-kafka-headless.tot', port=9092, rack=None)
2019 Apr 10 06:27:12  | 2019-04-10 06:25:58,254 - kafka.coordinator - WARNING - Marking the coordinator dead (node 0) for group NSFW_CONSUMER: [Error 7] RequestTimedOutError.
2019 Apr 10 06:27:12  | 2019-04-10 06:25:53,252 - kafka.coordinator - INFO - Discovered coordinator 0 for group NSFW_CONSUMER
2019 Apr 10 06:27:12  | 2019-04-10 06:25:58,358 - kafka.coordinator - INFO - Discovered coordinator 0 for group NSFW_CONSUMER
2019 Apr 10 06:27:12  | 2019-04-10 06:25:58,358 - kafka.cluster - INFO - Group coordinator for NSFW_CONSUMER is BrokerMetadata(nodeId=0, host='tot-kafka-0.tot-kafka-headless.tot', port=9092, rack=None)
2019 Apr 10 06:27:12  | 2019-04-10 06:26:00,382 - kafka.coordinator - INFO - (Re-)joining group NSFW_CONSUMER
2019 Apr 10 06:27:12  | 2019-04-10 06:26:00,485 - kafka.coordinator - INFO - Successfully joined group NSFW_CONSUMER with generation 221
2019 Apr 10 06:27:12  | 2019-04-10 06:26:00,381 - kafka.coordinator.consumer - INFO - Revoking previously assigned partitions {TopicPartition(topic='request_for_analyze', partition=3)} for group NSFW_CONSUMER
2019 Apr 10 06:27:12  | 2019-04-10 06:26:00,486 - kafka.coordinator.consumer - INFO - Setting newly assigned partitions {TopicPartition(topic='request_for_analyze', partition=3)} for group NSFW_CONSUMER
2019 Apr 10 06:27:12  | 2019-04-10 06:26:00,486 - kafka.consumer.subscription_state - INFO - Updated partition assignment: [TopicPartition(topic='request_for_analyze', partition=3)]
2019 Apr 10 06:34:42  | 2019-04-10 06:34:42,897 - kafka.client - INFO - Closing idle connection 2, last active 540000 ms ago
2019 Apr 10 06:34:42  | 2019-04-10 06:34:42,898 - kafka.conn - INFO - <BrokerConnection node_id=2 host=tot-kafka-2.tot-kafka-headless.tot:9092 <connected> [IPv4 ('10.244.3.12', 9092)]>: Closing connection. 
2019 Apr 10 07:21:58  | 2019-04-10 07:21:57,707 - kafka.coordinator - WARNING - Heartbeat failed for group NSFW_CONSUMER because it is rebalancing
2019 Apr 10 07:21:58  | 2019-04-10 07:21:57,717 - kafka.coordinator.consumer - INFO - Revoking previously assigned partitions {TopicPartition(topic='request_for_analyze', partition=3)} for group NSFW_CONSUMER
2019 Apr 10 07:21:58  | 2019-04-10 07:21:57,718 - kafka.coordinator - INFO - (Re-)joining group NSFW_CONSUMER
2019 Apr 10 07:21:58  | 2019-04-10 07:21:57,990 - kafka.coordinator - INFO - Successfully joined group NSFW_CONSUMER with generation 222
2019 Apr 10 07:21:58  | 2019-04-10 07:21:57,994 - kafka.consumer.subscription_state - INFO - Updated partition assignment: [TopicPartition(topic='request_for_analyze', partition=3)]
2019 Apr 10 07:21:58  | 2019-04-10 07:21:57,996 - kafka.coordinator.consumer - INFO - Setting newly assigned partitions {TopicPartition(topic='request_for_analyze', partition=3)} for group NSFW_CONSUMER
2019 Apr 10 07:26:02  | 2019-04-10 07:26:02,457 - kafka.coordinator - WARNING - Heartbeat failed for group NSFW_CONSUMER because it is rebalancing
2019 Apr 10 07:26:02  | 2019-04-10 07:26:02,483 - kafka.coordinator - INFO - (Re-)joining group NSFW_CONSUMER
2019 Apr 10 07:26:02  | 2019-04-10 07:26:02,482 - kafka.coordinator.consumer - INFO - Revoking previously assigned partitions {TopicPartition(topic='request_for_analyze', partition=3)} for group NSFW_CONSUMER
2019 Apr 10 07:26:02  | 2019-04-10 07:26:02,802 - kafka.coordinator - INFO - Successfully joined group NSFW_CONSUMER with generation 223
2019 Apr 10 07:26:02  | 2019-04-10 07:26:02,802 - kafka.consumer.subscription_state - INFO - Updated partition assignment: [TopicPartition(topic='request_for_analyze', partition=2)]
2019 Apr 10 07:26:02  | 2019-04-10 07:26:02,803 - kafka.coordinator.consumer - INFO - Setting newly assigned partitions {TopicPartition(topic='request_for_analyze', partition=2)} for group NSFW_CONSUMER
2019 Apr 10 07:36:02  | 2019-04-10 07:36:02,635 - kafka.coordinator - WARNING - Heartbeat session expired, marking coordinator dead
2019 Apr 10 07:36:02  | 2019-04-10 07:36:02,635 - kafka.coordinator - WARNING - Marking the coordinator dead (node 0) for group NSFW_CONSUMER: Heartbeat session expired.
2019 Apr 10 07:36:02  | 2019-04-10 07:36:02,643 - kafka.cluster - INFO - Group coordinator for NSFW_CONSUMER is BrokerMetadata(nodeId=0, host='tot-kafka-0.tot-kafka-headless.tot', port=9092, rack=None)
2019 Apr 10 07:36:02  |             rebalanced and assigned the partitions to another member.
2019 Apr 10 07:36:02  | 2019-04-10 07:36:02,644 - kafka.coordinator - INFO - Discovered coordinator 0 for group NSFW_CONSUMER
2019 Apr 10 07:36:02  | 2019-04-10 07:36:02,648 - kafka.coordinator.consumer - WARNING - Auto offset commit failed for group NSFW_CONSUMER: CommitFailedError: Commit cannot be completed since the group has already
2019 Apr 10 07:36:02  |             This means that the time between subsequent calls to poll()
2019 Apr 10 07:36:02  |             typically implies that the poll loop is spending too much
2019 Apr 10 07:36:02  |             was longer than the configured max_poll_interval_ms, which
2019 Apr 10 07:36:02  |             increasing the rebalance timeout with max_poll_interval_ms,
2019 Apr 10 07:36:02  |             time message processing. You can address this either by
2019 Apr 10 07:36:02  |             with max_poll_records.
2019 Apr 10 07:36:02  |             or by reducing the maximum size of batches returned in poll()
2019 Apr 10 07:36:02  |             
2019 Apr 10 07:36:02  | 2019-04-10 07:36:02,649 - kafka.coordinator.consumer - ERROR - Offset commit failed: This is likely to cause duplicate message delivery
2019 Apr 10 07:36:02  | Traceback (most recent call last):
  File "/opt/app/.venv/lib/python3.6/site-packages/kafka/coordinator/consumer.py", line 521, in _maybe_auto_commit_offsets_sync
    self.commit_offsets_sync(self._subscription.all_consumed_offsets())
  File "/opt/app/.venv/lib/python3.6/site-packages/kafka/coordinator/consumer.py", line 514, in commit_offsets_sync
    raise future.exception # pylint: disable-msg=raising-bad-type
kafka.errors.CommitFailedError: CommitFailedError: Commit cannot be completed since the group has already
2019 Apr 10 07:36:02  |             rebalanced and assigned the partitions to another member.
2019 Apr 10 07:36:02  |             This means that the time between subsequent calls to poll()
2019 Apr 10 07:36:02  |             was longer than the configured max_poll_interval_ms, which
2019 Apr 10 07:36:02  |             typically implies that the poll loop is spending too much
2019 Apr 10 07:36:02  |             time message processing. You can address this either by
2019 Apr 10 07:36:02  |             or by reducing the maximum size of batches returned in poll()
2019 Apr 10 07:36:02  |             increasing the rebalance timeout with max_poll_interval_ms,
2019 Apr 10 07:36:02  |             
2019 Apr 10 07:36:02  |             with max_poll_records.
2019 Apr 10 07:36:02  | 2019-04-10 07:36:02,649 - kafka.coordinator.consumer - INFO - Revoking previously assigned partitions {TopicPartition(topic='request_for_analyze', partition=2)} for group NSFW_CONSUMER
2019 Apr 10 07:36:02  | 2019-04-10 07:36:02,650 - kafka.coordinator - INFO - (Re-)joining group NSFW_CONSUMER
2019 Apr 10 07:36:03  | 2019-04-10 07:36:03,920 - kafka.coordinator - INFO - Successfully joined group NSFW_CONSUMER with generation 225
2019 Apr 10 07:36:03  | 2019-04-10 07:36:03,921 - kafka.coordinator.consumer - INFO - Setting newly assigned partitions {TopicPartition(topic='request_for_analyze', partition=4)} for group NSFW_CONSUMER
2019 Apr 10 07:36:03  | 2019-04-10 07:36:03,920 - kafka.consumer.subscription_state - INFO - Updated partition assignment: [TopicPartition(topic='request_for_analyze', partition=4)]
2019 Apr 10 07:36:04  | 2019-04-10 07:36:03,989 - kafka.conn - INFO - <BrokerConnection node_id=2 host=tot-kafka-2.tot-kafka-headless.tot:9092 <connecting> [IPv4 ('10.244.3.12', 9092)]>: connecting to tot-kafka-2.tot-kafka-headless.tot:9092 [('10.244.3.12', 9092) IPv4]
2019 Apr 10 07:36:04  | 2019-04-10 07:36:03,991 - kafka.conn - INFO - <BrokerConnection node_id=2 host=tot-kafka-2.tot-kafka-headless.tot:9092 <connecting> [IPv4 ('10.244.3.12', 9092)]>: Connection complete.
2019 Apr 10 07:43:54  | 2019-04-10 07:43:54,668 - kafka.coordinator - WARNING - Heartbeat failed for group NSFW_CONSUMER because it is rebalancing
2019 Apr 10 07:43:54  | 2019-04-10 07:43:54,677 - kafka.coordinator.consumer - INFO - Revoking previously assigned partitions {TopicPartition(topic='request_for_analyze', partition=4)} for group NSFW_CONSUMER
2019 Apr 10 07:43:54  | 2019-04-10 07:43:54,700 - kafka.coordinator - INFO - Successfully joined group NSFW_CONSUMER with generation 226
2019 Apr 10 07:43:54  | 2019-04-10 07:43:54,684 - kafka.coordinator - INFO - Elected group leader -- performing partition assignments using range
2019 Apr 10 07:43:54  | 2019-04-10 07:43:54,678 - kafka.coordinator - INFO - (Re-)joining group NSFW_CONSUMER
2019 Apr 10 07:43:54  | 2019-04-10 07:43:54,701 - kafka.coordinator.consumer - INFO - Setting newly assigned partitions {TopicPartition(topic='request_for_analyze', partition=4)} for group NSFW_CONSUMER
2019 Apr 10 07:43:54  | 2019-04-10 07:43:54,700 - kafka.consumer.subscription_state - INFO - Updated partition assignment: [TopicPartition(topic='request_for_analyze', partition=4)]

ottomata · 2019-05-13T18:42:16Z

I'll just delete my comment since it is a bit irrelevant and incorrect with the wrong version numbers and results.

Our summary is:

1.4.1 was working fine for us, but another of our users wanted an upgrade of kafka-python to get the fix for #1628. We upgraded to 1.4.6, and saw the missed heatbeats and rebalances this issue is about. We incorrectly reverted to 1.4.3 with #1628 backported, and still saw missed heartbeats and rebalances. We then reverted to 1.4.3 without #1628 backported, and eventually also saw missed heartbeats. So, I was incorrect in guessing that the fix in #1628 may have caused this issue.

Once we reverted all the way back to 1.4.1, the missed heartbeats ceased happening.

I haven't seen other reports of performance/instability problems with 1.4.6.
We and others in this thread are reporting that unexpected missed heartbeats happen in 1.4.6.

Thanks, sorry for the confusion!

rolisz · 2019-05-14T08:43:39Z

@jeffwidman I still encounter the problems described in this issue, on 1.4.6. I can provide debug logs if they would useful.

jeffwidman · 2019-05-14T11:34:38Z

I can provide debug logs if they would useful.

Yes please. In particular, please make sure your log formatter includes timings, as I suspect the logs will be useless otherwise for identifying the root cause.

elukey · 2019-05-14T12:47:06Z

@jeffwidman I can provide logs as well, I am testing 1.4.6 in a testing environment with DEBUG log enabled. Usually these are the logs that I see:

May 13 23:28:32 deployment-eventlog05 eventlogging-processor@client-side-01[26126]: 2019-05-13 23:28:32,170 [26126] (MainThread) kafka.client [DEBUG] Sending metadata request MetadataRequest_v1(topics=[u'eventlogging-client-side']) to node 2
May 13 23:28:32 deployment-eventlog05 eventlogging-processor@client-side-01[26126]: 2019-05-13 23:28:32,170 [26126] (MainThread) kafka.protocol.parser [DEBUG] Sending request MetadataRequest_v1(topics=[u'eventlogging-client-side'])
May 13 23:28:32 deployment-eventlog05 eventlogging-processor@client-side-01[26126]: 2019-05-13 23:28:32,170 [26126] (MainThread) kafka.conn [DEBUG] <BrokerConnection node_id=2 host=deployment-kafka-jumbo-2.deployment-prep.eqiad.wmflabs:9092 <connected> [IPv4 ('172.16.5.47', 9092)]> Request 117275: MetadataRequest_v1(topics=[u'eventlogging-client-side'])
May 13 23:28:32 deployment-eventlog05 eventlogging-processor@client-side-01[26126]: 2019-05-13 23:28:32,171 [26126] (eventlogging_processor_client_side_00-heartbeat) kafka.coordinator [WARNING] Heartbeat session expired, marking coordinator dead
May 13 23:28:32 deployment-eventlog05 eventlogging-processor@client-side-01[26126]: 2019-05-13 23:28:32,171 [26126] (eventlogging_processor_client_side_00-heartbeat) kafka.coordinator [WARNING] Marking the coordinator dead (node 1) for group eventlogging_processor_client_side_00: Heartbeat session expired.

The metadata request before the heartbeat failure seems to happen all the times. How far should I go in the logs to get the culprit of Heartbeat session expired ? Is it related to session_timeout_ms (we use the default 10s)? Thanks a lot!

ottomata · 2019-05-22T15:21:44Z

@jeffwidman are there other logs we should try and get for ya?

tmcqueen-materials · 2019-05-31T20:55:02Z

We appear to be seeing this issue as well. It seems to happen when the processing code is also multithreaded/multiprocessed and is doing intensive work with large (0.5 MB) messages. Perhaps an issue where the background heartbeat thread gets starved or doesn't know it needs to do something?

cvley · 2019-06-01T12:19:51Z

I have encountered this problem: consumers keep rebalancing when the machine has lots network traffic. I tried to set the heartbeat_interval_ms to 1000 ms, rebalancing didn't show in the log, but messages didn't consumed neither.

After I downgraded to 1.3.5, everything is normal.

Just for reference.

vimal3271 · 2019-06-01T16:10:22Z

The below configuration worked for me

'request_timeout_ms': 305000,
'max_poll_interval_ms': 300000,
#Increased to 3 mins from default 10s
'session_timeout_ms': 180000,  # 3mins
#Increased to 1min from default 3s
'heartbeat_interval_ms': 60000,  # 1min
'metadata_max_age_ms': 50000

Plz try out. I am using 1.4.5

charettes · 2019-06-26T16:04:23Z

This is still happening on 1.4.5 and is preventing us from using this library with Python 3.7 because only < 1.4 works appropriately. When using < 1.4 we see frequent heartbeat failures because of rebalancing but more importantly double consumption of messages which is highly problematic.

This is also happening when bumping session.timeout.ms to larger values which eventually lead to spurious ValueError: I/O operation on closed epoll fd errors.

piperck · 2019-08-16T09:57:27Z

We use 1.4.6
It looks like a poll period hang and casue can not send heartbeats.
follow

workaround might be to reduce metadata_max_age_ms to something close / equal to your heartbeat_timeout_ms. This will cause more frequent metadata requests, but should unblock the send side when there is no socket data available for reads.

everythings will be fine ....

dingyo777 · 2019-09-02T01:24:07Z

Hello guys, I also have same issue with 1.4.6.
Now I've tried to 2nd workaround @dpkp suggested: reduce metadata_max_age_ms to heartbeat_interval_ms.
Let me report later whether it works or not

dingyo777 · 2019-09-04T08:07:44Z

Result!
workaround works well. No heartbeat fails, no rebalance occurs.

dpkp · 2019-09-30T21:17:12Z

I made several improvements to KafkaConsumer and the underlying client that I believe should fix this issue. Please try 1.4.7 and reopen / file issues if this persists!

sangmeshcp · 2019-10-02T22:29:40Z

still hitting this issue with 1.4.7

[2019-10-02 15:28:24,488] {base.py:828} WARNING - Heartbeat failed for group ca because it is rebalancing
[2019-10-02 15:28:27,587] {base.py:828} WARNING - Heartbeat failed for group ca because it is rebalancing
[2019-10-02 15:28:30,680] {base.py:828} WARNING - Heartbeat failed for group ca because it is rebalancing
[2019-10-02 15:28:33,780] {base.py:828} WARNING - Heartbeat failed for group ca because it is rebalancing
[2019-10-02 15:28:36,882] {base.py:828} WARNING - Heartbeat failed for group ca because it is rebalancing
[2019-10-02 15:28:39,975] {base.py:828} WARNING - Heartbeat failed for group ca because it is rebalancing
[2019-10-02 15:28:43,074] {base.py:828} WARNING - Heartbeat failed for group ca because it is rebalancing
[2019-10-02 15:28:46,165] {base.py:828} WARNING - Heartbeat failed for group ca because it is rebalancing
[2019-10-02 15:28:49,164] {base.py:828} WARNING - Heartbeat failed for group ca because it is rebalancing
[2019-10-02 15:28:52,266] {base.py:828} WARNING - Heartbeat failed for group ca because it is rebalancing

jmgpeeters · 2019-10-03T14:59:42Z

1.4.7 appears to have fixed the issue for us.

sangmeshcp · 2019-10-03T16:41:59Z

1.4.7 appears to have fixed the issue for us.

is it will with multiple brokers?

elukey · 2019-10-08T05:02:00Z

1.4.7 appears to have fixed the issue for us.

Same thing from my side: I have been testing the new version for 24h now in production and everything went smoothly. Will keep monitor and report back in case anything arise, but so far so good! Thanks!

asalem · 2019-10-16T09:00:06Z

simply you need to reduce the max poll records, default if 500 so you may tweak that to get rid of this warning

kennes913 · 2019-10-17T00:24:52Z

I am still having this issue using 1.4.7. Like @sangmeshcp, we have multiple brokers. I am beginning to think though this may have something to do with the kafka cluster configuration rather than kafka-python since I was able to connect to the same topic in a different group with no ill effects.

Update:
We have some legacy processes that use 1.3.3 and python 3.4. I tested using that combination and everything ran fine. Not directly helpful -- just providing more data points.

naoshadmehta · 2019-10-29T09:14:34Z

We're seeing this problem with 1.4.6 and 1.4.7. We have 3 broker nodes and 6 consumers per node. Our topic has 18 partitions and for about an hour or so, we see no issues. After an hour or so, we start seeing

server.log:[2019-10-29 01:28:16,927] INFO [GroupCoordinator 1]: Preparing to rebalance group tasks_group in state PreparingRebalance with old generation 9 (__consumer_offsets-17) (reason: removing member kafka-python-1.4.7-52039bb5-17c5-42d7-8af3-d95f0dbc3f3f on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)

A few seconds later, we see a new consumer being added:
server.log:[2019-10-29 01:29:00,612] INFO [GroupCoordinator 1]: Preparing to rebalance group tasks_group in state PreparingRebalance with old generation 10 (__consumer_offsets-17) (reason: Adding new member kafka-python-1.4.7-27f5bbac-b1e6-41a9-999b-abb56868e9e2) (kafka.coordinator.group.GroupCoordinator)

After the system gets into this state, we keep seeing remove and add of new members to the consumer group and in a few hours, it reaches a point where the rebalance takes more than 10 minutes to complete. By this time, the lag on the partitions is so high that its impossible to recover from this situation.

Our broker config:
max.session.timeout.ms: 30000
heartbeat.interval.ms: 10000

Consumer config:
max.session.timeout.ms: 10000
heartbeat.interval.ms: 3000
consumer_timeout_ms: 2000

Everything else is left as default.

Any suggestions on what we could tweak in the broker or consumer configs to avoid seeing this issue?

should help with frequent partition rebalances according to dpkp/kafka-python#1418

jmgpeeters · 2020-01-21T15:44:07Z

Is there any plan to upgrade kafka-python to 1.4.7 on conda-forge? https://anaconda.org/conda-forge/kafka-python/files currently still shows 1.4.6 as the latest.

jeffwidman · 2020-02-11T18:21:47Z

@jmgpeeters please open a new ticket, this one is completely unrelated to condaforge.

Seeing problems with the eqiad group getting stuck with "heartbeat failed from group xyz because it is rebalancing". Based on dpkp/kafka-python#1418 this may be resolved with a version update to 1.4.7. Change-Id: I3896f52af54112ce69c8b60d67b8f8471bee8d65

xiaoma-father · 2020-09-23T11:46:06Z

i used kafka-python 2.0.1, and I see the warning in log very frequent ,nearly every second, and the client consume message in kafka slow. I solved this question by create many group to consume the topics. firstly I consume all the topics in one group, nearly twelve topics, after finding solution on web, I create five group to consume different function topic ,this problem is solved. I hope I can help you.

andrewkowalik mentioned this issue Mar 10, 2018

Consumer overruns its session timeout, drops out of the group and never rejoins--stuck in zombie state. #1435

Closed

tvoinarovskyi mentioned this issue Mar 10, 2018

After I restart the kafka broker, the consumer can not continue consume the topic. #1419

Closed

dpkp added the consumer label Mar 10, 2018

licy121 mentioned this issue Jan 8, 2019

Bug report: sync group request timeout error will lead to consumer come in an abnormal status #1691

Closed

jeffwidman closed this as completed Apr 3, 2019

dpkp closed this as completed Sep 30, 2019

naoshadmehta mentioned this issue Oct 29, 2019

Frequent rebalance of Consumer group when using kafka-python 1.4.6 and 1.4.7 #1939

Closed

jharting added a commit to jharting/insights-host-inventory that referenced this issue Dec 17, 2019

upgrade kafka-python to 1.4.7 RHCLOUD-3625

fc88d0d

should help with frequent partition rebalances according to dpkp/kafka-python#1418

jharting mentioned this issue Dec 17, 2019

upgrade kafka-python to 1.4.7 RHCLOUD-3625 RedHatInsights/insights-host-inventory#550

Merged

jharting added a commit to RedHatInsights/insights-host-inventory that referenced this issue Dec 19, 2019

upgrade kafka-python to 1.4.7 RHCLOUD-3625

0adc085

should help with frequent partition rebalances according to dpkp/kafka-python#1418

p2jindia mentioned this issue Apr 15, 2020

Auto commit warning and error messages in log even when we set enable_auto_commit=False #1834

Closed

Heartbeat failed for group xxxWorker because it is rebalancing #1418

Heartbeat failed for group xxxWorker because it is rebalancing #1418

Comments

ealfatt commented Mar 8, 2018 • edited Loading

andrewkowalik commented Mar 10, 2018

dpkp commented Mar 10, 2018

ealfatt commented Mar 12, 2018

ealfatt commented Mar 14, 2018

dpkp commented Mar 22, 2018

slash31 commented Oct 29, 2018 • edited Loading

julienaubert commented Nov 7, 2018 • edited Loading

88manpreet commented Nov 21, 2018

dpkp commented Nov 21, 2018 via email

rafenden commented Nov 27, 2018

goern commented Dec 3, 2018

rabindrakm commented Jan 2, 2019

licy121 commented Jan 8, 2019

dpkp commented Jan 13, 2019

vimal3271 commented Feb 28, 2019

dpkp commented Feb 28, 2019

vimal3271 commented Feb 28, 2019

dpkp commented Feb 28, 2019

vimal3271 commented Mar 5, 2019

mhorowitz0 commented Mar 5, 2019

jeffwidman commented Mar 13, 2019 • edited Loading

jeffwidman commented Apr 3, 2019

mhorowitz0 commented Apr 5, 2019

mhorowitz0 commented Apr 5, 2019

artemistomaras commented Apr 10, 2019 • edited Loading

ottomata commented May 13, 2019 • edited Loading

rolisz commented May 14, 2019

jeffwidman commented May 14, 2019 • edited Loading

elukey commented May 14, 2019

ottomata commented May 22, 2019

tmcqueen-materials commented May 31, 2019

cvley commented Jun 1, 2019

vimal3271 commented Jun 1, 2019 • edited Loading

charettes commented Jun 26, 2019

piperck commented Aug 16, 2019

dingyo777 commented Sep 2, 2019 • edited Loading

dingyo777 commented Sep 4, 2019

dpkp commented Sep 30, 2019

sangmeshcp commented Oct 2, 2019

jmgpeeters commented Oct 3, 2019

sangmeshcp commented Oct 3, 2019

elukey commented Oct 8, 2019

asalem commented Oct 16, 2019

kennes913 commented Oct 17, 2019 • edited Loading

naoshadmehta commented Oct 29, 2019

jmgpeeters commented Jan 21, 2020

jeffwidman commented Feb 11, 2020

xiaoma-father commented Sep 23, 2020

ealfatt commented Mar 8, 2018 •

edited

Loading

slash31 commented Oct 29, 2018 •

edited

Loading

julienaubert commented Nov 7, 2018 •

edited

Loading

jeffwidman commented Mar 13, 2019 •

edited

Loading

artemistomaras commented Apr 10, 2019 •

edited

Loading

ottomata commented May 13, 2019 •

edited

Loading

jeffwidman commented May 14, 2019 •

edited

Loading

vimal3271 commented Jun 1, 2019 •

edited

Loading

dingyo777 commented Sep 2, 2019 •

edited

Loading

kennes913 commented Oct 17, 2019 •

edited

Loading