Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heartbeat failed for group xxxWorker because it is rebalancing #1418

Closed
ealfatt opened this issue Mar 8, 2018 · 54 comments
Closed

Heartbeat failed for group xxxWorker because it is rebalancing #1418

ealfatt opened this issue Mar 8, 2018 · 54 comments
Labels

Comments

@ealfatt
Copy link

ealfatt commented Mar 8, 2018

Hi all
I'm facing this problem that is driving to me crazy with 1.4.1 version of Kafka python
the instruction that i perform are:

  • create a consumer in this way
    KafkaConsumer(bootstrap_servers=kafka_multi_hosts,
    auto_offset_reset=earliest,
    enable_auto_commit=False,
    group_id=group_name,
    reconnect_backoff_ms=1,
    consumer_timeout_ms=5000)
  • subscribe on a topics
  • finally the consumer.poll(500)

no problem till now but then in the log i see:

[INFO] 03/08/2018 02:52:53 PM Subscribe executed.
[INFO] 03/08/2018 02:52:53 PM Initialization pool executed.
[INFO] 03/08/2018 02:52:53 PM Subscribed to topic: event
[INFO] 03/08/2018 02:52:53 PM eventHandle connected
[WARNING] 03/08/2018 02:53:23 PM Heartbeat failed for group emsWorker because it is rebalancing
[WARNING] 03/08/2018 02:53:26 PM Heartbeat failed for group emsWorker because it is rebalancing
[WARNING] 03/08/2018 02:53:29 PM Heartbeat failed for group emsWorker because it is rebalancing
[WARNING] 03/08/2018 02:53:32 PM Heartbeat failed for group emsWorker because it is rebalancing
......
......
......
[WARNING] 03/08/2018 02:57:48 PM Heartbeat failed for group emsWorker because it is rebalancing
[WARNING] 03/08/2018 02:57:51 PM Heartbeat failed for group emsWorker because it is rebalancing
[INFO] 03/08/2018 02:57:53 PM Leaving consumer group (group_name).

why?

I've also added the option max_poll_records=50 in the kafkaConsumer definition but nothing is changed

@dpkp can you help me?
do you know if the 1.4.1 version presents some problem about that? Because in the previous version I can not see this problem.

thanks in advance

@andrewkowalik
Copy link
Contributor

@ealfatt Yeah this is definitely newer behavior since there was not a background heartbeat thread so it basically happened during polling.

I am guessing after the last message your worker is not rejoining?

@dpkp
Copy link
Owner

dpkp commented Mar 10, 2018

kafka-python 1.4 changes heartbeats to a background thread, and it enables additional timeout configurations in max_poll_interval_ms and session_timeout_ms . It is certainly possible there is a bug in the code, but all of our tests pass so I'm not aware of anything obvious. Can you enable debug logs and see if anything looks strange?

@ealfatt
Copy link
Author

ealfatt commented Mar 12, 2018

@infecto yes, after the last message my worker was not rejoining (thanks for the response)
Hi @dpkp, thanks for the response. I'll try to play with those two parameter and try to enable the debug logs (how i can do that? is it explained in the documentation?)

@ealfatt
Copy link
Author

ealfatt commented Mar 14, 2018

Hi @dpkp
below 2 logs of what i can see when the issue above happen.

log1.txt
log2.txt

could yo give a look and let me know your feedback?
thanks a lot in advance

@dpkp
Copy link
Owner

dpkp commented Mar 22, 2018

This log entry: Heartbeat poll expired, leaving group means that your consumer application is not processing messages fast enough. What is your max_poll_interval_ms configuration? Is there something that might block your poll loop for longer than that amount?

@slash31
Copy link

slash31 commented Oct 29, 2018

After upgrading to 1.4.3 from 1.3.5 a couple of weeks ago, I started seeing my consumer groups constantly rebalancing at the session timeout setting, no matter what I set it to. I tried:

  • adjusting the poll/heartbeat/session timers, as well as max recs polled
  • setting the assignment algorithm deterministically to round robin
  • turning off auto-commit and manually committing
  • upgrading Kafka to the latest available parcel

Eventually, I ended up writing a handler into the application where it would kill the consumer if the commit failed, then reincarnate it in the job queue and start it back up. You can imagine that with the default settings it was effectively in a perpetual loop of rebalancing, then reading a high volume of messages - the broker traffic graph looked like a perfect sawtooth.

I see heartbeat requests and successful responses, but commit responses with an Error 25 just before the error's thrown that the consumer can't commit successfully:

2018-10-28 20:35:06,563 kafka.conn DEBUG <BrokerConnection node_id=772 host=xvda-c14.xv.dc.openx.org:9092 [IPv4 ('10.11.3.14', 9092)]> Response 103 (0.808000564575 ms): OffsetCommitResponse_v2(topics=[(topic=u'netflow.raw', partitions=[(partition=0, error_code=25), (partition=5, error_code=25), (partition=10, error_code=25)])])
2018-10-28 20:35:06,563 kafka.conn DEBUG <BrokerConnection node_id=772 host=xvda-c14.xv.dc.openx.org:9092 [IPv4 ('10.11.3.14', 9092)]> Response 103 (0.808000564575 ms): OffsetCommitResponse_v2(topics=[(topic=u'netflow.raw', partitions=[(partition=0, error_code=25), (partition=5, error_code=25), (partition=10, error_code=25)])])
2018-10-28 20:35:06,563 kafka.coordinator.consumer DEBUG OffsetCommit for group netflow_raw failed: [Error 25] UnknownMemberIdError: netflow_raw
2018-10-28 20:35:06,563 kafka.coordinator.consumer DEBUG OffsetCommit for group netflow_raw failed: [Error 25] UnknownMemberIdError: netflow_raw

I downgraded the version to 1.3.5 again and everything is happy, all consumers have been running smoothly for hours now with no issue and the broker traffic graphs are flat. One other thing I noticed is a lot of wow/flutter on graphs of this data (netflow) when using 1.4.3, which smoothed out again after downgrading. The setup for each of my clusters is 10 consumers running on one host via multiprocess, consuming single topic with 10 partitions on three brokers. The application is very simple/lightweight; it reads 2k-4k messages per second depending on time of day, formats/augments some of the data, and then dumps it to CrateDB. With the 1.3.5 client, I can easily get > 20MBps throughput with this application (average is ~2MBps in production when the consumers are caught up), with each db write of 5k rows taking < 1s.

If you're interested in any of the other logs/graphs from either kafka-python or my brokers, I'd be happy to share.

@julienaubert
Copy link

julienaubert commented Nov 7, 2018

I see this as well.

I have a set of consumers, each polling 500 records at a time, being part of group generation N

Another consumer joins. Causing a rebalance. Generation N+1

I can see in the logs that a consumer, while processing the 500 records that it got at generation N, will log that the heart beat failed due to rebalancing. But the processing code does not learn this, so it continues processing.. (I should fix so that the processing checks for this case and aborts. Not sure how?)

I can see that the broker does not include this consumer in the rebalanced group (at N+1). I wonder why, as it was sending heart-beats? Is this considered a bug?

I only quickly glimpsed at the code. It seems that this could happen: HeartBeat ends up in a state "do not send more heartbeats" due to rebalancing and the broker taking sufficiently long time to perform the rebalancing. So the consumer has stopped sending heartbeats, and is subsequently not included. It seems some special-case handling would be needed (eg reset timeouts) when rebalancing occurs?

(The rebalancing keeps happening as when this consumer that did not become part of generation N+1 later tries to join again, the same thing will happen, this time some other consumer will be busy processing and not become part of generation N+2... and so on and so on.)

@88manpreet
Copy link

We have also started seeing this error quite frequently with the upgrade to 1.4.3. So far none of tuning has turned out to be helpful.

@dpkp
Copy link
Owner

dpkp commented Nov 21, 2018 via email

@rafenden
Copy link

It happens to me as well, I'm running 1.4.4

@goern
Copy link

goern commented Dec 3, 2018

I can see it on 1.4.4 too.

@rabindrakm
Copy link

Has there been a solution for this ?

@licy121
Copy link

licy121 commented Jan 8, 2019

It is similar with the bug #1691 , I fixed it. Do you think this is the same with it?

@dpkp
Copy link
Owner

dpkp commented Jan 13, 2019

One point to keep in mind: heartbeats "failing" during rebalance is completely normal. The main java client treats this as a debug log, not a warning. If this is simply a case of spammy logs, we should probably return the log entry to debug.

However, if you are seeing unexpected rebalances and/or rebalances are not resolving correctly, then there may be something else going on. It is hard to say what that is from the issue detail so far. Some possibilities: your consumer is taking too long to process each batch, causing a max poll interval timeout. Or, there could be some bug or deadlock in kafka-python that is causing the consumer to stop processing records altogether. Do you have any more information on specifically what you are seeing in the consumer behavior apart from the heartbeat?

@vimal3271
Copy link

I am seeing consumer rebalances even if there is no messages to consume. Start three consumers in a group and send some messages to topic and after that stop the producer. The consumer will start seeing rebalances after 5-6mins.
Sample code here:
https://stackoverflow.com/questions/54908902/kafka-consumer-rebalance-occurs-even-if-there-is-no-message-to-consume

@dpkp
Copy link
Owner

dpkp commented Feb 28, 2019

vimal: thanks for posting. I believe you may be hitting lock contention between an idle client.poll -- which can block and hold the client lock for the entire request_timeout_ms -- and the attempt by the heartbeat thread to send a new request. It seems to me that we may need to use KafkaClient.wakeup() to make sure that the polling thread drops the lock if/when we need to send a request from a different thread.

@vimal3271
Copy link

@dpkp Thanks for the reply. Is there any workaround to avoid this?

@dpkp
Copy link
Owner

dpkp commented Feb 28, 2019

This shouldn't be an issue when messages are flowing through your topics at a steady rate. If this is just a test environment, and you expect your production environment to have more steady live data, then you could just ignore the error in testing. But if you are managing a topic w/ very low traffic -- delays of minutes between consecutive messages, for example -- you might try to reduce the request_timeout_ms to something closer to the heartbeat_interval_ms, which should prevent the read side from blocking for much longer than the heartbeat timeout. But note that other timeouts may also need to change (max_poll_interval_ms and session_timeout_ms perhaps). Another workaround might be to reduce metadata_max_age_ms to something close / equal to your heartbeat_timeout_ms. This will cause more frequent metadata requests, but should unblock the send side when there is no socket data available for reads.

@vimal3271
Copy link

@dpkp Thanks for the workaround. The rebalance is not happening anymore. I used the second approach of reducing metadata_max_age_ms closer to heartbeat_time_out

@mhorowitz0
Copy link

vimal: thanks for posting. I believe you may be hitting lock contention between an idle client.poll -- which can block and hold the client lock for the entire request_timeout_ms -- and the attempt by the heartbeat thread to send a new request. It seems to me that we may need to use KafkaClient.wakeup() to make sure that the polling thread drops the lock if/when we need to send a request from a different thread.

@dpkp I think I am hitting this exact issue with the lock contention, and it happens quite frequently (using 1.4.4). I turned the log level to DEBUG, and I can see that the heartbeats were every 3 seconds, then it went into a _client.poll -> _selector.select -> _epoll.poll (confirmed in pdb stack trace) and just hung for 5 minutes until it timed out, despite the fact the log end offset was way ahead of the current offset. During that time there were no heartbeats sent. I suspect the session timedout after 10 seconds, and the poll was left in the lurch until it reached the request_timeout_ms, after which it recognized the heartbeat timeout and started refreshing the metadata and doing a rejoin of the consumer group. After that the heartbeats started again every 3 seconds. I saw another issue with the heartbeat not being started after JoinGroup, but from what I can see the heartbeats were regular since the last JoinGroup in the logs, and only stopped when it went into that poll.

The issue happens randomly, but it happens often. The result is that consumers periodically hang in polls and messages build up and the lag becomes large, then it has to rejoin the group and then catch up on a backlog of messages. This problem becomes even worse when the poll goes for more than the max_poll_interval_ms causing runaway constant rebalancing, especially in groups with many consumers. I've mitigated the runaway rebalancing by just upping the max_poll_interval_ms to 1 hour, but I still see those constant poll hangs on partitions with plenty of lag, from anywhere from 20 seconds to 5 minutes. I can only thus far attribute them to the stoppage of the heartbeats and a session timeout as a result.

I was thinking about reducing the request_timeout_ms from the default of 5 minutes to 30 seconds, but that will only work around the issue, it seems the underlying root cause of those heartbeats stopping would still be there.

@jeffwidman
Copy link
Collaborator

jeffwidman commented Mar 13, 2019

@mhorowitz0 / @vimal3271 / @slash31 / @88manpreet there were a number of fixes included in the 1.4.6 release that should alleviate these lock problems w/o requiring workarounds... do you mind trying them and see whether they fix these issues?

@jeffwidman
Copy link
Collaborator

I'm going to close, please try the new 1.4.6 release, if you still see issues please open a new ticket.

@mhorowitz0
Copy link

@jeffwidman I tried 1.4.5, and I still see the same symptoms. I will try 1.4.6.

@mhorowitz0
Copy link

@jeffwidman I am seeing the same issue with 1.4.6. The issue is intermittent, so it is hard to catch, but from what I can see even with a single consumer group, a poll() can hang inexplicably for several minutes even when the topic/partition has a large lag, and even in as little as 12 seconds (enough to trigger a session timeout), and during that time heartbeats stop being sent for longer than the session timeout, and so a rebalance is triggered. I don't know why it hangs so long in the poll() when most of the time polls() take less than a second, and I don't know why the heartbeats stop, but I can only guess it is the same root cause, and most likely the lock contention @dpkp refers to above.

@artemistomaras
Copy link

artemistomaras commented Apr 10, 2019

@jeffwidman I am having the same issue as @mhorowitz0 with 1.4.6 with the default provided settings for the consumer.

A consumer inside a ConsumerGroup fails with the following log when there are no messages in the topic partition he is currently assigned to.

2019 Apr 10 06:22:03  | 2019-04-10 06:22:02,491 - kafka.coordinator - INFO - (Re-)joining group NSFW_CONSUMER
2019 Apr 10 06:22:06  | 2019-04-10 06:22:05,582 - kafka.coordinator - INFO - Successfully joined group NSFW_CONSUMER with generation 220
2019 Apr 10 06:22:06  | 2019-04-10 06:22:05,583 - kafka.consumer.subscription_state - INFO - Updated partition assignment: [TopicPartition(topic='request_for_analyze', partition=3)]
2019 Apr 10 06:22:06  | 2019-04-10 06:22:05,583 - kafka.coordinator.consumer - INFO - Setting newly assigned partitions {TopicPartition(topic='request_for_analyze', partition=3)} for group NSFW_CONSUMER
2019 Apr 10 06:27:12  | 2019-04-10 06:25:37,880 - kafka.coordinator - WARNING - Heartbeat session expired, marking coordinator dead
2019 Apr 10 06:27:12  | 2019-04-10 06:25:37,880 - kafka.coordinator - WARNING - Marking the coordinator dead (node 0) for group NSFW_CONSUMER: Heartbeat session expired.
2019 Apr 10 06:27:12  | 2019-04-10 06:25:37,889 - kafka.cluster - INFO - Group coordinator for NSFW_CONSUMER is BrokerMetadata(nodeId=0, host='tot-kafka-0.tot-kafka-headless.tot', port=9092, rack=None)
2019 Apr 10 06:27:12  | 2019-04-10 06:25:42,893 - kafka.coordinator - WARNING - Marking the coordinator dead (node 0) for group NSFW_CONSUMER: [Error 7] RequestTimedOutError.
2019 Apr 10 06:27:12  | 2019-04-10 06:25:37,890 - kafka.coordinator - INFO - Discovered coordinator 0 for group NSFW_CONSUMER
2019 Apr 10 06:27:12  | 2019-04-10 06:25:42,897 - kafka.cluster - INFO - Group coordinator for NSFW_CONSUMER is BrokerMetadata(nodeId=0, host='tot-kafka-0.tot-kafka-headless.tot', port=9092, rack=None)
2019 Apr 10 06:27:12  | 2019-04-10 06:25:42,894 - kafka.coordinator - WARNING - Heartbeat failed for group NSFW_CONSUMER because it is rebalancing
2019 Apr 10 06:27:12  | 2019-04-10 06:25:42,893 - kafka.coordinator.consumer - WARNING - Auto offset commit failed for group NSFW_CONSUMER: [Error 7] RequestTimedOutError: NSFW_CONSUMER
2019 Apr 10 06:27:12  | 2019-04-10 06:25:42,898 - kafka.coordinator - INFO - Discovered coordinator 0 for group NSFW_CONSUMER
2019 Apr 10 06:27:12  | 2019-04-10 06:25:48,003 - kafka.cluster - INFO - Group coordinator for NSFW_CONSUMER is BrokerMetadata(nodeId=0, host='tot-kafka-0.tot-kafka-headless.tot', port=9092, rack=None)
2019 Apr 10 06:27:12  | 2019-04-10 06:25:47,901 - kafka.coordinator - WARNING - Marking the coordinator dead (node 0) for group NSFW_CONSUMER: [Error 7] RequestTimedOutError.
2019 Apr 10 06:27:12  | 2019-04-10 06:25:53,149 - kafka.coordinator - WARNING - Marking the coordinator dead (node 0) for group NSFW_CONSUMER: [Error 7] RequestTimedOutError.
2019 Apr 10 06:27:12  | 2019-04-10 06:25:48,003 - kafka.coordinator - INFO - Discovered coordinator 0 for group NSFW_CONSUMER
2019 Apr 10 06:27:12  | 2019-04-10 06:25:53,252 - kafka.cluster - INFO - Group coordinator for NSFW_CONSUMER is BrokerMetadata(nodeId=0, host='tot-kafka-0.tot-kafka-headless.tot', port=9092, rack=None)
2019 Apr 10 06:27:12  | 2019-04-10 06:25:58,254 - kafka.coordinator - WARNING - Marking the coordinator dead (node 0) for group NSFW_CONSUMER: [Error 7] RequestTimedOutError.
2019 Apr 10 06:27:12  | 2019-04-10 06:25:53,252 - kafka.coordinator - INFO - Discovered coordinator 0 for group NSFW_CONSUMER
2019 Apr 10 06:27:12  | 2019-04-10 06:25:58,358 - kafka.coordinator - INFO - Discovered coordinator 0 for group NSFW_CONSUMER
2019 Apr 10 06:27:12  | 2019-04-10 06:25:58,358 - kafka.cluster - INFO - Group coordinator for NSFW_CONSUMER is BrokerMetadata(nodeId=0, host='tot-kafka-0.tot-kafka-headless.tot', port=9092, rack=None)
2019 Apr 10 06:27:12  | 2019-04-10 06:26:00,382 - kafka.coordinator - INFO - (Re-)joining group NSFW_CONSUMER
2019 Apr 10 06:27:12  | 2019-04-10 06:26:00,485 - kafka.coordinator - INFO - Successfully joined group NSFW_CONSUMER with generation 221
2019 Apr 10 06:27:12  | 2019-04-10 06:26:00,381 - kafka.coordinator.consumer - INFO - Revoking previously assigned partitions {TopicPartition(topic='request_for_analyze', partition=3)} for group NSFW_CONSUMER
2019 Apr 10 06:27:12  | 2019-04-10 06:26:00,486 - kafka.coordinator.consumer - INFO - Setting newly assigned partitions {TopicPartition(topic='request_for_analyze', partition=3)} for group NSFW_CONSUMER
2019 Apr 10 06:27:12  | 2019-04-10 06:26:00,486 - kafka.consumer.subscription_state - INFO - Updated partition assignment: [TopicPartition(topic='request_for_analyze', partition=3)]
2019 Apr 10 06:34:42  | 2019-04-10 06:34:42,897 - kafka.client - INFO - Closing idle connection 2, last active 540000 ms ago
2019 Apr 10 06:34:42  | 2019-04-10 06:34:42,898 - kafka.conn - INFO - <BrokerConnection node_id=2 host=tot-kafka-2.tot-kafka-headless.tot:9092 <connected> [IPv4 ('10.244.3.12', 9092)]>: Closing connection. 
2019 Apr 10 07:21:58  | 2019-04-10 07:21:57,707 - kafka.coordinator - WARNING - Heartbeat failed for group NSFW_CONSUMER because it is rebalancing
2019 Apr 10 07:21:58  | 2019-04-10 07:21:57,717 - kafka.coordinator.consumer - INFO - Revoking previously assigned partitions {TopicPartition(topic='request_for_analyze', partition=3)} for group NSFW_CONSUMER
2019 Apr 10 07:21:58  | 2019-04-10 07:21:57,718 - kafka.coordinator - INFO - (Re-)joining group NSFW_CONSUMER
2019 Apr 10 07:21:58  | 2019-04-10 07:21:57,990 - kafka.coordinator - INFO - Successfully joined group NSFW_CONSUMER with generation 222
2019 Apr 10 07:21:58  | 2019-04-10 07:21:57,994 - kafka.consumer.subscription_state - INFO - Updated partition assignment: [TopicPartition(topic='request_for_analyze', partition=3)]
2019 Apr 10 07:21:58  | 2019-04-10 07:21:57,996 - kafka.coordinator.consumer - INFO - Setting newly assigned partitions {TopicPartition(topic='request_for_analyze', partition=3)} for group NSFW_CONSUMER
2019 Apr 10 07:26:02  | 2019-04-10 07:26:02,457 - kafka.coordinator - WARNING - Heartbeat failed for group NSFW_CONSUMER because it is rebalancing
2019 Apr 10 07:26:02  | 2019-04-10 07:26:02,483 - kafka.coordinator - INFO - (Re-)joining group NSFW_CONSUMER
2019 Apr 10 07:26:02  | 2019-04-10 07:26:02,482 - kafka.coordinator.consumer - INFO - Revoking previously assigned partitions {TopicPartition(topic='request_for_analyze', partition=3)} for group NSFW_CONSUMER
2019 Apr 10 07:26:02  | 2019-04-10 07:26:02,802 - kafka.coordinator - INFO - Successfully joined group NSFW_CONSUMER with generation 223
2019 Apr 10 07:26:02  | 2019-04-10 07:26:02,802 - kafka.consumer.subscription_state - INFO - Updated partition assignment: [TopicPartition(topic='request_for_analyze', partition=2)]
2019 Apr 10 07:26:02  | 2019-04-10 07:26:02,803 - kafka.coordinator.consumer - INFO - Setting newly assigned partitions {TopicPartition(topic='request_for_analyze', partition=2)} for group NSFW_CONSUMER
2019 Apr 10 07:36:02  | 2019-04-10 07:36:02,635 - kafka.coordinator - WARNING - Heartbeat session expired, marking coordinator dead
2019 Apr 10 07:36:02  | 2019-04-10 07:36:02,635 - kafka.coordinator - WARNING - Marking the coordinator dead (node 0) for group NSFW_CONSUMER: Heartbeat session expired.
2019 Apr 10 07:36:02  | 2019-04-10 07:36:02,643 - kafka.cluster - INFO - Group coordinator for NSFW_CONSUMER is BrokerMetadata(nodeId=0, host='tot-kafka-0.tot-kafka-headless.tot', port=9092, rack=None)
2019 Apr 10 07:36:02  |             rebalanced and assigned the partitions to another member.
2019 Apr 10 07:36:02  | 2019-04-10 07:36:02,644 - kafka.coordinator - INFO - Discovered coordinator 0 for group NSFW_CONSUMER
2019 Apr 10 07:36:02  | 2019-04-10 07:36:02,648 - kafka.coordinator.consumer - WARNING - Auto offset commit failed for group NSFW_CONSUMER: CommitFailedError: Commit cannot be completed since the group has already
2019 Apr 10 07:36:02  |             This means that the time between subsequent calls to poll()
2019 Apr 10 07:36:02  |             typically implies that the poll loop is spending too much
2019 Apr 10 07:36:02  |             was longer than the configured max_poll_interval_ms, which
2019 Apr 10 07:36:02  |             increasing the rebalance timeout with max_poll_interval_ms,
2019 Apr 10 07:36:02  |             time message processing. You can address this either by
2019 Apr 10 07:36:02  |             with max_poll_records.
2019 Apr 10 07:36:02  |             or by reducing the maximum size of batches returned in poll()
2019 Apr 10 07:36:02  |             
2019 Apr 10 07:36:02  | 2019-04-10 07:36:02,649 - kafka.coordinator.consumer - ERROR - Offset commit failed: This is likely to cause duplicate message delivery
2019 Apr 10 07:36:02  | Traceback (most recent call last):
  File "/opt/app/.venv/lib/python3.6/site-packages/kafka/coordinator/consumer.py", line 521, in _maybe_auto_commit_offsets_sync
    self.commit_offsets_sync(self._subscription.all_consumed_offsets())
  File "/opt/app/.venv/lib/python3.6/site-packages/kafka/coordinator/consumer.py", line 514, in commit_offsets_sync
    raise future.exception # pylint: disable-msg=raising-bad-type
kafka.errors.CommitFailedError: CommitFailedError: Commit cannot be completed since the group has already
2019 Apr 10 07:36:02  |             rebalanced and assigned the partitions to another member.
2019 Apr 10 07:36:02  |             This means that the time between subsequent calls to poll()
2019 Apr 10 07:36:02  |             was longer than the configured max_poll_interval_ms, which
2019 Apr 10 07:36:02  |             typically implies that the poll loop is spending too much
2019 Apr 10 07:36:02  |             time message processing. You can address this either by
2019 Apr 10 07:36:02  |             or by reducing the maximum size of batches returned in poll()
2019 Apr 10 07:36:02  |             increasing the rebalance timeout with max_poll_interval_ms,
2019 Apr 10 07:36:02  |             
2019 Apr 10 07:36:02  |             with max_poll_records.
2019 Apr 10 07:36:02  | 2019-04-10 07:36:02,649 - kafka.coordinator.consumer - INFO - Revoking previously assigned partitions {TopicPartition(topic='request_for_analyze', partition=2)} for group NSFW_CONSUMER
2019 Apr 10 07:36:02  | 2019-04-10 07:36:02,650 - kafka.coordinator - INFO - (Re-)joining group NSFW_CONSUMER
2019 Apr 10 07:36:03  | 2019-04-10 07:36:03,920 - kafka.coordinator - INFO - Successfully joined group NSFW_CONSUMER with generation 225
2019 Apr 10 07:36:03  | 2019-04-10 07:36:03,921 - kafka.coordinator.consumer - INFO - Setting newly assigned partitions {TopicPartition(topic='request_for_analyze', partition=4)} for group NSFW_CONSUMER
2019 Apr 10 07:36:03  | 2019-04-10 07:36:03,920 - kafka.consumer.subscription_state - INFO - Updated partition assignment: [TopicPartition(topic='request_for_analyze', partition=4)]
2019 Apr 10 07:36:04  | 2019-04-10 07:36:03,989 - kafka.conn - INFO - <BrokerConnection node_id=2 host=tot-kafka-2.tot-kafka-headless.tot:9092 <connecting> [IPv4 ('10.244.3.12', 9092)]>: connecting to tot-kafka-2.tot-kafka-headless.tot:9092 [('10.244.3.12', 9092) IPv4]
2019 Apr 10 07:36:04  | 2019-04-10 07:36:03,991 - kafka.conn - INFO - <BrokerConnection node_id=2 host=tot-kafka-2.tot-kafka-headless.tot:9092 <connecting> [IPv4 ('10.244.3.12', 9092)]>: Connection complete.
2019 Apr 10 07:43:54  | 2019-04-10 07:43:54,668 - kafka.coordinator - WARNING - Heartbeat failed for group NSFW_CONSUMER because it is rebalancing
2019 Apr 10 07:43:54  | 2019-04-10 07:43:54,677 - kafka.coordinator.consumer - INFO - Revoking previously assigned partitions {TopicPartition(topic='request_for_analyze', partition=4)} for group NSFW_CONSUMER
2019 Apr 10 07:43:54  | 2019-04-10 07:43:54,700 - kafka.coordinator - INFO - Successfully joined group NSFW_CONSUMER with generation 226
2019 Apr 10 07:43:54  | 2019-04-10 07:43:54,684 - kafka.coordinator - INFO - Elected group leader -- performing partition assignments using range
2019 Apr 10 07:43:54  | 2019-04-10 07:43:54,678 - kafka.coordinator - INFO - (Re-)joining group NSFW_CONSUMER
2019 Apr 10 07:43:54  | 2019-04-10 07:43:54,701 - kafka.coordinator.consumer - INFO - Setting newly assigned partitions {TopicPartition(topic='request_for_analyze', partition=4)} for group NSFW_CONSUMER
2019 Apr 10 07:43:54  | 2019-04-10 07:43:54,700 - kafka.consumer.subscription_state - INFO - Updated partition assignment: [TopicPartition(topic='request_for_analyze', partition=4)]

@ottomata
Copy link

ottomata commented May 13, 2019

I'll just delete my comment since it is a bit irrelevant and incorrect with the wrong version numbers and results.

Our summary is:

1.4.1 was working fine for us, but another of our users wanted an upgrade of kafka-python to get the fix for #1628. We upgraded to 1.4.6, and saw the missed heatbeats and rebalances this issue is about. We incorrectly reverted to 1.4.3 with #1628 backported, and still saw missed heartbeats and rebalances. We then reverted to 1.4.3 without #1628 backported, and eventually also saw missed heartbeats. So, I was incorrect in guessing that the fix in #1628 may have caused this issue.

Once we reverted all the way back to 1.4.1, the missed heartbeats ceased happening.

I haven't seen other reports of performance/instability problems with 1.4.6.
We and others in this thread are reporting that unexpected missed heartbeats happen in 1.4.6.

Thanks, sorry for the confusion!

@rolisz
Copy link

rolisz commented May 14, 2019

@jeffwidman I still encounter the problems described in this issue, on 1.4.6. I can provide debug logs if they would useful.

@jeffwidman
Copy link
Collaborator

jeffwidman commented May 14, 2019

I can provide debug logs if they would useful.

Yes please. In particular, please make sure your log formatter includes timings, as I suspect the logs will be useless otherwise for identifying the root cause.

@elukey
Copy link

elukey commented May 14, 2019

@jeffwidman I can provide logs as well, I am testing 1.4.6 in a testing environment with DEBUG log enabled. Usually these are the logs that I see:

May 13 23:28:32 deployment-eventlog05 eventlogging-processor@client-side-01[26126]: 2019-05-13 23:28:32,170 [26126] (MainThread) kafka.client [DEBUG] Sending metadata request MetadataRequest_v1(topics=[u'eventlogging-client-side']) to node 2
May 13 23:28:32 deployment-eventlog05 eventlogging-processor@client-side-01[26126]: 2019-05-13 23:28:32,170 [26126] (MainThread) kafka.protocol.parser [DEBUG] Sending request MetadataRequest_v1(topics=[u'eventlogging-client-side'])
May 13 23:28:32 deployment-eventlog05 eventlogging-processor@client-side-01[26126]: 2019-05-13 23:28:32,170 [26126] (MainThread) kafka.conn [DEBUG] <BrokerConnection node_id=2 host=deployment-kafka-jumbo-2.deployment-prep.eqiad.wmflabs:9092 <connected> [IPv4 ('172.16.5.47', 9092)]> Request 117275: MetadataRequest_v1(topics=[u'eventlogging-client-side'])
May 13 23:28:32 deployment-eventlog05 eventlogging-processor@client-side-01[26126]: 2019-05-13 23:28:32,171 [26126] (eventlogging_processor_client_side_00-heartbeat) kafka.coordinator [WARNING] Heartbeat session expired, marking coordinator dead
May 13 23:28:32 deployment-eventlog05 eventlogging-processor@client-side-01[26126]: 2019-05-13 23:28:32,171 [26126] (eventlogging_processor_client_side_00-heartbeat) kafka.coordinator [WARNING] Marking the coordinator dead (node 1) for group eventlogging_processor_client_side_00: Heartbeat session expired.

The metadata request before the heartbeat failure seems to happen all the times. How far should I go in the logs to get the culprit of Heartbeat session expired ? Is it related to session_timeout_ms (we use the default 10s)? Thanks a lot!

@ottomata
Copy link

@jeffwidman are there other logs we should try and get for ya?

@tmcqueen-materials
Copy link

We appear to be seeing this issue as well. It seems to happen when the processing code is also multithreaded/multiprocessed and is doing intensive work with large (0.5 MB) messages. Perhaps an issue where the background heartbeat thread gets starved or doesn't know it needs to do something?

@cvley
Copy link

cvley commented Jun 1, 2019

I have encountered this problem: consumers keep rebalancing when the machine has lots network traffic. I tried to set the heartbeat_interval_ms to 1000 ms, rebalancing didn't show in the log, but messages didn't consumed neither.

After I downgraded to 1.3.5, everything is normal.

Just for reference.

@vimal3271
Copy link

vimal3271 commented Jun 1, 2019

The below configuration worked for me

'request_timeout_ms': 305000,
'max_poll_interval_ms': 300000,
#Increased to 3 mins from default 10s
'session_timeout_ms': 180000,  # 3mins
#Increased to 1min from default 3s
'heartbeat_interval_ms': 60000,  # 1min
'metadata_max_age_ms': 50000

Plz try out. I am using 1.4.5

@charettes
Copy link

This is still happening on 1.4.5 and is preventing us from using this library with Python 3.7 because only < 1.4 works appropriately. When using < 1.4 we see frequent heartbeat failures because of rebalancing but more importantly double consumption of messages which is highly problematic.

This is also happening when bumping session.timeout.ms to larger values which eventually lead to spurious ValueError: I/O operation on closed epoll fd errors.

@piperck
Copy link

piperck commented Aug 16, 2019

We use 1.4.6
It looks like a poll period hang and casue can not send heartbeats.
follow

workaround might be to reduce metadata_max_age_ms to something close / equal to your heartbeat_timeout_ms. This will cause more frequent metadata requests, but should unblock the send side when there is no socket data available for reads.

everythings will be fine ....

@dingyo777
Copy link

dingyo777 commented Sep 2, 2019

Hello guys, I also have same issue with 1.4.6.
Now I've tried to 2nd workaround @dpkp suggested: reduce metadata_max_age_ms to heartbeat_interval_ms.
Let me report later whether it works or not

@dingyo777
Copy link

Result!
workaround works well. No heartbeat fails, no rebalance occurs.

@dpkp
Copy link
Owner

dpkp commented Sep 30, 2019

I made several improvements to KafkaConsumer and the underlying client that I believe should fix this issue. Please try 1.4.7 and reopen / file issues if this persists!

@dpkp dpkp closed this as completed Sep 30, 2019
@sangmeshcp
Copy link

still hitting this issue with 1.4.7

[2019-10-02 15:28:24,488] {base.py:828} WARNING - Heartbeat failed for group ca because it is rebalancing
[2019-10-02 15:28:27,587] {base.py:828} WARNING - Heartbeat failed for group ca because it is rebalancing
[2019-10-02 15:28:30,680] {base.py:828} WARNING - Heartbeat failed for group ca because it is rebalancing
[2019-10-02 15:28:33,780] {base.py:828} WARNING - Heartbeat failed for group ca because it is rebalancing
[2019-10-02 15:28:36,882] {base.py:828} WARNING - Heartbeat failed for group ca because it is rebalancing
[2019-10-02 15:28:39,975] {base.py:828} WARNING - Heartbeat failed for group ca because it is rebalancing
[2019-10-02 15:28:43,074] {base.py:828} WARNING - Heartbeat failed for group ca because it is rebalancing
[2019-10-02 15:28:46,165] {base.py:828} WARNING - Heartbeat failed for group ca because it is rebalancing
[2019-10-02 15:28:49,164] {base.py:828} WARNING - Heartbeat failed for group ca because it is rebalancing
[2019-10-02 15:28:52,266] {base.py:828} WARNING - Heartbeat failed for group ca because it is rebalancing

@jmgpeeters
Copy link

1.4.7 appears to have fixed the issue for us.

@sangmeshcp
Copy link

1.4.7 appears to have fixed the issue for us.

is it will with multiple brokers?

@elukey
Copy link

elukey commented Oct 8, 2019

1.4.7 appears to have fixed the issue for us.

Same thing from my side: I have been testing the new version for 24h now in production and everything went smoothly. Will keep monitor and report back in case anything arise, but so far so good! Thanks!

@asalem
Copy link

asalem commented Oct 16, 2019

simply you need to reduce the max poll records, default if 500 so you may tweak that to get rid of this warning

@kennes913
Copy link

kennes913 commented Oct 17, 2019

I am still having this issue using 1.4.7. Like @sangmeshcp, we have multiple brokers. I am beginning to think though this may have something to do with the kafka cluster configuration rather than kafka-python since I was able to connect to the same topic in a different group with no ill effects.

Update:
We have some legacy processes that use 1.3.3 and python 3.4. I tested using that combination and everything ran fine. Not directly helpful -- just providing more data points.

@naoshadmehta
Copy link

We're seeing this problem with 1.4.6 and 1.4.7. We have 3 broker nodes and 6 consumers per node. Our topic has 18 partitions and for about an hour or so, we see no issues. After an hour or so, we start seeing

server.log:[2019-10-29 01:28:16,927] INFO [GroupCoordinator 1]: Preparing to rebalance group tasks_group in state PreparingRebalance with old generation 9 (__consumer_offsets-17) (reason: removing member kafka-python-1.4.7-52039bb5-17c5-42d7-8af3-d95f0dbc3f3f on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)

A few seconds later, we see a new consumer being added:
server.log:[2019-10-29 01:29:00,612] INFO [GroupCoordinator 1]: Preparing to rebalance group tasks_group in state PreparingRebalance with old generation 10 (__consumer_offsets-17) (reason: Adding new member kafka-python-1.4.7-27f5bbac-b1e6-41a9-999b-abb56868e9e2) (kafka.coordinator.group.GroupCoordinator)

After the system gets into this state, we keep seeing remove and add of new members to the consumer group and in a few hours, it reaches a point where the rebalance takes more than 10 minutes to complete. By this time, the lag on the partitions is so high that its impossible to recover from this situation.

Our broker config:
max.session.timeout.ms: 30000
heartbeat.interval.ms: 10000

Consumer config:
max.session.timeout.ms: 10000
heartbeat.interval.ms: 3000
consumer_timeout_ms: 2000

Everything else is left as default.

Any suggestions on what we could tweak in the broker or consumer configs to avoid seeing this issue?

jharting added a commit to jharting/insights-host-inventory that referenced this issue Dec 17, 2019
should help with frequent partition rebalances according to dpkp/kafka-python#1418
jharting added a commit to RedHatInsights/insights-host-inventory that referenced this issue Dec 19, 2019
should help with frequent partition rebalances according to dpkp/kafka-python#1418
@jmgpeeters
Copy link

Is there any plan to upgrade kafka-python to 1.4.7 on conda-forge? https://anaconda.org/conda-forge/kafka-python/files currently still shows 1.4.6 as the latest.

@jeffwidman
Copy link
Collaborator

@jmgpeeters please open a new ticket, this one is completely unrelated to condaforge.

wmfgerrit pushed a commit to wikimedia/search-MjoLniR-deploy that referenced this issue Mar 2, 2020
Seeing problems with the eqiad group getting stuck with
"heartbeat failed from group xyz because it is rebalancing". Based
on dpkp/kafka-python#1418 this may
be resolved with a version update to 1.4.7.

Change-Id: I3896f52af54112ce69c8b60d67b8f8471bee8d65
@xiaoma-father
Copy link

i used kafka-python 2.0.1, and I see the warning in log very frequent ,nearly every second, and the client consume message in kafka slow. I solved this question by create many group to consume the topics. firstly I consume all the topics in one group, nearly twelve topics, after finding solution on web, I create five group to consume different function topic ,this problem is solved. I hope I can help you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests