-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heartbeat failed for group xxxWorker because it is rebalancing #1418
Comments
@ealfatt Yeah this is definitely newer behavior since there was not a background heartbeat thread so it basically happened during polling. I am guessing after the last message your worker is not rejoining? |
kafka-python 1.4 changes heartbeats to a background thread, and it enables additional timeout configurations in max_poll_interval_ms and session_timeout_ms . It is certainly possible there is a bug in the code, but all of our tests pass so I'm not aware of anything obvious. Can you enable debug logs and see if anything looks strange? |
This log entry: |
After upgrading to 1.4.3 from 1.3.5 a couple of weeks ago, I started seeing my consumer groups constantly rebalancing at the session timeout setting, no matter what I set it to. I tried:
Eventually, I ended up writing a handler into the application where it would kill the consumer if the commit failed, then reincarnate it in the job queue and start it back up. You can imagine that with the default settings it was effectively in a perpetual loop of rebalancing, then reading a high volume of messages - the broker traffic graph looked like a perfect sawtooth. I see heartbeat requests and successful responses, but commit responses with an Error 25 just before the error's thrown that the consumer can't commit successfully:
I downgraded the version to 1.3.5 again and everything is happy, all consumers have been running smoothly for hours now with no issue and the broker traffic graphs are flat. One other thing I noticed is a lot of wow/flutter on graphs of this data (netflow) when using 1.4.3, which smoothed out again after downgrading. The setup for each of my clusters is 10 consumers running on one host via multiprocess, consuming single topic with 10 partitions on three brokers. The application is very simple/lightweight; it reads 2k-4k messages per second depending on time of day, formats/augments some of the data, and then dumps it to CrateDB. With the 1.3.5 client, I can easily get > 20MBps throughput with this application (average is ~2MBps in production when the consumers are caught up), with each db write of 5k rows taking < 1s. If you're interested in any of the other logs/graphs from either kafka-python or my brokers, I'd be happy to share. |
I see this as well. I have a set of consumers, each polling 500 records at a time, being part of group generation N Another consumer joins. Causing a rebalance. Generation N+1 I can see in the logs that a consumer, while processing the 500 records that it got at generation N, will log that the heart beat failed due to rebalancing. But the processing code does not learn this, so it continues processing.. (I should fix so that the processing checks for this case and aborts. Not sure how?) I can see that the broker does not include this consumer in the rebalanced group (at N+1). I wonder why, as it was sending heart-beats? Is this considered a bug? I only quickly glimpsed at the code. It seems that this could happen: HeartBeat ends up in a state "do not send more heartbeats" due to rebalancing and the broker taking sufficiently long time to perform the rebalancing. So the consumer has stopped sending heartbeats, and is subsequently not included. It seems some special-case handling would be needed (eg reset timeouts) when rebalancing occurs? (The rebalancing keeps happening as when this consumer that did not become part of generation N+1 later tries to join again, the same thing will happen, this time some other consumer will be busy processing and not become part of generation N+2... and so on and so on.) |
We have also started seeing this error quite frequently with the upgrade to 1.4.3. So far none of tuning has turned out to be helpful. |
Can you try 1.4.4 ? There are some fixes included that may help.
…On Wed, Nov 21, 2018, 3:30 PM Manpreet Singh ***@***.*** wrote:
We have also started seeing this error quite frequently with the upgrade
to 1.4.3. So far none of tuning has turned out to be helpful.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1418 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAzetC4CfcGRGl2q5Ktv8GI10HysmoIxks5uxeIAgaJpZM4Si39s>
.
|
It happens to me as well, I'm running |
I can see it on |
Has there been a solution for this ? |
It is similar with the bug #1691 , I fixed it. Do you think this is the same with it? |
One point to keep in mind: heartbeats "failing" during rebalance is completely normal. The main java client treats this as a debug log, not a warning. If this is simply a case of spammy logs, we should probably return the log entry to debug. However, if you are seeing unexpected rebalances and/or rebalances are not resolving correctly, then there may be something else going on. It is hard to say what that is from the issue detail so far. Some possibilities: your consumer is taking too long to process each batch, causing a max poll interval timeout. Or, there could be some bug or deadlock in kafka-python that is causing the consumer to stop processing records altogether. Do you have any more information on specifically what you are seeing in the consumer behavior apart from the heartbeat? |
I am seeing consumer rebalances even if there is no messages to consume. Start three consumers in a group and send some messages to topic and after that stop the producer. The consumer will start seeing rebalances after 5-6mins. |
vimal: thanks for posting. I believe you may be hitting lock contention between an idle client.poll -- which can block and hold the client lock for the entire request_timeout_ms -- and the attempt by the heartbeat thread to send a new request. It seems to me that we may need to use KafkaClient.wakeup() to make sure that the polling thread drops the lock if/when we need to send a request from a different thread. |
@dpkp Thanks for the reply. Is there any workaround to avoid this? |
This shouldn't be an issue when messages are flowing through your topics at a steady rate. If this is just a test environment, and you expect your production environment to have more steady live data, then you could just ignore the error in testing. But if you are managing a topic w/ very low traffic -- delays of minutes between consecutive messages, for example -- you might try to reduce the |
@dpkp Thanks for the workaround. The rebalance is not happening anymore. I used the second approach of reducing |
@dpkp I think I am hitting this exact issue with the lock contention, and it happens quite frequently (using 1.4.4). I turned the log level to DEBUG, and I can see that the heartbeats were every 3 seconds, then it went into a _client.poll -> _selector.select -> _epoll.poll (confirmed in pdb stack trace) and just hung for 5 minutes until it timed out, despite the fact the log end offset was way ahead of the current offset. During that time there were no heartbeats sent. I suspect the session timedout after 10 seconds, and the poll was left in the lurch until it reached the request_timeout_ms, after which it recognized the heartbeat timeout and started refreshing the metadata and doing a rejoin of the consumer group. After that the heartbeats started again every 3 seconds. I saw another issue with the heartbeat not being started after JoinGroup, but from what I can see the heartbeats were regular since the last JoinGroup in the logs, and only stopped when it went into that poll. The issue happens randomly, but it happens often. The result is that consumers periodically hang in polls and messages build up and the lag becomes large, then it has to rejoin the group and then catch up on a backlog of messages. This problem becomes even worse when the poll goes for more than the max_poll_interval_ms causing runaway constant rebalancing, especially in groups with many consumers. I've mitigated the runaway rebalancing by just upping the max_poll_interval_ms to 1 hour, but I still see those constant poll hangs on partitions with plenty of lag, from anywhere from 20 seconds to 5 minutes. I can only thus far attribute them to the stoppage of the heartbeats and a session timeout as a result. I was thinking about reducing the request_timeout_ms from the default of 5 minutes to 30 seconds, but that will only work around the issue, it seems the underlying root cause of those heartbeats stopping would still be there. |
@mhorowitz0 / @vimal3271 / @slash31 / @88manpreet there were a number of fixes included in the |
I'm going to close, please try the new 1.4.6 release, if you still see issues please open a new ticket. |
@jeffwidman I tried 1.4.5, and I still see the same symptoms. I will try 1.4.6. |
@jeffwidman I am seeing the same issue with 1.4.6. The issue is intermittent, so it is hard to catch, but from what I can see even with a single consumer group, a poll() can hang inexplicably for several minutes even when the topic/partition has a large lag, and even in as little as 12 seconds (enough to trigger a session timeout), and during that time heartbeats stop being sent for longer than the session timeout, and so a rebalance is triggered. I don't know why it hangs so long in the poll() when most of the time polls() take less than a second, and I don't know why the heartbeats stop, but I can only guess it is the same root cause, and most likely the lock contention @dpkp refers to above. |
@jeffwidman I am having the same issue as @mhorowitz0 with 1.4.6 with the default provided settings for the consumer. A consumer inside a ConsumerGroup fails with the following log when there are no messages in the topic partition he is currently assigned to.
|
I'll just delete my comment since it is a bit irrelevant and incorrect with the wrong version numbers and results. Our summary is: 1.4.1 was working fine for us, but another of our users wanted an upgrade of kafka-python to get the fix for #1628. We upgraded to 1.4.6, and saw the missed heatbeats and rebalances this issue is about. We incorrectly reverted to 1.4.3 with #1628 backported, and still saw missed heartbeats and rebalances. We then reverted to 1.4.3 without #1628 backported, and eventually also saw missed heartbeats. So, I was incorrect in guessing that the fix in #1628 may have caused this issue. Once we reverted all the way back to 1.4.1, the missed heartbeats ceased happening.
Thanks, sorry for the confusion! |
@jeffwidman I still encounter the problems described in this issue, on 1.4.6. I can provide debug logs if they would useful. |
Yes please. In particular, please make sure your log formatter includes timings, as I suspect the logs will be useless otherwise for identifying the root cause. |
@jeffwidman I can provide logs as well, I am testing 1.4.6 in a testing environment with DEBUG log enabled. Usually these are the logs that I see:
The metadata request before the heartbeat failure seems to happen all the times. How far should I go in the logs to get the culprit of |
@jeffwidman are there other logs we should try and get for ya? |
We appear to be seeing this issue as well. It seems to happen when the processing code is also multithreaded/multiprocessed and is doing intensive work with large (0.5 MB) messages. Perhaps an issue where the background heartbeat thread gets starved or doesn't know it needs to do something? |
I have encountered this problem: consumers keep rebalancing when the machine has lots network traffic. I tried to set the After I downgraded to Just for reference. |
The below configuration worked for me
Plz try out. I am using 1.4.5 |
This is still happening on 1.4.5 and is preventing us from using this library with Python 3.7 because only < 1.4 works appropriately. When using < 1.4 we see frequent heartbeat failures because of rebalancing but more importantly double consumption of messages which is highly problematic. This is also happening when bumping |
We use 1.4.6
everythings will be fine .... |
Hello guys, I also have same issue with 1.4.6. |
Result! |
I made several improvements to KafkaConsumer and the underlying client that I believe should fix this issue. Please try 1.4.7 and reopen / file issues if this persists! |
still hitting this issue with 1.4.7 [2019-10-02 15:28:24,488] {base.py:828} WARNING - Heartbeat failed for group ca because it is rebalancing |
1.4.7 appears to have fixed the issue for us. |
is it will with multiple brokers? |
Same thing from my side: I have been testing the new version for 24h now in production and everything went smoothly. Will keep monitor and report back in case anything arise, but so far so good! Thanks! |
simply you need to reduce the max poll records, default if 500 so you may tweak that to get rid of this warning |
I am still having this issue using 1.4.7. Like @sangmeshcp, we have multiple brokers. I am beginning to think though this may have something to do with the kafka cluster configuration rather than kafka-python since I was able to connect to the same topic in a different group with no ill effects. Update: |
We're seeing this problem with 1.4.6 and 1.4.7. We have 3 broker nodes and 6 consumers per node. Our topic has 18 partitions and for about an hour or so, we see no issues. After an hour or so, we start seeing server.log:[2019-10-29 01:28:16,927] INFO [GroupCoordinator 1]: Preparing to rebalance group tasks_group in state PreparingRebalance with old generation 9 (__consumer_offsets-17) (reason: removing member kafka-python-1.4.7-52039bb5-17c5-42d7-8af3-d95f0dbc3f3f on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator) A few seconds later, we see a new consumer being added: After the system gets into this state, we keep seeing remove and add of new members to the consumer group and in a few hours, it reaches a point where the rebalance takes more than 10 minutes to complete. By this time, the lag on the partitions is so high that its impossible to recover from this situation. Our broker config: Consumer config: Everything else is left as default. Any suggestions on what we could tweak in the broker or consumer configs to avoid seeing this issue? |
should help with frequent partition rebalances according to dpkp/kafka-python#1418
should help with frequent partition rebalances according to dpkp/kafka-python#1418
Is there any plan to upgrade kafka-python to 1.4.7 on conda-forge? https://anaconda.org/conda-forge/kafka-python/files currently still shows 1.4.6 as the latest. |
@jmgpeeters please open a new ticket, this one is completely unrelated to condaforge. |
Seeing problems with the eqiad group getting stuck with "heartbeat failed from group xyz because it is rebalancing". Based on dpkp/kafka-python#1418 this may be resolved with a version update to 1.4.7. Change-Id: I3896f52af54112ce69c8b60d67b8f8471bee8d65
i used kafka-python 2.0.1, and I see the warning in log very frequent ,nearly every second, and the client consume message in kafka slow. I solved this question by create many group to consume the topics. firstly I consume all the topics in one group, nearly twelve topics, after finding solution on web, I create five group to consume different function topic ,this problem is solved. I hope I can help you. |
Hi all
I'm facing this problem that is driving to me crazy with 1.4.1 version of Kafka python
the instruction that i perform are:
KafkaConsumer(bootstrap_servers=kafka_multi_hosts,
auto_offset_reset=earliest,
enable_auto_commit=False,
group_id=group_name,
reconnect_backoff_ms=1,
consumer_timeout_ms=5000)
no problem till now but then in the log i see:
[INFO] 03/08/2018 02:52:53 PM Subscribe executed.
[INFO] 03/08/2018 02:52:53 PM Initialization pool executed.
[INFO] 03/08/2018 02:52:53 PM Subscribed to topic: event
[INFO] 03/08/2018 02:52:53 PM eventHandle connected
[WARNING] 03/08/2018 02:53:23 PM Heartbeat failed for group emsWorker because it is rebalancing
[WARNING] 03/08/2018 02:53:26 PM Heartbeat failed for group emsWorker because it is rebalancing
[WARNING] 03/08/2018 02:53:29 PM Heartbeat failed for group emsWorker because it is rebalancing
[WARNING] 03/08/2018 02:53:32 PM Heartbeat failed for group emsWorker because it is rebalancing
......
......
......
[WARNING] 03/08/2018 02:57:48 PM Heartbeat failed for group emsWorker because it is rebalancing
[WARNING] 03/08/2018 02:57:51 PM Heartbeat failed for group emsWorker because it is rebalancing
[INFO] 03/08/2018 02:57:53 PM Leaving consumer group (group_name).
why?
I've also added the option max_poll_records=50 in the kafkaConsumer definition but nothing is changed
@dpkp can you help me?
do you know if the 1.4.1 version presents some problem about that? Because in the previous version I can not see this problem.
thanks in advance
The text was updated successfully, but these errors were encountered: