New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Help: Error sending GroupCoordinatorRequest_v0 to node #739
Comments
What is under those XXXs ? In particular, the string in brackets should be the error type. What does it say? |
I can't reproduce based on the information here. Please re-open if you have more! |
recently I also has same problem. I'm working with kafka 0.9.0.1 and kafka-python 1.2.1. |
This error is logged when there is a networking failure. There isn't enough information here for me to provide any further advice. You might try enabling debug logs ( |
@dpkp , I have resolved this problem by setting request_timeout_ms to 300000 insttead of 40000 by default. So I think the reason is I fetch too many messages from kafka broker one time, and after I processed all of them, call consumer.poll() again, it's beyond request_timeout_ms. What do you think? If you don't agree me or you can not certain, I will attach more information here to help you analyze. |
+1, i got the same error. |
I don't see how I can re-open this issue, I don't see anything that could be related to "re-open" in the github UI. I'm getting this error on localhost with a single broker and zookeeper on a fresh topic populated with some messages, it's clearly not a networking issue. topic_partition = kafka.TopicPartition('test', 0)
consumer = kafka.KafkaConsumer(
bootstrap_servers = ['localhost'],
client_id = 'kafka_cat',
value_deserializer = cPickle.loads,
auto_offset_reset = 'earliest',
enable_auto_commit = False,
consumer_timeout_ms = 1000 * 3,
api_version = (0, 10),
)
with contextlib.closing(consumer):
consumer.assign([topic_partition])
consumer.seek_to_beginning()
logger.info("committed offset %s, next record offset to be be fetched %s", consumer.committed(topic_partition), consumer.position(topic_partition))
for record in consumer:
print record Here is a fragment of debug log around the error message:
|
I've just tried to use group coordination without additional calls with the same result - reduced the code to just passing a topic to KafkaConsumer and then just iterating on it. So the code is basically the following (env is the same, kafka-python==1.3.1, a single localhost-only kafka 0.10.1.0 broker): consumer = kafka.KafkaConsumer(
'test',
bootstrap_servers = ['127.0.0.1'],
client_id = 'kafka_cat',
group_id = 'kafka_cat',
value_deserializer = cPickle.loads,
auto_offset_reset = 'earliest',
enable_auto_commit = False,
consumer_timeout_ms = 1000 * 2,
api_version = (0, 10),
)
with contextlib.closing(consumer):
for record in consumer:
print record The head of debug log (the error wrapped with newlines):
|
happens to me as well, every time after I change the consumer-group id to a new one. E.g. works well with consumer group id N, then I decide to change to consumer group N+1 (that doesn't exist yet), then consumer fails as follows.
One kafka node only, on local host, kafka version kafka_2.11-0.10.1.0 Consumer is also on localhost, auto_offset_reset=latest, max_poll_records=1, session_timeout_ms = 60000. |
have a full debug-level reproduction of this, attached. Renamed the topic and group-id names in a consistent way, to hide some business details. the relevant group id topic-as-sp-group_5. While I was using topic-as-sp-group_4, all was well for a while, and when moving to topic-as-sp-group_5, this issue started to reproduce. When we moved from topic-as-sp-group_3 to topic-as-sp-group_4, we had the same issue, but at some point it subsided. btw, this happens on multiple machines with separate kafka installations. Kafka server is with default configuration, no changes (these are all mac dev machines). |
Thanks -- I'll take a look and see if this uncovers anything. But, FYI, changing consumer groups is not supported. You should always create a new consumer instance if you want to use a different group. |
Thanks, that would be awesome. I'll explain what I meant regarding "changing" the consumer group - I wasn't referring to actually changing it in runtime, I was referring to restarting the process with a new configuration that makes the consumer connect to a new group id. The logs I've attached are just the debug logs that have been written while the process started and created a consumer as part of its startup. |
right on. can you provide a short code snippet that might help reproduce? |
i will try to arrange something that reproduces this. btw, I've found this kafka jira that was open, which has exactly the same problem i'm seeing (the one opening the issue also was using kafka-python) - https://issues.apache.org/jira/browse/KAFKA-4086 In my case, it's totally unrelated to long processing, though. Actually, the message rate is very low, in most cases the consumer starts up and there are no new messages at all, and the problem still happens. |
Found the root cause of the problem. The reason for the issue was that the session timeout was larger than the request timeout. Whenever we would restart a process, the previous process' session would live for too long relative to the request timeout, leading to the request timeout error. When looking at the standard java based kafka client, there is actually an exception being thrown when the session timeout is larger than the request timeout, or when the fetch-max-wait is larger than the request timeout. See code Here. I've created a pull request that throws an error if these constraints are violated #986 . Ran the tests manually through pytest tough, since tox stuff wasn't working for me. Send me any comments on it if needed. Thanks |
Joining @harelba - It happened to me and was a nightmare to solve until I understood the problem, there is no sense in having session_timeout_ms > request_timeout_ms by definition. |
merged harelba's PR to fail fast if these values are mis-configured. Thanks for the debugging! |
Upgraded to kafka-python 1.2.2 running against a 3-node Kafka 0.10 cluster. We're receiving this error once on every subscribe action:
The node specified appears to be random.
The text was updated successfully, but these errors were encountered: