-
Notifications
You must be signed in to change notification settings - Fork 14k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-6783: consumer poll(timeout) blocked infinitely #4861
Conversation
bug "consumer poll blocked" is found in 0.10.x versions in: Yesterday I found it in 1.1.0 as well and made a fix. |
@koqizhao Thanks for the patch. Can you separate this into two separate PRs please? |
@hachikuji , OK. Created another pull #4865 for 6784 and update the current only for 6783. |
d824d9a
to
aa75e8f
Compare
bootstrap server fix bug: FindCoordinatorResponse cannot be cast to FetchResponse
aa75e8f
to
46380b9
Compare
Hey @koqizhao , thanks for the patch. I only just recently started working on this issue. It's causing https://issues.apache.org/jira/browse/KAFKA-5697 as well. I'm happy to yield to you. In particular, I like your new test and your approach to the fixing the tests. I was just going to add 30s to every invocation of poll() in our tests, which isn't too principled. Feel free to grab any code from my PR to fill in the missing cases @hachikuji was talking about. Let me know if you want to continue with this PR, and I'll stop work on mine and switch to reviewing yours when it's ready! |
Oh, also, I still haven't decided about whether or not to write a KIP. The basic issue is whether we need to preserve the existing semantics in some way. Namely, whether we need to continue providing some way to just sync the metadata and to just "drain" previously fetched results without fetching another batch. The biggest place this would matter is when timeout = 0, so I did this search: https://www.google.com/search?q="consumer.poll%280%29"+site%3A%3Ahttps%3A%2F%2Fgithub.com Many of the hits seem to acknowledge that At the very least, the existance of calls to One option is to treat What do you think? |
Yeah, I think this is exactly right. At a minimum, having an API to do this would be useful in testing. Maybe something like this: Set<TopicPartition> awaitAssignment(long timeout, TimeUnit unit); It's kind of a weird though and I'm not sure what kind of use cases it addresses outside of testing. That might make it a tough sell. Our case for not doing a KIP would be stronger if the change didn't break our own tests 😉. My feeling is we probably just have to do it, but I would like to be convinced otherwise. I almost hate to suggest it, but we could introduce a new config which controls whether or not poll() should block to join the join the group and fetch offsets. Maybe something like |
A new config if poll(0) is equivalent to poll(max), infinite block continues to happen, which is not expected for users using poll(0). I suggest to have a new config for poll(timeout > 0), always honor the timeout |
That sounds similar to one thought I had, which would be to add a new variant Plus, from my informal survey, it really seems like you would have two distinct use cases, I think whether it's a new method or a new config, it's probably KIP-worthy. I think we could forgive ourselves for sneaking something by if it's just purely a semantic change, but offering a new config is just as much a public interface change as a new method. From where I'm sitting, it seems like the new method is the better option. It will let people who just want to wait for an assignment and not get results to call I'm willing to write the KIP (which is good, because I'm arguing for it ;) ), but I'm also willing to take a back seat if @koqizhao wants to lead it. |
Great! Agree to a new method. Thanks @vvcephei. I would like you to write the KIP. Maybe |
Maybe add 2 methods: |
Ok! I'll do it today. Good call on the timeout, I think At an implementation level, we are still going to need to figure out what to do with all the tests, I think it boils down to either switching to |
I've created KIP-288 as we discussed. I also started a discussion thread on the mailing list (dev@kafka.apache.org). Please reply with your thoughts! Also, please reach out to anyone you think might have an opinion. Thanks, |
Actually, I've just learned of KIP-266, which also covers this issue. @koqizhao and @hachikuji , can you review KIP-266 and contribute thoughts to that discussion? |
retest this please |
These cases failed by the same cause. I'm not familiar with kafka streams, @vvcephei, would you please help? Thanks. org.apache.kafka.streams.integration.ResetIntegrationTest.testReprocessingByDurationAfterResetWithoutIntermediateUserTopic gradle streams:test --tests org.apache.kafka.streams.integration.ResetIntegrationTest.testReprocessingByDurationAfterResetWithoutIntermediateUserTopic org.apache.kafka.streams.integration.ResetIntegrationTest > testReprocessingByDurationAfterResetWithoutIntermediateUserTopic FAILED |
Hey @koqizhao , I just saw this. I'm not too familiar with that test either. Does it fail in isolation? I would say just to try it with an increased timeout, but I think 30s is the max timeout for the Streams tests. For the record, though, I got all the tests to pass on c5b19b5, which sets the metadata timeout to 30s for everything, which suggests that a 30s timeout should be long enough... It might be that some condition is missed in the metadata update. For example, in 12e0c9c I just fixed a problem where if a consumergroup rejoin timed out, it wouldn't try again. If you can get the test running in your IDE, tracing through the KafkaConsumer part of the test execution might give you a clue about what's going wrong. For clarity: both of those commits are in #4855 , if you're looking for them. |
Hi, Any updates on this issue? Also faced this issue on production, when we had uncreachable Kafka brokers. |
Right now my fix works for core functionality, but causes Stream test cases fail. These days I'm busy and don't focus on those failed cases.
We have had a KIP on this.
…________________________________
发件人: Alexander Guz <notifications@github.com>
发送时间: 2018年5月9日 15:41
收件人: apache/kafka
抄送: Qiang; Mention
主题: Re: [apache/kafka] KAFKA-6783: consumer poll(timeout) blocked infinitely (#4861)
Hi,
Any updates on this issue? Also faced this issue on production, when we had uncreachable Kafka brokers.
―
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#4861 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AGRFyEPnyXKrXTcIY6tx_1o42KWvxzi1ks5tww4sgaJpZM4TSAGr>.
|
The bug is duplicated to 5697, and is fixed in 2.0.0. My code is partly included in fix for 5697. John Roesler resolved KAFKA-6783.
We've merged [https://github.com/apache/kafka/commit/c470ff70d3e829c8b12f6eb6cc812c4162071a1f] under KAFKA-5697, which should fix this issue. In retrospect, your ticket would have been a more appropriate scope for the work, but it's too late to change the commit title now. |
KAFKA-6783: consumer poll(timeout) blocked infinitely when no available bootstrap server