-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consumer hangs if closing together with deleting topic #4362
Comments
I recently encountered the same issue. I traced it down to the fact that:
I patched this with the following change in
But, it's my first foray into the deeper guts of rdkafka; I have no particular confidence that that's the best patch. |
Hello, does it happen with versions >= 2.1.0 ? There's this fix in that version.
|
Yes, we're working off a fork of v2.1.0, which has #4187. |
I've attached debug logs illustrating the issue in our reproduction. |
Bumping for attention, still seeing the same issue on v1.9.2 and v2.3.0 on Python 3.11 using confluent-kafka-python. Strangely enough, I didn't have this issue with v1.9.2 on Python 3.8. I don't have data to quantify this, but I do think that the aforementioned patch did successfully reduce the incidence of consumers failing to close. I'll need to dig deeper to see what's going on. |
Description
Consumer hangs if closing together with deleting topic
How to reproduce
Hello,
in ClickHouse we have an issue with an integration test if librdkafka master is more recent than 8e20e1e, IOW if librdkafka contains PR 4117.
Test scenario:
1. Six consumers consume messages from a topic with six partitions.
2. Delete the topic (not via librdkafka)
3. Close the consumers one by one
One of the consumers is more or less reproducibly hangs during closing, while virtually anything helps – it is enough to add a sleep() between (2) and (3) or even try to use a ClickHouse build with a sanitizer.
I tried to create MRU not using ClickHouse, but did not succeeded.
The scenario seems a bit insane, though it is crucial for us and effectively prevents us from using recent librdkafka.
How ClickHouse closes a consumer.
• unsubscribe
• drain queue
• free callbacks
• call rdkafka_consumer_close
ClickHouse maintains rebalance callback (actually cppkafka does).
My investigations.
Problematic part of PR 4117 is rd_kafka_toppar_keep(rktp)
Specifically where it is called from rd_kafka_toppar_pause_resume to do resume.
In rd_kafka_broker_thread_main we are waiting forever while (!rd_kafka_broker_terminating(rkb)) which is actually rd_refcnt_get(&(rkb)->rkb_refcnt) <= 1 .
REFCNT DEBUG output https://pastila.nl/?00659d03/ee47523355fd8a694171a23c8b2a48c6
Some ClickHouse logs https://pastila.nl/?002bbb54/247b8ebbb941432451f7ae5ce10f319b
Am I right thinking that the problem is there is no suitable counterpart to read RD_KAFKA_OP_BARRIER from fetch queue?
Is it possible to resolve this problem at application side?
Checklist
image: confluentinc/cp-kafka:5.2.0
ubuntu:22.04
debug=..
as necessary) from librdkafkaThe text was updated successfully, but these errors were encountered: