Consumer hangs if closing together with deleting topic #4362

ilejn · 2023-07-21T13:14:18Z

Description

Consumer hangs if closing together with deleting topic

How to reproduce

Hello,
in ClickHouse we have an issue with an integration test if librdkafka master is more recent than 8e20e1e, IOW if librdkafka contains PR 4117.
Test scenario:
1. Six consumers consume messages from a topic with six partitions.
2. Delete the topic (not via librdkafka)
3. Close the consumers one by one
One of the consumers is more or less reproducibly hangs during closing, while virtually anything helps – it is enough to add a sleep() between (2) and (3) or even try to use a ClickHouse build with a sanitizer.
I tried to create MRU not using ClickHouse, but did not succeeded.

The scenario seems a bit insane, though it is crucial for us and effectively prevents us from using recent librdkafka.
How ClickHouse closes a consumer.
• unsubscribe
• drain queue
• free callbacks
• call rdkafka_consumer_close

ClickHouse maintains rebalance callback (actually cppkafka does).

My investigations.
Problematic part of PR 4117 is rd_kafka_toppar_keep(rktp)
Specifically where it is called from rd_kafka_toppar_pause_resume to do resume.
In rd_kafka_broker_thread_main we are waiting forever while (!rd_kafka_broker_terminating(rkb)) which is actually rd_refcnt_get(&(rkb)->rkb_refcnt) <= 1 .
REFCNT DEBUG output https://pastila.nl/?00659d03/ee47523355fd8a694171a23c8b2a48c6
Some ClickHouse logs https://pastila.nl/?002bbb54/247b8ebbb941432451f7ae5ce10f319b

Am I right thinking that the problem is there is no suitable counterpart to read RD_KAFKA_OP_BARRIER from fetch queue?
Is it possible to resolve this problem at application side?

Checklist

librdkafka version 8e20e1e
Apache Kafka version:image: confluentinc/cp-kafka:5.2.0
librdkafka client configuration: https://pastila.nl/?07076592/a59038db454e37bda6f455f41a9e4020)](https://pastila.nl/?07076592/a59038db454e37bda6f455f41a9e4020)
Operating system: ubuntu:22.04
Provide logs (with debug=.. as necessary) from librdkafka
Provide broker log excerpts
Critical issue

The text was updated successfully, but these errors were encountered:

kwdubuc · 2023-07-29T20:47:56Z

I recently encountered the same issue.

I traced it down to the fact that:

rd_kafka_toppar_fetch_stop calls rd_kafka_q_fwd_set(rktp->rktp_fetchq, NULL);, which, at least in our application, disconnects the rktp_fetchq from anything that will service it.
An immediately following rd_kafka_toppar_pause_resume call (from a previously-queued RESUME op) calls rd_kafka_toppar_op_version_bump, which puts a BARRIER op on that rktp_fetchq. That op contains a reference to the rktp object, so if the rktp_fetchq is never serviced, the rktp object is never freed. The rktp object in turn contains a reference to its former rktp_broker object, so the rktp_broker's reference count never drops to 1. Which in turn means that the broker's thread never terminated.
An eventual application call to rd_kafka_destroy waits for the broker thread to terminate, which never happens; the application hangs.

I patched this with the following change in rd_kafka_toppar_broker_leave_for_remove:

diff --git a/src/rdkafka_partition.c b/src/rdkafka_partition.c
index 46d2fb3e..735c2bf9 100644
--- a/src/rdkafka_partition.c
+++ b/src/rdkafka_partition.c
@@ -1086,6 +1086,13 @@ void rd_kafka_toppar_broker_leave_for_remove(rd_kafka_toppar_t *rktp) {
                 rd_kafka_toppar_set_fetch_state(
                     rktp, RD_KAFKA_TOPPAR_FETCH_OFFSET_QUERY);
 
+        /* Purge any stale operations on the fetchq; nothing will serve them
+         * at this point. */
+        if (!rd_kafka_q_is_fwded(rktp->rktp_fetchq)) {
+                rd_kafka_q_disable(rktp->rktp_fetchq);
+                rd_kafka_q_purge(rktp->rktp_fetchq);
+        }
+
         rko           = rd_kafka_op_new(RD_KAFKA_OP_PARTITION_LEAVE);
         rko->rko_rktp = rd_kafka_toppar_keep(rktp);

But, it's my first foray into the deeper guts of rdkafka; I have no particular confidence that that's the best patch.

emasab · 2023-07-31T16:38:19Z

Hello, does it happen with versions >= 2.1.0 ? There's this fix in that version.

A reference count issue was blocking the consumer from closing.
The problem would happen when a partition is lost, because forcibly
unassigned from the consumer or if the corresponding topic is deleted.

kwdubuc · 2023-07-31T21:46:14Z

Yes, we're working off a fork of v2.1.0, which has #4187.

kwdubuc · 2023-08-01T16:52:15Z

I've attached debug logs illustrating the issue in our reproduction.
confluentinc-librdkafka-4362.txt

wbarnha · 2024-01-03T18:42:22Z

Bumping for attention, still seeing the same issue on v1.9.2 and v2.3.0 on Python 3.11 using confluent-kafka-python. Strangely enough, I didn't have this issue with v1.9.2 on Python 3.8.

I don't have data to quantify this, but I do think that the aforementioned patch did successfully reduce the incidence of consumers failing to close. I'll need to dig deeper to see what's going on.

filimonov mentioned this issue Jul 24, 2023

Update librdkafka to version 2.0.2 ClickHouse/ClickHouse#43259

Closed

emasab added the wait-info label Jul 31, 2023

plachor mentioned this issue Dec 15, 2023

Consumer Hangs while Dispose in K8s confluentinc/confluent-kafka-dotnet#2013

Open

8 tasks

emasab mentioned this issue Apr 3, 2024

C++ consumer APIs have memory leaks under certain conditions #4486

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consumer hangs if closing together with deleting topic #4362

Consumer hangs if closing together with deleting topic #4362

ilejn commented Jul 21, 2023 •

edited

Loading

kwdubuc commented Jul 29, 2023

emasab commented Jul 31, 2023 •

edited

Loading

kwdubuc commented Jul 31, 2023

kwdubuc commented Aug 1, 2023

wbarnha commented Jan 3, 2024

Consumer hangs if closing together with deleting topic #4362

Consumer hangs if closing together with deleting topic #4362

Comments

ilejn commented Jul 21, 2023 • edited Loading

Description

How to reproduce

Checklist

kwdubuc commented Jul 29, 2023

emasab commented Jul 31, 2023 • edited Loading

kwdubuc commented Jul 31, 2023

kwdubuc commented Aug 1, 2023

wbarnha commented Jan 3, 2024

ilejn commented Jul 21, 2023 •

edited

Loading

emasab commented Jul 31, 2023 •

edited

Loading