Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rd_kafka_consumer_close/rd_kafka_destroy remain blocked indefinitely if the broker is unreachable #4519

Open
3 of 5 tasks
andreamartini opened this issue Nov 22, 2023 · 3 comments

Comments

@andreamartini
Copy link

Description

I'm usinng librdkafka-2.0.2, and i'm facing with rd_kafka_consumer_close(myHandle) call blocking (once the broker become unreachable).
The same problem appears with rd_kafka_destroy(myHandle).
if i skip both api calls, a leakage appears each time a connection lost or unreachable broker happen.
Is there a way to avoid this call blocking (without running into memory leakage)?
thank you in advance

How to reproduce

Simulate lost connection (i,e, using clumsy) before invoke rd_kafka_consumer_close or rd_kafka_destroy.
Checklist

Please provide the following information:

  • librdkafka version (release number or git tag): <v2.0.2>
  • [x ] Apache Kafka version: <v2.8.2>
  • librdkafka client configuration: <client.id="somevalue";bootstrap.servers= "my_Ip";group.id= "somevalue"; statistics.interval.ms=1000>
  • Operating system: <Windows10>
  • Provide logs (with debug=.. as necessary) from librdkafka
  • Provide broker log excerpts
  • [x ] Critical issue
@jpiper
Copy link

jpiper commented Jan 10, 2024

Maybe try with the latest release of 2.3.0 first

@andreamartini
Copy link
Author

Thank you for your answer. I get the same problem with 2.3.0 version.

@davidhcefx
Copy link

davidhcefx commented Mar 14, 2024

Lately, we have also encountered this pthread_join deadlock issue since we upgraded from librdkafka version 1 to 2. It did not happen frequently, only something like once in several hundreds of API calls. However, the backtrace is always similar:

(gdb) bt 4
#0  0x00007fb495bda4a7 in __pthread_clockjoin_ex ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007fb495be57aa in thrd_join ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x0000560106309f68 in rd_kafka_destroy_app ()
#3  0x0000560106309ff4 in rd_kafka_destroy ()
(More stack frames follow...)

Seems like it is waiting for some thread to join, but either the target thread refused to stop or it did not exist. (blocked for over an hour!)

Version

  • librdkafka: 2.2.0
  • Apache kafka: 3.4.0
  • OS: Debian 11
  • glibc: ldd (Debian GLIBC 2.31-13+deb11u5) 2.31

Debugging

Some more Gdb context:

(gdb) fr 3
#3  0x0000560106309ff4 in rd_kafka_destroy ()
(gdb) x/3i rd_kafka_destroy+21
   0x560106309fec <rd_kafka_destroy+21>:        mov    rdi,rax
   0x560106309fef <rd_kafka_destroy+24>:
    call   0x5601063099e7 <rd_kafka_destroy_app>
=> 0x560106309ff4 <rd_kafka_destroy+29>:        nop
(gdb) fr 2
#2  0x0000560106309f68 in rd_kafka_destroy_app ()
(gdb) x/3i rd_kafka_destroy_app+1401
   0x560106309f60 <rd_kafka_destroy_app+1401>:  mov    rdi,rax
   0x560106309f63 <rd_kafka_destroy_app+1404>:
    call   0x56010627a0c0 <thrd_join@plt>
=> 0x560106309f68 <rd_kafka_destroy_app+1409>:  test   eax,eax
(gdb) fr 1
#1  0x00007fb495be57aa in thrd_join ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) x/3i thrd_join+34
   0x7fb495be57a2 <thrd_join+34>:       mov    rsi,rsp
   0x7fb495be57a5 <thrd_join+37>:
    call   0x7fb495bda270 <__pthread_clockjoin_ex>
=> 0x7fb495be57aa <thrd_join+42>:       test   rbx,rbx
(gdb) x/3i __pthread_clockjoin_ex+563
   0x7fb495bda4a3 <__pthread_clockjoin_ex+563>: mov    eax,ebx
   0x7fb495bda4a5 <__pthread_clockjoin_ex+565>: syscall
=> 0x7fb495bda4a7 <__pthread_clockjoin_ex+567>: mov    edi,r8d
  • I wonder if there's a way to find out the target thread ID it is waiting for?

  • With some more effort, I managed to recover the Posix thread_id from frame 2. It turns out that the blocking thread is "rdk:main", and here's its backtrace:
(gdb) bt
#0  0x00007fb495bda4a7 in __pthread_clockjoin_ex ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007fb495be57aa in thrd_join ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x000056010630a6b0 in rd_kafka_destroy_internal ()
#3  0x000056010630eac1 in rd_kafka_thread_main ()
#4  0x00007fb495bd8f5a in start_thread ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#5  0x00007fb4955a8a2f in clone ()
   from /lib/x86_64-linux-gnu/libc.so.6

Looks like it is waiting for another thread!
Using similar methods, I discovered the pthread struct from frame 2, where $rdi = *(*($rbp-0x20)):

(gdb) x/7i rd_kafka_destroy_internal+1664
   0x56010630a69a <rd_kafka_destroy_internal+1664>:
    mov    rax,QWORD PTR [rbp-0x20]
   0x56010630a69e <rd_kafka_destroy_internal+1668>:
    mov    rax,QWORD PTR [rax]
   0x56010630a6a1 <rd_kafka_destroy_internal+1671>:
    lea    rdx,[rbp-0x68]
   0x56010630a6a5 <rd_kafka_destroy_internal+1675>:
    mov    rsi,rdx
   0x56010630a6a8 <rd_kafka_destroy_internal+1678>:
    mov    rdi,rax
   0x56010630a6ab <rd_kafka_destroy_internal+1681>:
    call   0x56010627a0c0 <thrd_join@plt>
=> 0x56010630a6b0 <rd_kafka_destroy_internal+1686>:
    mov    rax,QWORD PTR [rbp-0x20]

This time the target thread "rdk:broker237" does not seem to be waiting for someone else, but instead in conditional wait:

(gdb) bt
#0  0x00007fb495bdfad8 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007fb495be5b39 in cnd_timedwait ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x0000560106465256 in cnd_timedwait_abs ()
#3  0x000056010639a492 in rd_kafka_q_pop_serve ()
#4  0x000056010639a532 in rd_kafka_q_pop ()
#5  0x000056010632f777 in rd_kafka_broker_ops_serve ()
#6  0x000056010632f8bc in rd_kafka_broker_ops_io_serve ()
#7  0x00005601063313fa in rd_kafka_broker_consumer_serve ()
#8  0x0000560106331682 in rd_kafka_broker_serve ()
#9  0x00005601063318d6 in rd_kafka_broker_thread_main ()
#10 0x00007fb495bd8f5a in start_thread ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#11 0x00007fb4955a8a2f in clone ()
   from /lib/x86_64-linux-gnu/libc.so.6

So perhaps the main handler failed to wake this thread up from conditional wait?
I'm not familiar with the code base, but I hope the above analysis can bring some insight into why it can be blocking on rd_kafka_destroy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants