The mosquitto_sub client can end in a deadlock in certain scenarios when the timeout option (`-W) is set.
The client utilizes alarm() to set a signal handler that triggers once the timeout expires. However, this can result in a deadlock condition in the following case:
This is somewhat difficult to reproduce, but I've found the issue will investigating what we think is #2887 and a variation of #2887 (comment) seems to be the best bet to end up in the same case.
(gdb) bt
#0 futex_wait (private=0, expected=2, futex_word=0x58816d092760) at ../sysdeps/nptl/futex-internal.h:146
#1 __GI___lll_lock_wait (futex=futex@entry=0x58816d092760, private=0) at ./nptl/lowlevellock.c:49
#2 0x00007a62076a00f1 in lll_mutex_lock_optimized (mutex=0x58816d092760) at ./nptl/pthread_mutex_lock.c:48
#3 ___pthread_mutex_lock (mutex=0x58816d092760) at ./nptl/pthread_mutex_lock.c:93
#4 0x00007a6207857aac in log__printf (mosq=0x58816d0925f0, priority=16, fmt=0x7a620786987f "Client %s sending DISCONNECT") at /home/audric/Projets/mosquitto/lib/logging_mosq.c:40
#5 0x00007a6207862789 in send__disconnect (mosq=0x58816d0925f0, reason_code=4 '\004', properties=0x0) at /home/audric/Projets/mosquitto/lib/send_disconnect.c:54
#6 0x00007a62078552a4 in mosquitto_disconnect_v5 (mosq=0x58816d0925f0, reason_code=4, properties=0x0) at /home/audric/Projets/mosquitto/lib/connect.c:263
#7 0x000058816b69cf17 in my_signal_handler (signum=14) at /home/audric/Projets/mosquitto/client/sub_client.c:55
#8 <signal handler called>
#9 0x00007a620771c574 in __GI___libc_write (fd=1, buf=0x58816d094130, nbytes=86) at ../sysdeps/unix/sysv/linux/write.c:26
#10 0x00007a6207693965 in _IO_new_file_write (f=0x7a62078045c0 <_IO_2_1_stdout_>, data=0x58816d094130, n=86) at ./libio/fileops.c:1181
#11 0x00007a6207692561 in new_do_write (to_do=86, data=0x58816d094130 "Client null received PUBLISH (d0, q0, r1, m0, 'test-retained/18334', ... (500 bytes))\n", fp=0x7a62078045c0 <_IO_2_1_stdout_>) at ./libio/libioP.h:1030
#12 _IO_new_do_write (fp=fp@entry=0x7a62078045c0 <_IO_2_1_stdout_>, data=0x58816d094130 "Client null received PUBLISH (d0, q0, r1, m0, 'test-retained/18334', ... (500 bytes))\n", to_do=86) at ./libio/fileops.c:426
#13 0x00007a6207692ee3 in _IO_new_file_overflow (f=0x7a62078045c0 <_IO_2_1_stdout_>, ch=10) at ./libio/fileops.c:784
#14 0x00007a6207687dca in __GI__IO_puts (str=0x58816d094590 "Client null received PUBLISH (d0, q0, r1, m0, 'test-retained/18334', ... (500 bytes))") at ./libio/ioputs.c:41
#15 0x000058816b69d459 in my_log_callback (mosq=0x58816d0925f0, obj=0x58816b6ad060 <cfg>, level=16, str=0x58816d094590 "Client null received PUBLISH (d0, q0, r1, m0, 'test-retained/18334', ... (500 bytes))") at /home/audric/Projets/mosquitto/client/sub_client.c:176
#16 0x00007a6207857bbd in log__printf (mosq=0x58816d0925f0, priority=16, fmt=0x7a6207868220 "Client %s received PUBLISH (d%d, q%d, r%d, m%d, '%s', ... (%ld bytes))") at /home/audric/Projets/mosquitto/lib/logging_mosq.c:54
#17 0x00007a6207856462 in handle__publish (mosq=0x58816d0925f0) at /home/audric/Projets/mosquitto/lib/handle_publish.c:116
#18 0x00007a620786200d in handle__packet (mosq=0x58816d0925f0) at /home/audric/Projets/mosquitto/lib/read_handle.c:51
#19 0x00007a620785f88c in packet__read (mosq=0x58816d0925f0) at /home/audric/Projets/mosquitto/lib/packet_mosq.c:566
#20 0x00007a6207858935 in mosquitto_loop_read (mosq=0x58816d0925f0, max_packets=4096) at /home/audric/Projets/mosquitto/lib/loop.c:380
#21 0x00007a6207858140 in mosquitto_loop (mosq=0x58816d0925f0, timeout=-1, max_packets=1) at /home/audric/Projets/mosquitto/lib/loop.c:149
#22 0x00007a620785855d in mosquitto_loop_forever (mosq=0x58816d0925f0, timeout=-1, max_packets=1) at /home/audric/Projets/mosquitto/lib/loop.c:257
#23 0x000058816b69de5d in main (argc=10, argv=0x7ffc9a6c2078) at /home/audric/Projets/mosquitto/client/sub_client.c:403
#24 0x00007a620762a1ca in __libc_start_call_main (main=main@entry=0x58816b69db17 <main>, argc=argc@entry=10, argv=argv@entry=0x7ffc9a6c2078) at ../sysdeps/nptl/libc_start_call_main.h:58
#25 0x00007a620762a28b in __libc_start_main_impl (main=0x58816b69db17 <main>, argc=10, argv=0x7ffc9a6c2078, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffc9a6c2068) at ../csu/libc-start.c:360
#26 0x000058816b69ce05 in _start ()
The mosquitto_sub client can end in a deadlock in certain scenarios when the timeout option (`-W) is set.
The client utilizes alarm() to set a signal handler that triggers once the timeout expires. However, this can result in a deadlock condition in the following case:
&mosq->log_callback_mutexis taken in log__printflog__printfwill try to acquire the lock on the same mutex from step 1: deadlock.This is somewhat difficult to reproduce, but I've found the issue will investigating what we think is #2887 and a variation of #2887 (comment) seems to be the best bet to end up in the same case.
So:
for i in {0..20000}; do echo $i; mosquitto_pub -t "retained/$i" -m $i -r -q 0 ; donemosquitto_sub --retained-only --remove-retained -W 1 -F %t -t "retained/+" -dEnvironment and version
Backtrace from GDB: