New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exec plugin sporadically hanging in getpwnam_r #229
Comments
I'm noticing similar behavior with only 2 exec plug-ins configured when collectd starts up on a machine under load. |
Hi, can you create a core dump and provide a stack backtrace or a process that's hanging in Best regards, |
This is the backtrace for a hung process:
This is on an AMD64 based system running Debian 6.0.7 with a 3.2 based kernel. |
I'm seeing something similar. For some background, the name service switch library (that So, from an example gdb session:
So in this case, while our thread running the exec plugin is trying to fork a new child, thread 2 (incidentally running write_graphite) is trying to do a hostname lookup, which requires loading And so because we lookup the user we want to run the process as in the child, which requires said mutex, we have a lovely recipe for a deadlock. So, looking at a stack overflow question about pthread_atfork, it's suggested that only async-signal-safe (re-entrant) functions are used in the child. I would suggest then, that the fix is relatively easy--just move the lookup of the user to run as into the parent process (so, from |
Hi Ceri, thank you very much for your analysis and patch! I think a tricky issue such as this one deserves some recognition! If you email me your address and size (I have M, L, and XL) I'll send you one of the Appreciation Program t-shirts. Thanks and best regards, |
There seems to be some sort of race condition. Execution hangs at:
status = getpwnam_r (pl->user, &sp, nambuf, sizeof (nambuf), &sp_ptr);
in exec.c, about 1/3 of the time. I have about 10 exec items, all running under the same user.
exec plugin conf:
LoadPlugin "exec"
<Plugin "exec">
Exec "nagios" "runner" "/opt/collectd/bin/nagios_to_collectd" "-rt" "-n" "users" "/usr/lib/nagios/plugins/check_users" "-w" "20" "-c" "30"
Exec "nagios" "runner" "/opt/collectd/bin/nagios_to_collectd" "-rt" "-n" "load" "/usr/lib/nagios/plugins/check_load" "-w" "15,10,5" "-c" "30,20,10"
Exec "nagios" "runner" "/opt/collectd/bin/nagios_to_collectd" "-rt" "-n" "all_disks" "/usr/lib/nagios/plugins/check_disk" "-w" "8%" "-c" "5%" "-A" "-x" "/dev/shm" "-X" "nfs" "-i" "/boot"
Exec "nagios" "runner" "/opt/collectd/bin/nagios_to_collectd" "-rt" "-n" "zombie_procs" "--translate" "line_1_item_1:processes" "/usr/lib/nagios/plugins/check_procs" "-w" "5" "-c" "10" "-s" "Z"
Exec "nagios" "runner" "/opt/collectd/bin/nagios_to_collectd" "-rt" "-n" "total_procs" "--translate" "line_1_item_1:processes" "/usr/lib/nagios/plugins/check_procs" "-w" "400" "-c" "700"
Exec "nagios" "runner" "/opt/collectd/bin/nagios_to_collectd" "-rt" "-n" "swap" "/usr/lib/nagios/plugins/check_swap" "-w" "25%" "-c" "15%"
Exec "nagios" "runner" "/opt/collectd/bin/nagios_to_collectd" "-rt" "-n" "mem" "/usr/lib/nagios/plugins/check_mem.sh" "-w" "250" "-c" "150" "-p"
Exec "nagios" "runner" "/opt/collectd/bin/nagios_to_collectd" "-rt" "-n" "vmstat" "/usr/lib/nagios/plugins/check_vmstat.rb"
Exec "nagios" "runner" "/opt/collectd/bin/nagios_to_collectd" "-rt" "-n" "tmpfiles" "/usr/lib/nagios/plugins/check_shell.rb" "'ls /tmp/ | wc -l'"
Exec "nagios" "runner" "/opt/collectd/bin/nagios_to_collectd" "-rt" "-n" "clam_found" "/usr/lib/nagios/plugins/check_clam_found.sh"
While stuck in getpwnam_r in strace the child process looks like:
[root@ads-test log]# ps -ef | grep 21676
root 21676 21664 0 19:52 ? 00:00:00 /opt/collectd/sbin/collectd -C /opt/collectd/etc/collectd-client.conf -P /var/run/collectd-client.pid
root 22020 10152 0 19:52 pts/0 00:00:00 grep 21676
[root@ads-test log]# strace -f -p 21676
Process 21676 attached - interrupt to quit
futex(0x7fafd4a095f0, FUTEX_WAIT_PRIVATE, 2, NULL
^C <unfinished ...>
I have found inserting a sleep(1) between new thread creation fixes the issue, though obviously not a perfect solution.
patch to exec.c:
40a41
> #include <unistd.h>
828a851
> sleep(1);
The text was updated successfully, but these errors were encountered: