Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exec plugin sporadically hanging in getpwnam_r #229

Closed
ghost opened this issue Jan 3, 2013 · 5 comments
Closed

exec plugin sporadically hanging in getpwnam_r #229

ghost opened this issue Jan 3, 2013 · 5 comments
Assignees
Labels
Bug A genuine bug
Milestone

Comments

@ghost
Copy link

ghost commented Jan 3, 2013

There seems to be some sort of race condition. Execution hangs at:

status = getpwnam_r (pl->user, &sp, nambuf, sizeof (nambuf), &sp_ptr);

in exec.c, about 1/3 of the time. I have about 10 exec items, all running under the same user.

exec plugin conf:

LoadPlugin "exec"

<Plugin "exec">
Exec "nagios" "runner" "/opt/collectd/bin/nagios_to_collectd" "-rt" "-n" "users" "/usr/lib/nagios/plugins/check_users" "-w" "20" "-c" "30"
Exec "nagios" "runner" "/opt/collectd/bin/nagios_to_collectd" "-rt" "-n" "load" "/usr/lib/nagios/plugins/check_load" "-w" "15,10,5" "-c" "30,20,10"
Exec "nagios" "runner" "/opt/collectd/bin/nagios_to_collectd" "-rt" "-n" "all_disks" "/usr/lib/nagios/plugins/check_disk" "-w" "8%" "-c" "5%" "-A" "-x" "/dev/shm" "-X" "nfs" "-i" "/boot"
Exec "nagios" "runner" "/opt/collectd/bin/nagios_to_collectd" "-rt" "-n" "zombie_procs" "--translate" "line_1_item_1:processes" "/usr/lib/nagios/plugins/check_procs" "-w" "5" "-c" "10" "-s" "Z"
Exec "nagios" "runner" "/opt/collectd/bin/nagios_to_collectd" "-rt" "-n" "total_procs" "--translate" "line_1_item_1:processes" "/usr/lib/nagios/plugins/check_procs" "-w" "400" "-c" "700"
Exec "nagios" "runner" "/opt/collectd/bin/nagios_to_collectd" "-rt" "-n" "swap" "/usr/lib/nagios/plugins/check_swap" "-w" "25%" "-c" "15%"
Exec "nagios" "runner" "/opt/collectd/bin/nagios_to_collectd" "-rt" "-n" "mem" "/usr/lib/nagios/plugins/check_mem.sh" "-w" "250" "-c" "150" "-p"
Exec "nagios" "runner" "/opt/collectd/bin/nagios_to_collectd" "-rt" "-n" "vmstat" "/usr/lib/nagios/plugins/check_vmstat.rb"
Exec "nagios" "runner" "/opt/collectd/bin/nagios_to_collectd" "-rt" "-n" "tmpfiles" "/usr/lib/nagios/plugins/check_shell.rb" "'ls /tmp/ | wc -l'"
Exec "nagios" "runner" "/opt/collectd/bin/nagios_to_collectd" "-rt" "-n" "clam_found" "/usr/lib/nagios/plugins/check_clam_found.sh"

While stuck in getpwnam_r in strace the child process looks like:
[root@ads-test log]# ps -ef | grep 21676
root 21676 21664 0 19:52 ? 00:00:00 /opt/collectd/sbin/collectd -C /opt/collectd/etc/collectd-client.conf -P /var/run/collectd-client.pid
root 22020 10152 0 19:52 pts/0 00:00:00 grep 21676
[root@ads-test log]# strace -f -p 21676
Process 21676 attached - interrupt to quit
futex(0x7fafd4a095f0, FUTEX_WAIT_PRIVATE, 2, NULL
^C <unfinished ...>

I have found inserting a sleep(1) between new thread creation fixes the issue, though obviously not a perfect solution.

patch to exec.c:

40a41
> #include <unistd.h>
828a851
> sleep(1);

@NeoPhi
Copy link

NeoPhi commented May 12, 2013

I'm noticing similar behavior with only 2 exec plug-ins configured when collectd starts up on a machine under load.

@octo
Copy link
Member

octo commented May 26, 2013

Hi,

can you create a core dump and provide a stack backtrace or a process that's hanging in getpwnam_r? E.g. by sending the process the ABRT or QUIT signal. Also, which operating system are you using?

Best regards,
—octo

@NeoPhi
Copy link

NeoPhi commented May 27, 2013

This is the backtrace for a hung process:

(gdb) bt
#0  0x00007f85e6e8f62e in ?? () from /lib/libc.so.6
#1  0x00007f85e6e93769 in ?? () from /lib/libc.so.6
#2  0x00007f85e6e9334e in __nss_database_lookup () from /lib/libc.so.6
#3  0x00007f85e6e93ee8 in __nss_passwd_lookup2 () from /lib/libc.so.6
#4  0x00007f85e6e52e7f in getpwnam_r () from /lib/libc.so.6
#5  0x00007f85e5f9f4bf in exec_child (pl=0x173f540,
    fd_in=<value optimized out>, fd_out=<value optimized out>,
    fd_err=0x7f85dbffeea8) at exec.c:303
#6  fork_child (pl=0x173f540, fd_in=<value optimized out>,
    fd_out=<value optimized out>, fd_err=0x7f85dbffeea8) at exec.c:508
#7  0x00007f85e5f9fbbb in exec_read_one (arg=0x7f85e7113560) at exec.c:559
#8  0x00007f85e75288ca in start_thread () from /lib/libpthread.so.0
#9  0x00007f85e6e82b6d in clone () from /lib/libc.so.6
#10 0x0000000000000000 in ?? ()
(gdb)

This is on an AMD64 based system running Debian 6.0.7 with a 3.2 based kernel.

@cstorey
Copy link

cstorey commented Aug 14, 2013

I'm seeing something similar. For some background, the name service switch library (that getpwnam_r interface is part of) uses a static mutex to protect internal resources, and so will need to acquire this mutex temporarily on all lookups. Now, there's a chance that another thread will be running with the mutex held when our thread calls fork(2). If that does happen, then the mutex will be recorded as locked in the child, even though there are no other threads running in the child.

So, from an example gdb session:

(gdb) break fork
Breakpoint 1 at 0x405b90
(gdb) run -C /etc/collectd/collectd.conf -f
Starting program: /usr/sbin/collectd -C /etc/collectd/collectd.conf -f
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff4f1e700 (LWP 12625)]
[New Thread 0x7ffff471d700 (LWP 12626)]
[New Thread 0x7ffff3f1c700 (LWP 12627)]
[New Thread 0x7ffff371b700 (LWP 12628)]
[New Thread 0x7ffff2f1a700 (LWP 12629)]
[New Thread 0x7ffff2719700 (LWP 12630)]
[New Thread 0x7ffff1f18700 (LWP 12631)]
[New Thread 0x7ffff1717700 (LWP 12632)]
[New Thread 0x7ffff0f16700 (LWP 12633)]
[New Thread 0x7fffdffff700 (LWP 12634)]
[New Thread 0x7fffdf7fe700 (LWP 12635)]
[Switching to Thread 0x7fffdf7fe700 (LWP 12635)]

Breakpoint 1, __fork () at ../nptl/sysdeps/unix/sysv/linux/pt-fork.c:26
26  ../nptl/sysdeps/unix/sysv/linux/pt-fork.c: No such file or directory.
(gdb) info threads
  Id   Target Id         Frame 
* 12   Thread 0x7fffdf7fe700 (LWP 12635) "collectd" __fork () at ../nptl/sysdeps/unix/sysv/linux/pt-fork.c:26
  11   Thread 0x7fffdffff700 (LWP 12634) "collectd" pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
  10   Thread 0x7ffff0f16700 (LWP 12633) "collectd" __lll_lock_wait_private ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:93
  9    Thread 0x7ffff1717700 (LWP 12632) "collectd" pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
  8    Thread 0x7ffff1f18700 (LWP 12631) "collectd" pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
  7    Thread 0x7ffff2719700 (LWP 12630) "collectd" _int_malloc (av=0x7fffec000020, bytes=32808)
    at malloc.c:3879
  6    Thread 0x7ffff2f1a700 (LWP 12629) "collectd" __lll_lock_wait ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:132
  5    Thread 0x7ffff371b700 (LWP 12628) "collectd" __lll_lock_wait ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:132
  4    Thread 0x7ffff3f1c700 (LWP 12627) "collectd" __lll_lock_wait ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:132
  3    Thread 0x7ffff471d700 (LWP 12626) "collectd" __lll_lock_wait ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:132
  2    Thread 0x7ffff4f1e700 (LWP 12625) "collectd" 0x00007ffff7de0da2 in elf_get_dynamic_info (temp=0x0, 
    l=0x7fffc8001790) at dynamic-link.h:166
  1    Thread 0x7ffff7fed700 (LWP 12622) "collectd" 0x00007ffff79c452d in nanosleep ()
    at ../sysdeps/unix/syscall-template.S:82
(gdb) thread apply 2 bt

Thread 2 (Thread 0x7ffff4f1e700 (LWP 12625)):
#0  0x00007ffff7de0da2 in elf_get_dynamic_info (temp=0x0, l=0x7fffc8001790) at dynamic-link.h:166
#1  _dl_map_object_from_fd (name=0x7ffff4f1ca20 "libnss_files.so.2", fd=9, fbp=0x7ffff4f1c240, 
    realname=0x7fffc8000e90 "/lib/x86_64-linux-gnu/libnss_files.so.2", loader=<optimized out>, 
    l_type=<optimized out>, mode=-1879048191, stack_endp=0x7ffff4f1c5a0, nsid=0) at dl-load.c:1418
#2  0x00007ffff7de269a in _dl_map_object (loader=0x0, name=0x7ffff4f1ca20 "libnss_files.so.2", type=2, 
    trace_mode=0, mode=-1879048191, nsid=<optimized out>) at dl-load.c:2568
#3  0x00007ffff7ded84b in dl_open_worker (a=0x7ffff4f1c7d0) at dl-open.c:225
#4  0x00007ffff7de9176 in _dl_catch_error (objname=0x7ffff4f1c818, errstring=0x7ffff4f1c820, 
    mallocedp=0x7ffff4f1c82f, operate=0x7ffff7ded700 <dl_open_worker>, args=0x7ffff4f1c7d0) at dl-error.c:178
#5  0x00007ffff7ded31a in _dl_open (file=0x7ffff4f1ca20 "libnss_files.so.2", mode=-2147483647, 
    caller_dlopen=0x7ffff74f1a2e, nsid=-2, argc=4, argv=<optimized out>, env=0x7fffffffe5b0) at dl-open.c:639
#6  0x00007ffff751c612 in do_dlopen (ptr=0x7ffff4f1c9d0) at dl-libc.c:89
#7  0x00007ffff7de9176 in _dl_catch_error (objname=0x7ffff4f1ca00, errstring=0x7ffff4f1c9f0, 
    mallocedp=0x7ffff4f1ca0f, operate=0x7ffff751c5d0 <do_dlopen>, args=0x7ffff4f1c9d0) at dl-error.c:178
#8  0x00007ffff751c6d4 in dlerror_run (args=0x7ffff4f1c9d0, operate=0x7ffff751c5d0 <do_dlopen>)
    at dl-libc.c:48
#9  __GI___libc_dlopen_mode (name=<optimized out>, mode=<optimized out>) at dl-libc.c:165
#10 0x00007ffff74f1a2e in nss_load_library (ni=<optimized out>) at nsswitch.c:372
#11 0x00007ffff74f248d in __GI___nss_lookup_function (ni=0x7fffc80012b0, 
    fct_name=0x7ffff75672db "gethostbyname2_r") at nsswitch.c:474
#12 0x00007ffff74f269c in __GI___nss_lookup (ni=0x7ffff4f1cb60, fct_name=0x7ffff75672db "gethostbyname2_r", 
    fct2_name=0x0, fctp=0x7ffff4f1cb70) at nsswitch.c:202
#13 0x00007ffff74f91e0 in __gethostbyname2_r (name=0x639d40 "localhost", af=2, resbuf=0x7ffff4f1cf20, 
    buffer=0x7ffff4f1cbf0 "", buflen=512, result=0x7ffff4f1cf70, h_errnop=0x7ffff4f1cf90)
    at ../nss/getXXbyYY_r.c:203
#14 0x00007ffff74b7e1a in gaih_inet (name=0x639d40 "localhost", service=<optimized out>, req=0x7ffff4f1d0d0, 
    pai=<optimized out>, naddrs=0x7ffff4f1d180) at ../sysdeps/posix/getaddrinfo.c:581
#15 0x00007ffff74ba99e in __GI_getaddrinfo (name=0x639d40 "localhost", service=<optimized out>, 
    hints=0x7ffff4f1d0d0, pai=0x7ffff4f1d218) at ../sysdeps/posix/getaddrinfo.c:2423
#16 0x00007ffff5326768 in ?? () from /usr/lib/collectd/write_graphite.so
#17 0x00007ffff5326b78 in ?? () from /usr/lib/collectd/write_graphite.so
#18 0x000000000040db8d in plugin_write ()
#19 0x000000000040a9d5 in ?? ()
#20 0x000000000040acb5 in fc_process_chain ()
#21 0x000000000040ccc6 in ?? ()
#22 0x000000000040ec13 in ?? ()
#23 0x00007ffff79bce9a in start_thread (arg=0x7ffff4f1e700) at pthread_create.c:308
#24 0x00007ffff74dfccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#25 0x0000000000000000 in ?? ()
(gdb) 

So in this case, while our thread running the exec plugin is trying to fork a new child, thread 2 (incidentally running write_graphite) is trying to do a hostname lookup, which requires loading libnss_files.so.2, which is done whilst the nss mutex is held. So, because once the child process is running, this lock is never unlocked before we try to re-lock it, any attempts to use it will deadlock.

And so because we lookup the user we want to run the process as in the child, which requires said mutex, we have a lovely recipe for a deadlock.

So, looking at a stack overflow question about pthread_atfork, it's suggested that only async-signal-safe (re-entrant) functions are used in the child.

I would suggest then, that the fix is relatively easy--just move the lookup of the user to run as into the parent process (so, from exec_child to fork_child before the fork). Thinking about it, I might as well submit a PR for this. :)

@octo
Copy link
Member

octo commented Aug 16, 2013

Hi Ceri,

thank you very much for your analysis and patch! I think a tricky issue such as this one deserves some recognition! If you email me your address and size (I have M, L, and XL) I'll send you one of the Appreciation Program t-shirts.

Thanks and best regards,
—octo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug A genuine bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants