New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fluentd main process died unexpectedly with code=139 #769

Closed
stanhu opened this Issue Jan 11, 2016 · 7 comments

Comments

Projects
None yet
3 participants
@stanhu

stanhu commented Jan 11, 2016

I've been seeing fluentd restart on a Ubuntu 14.04 system about once a day due to this error. Here are the debug logs:

2016-01-11 19:37:06 +0000 [debug]: plugin/input_session.rb:122:on_read: on_read
2016-01-11 19:37:07 +0000 [debug]: plugin/input_session.rb:122:on_read: on_read
2016-01-11 19:37:07 +0000 [debug]: plugin/input_session.rb:122:on_read: on_read
2016-01-11 19:37:08 +0000 [debug]: plugin/input_session.rb:122:on_read: on_read
2016-01-11 19:37:27 +0000 [info]: fluent/supervisor.rb:296:supervise: process finished code=139
2016-01-11 19:37:27 +0000 [error]: fluent/supervisor.rb:146:start: fluentd main process died unexpectedly. restarting.
2016-01-11 19:37:27 +0000 [info]: fluent/supervisor.rb:279:supervise: starting fluentd-0.12.12

I think the supervise module is saying this is errno 139, which I think indicates a segmentation fault. I enabled cores, and the gdb backtrace looks like:

#0  vm_throw (throwobj=8, throw_state=0, reg_cfp=0x7fc8ab305e20, th=0x7fc909c52000) at vm_insnhelper.c:748
#1  vm_exec_core (th=th@entry=0x7fc909c52000, initial=initial@entry=0) at insns.def:1126
#2  0x00007fc9188de1ff in vm_exec (th=0x7fc909c52000) at vm.c:1398
#3  0x00007fc9188e61c5 in invoke_block_from_c (defined_class=140501548692160, cref=0x0, blockptr=0x0, argv=0x0, argc=0, self=140500363872000, block=<optimized out>, th=<optimized out>) at vm.c:817
#4  vm_yield (argc=0, argv=0x0, th=<optimized out>) at vm.c:856
#5  rb_yield_0 (argv=0x0, argc=0) at vm_eval.c:938
#6  loop_i () at vm_eval.c:1008
#7  0x00007fc918789f3e in rb_rescue2 (b_proc=b_proc@entry=0x7fc9188e58c0 <loop_i>, data1=data1@entry=0, r_proc=r_proc@entry=0x0, data2=data2@entry=0) at eval.c:754
#8  0x00007fc9188d419e in rb_f_loop (self=140500363872000) at vm_eval.c:1042
#9  0x00007fc9188d58d4 in vm_call_cfunc_with_frame (ci=<optimized out>, reg_cfp=0x7fc8ab305f10, th=0x7fc909c52000) at vm_insnhelper.c:1489
#10 vm_call_cfunc (th=0x7fc909c52000, reg_cfp=0x7fc8ab305f10, ci=<optimized out>) at vm_insnhelper.c:1579
#11 0x00007fc9188daed5 in vm_exec_core (th=th@entry=0x7fc909c52000, initial=initial@entry=0) at insns.def:999
#12 0x00007fc9188de1ff in vm_exec (th=th@entry=0x7fc909c52000) at vm.c:1398
#13 0x00007fc9188df904 in vm_call0_body (th=0x7fc909c52000, ci=ci@entry=0x7fc8aadfc6f0, argv=0x7fc8ca4a38e8, argv@entry=0x7fc8aadfc730) at vm_eval.c:171
#14 0x00007fc9188e04a7 in vm_call0 (defined_class=140501548692160, me=<optimized out>, argv=argv@entry=0x7fc8aadfc730, argc=argc@entry=32713, id=<optimized out>, recv=<optimized out>, th=<optimized out>)
    at vm_eval.c:50
#15 rb_vm_call (th=<optimized out>, recv=<optimized out>, id=<optimized out>, argc=argc@entry=0, argv=argv@entry=0x7fc8ca4a38e8, me=<optimized out>, defined_class=140501548692160) at vm_eval.c:247
#16 0x00007fc918792c06 in rb_method_call_with_block (argc=argc@entry=0, argv=0x7fc8ca4a38e8, method=method@entry=140500363871840, pass_procval=pass_procval@entry=8) at proc.c:1820
#17 0x00007fc918792dc9 in bmcall (args=140500364048600, method=140500363871840, argc=0, argv=<optimized out>, passed_proc=8) at proc.c:2295
#18 0x00007fc9188e1a9e in vm_yield_with_cfunc (blockargptr=<optimized out>, argv=<optimized out>, argc=0, self=<optimized out>, block=0x7fc909e89ab0, th=0x7fc909c52000) at vm_insnhelper.c:2115
#19 invoke_block_from_c (th=0x7fc909c52000, block=0x7fc909e89ab0, self=<optimized out>, argc=0, argv=<optimized out>, blockptr=<optimized out>, cref=0x0, defined_class=140501648546800) at vm.c:828
#20 0x00007fc9188e1e9b in vm_invoke_proc (th=th@entry=0x7fc909c52000, proc=proc@entry=0x7fc909e89ab0, self=140500363871840, defined_class=140501648546800, argc=0, argv=0x7fc8ca478558, 
    blockptr=blockptr@entry=0x0) at vm.c:881
#21 0x00007fc9188e1f4a in rb_vm_invoke_proc (th=th@entry=0x7fc909c52000, proc=proc@entry=0x7fc909e89ab0, argc=<optimized out>, argv=<optimized out>, blockptr=blockptr@entry=0x0) at vm.c:900
#22 0x00007fc9188f7cc0 in thread_start_func_2 (th=th@entry=0x7fc909c52000, stack_start=<optimized out>) at thread.c:535
#23 0x00007fc9188f812b in thread_start_func_1 (th_ptr=0x7fc909c52000) at thread_pthread.c:803
#24 0x00007fc91812f182 in start_thread (arg=0x7fc8aadfd700) at pthread_create.c:312
#25 0x00007fc91843f47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) 

I'm not sure if this is a bug with ruby, fluentd, or the interaction of the two. I noticed that currently ruby 2.1.5p273 is being used; perhaps it's wise to update to ruby 2.1.8?

UPDATE: Updated to td-agent 2.3.0 with ruby 2.1.8.

@repeatedly

This comment has been minimized.

Show comment
Hide comment
@repeatedly

repeatedly Jan 12, 2016

Member

Is above gdb log for supervisor or worker?

Member

repeatedly commented Jan 12, 2016

Is above gdb log for supervisor or worker?

@repeatedly

This comment has been minimized.

Show comment
Hide comment
@repeatedly

repeatedly Jan 12, 2016

Member

Ah, log is from core dump. So this is a worker process's gdb log.

Member

repeatedly commented Jan 12, 2016

Ah, log is from core dump. So this is a worker process's gdb log.

@repeatedly repeatedly assigned repeatedly and nurse and unassigned repeatedly Jan 12, 2016

@nurse

This comment has been minimized.

Show comment
Hide comment
@nurse

nurse Jan 13, 2016

Contributor

#0 vm_throw (throwobj=8, throw_state=0, reg_cfp=0x7fc8ab305e20, th=0x7fc909c52000) at vm_insnhelper.c:748

Since 8 of throwobj=8 means nil, the memory seems corrupted. Usually such situation is caused by an extension library and maybe VM's bug.

Could you reproduce with dummy data? I want to inspect this with customized ruby to show more information.

Contributor

nurse commented Jan 13, 2016

#0 vm_throw (throwobj=8, throw_state=0, reg_cfp=0x7fc8ab305e20, th=0x7fc909c52000) at vm_insnhelper.c:748

Since 8 of throwobj=8 means nil, the memory seems corrupted. Usually such situation is caused by an extension library and maybe VM's bug.

Could you reproduce with dummy data? I want to inspect this with customized ruby to show more information.

@stanhu

This comment has been minimized.

Show comment
Hide comment
@stanhu

stanhu Jan 13, 2016

I can't reproduce it yet; it happens every few days or so. It's possible that our log files have garbage lines (e.g. non-null values) that may be confusing the extension library. I'm not sure. It hasn't happened again since I updated to td-agent 2.3.0 and ruby 2.1.8, though. Let me know if there's any additional debugging I can enable.

stanhu commented Jan 13, 2016

I can't reproduce it yet; it happens every few days or so. It's possible that our log files have garbage lines (e.g. non-null values) that may be confusing the extension library. I'm not sure. It hasn't happened again since I updated to td-agent 2.3.0 and ruby 2.1.8, though. Let me know if there's any additional debugging I can enable.

@nurse

This comment has been minimized.

Show comment
Hide comment
@nurse

nurse Jan 14, 2016

Contributor

On crash Ruby outputs its runtime information to STDERR. To get the log, it need to change to reopen STDERR as some file instead of /dev/null.
But be careful the file size if some processes may outputs too many logs.

diff --git a/lib/fluent/supervisor.rb b/lib/fluent/supervisor.rb
index 5be4f5a..f18cb23 100644
--- a/lib/fluent/supervisor.rb
+++ b/lib/fluent/supervisor.rb
@@ -265,7 +265,7 @@ module Fluent
       if @wait_daemonize_pipe_w
         STDIN.reopen("/dev/null")
         STDOUT.reopen("/dev/null", "w")
-        STDERR.reopen("/dev/null", "w")
+        STDERR.reopen("/home/foo/td-agent-#{Process.pid}.log", "w")
         @wait_daemonize_pipe_w.write @supervisor_pid.to_s
         @wait_daemonize_pipe_w.close
         @wait_daemonize_pipe_w = nil
Contributor

nurse commented Jan 14, 2016

On crash Ruby outputs its runtime information to STDERR. To get the log, it need to change to reopen STDERR as some file instead of /dev/null.
But be careful the file size if some processes may outputs too many logs.

diff --git a/lib/fluent/supervisor.rb b/lib/fluent/supervisor.rb
index 5be4f5a..f18cb23 100644
--- a/lib/fluent/supervisor.rb
+++ b/lib/fluent/supervisor.rb
@@ -265,7 +265,7 @@ module Fluent
       if @wait_daemonize_pipe_w
         STDIN.reopen("/dev/null")
         STDOUT.reopen("/dev/null", "w")
-        STDERR.reopen("/dev/null", "w")
+        STDERR.reopen("/home/foo/td-agent-#{Process.pid}.log", "w")
         @wait_daemonize_pipe_w.write @supervisor_pid.to_s
         @wait_daemonize_pipe_w.close
         @wait_daemonize_pipe_w = nil
@repeatedly

This comment has been minimized.

Show comment
Hide comment
@repeatedly

repeatedly Apr 8, 2016

Member

PacketZoom implements own specific aggregator so we never get the crash log.
Currently, this problem happens on only their environment so closed.

Member

repeatedly commented Apr 8, 2016

PacketZoom implements own specific aggregator so we never get the crash log.
Currently, this problem happens on only their environment so closed.

@repeatedly repeatedly closed this Apr 8, 2016

@nurse

This comment has been minimized.

Show comment
Hide comment
@nurse

nurse Apr 11, 2016

Contributor

As a note, people can get Ruby's crash log by strace.

Contributor

nurse commented Apr 11, 2016

As a note, people can get Ruby's crash log by strace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment