Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluentd stuck/hangs because of infinity regexp (99.9%) please improve detection/validation #2464

Closed
kvborodin opened this issue Jun 18, 2019 · 9 comments · Fixed by #2513
Closed
Labels
bug Something isn't working

Comments

@kvborodin
Copy link

kvborodin commented Jun 18, 2019

First of all I'd like to thanks to all developers who made this piece of software works at least 23 hours in production without any issues. My issue is happened after 24 hours.

Describe the bug
I'm using 'multi_format' plugin inside 'tail' source, which is made by Mr @repeatedly

My "parse".."/parse" code inside 'tail' plugin looks like this structure, I have 10-15 various expressions to catch my multi-format logs:

    # PATTERN: pq_rules_log
    <pattern>
      format regexp
      expression /(?<sql1>\s*(INSERT|REPLACE)\s+(DELAYED\s*|IGNORE\s*)?INTO)\s+(?<db>[^\s]+?)\.(?<table>.+?)[(\s]((.*)(?<decider>filter_from)(.*))VALUES\s*\(FROM_UNIXTIME\((?<logged>\d+)\),\s\'(?<api_key>[^,]+)\',\s?(?<api_key_id>[^,]+),\s\'(?<page_id>[^,]+)\',\s\'(?<action>[^,]+)\',\s\'(?<query>.*)\',\s\'(?<tag>[^,]*)\',\s\'(?<query_id>[^']*)\',\s\'(?<api_section>[^,]+)\',\s\'(?<filter_from>[^,]+)\',\s\'(?<rule_type>[^,]*)\'/im
    </pattern>

Every day I'm having production stuck/hang fluentd process which consumes 100% cpu
But it's not just about consuming high CPU, fluentd just stops consuming/tailing my logs same time, which is horrible for my 200 production bare-metal servers, you know.
I have only 2 log files at the same time, just 2 files which I want this software to consume:

 @type tail
  @log_level trace
  read_lines_limit 50
  enable_stat_watcher false
  refresh_interval 600
  path /boardreader/log/performance/log_*
  pos_file /boardreader/log/fluentd_all.log.pos

When Fluentd hang/stuck I am not able to get report by sending SIGCONT signal to PID
Strace shows me same data even if I strace well-working fluentd or stuked:

futex(0x7f6e116274b4, FUTEX_WAIT_BITSET_PRIVATE, 346209, {28652557, 138109776}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x7f6e11627528, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f6e116274b4, FUTEX_WAIT_BITSET_PRIVATE, 346211, {28652557, 438309940}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x7f6e11627528, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f6e116274b4, FUTEX_WAIT_BITSET_PRIVATE, 346213, {28652557, 738578707}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x7f6e11627528, FUTEX_WAKE_PRIVATE, 1) = 0

But pstack shows me interesting information every time even if I stop/start fluentd:

look 0 and 1 lines

Thread 3 (Thread 0x7f45513ff700 (LWP 20109)):
#0  0x00007f4567024a63 in match_at (reg=reg@entry=0x7f455a3a6300, str=str@entry=0x7f454e2f7b80 "INSERT DELAYED INTO logs.pq_rules_log_190618 (`logged`, `api_key`, `api_key_id`, `page_id`, `action`, `query`, `tag`, `query_id`, `api_section`, `filter_from`, `rule_type`)", ' ' <repeats 17 times>, "VALUES (FRO"..., end=end@entry=0x7f454e325bcb "", sstart=sstart@entry=0x7f454e2f7b80 "INSERT DELAYED INTO logs.pq_rules_log_190618 (`logged`, `api_key`, `api_key_id`, `page_id`, `action`, `query`, `tag`, `query_id`, `api_section`, `filter_from`, `rule_type`)", ' ' <repeats 17 times>, "VALUES (FRO"..., sprev=<optimized out>, msa=msa@entry=0x7f45513fa8d0) at regexec.c:3139
#1  0x00007f4567027b7a in onig_search_gpos (reg=reg@entry=0x7f455a3a6300, str=0x7f454e2f7b80 "INSERT DELAYED INTO logs.pq_rules_log_190618 (`logged`, `api_key`, `api_key_id`, `page_id`, `action`, `query`, `tag`, `query_id`, `api_section`, `filter_from`, `rule_type`)", ' ' <repeats 17 times>, "VALUES (FRO"..., end=0x7f454e325bcb "", global_pos=<optimized out>, start=<optimized out>, range=range@entry=0x7f454e325bcb "", region=region@entry=0x7f45513fa9a0, option=option@entry=0) at regexec.c:4416
#2  0x00007f456702812b in onig_search (reg=reg@entry=0x7f455a3a6300, str=<optimized out>, end=<optimized out>, start=<optimized out>, range=range@entry=0x7f454e325bcb "", region=region@entry=0x7f45513fa9a0, option=option@entry=0) at regexec.c:4145
#3  0x00007f456700e7b2 in rb_reg_search0 (set_backref_str=1, reverse=<optimized out>, pos=0, str=139935792957960, re=139935871612520) at re.c:1531
#4  rb_reg_search (re=re@entry=139935871612520, str=str@entry=139935792957960, pos=pos@entry=0, reverse=reverse@entry=0) at re.c:1588
#5  0x00007f456700ec3f in reg_match_pos (pos=0, strp=0x7f45513faa70, re=139935871612520) at re.c:3022
#6  rb_reg_match_m (argc=<optimized out>, argv=<optimized out>, re=139935871612520) at re.c:3197
#7  0x00007f45670abc91 in vm_call_cfunc_with_frame (ci=0x7f455a42f580, cc=<optimized out>, calling=<optimized out>, reg_cfp=0x7f455150d9b0, th=0x7f45593ac000) at vm_insnhelper.c:1769
#8  vm_call_cfunc (th=0x7f45593ac000, reg_cfp=0x7f455150d9b0, calling=<optimized out>, ci=0x7f455a42f580, cc=<optimized out>) at vm_insnhelper.c:1864
#9  0x00007f45670b6229 in vm_exec_core (th=th@entry=0x7f45593ac000, initial=initial@entry=0) at insns.def:1066
#10 0x00007f45670ba647 in vm_exec (th=th@entry=0x7f45593ac000) at vm.c:1775
#11 0x00007f45670bb201 in invoke_block (captured=<optimized out>, opt_pc=<optimized out>, type=572653569, cref=0x0, self=139935865021480, iseq=0x7f455a44d238, th=0x7f45593ac000) at vm.c:973
#12 invoke_iseq_block_from_c (th=0x7f45593ac000, captured=<optimized out>, self=139935865021480, argc=<optimized out>, argv=<optimized out>, passed_block_handler=<optimized out>, cref=0x0, splattable=0, is_lambda=0) at vm.c:1019
#13 0x00007f45670bb8bd in invoke_block_from_c_splattable (argc=<optimized out>, passed_block_handler=<optimized out>, cref=<optimized out>, is_lambda=<optimized out>, splattable=<optimized out>, argv=<optimized out>, block_handler=<optimized out>, th=<optimized out>) at vm.c:1037
#14 vm_yield (argc=1, argv=0x7f45513fb168, th=0x7f45593ac000) at vm.c:1079
#15 rb_yield_0 (argv=0x7f45513fb168, argc=1) at vm_eval.c:1010
#16 rb_yield_1 (val=139935871666800) at vm_eval.c:1016
#17 rb_yield (val=<optimized out>) at vm_eval.c:1026
#18 0x00007f4566ed6b5d in rb_ary_each (ary=139935865013440) at array.c:1824
#19 0x00007f45670abc91 in vm_call_cfunc_with_frame (ci=0x7f4559316320, cc=<optimized out>, calling=<optimized out>, reg_cfp=0x7f455150da40, th=0x7f45593ac000) at vm_insnhelper.c:1769
#20 vm_call_cfunc (th=0x7f45593ac000, reg_cfp=0x7f455150da40, calling=<optimized out>, ci=0x7f4559316320, cc=<optimized out>) at vm_insnhelper.c:1864
#21 0x00007f45670b6ece in vm_exec_core (th=th@entry=0x7f45593ac000, initial=initial@entry=0) at insns.def:967
#22 0x00007f45670ba647 in vm_exec (th=th@entry=0x7f45593ac000) at vm.c:1775
#23 0x00007f45670bb201 in invoke_block (captured=<optimized out>, opt_pc=<optimized out>, type=572653569, cref=0x0, self=139935844059400, iseq=0x7f455a4df7f0, th=0x7f45593ac000) at vm.c:973
#24 invoke_iseq_block_from_c (th=0x7f45593ac000, captured=<optimized out>, self=139935844059400, argc=<optimized out>, argv=<optimized out>, passed_block_handler=<optimized out>, cref=0x0, splattable=0, is_lambda=0) at vm.c:1019
#25 0x00007f45670bb8bd in invoke_block_from_c_splattable (argc=<optimized out>, passed_block_handler=<optimized out>, cref=<optimized out>, is_lambda=<optimized out>, splattable=<optimized out>, argv=<optimized out>, block_handler=<optimized out>, th=<optimized out>) at vm.c:1037
#26 vm_yield (argc=1, argv=0x7f45513fb868, th=0x7f45593ac000) at vm.c:1079
#27 rb_yield_0 (argv=0x7f45513fb868, argc=1) at vm_eval.c:1010
#28 rb_yield_1 (val=139935792957960) at vm_eval.c:1016
#29 rb_yield (val=<optimized out>) at vm_eval.c:1026
#30 0x00007f4566ed6b5d in rb_ary_each (ary=139935845464800) at array.c:1824
#31 0x00007f45670abc91 in vm_call_cfunc_with_frame (ci=0x7f4559315590, cc=<optimized out>, calling=<optimized out>, reg_cfp=0x7f455150db00, th=0x7f45593ac000) at vm_insnhelper.c:1769
#32 vm_call_cfunc (th=0x7f45593ac000, reg_cfp=0x7f455150db00, calling=<optimized out>, ci=0x7f4559315590, cc=<optimized out>) at vm_insnhelper.c:1864
#33 0x00007f45670b6ece in vm_exec_core (th=th@entry=0x7f45593ac000, initial=initial@entry=0) at insns.def:967
#34 0x00007f45670ba647 in vm_exec (th=0x7f45593ac000) at vm.c:1775
#35 0x00007f45670bea4a in vm_call0_body (th=<optimized out>, calling=calling@entry=0x7f45513fbec0, ci=ci@entry=0x7f45513fbeb0, cc=cc@entry=0x7f45513fbee0, argv=<optimized out>) at vm_eval.c:177
#36 0x00007f45670bee70 in vm_call0 (me=<optimized out>, argv=<optimized out>, argc=<optimized out>, id=<optimized out>, recv=<optimized out>, th=<optimized out>) at vm_eval.c:62
#37 rb_vm_call (th=<optimized out>, recv=<optimized out>, id=<optimized out>, argc=<optimized out>, argv=<optimized out>, me=<optimized out>) at vm_eval.c:257
#38 0x00007f45670abc91 in vm_call_cfunc_with_frame (ci=0x7f4559910d50, cc=<optimized out>, calling=<optimized out>, reg_cfp=0x7f455150db60, th=0x7f45593ac000) at vm_insnhelper.c:1769
#39 vm_call_cfunc (th=0x7f45593ac000, reg_cfp=0x7f455150db60, calling=<optimized out>, ci=0x7f4559910d50, cc=<optimized out>) at vm_insnhelper.c:1864
#40 0x00007f45670b6229 in vm_exec_core (th=th@entry=0x7f45593ac000, initial=initial@entry=0) at insns.def:1066
#41 0x00007f45670ba647 in vm_exec (th=0x7f45593ac000) at vm.c:1775
#42 0x00007f45670bea4a in vm_call0_body (th=<optimized out>, calling=calling@entry=0x7f45513fc530, ci=ci@entry=0x7f45513fc520, cc=cc@entry=0x7f45513fc550, argv=<optimized out>) at vm_eval.c:177
#43 0x00007f45670bee70 in vm_call0 (me=<optimized out>, argv=<optimized out>, argc=<optimized out>, id=<optimized out>, recv=<optimized out>, th=<optimized out>) at vm_eval.c:62

Which make me think, Fluentd has no ability to detect infinity regexp loops, which is basically handled by https://rubular.com/ for example.

To Reproduce
I can privately send you my config and log with unknown log-string which makes fluentd hangs

Expected behavior
Warn to logfile and continue processing logs.
Warning must include entire log-string like you do with unmatched strings.
There should be ability to match this logs and apply another regexp (i.e. retry to process more simpler regexp)
Your Environment

  • Fluentd or td-agent version: # rpm -qa | grep td-agent td-agent-3.4.1-0.el7.x86_64

And tested with 1.4 and 1.5 Fluentd (and other versions till 1.0) - same issue
I don't know why old version ruby 2.4.6 is using in production, when its EOL:

# /opt/td-agent/embedded/bin/ruby -v ruby 2.4.6p354 (2019-04-01 revision 67394) [x86_64-linux]

  • Operating system: # cat /etc/os-release NAME="CentOS Linux" VERSION="7 (Core)"

  • Kernel version: # uname -r 4.17.8-1.el7.elrepo.x86_64

Your Configuration

Part of config was already mentioned above
Because of NDA I can't paste my config here, I can share with developer my config and example of logs which causing fluend hanging

Your Error Log

In debug mode - no new logs at all, no new chunks created etc.
In trace mode - same
@kvborodin kvborodin added the bug Something isn't working label Jun 18, 2019
@kvborodin
Copy link
Author

kvborodin commented Jun 23, 2019

Same time, Logstash has no issues consuming my 2 log files with the same regex
This make me think current v.1.x of Fluentd is not production ready at all
I'm going to switch to Logstash for all my projects, guys
You should remove "production-ready" from your official site
Thanks

@repeatedly
Copy link
Member

Hmm... to debug this, we need your actual log for the investigation.
logstash uses java based regexp engine and ruby(cruby) uses other regexp engine.
So this seems to cause the different behaviour.

@kvborodin
Copy link
Author

Cool! I'll send you privately (repeatedly@gmail.com) when I got approve from my PM/TL

@kvborodin
Copy link
Author

What about SIGCONT sig during stuck? Any thoughts why it's always works but not in my situation?

@repeatedly
Copy link
Member

Any thoughts why it's always works but not in my situation?

This is log content / regexp combo issue or your regexp hits the bug of ruby's regexp engine.
Each regexp engine has each algorithm/pattern and it sometimes causes this issue with specific pattern. For example, java's regexp also has a similar problem: https://stackoverflow.com/questions/37484078/regexp-causes-to-hang-infinitely

@kvborodin
Copy link
Author

@repeatedly look, Logstash have GROK. GROK have timeouts on each regex matching operation which can be tuned. Logstash also log this unmatched/timeouted lines to separate log-file for the next investigation and next updating regex which is fails during processing. Without this ability you can't develop logs parsing matches, right? I mean, how can I understand which string from my multi format log made Fluentd stuck?

I just suggest to make improvements and add timeouts & logging for those users who is using regex patterns!

Also, the problem is with hanging Fluentd with infinity regex loops, this improvement which I suggest, can solve all this problems with hanging/stuck processing input records!

@kvborodin
Copy link
Author

I was talking about this feature: https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html#plugins-filters-grok-timeout_millis

also please take a look on this features:

  • tag_on_failure
  • tag_on_timeout

They are so helpful!

Thanks!

@repeatedly
Copy link
Member

grok is the collection of regexp so I assume logstash also has similar regexp problem.
The ruby's thread model is different from java but I will check we can use similar handling for regexp.

repeatedly added a commit that referenced this issue Jul 22, 2019
Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>
repeatedly added a commit that referenced this issue Jul 22, 2019
Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>
repeatedly added a commit that referenced this issue Jul 22, 2019
Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>
@repeatedly
Copy link
Member

Wrote patch for this issue: #2513

repeatedly added a commit that referenced this issue Jul 22, 2019
Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>
repeatedly added a commit that referenced this issue Jul 24, 2019
parser: Add timeout parameter to avoid parsing stuck. fix #2464
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants