Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

caught signal (SIGSEGV) with output elasticsearch #4412

Closed
GuillaumeSmaha opened this issue Dec 9, 2021 · 4 comments · Fixed by #4414
Closed

caught signal (SIGSEGV) with output elasticsearch #4412

GuillaumeSmaha opened this issue Dec 9, 2021 · 4 comments · Fixed by #4414

Comments

@GuillaumeSmaha
Copy link
Contributor

GuillaumeSmaha commented Dec 9, 2021

Bug Report

Describe the bug
I am having SEGV after few minutes on some pods and I think it is related to this PR #3788.
In some case, append_size is still inferior to required which means memcpy is writing outside of the buffer and can create a SEGV.

On 1.8.10 version, I try by adding a log to check this case GuillaumeSmaha@5b908dd and I got the log few time with a SEGV just after:

�[1m[�[0m2021/12/08 11:55:05�[1m]�[0m [�[93mdebug�[0m] [out_es] converted_size is 0
�[1m[�[0m2021/12/08 11:55:05�[1m]�[0m [�[93mdebug�[0m] [out_es] required size (197133, append: 4096) is still superior to available buffer (37999, previous available: 33903)

Thread 2 "flb-pipeline" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff61ff700 (LWP 45)]
__memmove_avx_unaligned_erms ()
    at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:262
262	../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
(gdb) bt
#0  __memmove_avx_unaligned_erms ()
    at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:262
#1  0x00005555556fc274 in es_bulk_append (bulk=0x7ffff5400d60, 
    index=0x7ffff4ac0a40 "{\"create\":{\"_index\":\"logstash-2021.12.08\",\"_type\":\"_doc\",\"_id\":\"a76df76b-d2c7-9d2b-e172-93e0ec3c3d4f\"}}\n", i_len=104, 
    json=0x7ffff253fad0 "{\"@timestamp\":\"2021-12-08T11:54:07.027Z\",\"time\":\"2021-12-08T11:54:07.027621077Z\",\"stream\":\"stdout\",\"_p\":\"P\",\"log\":\"\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\"..., j_len=196967, whole_size=33903, converted_size=0)
    at /tmpdir/fluent-bit/plugins/out_es/es_bulk.c:111
#2  0x00005555556f9af9 in elasticsearch_format (config=0x7ffff6632f80, 
    ins=0x7ffff6648f00, plugin_context=0x7ffff54a9600, flush_ctx=0x0, 
    tag=0x7ffff54a2700 "kube.var.log.containers.toolkit-jobs-job-48bfef83-e663-4c37-b55f-50ea1db9a588_toolkit-jobs_toolkit-jobs-job-48bfef83-e663-4c37-b55f-50ea1db9a588-1-db65750898be03ec7ef665854d8036800d9609adeb19a0cc5b021"..., 
    tag_len=215, data=0x7ffff2341f00, bytes=33903, out_data=0x7ffff4ac0f60, 
    out_size=0x7ffff4ac0f58) at /tmpdir/fluent-bit/plugins/out_es/es.c:537
#3  0x00005555556fa8c5 in cb_es_flush (data=0x7ffff2341f00, bytes=33903, 
    tag=0x7ffff54a2700 "kube.var.log.containers.toolkit-jobs-job-48bfef83-e663-4c37-b55f-50ea1db9a588_toolkit-jobs_toolkit-jobs-job-48bfef83-e663-4c37-b55f-50ea1db9a588-1-db65750898be03ec7ef665854d8036800d9609adeb19a0cc5b021"..., 
    tag_len=215, ins=0x7ffff6648f00, out_context=0x7ffff54a9600, 
    config=0x7ffff6632f80) at /tmpdir/fluent-bit/plugins/out_es/es.c:772
--Type <RET> for more, q to quit, c to continue without paging--
#4  0x000055555566b2df in output_pre_cb_flush ()
    at /tmpdir/fluent-bit/include/fluent-bit/flb_output.h:514
#5  0x0000555555b69d47 in co_init ()
    at /tmpdir/fluent-bit/lib/monkey/deps/flb_libco/amd64.c:117
#6  0x0000000000000000 in ?? ()
(gdb) 

gdb-fluent-bit-4vljw.log


I also try to revert the change by using:

   if (available < required) {
        append_size = available + required + ES_BULK_CHUNK;
        if (append_size < ES_BULK_CHUNK) {
            /* append at least ES_BULK_CHUNK size */
            append_size = ES_BULK_CHUNK;
        }

Until now, it is working for 20h without SEGV. (it was crashing after few minutes)

To Reproduce

  • Install latest elasticsearch using helm charts
  • Use the following configuration for fluent-bit
  • After few minutes of data sent by fluent-bit, a segfault happens

Expected behavior
SEGV accous

Screenshots

Your Environment

  • Version used: 1.8.10
  • Configuration:

    [SERVICE]
        Config_Watch  On

        # Flush
        # =====
        # set an interval of seconds before to flush records to a destination
        flush        1

        # Daemon
        # ======
        # instruct Fluent Bit to run in foreground or background mode.
        daemon       Off

        # Log_Level
        # =========
        # Set the verbosity level of the service, values can be:
        #
        # - error
        # - warning
        # - info
        # - debug
        # - trace
        #
        # by default 'info' is set, that means it includes 'error' and 'warning'.
        log_level    trace

        # Parsers File
        # ============
        # specify an optional 'Parsers' configuration file
        parsers_file parsers.conf
        parsers_file  custom_parsers.conf

        # HTTP Server
        # ===========
        # Enable/Disable the built-in HTTP Server for metrics
        http_server  On
        http_listen  0.0.0.0
        http_port    2020

    [INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/*.log
        # Except fluent-bit to avoid feedback loop
        Exclude_Path      ["*fluent-bit*.log"]
        DB                /var/log/fluentbit/flb_kube.db
        Mem_Buf_Limit     64MB
        Skip_Long_Lines   On
        Refresh_Interval  10
        # Parser for containerd (Container Runtime Interface)
        multiline.parser  cri

    [INPUT]
        Name              tail
        Tag               system.*
        Path              /var/log/syslog
        Parser            rsyslog
        DB                /var/log/fluentbit/flb_syslog.db
        Mem_Buf_Limit     64MB
        Skip_Long_Lines   On
        Refresh_Interval  10

  ## https://docs.fluentbit.io/manual/pipeline/filters
  filters: |
    [FILTER]
        Name           kubernetes
        Match          kube.*
        Kube_URL       https://kubernetes.default.svc:443

    [OUTPUT]
        Name            es
        Match           *
        Host            elasticsearch-master
        Port            9200
        tls             Off
        tls.verify      Off
        Generate_ID     On
        Logstash_Format On

  #  [OUTPUT]
  #      Name            stdout
  #      Match           *

    [PARSER]
        Name        rsyslog
        Format      regex
        Regex       (?<syslog_time>^[A-Za-z]+ [ 1-3][0-9] [0-9]{2}:[0-9]{2}:[0-9]{2}) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<log>.*)$
        Time_Key    syslog_time
        Time_Format %b %d %H:%M:%S
        Time_Keep   On
  • Environment name and version (e.g. Kubernetes? What version?):
    Kubernetes 1.21.1
  • Server type and version: Bare-metal
  • Operating System and version:
    Ubuntu 18.04 or ubuntu 20.04
  • Filters and plugins: kubernetes
@GuillaumeSmaha
Copy link
Contributor Author

I try 1.8.10 without reverting PR #3788 and with PR #4361 but it doesn't solve my issue.
I will try to create a test case.

logs:

[2021/12/09 11:36:48] [debug] [out_es] converted_size is 0
[2021/12/09 11:36:48] [debug] [http_client] not using http_proxy for header
[2021/12/09 11:36:48] [trace] [io coro=0x7f63e7896178] [net_write] trying 140 bytes
[2021/12/09 11:36:48] [trace] [io coro=0x7f63e7896178] [fd 108] write_async(2)=140 (140/140)
[2021/12/09 11:36:48] [trace] [io coro=0x7f63e7896178] [net_write] ret=140 total=140/140
[2021/12/09 11:36:48] [trace] [io coro=0x7f63e7896178] [net_write] trying 99169 bytes
[2021/12/09 11:36:48] [trace] [io coro=0x7f63e7896178] [fd 108] write_async(2)=99169 (99169/99169)
[2021/12/09 11:36:48] [trace] [io coro=0x7f63e7896178] [net_write] ret=99169 total=99169/99169
[2021/12/09 11:36:48] [trace] [io coro=0x7f63e7896178] [net_read] try up to 4095 bytes
[2021/12/09 11:36:48] [trace] [engine] resuming coroutine=0x7f63e7896988
[2021/12/09 11:36:48] [trace] [io coro=0x7f63e7896988] [net_read] ret=4095
[2021/12/09 11:36:48] [trace] [io coro=0x7f63e7896988] [net_read] try up to 32768 bytes
[2021/12/09 11:36:48] [trace] [io coro=0x7f63e7896988] [net_read] ret=32768
[2021/12/09 11:36:48] [trace] [io coro=0x7f63e7896988] [net_read] try up to 32768 bytes
[2021/12/09 11:36:48] [trace] [io coro=0x7f63e7896988] [net_read] ret=31083
[2021/12/09 11:36:48] [trace] [io coro=0x7f63e7896988] [net_read] try up to 1685 bytes
[2021/12/09 11:36:48] [engine] caught signal (SIGSEGV)
#0  0x7f63ea4a86ff      in  ???() at ultiarch/memmove-vec-unaligned-erms.S:262
#1  0x56262442a98a      in  flb_sds_cat() at src/flb_sds.c:132
#2  0x56262442a9f4      in  flb_sds_cat_safe() at src/flb_sds.c:145
#3  0x56262444e327      in  package_content() at src/multiline/flb_ml.c:324
#4  0x56262444e7d5      in  process_append() at src/multiline/flb_ml.c:495
#5  0x56262444e9f3      in  ml_append_try_parser() at src/multiline/flb_ml.c:564
#6  0x56262444ebfb      in  flb_ml_append() at src/multiline/flb_ml.c:630
#7  0x56262448c661      in  process_content() at plugins/in_tail/tail_file.c:375
#8  0x56262448eb8f      in  flb_tail_file_chunk() at plugins/in_tail/tail_file.c:1210
#9  0x562624486ab4      in  in_tail_collect_pending() at plugins/in_tail/tail.c:91
#10 0x56262442ed96      in  flb_input_collector_fd() at src/flb_input.c:1108
#11 0x5626244419eb      in  flb_engine_handle_event() at src/flb_engine.c:411
#12 0x5626244419eb      in  flb_engine_start() at src/flb_engine.c:696
#13 0x56262442383e      in  flb_lib_worker() at src/flb_lib.c:627
#14 0x7f63ead64fa2      in  start_thread() at /glibc-2.28/nptl/pthread_create.c:486
#15 0x7f63ea4454ce      in  ???() at v/linux/x86_64/clone.S:95
#16 0xffffffffffffffff  in  ???() at ???:0

or

[2021/12/09 11:41:43] [engine] caught signal (SIGSEGV)
#0  0x55bc9121adac      in  atomic_load_p() at lib/jemalloc-5.2.1/include/jemalloc/internal/atomic.h:62
#1  0x55bc9121adac      in  rtree_leaf_elm_bits_read() at lib/jemalloc-5.2.1/include/jemalloc/internal/rtree.h:175
#2  0x55bc9121adac      in  rtree_szind_slab_read() at lib/jemalloc-5.2.1/include/jemalloc/internal/rtree.h:500
#3  0x55bc9121adac      in  ifree() at lib/jemalloc-5.2.1/src/jemalloc.c:2570
#4  0x55bc9121adac      in  je_free_default() at lib/jemalloc-5.2.1/src/jemalloc.c:2790
#5  0x55bc91293583      in  flb_free() at include/fluent-bit/flb_mem.h:122
#6  0x55bc912946b6      in  flb_sds_destroy() at src/flb_sds.c:408
#7  0x55bc912c8f1a      in  flb_kv_item_destroy() at src/flb_kv.c:83
#8  0x55bc912c8f9c      in  flb_kv_release() at src/flb_kv.c:102
#9  0x55bc912cf83d      in  http_headers_destroy() at src/flb_http_client.c:998
#10 0x55bc912d02fd      in  flb_http_client_destroy() at src/flb_http_client.c:1324
#11 0x55bc9132915f      in  cb_es_flush() at plugins/out_es/es.c:885
#12 0x55bc9129940e      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:514
#13 0x55bc91797f66      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117

or

[2021/12/09 11:45:09] [debug] [out_es] converted_size is 0
[2021/12/09 11:45:09] [engine] caught signal (SIGSEGV)
[2021/12/09 11:45:09] [debug] [http_client] not using http_proxy for header
#0  0x7f6adaf2c1a7      in  ???() at ultiarch/strlen-avx2.S:96
#1  0x56434df09a1c      in  add_host_and_content_length() at src/flb_http_client.c:571
#2  0x56434df0a06a      in  flb_http_client() at src/flb_http_client.c:751
#3  0x56434df63bc6      in  cb_es_flush() at plugins/out_es/es.c:801
#4  0x56434ded440e      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:514
#5  0x56434e3d2f66      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#6  0xffffffffffffffff  in  ???() at ???:0

or

[2021/12/09 11:48:13] [engine] caught signal (SIGSEGV)
#0  0x55a017769d1f      in  pcache1FetchNoMutex() at lib/sqlite-amalgamation-3330000/sqlite3.c:50773
#1  0x55a017769d97      in  pcache1Fetch() at lib/sqlite-amalgamation-3330000/sqlite3.c:50829
#2  0x55a0177681c8      in  sqlite3PcacheFetch() at lib/sqlite-amalgamation-3330000/sqlite3.c:49266
#3  0x55a01776f558      in  getPageNormal() at lib/sqlite-amalgamation-3330000/sqlite3.c:57178
#4  0x55a01776f9b5      in  sqlite3PagerGet() at lib/sqlite-amalgamation-3330000/sqlite3.c:57354
#5  0x55a017778ac5      in  getAndInitPage() at lib/sqlite-amalgamation-3330000/sqlite3.c:66663
#6  0x55a01777c77e      in  moveToChild() at lib/sqlite-amalgamation-3330000/sqlite3.c:69593
#7  0x55a01777d7e5      in  btreePrevious() at lib/sqlite-amalgamation-3330000/sqlite3.c:70333
#8  0x55a01777d8fd      in  sqlite3BtreePrevious() at lib/sqlite-amalgamation-3330000/sqlite3.c:70367
#9  0x55a01777d888      in  btreePrevious() at lib/sqlite-amalgamation-3330000/sqlite3.c:70350
#10 0x55a01777d8fd      in  sqlite3BtreePrevious() at lib/sqlite-amalgamation-3330000/sqlite3.c:70367
#11 0x55a017799dec      in  sqlite3VdbeExec() at lib/sqlite-amalgamation-3330000/sqlite3.c:90902
#12 0x55a017790f90      in  sqlite3Step() at lib/sqlite-amalgamation-3330000/sqlite3.c:83691
#13 0x55a0177911eb      in  sqlite3_step() at lib/sqlite-amalgamation-3330000/sqlite3.c:83748
#14 0x55a0175caa56      in  db_file_exists() at plugins/in_tail/tail_db.c:112
#15 0x55a0175cad0c      in  flb_tail_db_file_set() at plugins/in_tail/tail_db.c:181
#16 0x55a0175cf44d      in  set_file_position() at plugins/in_tail/tail_file.c:733
#17 0x55a0175cffb0      in  flb_tail_file_append() at plugins/in_tail/tail_file.c:970
#18 0x55a0175d1985      in  flb_tail_file_rotated() at plugins/in_tail/tail_file.c:1534
#19 0x55a0175c9f07      in  tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:205
#20 0x55a017570d96      in  flb_input_collector_fd() at src/flb_input.c:1108
#21 0x55a0175839eb      in  flb_engine_handle_event() at src/flb_engine.c:411
#22 0x55a0175839eb      in  flb_engine_start() at src/flb_engine.c:696
#23 0x55a01756583e      in  flb_lib_worker() at src/flb_lib.c:627
#24 0x7fcf79bd0fa2      in  start_thread() at /glibc-2.28/nptl/pthread_create.c:486
#25 0x7fcf792b14ce      in  ???() at v/linux/x86_64/clone.S:95
#26 0xffffffffffffffff  in  ???() at ???:0

or

[2021/12/09 11:48:34] [debug] [out_es] converted_size is 0
[2021/12/09 11:48:35] [engine] caught signal (SIGSEGV)
#0  0x7fb2e8ede6ff      in  ???() at ultiarch/memmove-vec-unaligned-erms.S:262
#1  0x55bfa7c0f49d      in  es_bulk_append() at plugins/out_es/es_bulk.c:105
#2  0x55bfa7c0cd63      in  elasticsearch_format() at plugins/out_es/es.c:551
#3  0x55bfa7c0db51      in  cb_es_flush() at plugins/out_es/es.c:787
#4  0x55bfa7b7e40e      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:514
#5  0x55bfa807cf66      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#6  0xffffffffffffffff  in  ???() at ???:0

GuillaumeSmaha added a commit to GuillaumeSmaha/fluent-bit that referenced this issue Dec 9, 2021
In some case, ES_BULK_CHUNK is not enough to increase size of the
buffer

Fix fluent#4412

Signed-off-by: Guillaume Smaha <guillaume.smaha@gmail.com>
GuillaumeSmaha added a commit to GuillaumeSmaha/fluent-bit that referenced this issue Dec 9, 2021
In some case, ES_BULK_CHUNK is not enough to increase size of the
buffer

Fix fluent#4412

Signed-off-by: Guillaume Smaha <guillaume.smaha@gmail.com>
edsiper pushed a commit that referenced this issue Dec 9, 2021
In some case, ES_BULK_CHUNK is not enough to increase size of the
buffer

Fix #4412

Signed-off-by: Guillaume Smaha <guillaume.smaha@gmail.com>
edsiper pushed a commit that referenced this issue Dec 9, 2021
In some case, ES_BULK_CHUNK is not enough to increase size of the
buffer

Fix #4412

Signed-off-by: Guillaume Smaha <guillaume.smaha@gmail.com>
@fhitchen
Copy link

I have tried fluent-bit versions 1.7.x and 1.8.x and they all crash using tail input and elastic output with heavy JSON formatted logs. I tried the change in commit 0a061fb. Fluent-bit was a bit more stable but still crashed. I have gone back to version 1.6.10 which "just runs".

I have noticed another issue with the 1.8.x series, it is continually logging errors on writing to elastic, but the "status": element in the elastic response only ever shows status 201 which should be "document created" and not an error. When I see errors reported with 1.6 I always see a real error status in the response, like a 429 for example. I think that the elastic response handling has also been broken.

@edsiper
Copy link
Member

edsiper commented Dec 14, 2021

@fhitchen if you have a simple repro case for the crash, please provide the payload and config used.

0Delta pushed a commit to 0Delta/fluent-bit that referenced this issue Jan 20, 2022
In some case, ES_BULK_CHUNK is not enough to increase size of the
buffer

Fix fluent#4412

Signed-off-by: Guillaume Smaha <guillaume.smaha@gmail.com>
@mimmus
Copy link

mimmus commented Jan 18, 2023

What's the status of this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants