caught signal (SIGSEGV) with output elasticsearch #4412

GuillaumeSmaha · 2021-12-09T11:15:08Z

Bug Report

Describe the bug
I am having SEGV after few minutes on some pods and I think it is related to this PR #3788.
In some case, append_size is still inferior to required which means memcpy is writing outside of the buffer and can create a SEGV.

On 1.8.10 version, I try by adding a log to check this case GuillaumeSmaha@5b908dd and I got the log few time with a SEGV just after:

�[1m[�[0m2021/12/08 11:55:05�[1m]�[0m [�[93mdebug�[0m] [out_es] converted_size is 0
�[1m[�[0m2021/12/08 11:55:05�[1m]�[0m [�[93mdebug�[0m] [out_es] required size (197133, append: 4096) is still superior to available buffer (37999, previous available: 33903)

Thread 2 "flb-pipeline" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff61ff700 (LWP 45)]
__memmove_avx_unaligned_erms ()
    at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:262
262	../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
(gdb) bt
#0  __memmove_avx_unaligned_erms ()
    at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:262
#1  0x00005555556fc274 in es_bulk_append (bulk=0x7ffff5400d60, 
    index=0x7ffff4ac0a40 "{\"create\":{\"_index\":\"logstash-2021.12.08\",\"_type\":\"_doc\",\"_id\":\"a76df76b-d2c7-9d2b-e172-93e0ec3c3d4f\"}}\n", i_len=104, 
    json=0x7ffff253fad0 "{\"@timestamp\":\"2021-12-08T11:54:07.027Z\",\"time\":\"2021-12-08T11:54:07.027621077Z\",\"stream\":\"stdout\",\"_p\":\"P\",\"log\":\"\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\"..., j_len=196967, whole_size=33903, converted_size=0)
    at /tmpdir/fluent-bit/plugins/out_es/es_bulk.c:111
#2  0x00005555556f9af9 in elasticsearch_format (config=0x7ffff6632f80, 
    ins=0x7ffff6648f00, plugin_context=0x7ffff54a9600, flush_ctx=0x0, 
    tag=0x7ffff54a2700 "kube.var.log.containers.toolkit-jobs-job-48bfef83-e663-4c37-b55f-50ea1db9a588_toolkit-jobs_toolkit-jobs-job-48bfef83-e663-4c37-b55f-50ea1db9a588-1-db65750898be03ec7ef665854d8036800d9609adeb19a0cc5b021"..., 
    tag_len=215, data=0x7ffff2341f00, bytes=33903, out_data=0x7ffff4ac0f60, 
    out_size=0x7ffff4ac0f58) at /tmpdir/fluent-bit/plugins/out_es/es.c:537
#3  0x00005555556fa8c5 in cb_es_flush (data=0x7ffff2341f00, bytes=33903, 
    tag=0x7ffff54a2700 "kube.var.log.containers.toolkit-jobs-job-48bfef83-e663-4c37-b55f-50ea1db9a588_toolkit-jobs_toolkit-jobs-job-48bfef83-e663-4c37-b55f-50ea1db9a588-1-db65750898be03ec7ef665854d8036800d9609adeb19a0cc5b021"..., 
    tag_len=215, ins=0x7ffff6648f00, out_context=0x7ffff54a9600, 
    config=0x7ffff6632f80) at /tmpdir/fluent-bit/plugins/out_es/es.c:772
--Type <RET> for more, q to quit, c to continue without paging--
#4  0x000055555566b2df in output_pre_cb_flush ()
    at /tmpdir/fluent-bit/include/fluent-bit/flb_output.h:514
#5  0x0000555555b69d47 in co_init ()
    at /tmpdir/fluent-bit/lib/monkey/deps/flb_libco/amd64.c:117
#6  0x0000000000000000 in ?? ()
(gdb)

gdb-fluent-bit-4vljw.log

I also try to revert the change by using:

   if (available < required) {
        append_size = available + required + ES_BULK_CHUNK;
        if (append_size < ES_BULK_CHUNK) {
            /* append at least ES_BULK_CHUNK size */
            append_size = ES_BULK_CHUNK;
        }

Until now, it is working for 20h without SEGV. (it was crashing after few minutes)

To Reproduce

Install latest elasticsearch using helm charts
Use the following configuration for fluent-bit
After few minutes of data sent by fluent-bit, a segfault happens

Expected behavior
SEGV accous

Screenshots

Your Environment

Version used: 1.8.10
Configuration:


    [SERVICE]
        Config_Watch  On

        # Flush
        # =====
        # set an interval of seconds before to flush records to a destination
        flush        1

        # Daemon
        # ======
        # instruct Fluent Bit to run in foreground or background mode.
        daemon       Off

        # Log_Level
        # =========
        # Set the verbosity level of the service, values can be:
        #
        # - error
        # - warning
        # - info
        # - debug
        # - trace
        #
        # by default 'info' is set, that means it includes 'error' and 'warning'.
        log_level    trace

        # Parsers File
        # ============
        # specify an optional 'Parsers' configuration file
        parsers_file parsers.conf
        parsers_file  custom_parsers.conf

        # HTTP Server
        # ===========
        # Enable/Disable the built-in HTTP Server for metrics
        http_server  On
        http_listen  0.0.0.0
        http_port    2020

    [INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/*.log
        # Except fluent-bit to avoid feedback loop
        Exclude_Path      ["*fluent-bit*.log"]
        DB                /var/log/fluentbit/flb_kube.db
        Mem_Buf_Limit     64MB
        Skip_Long_Lines   On
        Refresh_Interval  10
        # Parser for containerd (Container Runtime Interface)
        multiline.parser  cri

    [INPUT]
        Name              tail
        Tag               system.*
        Path              /var/log/syslog
        Parser            rsyslog
        DB                /var/log/fluentbit/flb_syslog.db
        Mem_Buf_Limit     64MB
        Skip_Long_Lines   On
        Refresh_Interval  10

  ## https://docs.fluentbit.io/manual/pipeline/filters
  filters: |
    [FILTER]
        Name           kubernetes
        Match          kube.*
        Kube_URL       https://kubernetes.default.svc:443

    [OUTPUT]
        Name            es
        Match           *
        Host            elasticsearch-master
        Port            9200
        tls             Off
        tls.verify      Off
        Generate_ID     On
        Logstash_Format On

  #  [OUTPUT]
  #      Name            stdout
  #      Match           *

    [PARSER]
        Name        rsyslog
        Format      regex
        Regex       (?<syslog_time>^[A-Za-z]+ [ 1-3][0-9] [0-9]{2}:[0-9]{2}:[0-9]{2}) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<log>.*)$
        Time_Key    syslog_time
        Time_Format %b %d %H:%M:%S
        Time_Keep   On

Environment name and version (e.g. Kubernetes? What version?):
Kubernetes 1.21.1
Server type and version: Bare-metal
Operating System and version:
Ubuntu 18.04 or ubuntu 20.04
Filters and plugins: kubernetes

The text was updated successfully, but these errors were encountered:

GuillaumeSmaha · 2021-12-09T11:49:12Z

I try 1.8.10 without reverting PR #3788 and with PR #4361 but it doesn't solve my issue.
I will try to create a test case.

logs:

[2021/12/09 11:36:48] [debug] [out_es] converted_size is 0
[2021/12/09 11:36:48] [debug] [http_client] not using http_proxy for header
[2021/12/09 11:36:48] [trace] [io coro=0x7f63e7896178] [net_write] trying 140 bytes
[2021/12/09 11:36:48] [trace] [io coro=0x7f63e7896178] [fd 108] write_async(2)=140 (140/140)
[2021/12/09 11:36:48] [trace] [io coro=0x7f63e7896178] [net_write] ret=140 total=140/140
[2021/12/09 11:36:48] [trace] [io coro=0x7f63e7896178] [net_write] trying 99169 bytes
[2021/12/09 11:36:48] [trace] [io coro=0x7f63e7896178] [fd 108] write_async(2)=99169 (99169/99169)
[2021/12/09 11:36:48] [trace] [io coro=0x7f63e7896178] [net_write] ret=99169 total=99169/99169
[2021/12/09 11:36:48] [trace] [io coro=0x7f63e7896178] [net_read] try up to 4095 bytes
[2021/12/09 11:36:48] [trace] [engine] resuming coroutine=0x7f63e7896988
[2021/12/09 11:36:48] [trace] [io coro=0x7f63e7896988] [net_read] ret=4095
[2021/12/09 11:36:48] [trace] [io coro=0x7f63e7896988] [net_read] try up to 32768 bytes
[2021/12/09 11:36:48] [trace] [io coro=0x7f63e7896988] [net_read] ret=32768
[2021/12/09 11:36:48] [trace] [io coro=0x7f63e7896988] [net_read] try up to 32768 bytes
[2021/12/09 11:36:48] [trace] [io coro=0x7f63e7896988] [net_read] ret=31083
[2021/12/09 11:36:48] [trace] [io coro=0x7f63e7896988] [net_read] try up to 1685 bytes
[2021/12/09 11:36:48] [engine] caught signal (SIGSEGV)
#0  0x7f63ea4a86ff      in  ???() at ultiarch/memmove-vec-unaligned-erms.S:262
#1  0x56262442a98a      in  flb_sds_cat() at src/flb_sds.c:132
#2  0x56262442a9f4      in  flb_sds_cat_safe() at src/flb_sds.c:145
#3  0x56262444e327      in  package_content() at src/multiline/flb_ml.c:324
#4  0x56262444e7d5      in  process_append() at src/multiline/flb_ml.c:495
#5  0x56262444e9f3      in  ml_append_try_parser() at src/multiline/flb_ml.c:564
#6  0x56262444ebfb      in  flb_ml_append() at src/multiline/flb_ml.c:630
#7  0x56262448c661      in  process_content() at plugins/in_tail/tail_file.c:375
#8  0x56262448eb8f      in  flb_tail_file_chunk() at plugins/in_tail/tail_file.c:1210
#9  0x562624486ab4      in  in_tail_collect_pending() at plugins/in_tail/tail.c:91
#10 0x56262442ed96      in  flb_input_collector_fd() at src/flb_input.c:1108
#11 0x5626244419eb      in  flb_engine_handle_event() at src/flb_engine.c:411
#12 0x5626244419eb      in  flb_engine_start() at src/flb_engine.c:696
#13 0x56262442383e      in  flb_lib_worker() at src/flb_lib.c:627
#14 0x7f63ead64fa2      in  start_thread() at /glibc-2.28/nptl/pthread_create.c:486
#15 0x7f63ea4454ce      in  ???() at v/linux/x86_64/clone.S:95
#16 0xffffffffffffffff  in  ???() at ???:0

or

[2021/12/09 11:41:43] [engine] caught signal (SIGSEGV)
#0  0x55bc9121adac      in  atomic_load_p() at lib/jemalloc-5.2.1/include/jemalloc/internal/atomic.h:62
#1  0x55bc9121adac      in  rtree_leaf_elm_bits_read() at lib/jemalloc-5.2.1/include/jemalloc/internal/rtree.h:175
#2  0x55bc9121adac      in  rtree_szind_slab_read() at lib/jemalloc-5.2.1/include/jemalloc/internal/rtree.h:500
#3  0x55bc9121adac      in  ifree() at lib/jemalloc-5.2.1/src/jemalloc.c:2570
#4  0x55bc9121adac      in  je_free_default() at lib/jemalloc-5.2.1/src/jemalloc.c:2790
#5  0x55bc91293583      in  flb_free() at include/fluent-bit/flb_mem.h:122
#6  0x55bc912946b6      in  flb_sds_destroy() at src/flb_sds.c:408
#7  0x55bc912c8f1a      in  flb_kv_item_destroy() at src/flb_kv.c:83
#8  0x55bc912c8f9c      in  flb_kv_release() at src/flb_kv.c:102
#9  0x55bc912cf83d      in  http_headers_destroy() at src/flb_http_client.c:998
#10 0x55bc912d02fd      in  flb_http_client_destroy() at src/flb_http_client.c:1324
#11 0x55bc9132915f      in  cb_es_flush() at plugins/out_es/es.c:885
#12 0x55bc9129940e      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:514
#13 0x55bc91797f66      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117

or

[2021/12/09 11:45:09] [debug] [out_es] converted_size is 0
[2021/12/09 11:45:09] [engine] caught signal (SIGSEGV)
[2021/12/09 11:45:09] [debug] [http_client] not using http_proxy for header
#0  0x7f6adaf2c1a7      in  ???() at ultiarch/strlen-avx2.S:96
#1  0x56434df09a1c      in  add_host_and_content_length() at src/flb_http_client.c:571
#2  0x56434df0a06a      in  flb_http_client() at src/flb_http_client.c:751
#3  0x56434df63bc6      in  cb_es_flush() at plugins/out_es/es.c:801
#4  0x56434ded440e      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:514
#5  0x56434e3d2f66      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#6  0xffffffffffffffff  in  ???() at ???:0

or

[2021/12/09 11:48:13] [engine] caught signal (SIGSEGV)
#0  0x55a017769d1f      in  pcache1FetchNoMutex() at lib/sqlite-amalgamation-3330000/sqlite3.c:50773
#1  0x55a017769d97      in  pcache1Fetch() at lib/sqlite-amalgamation-3330000/sqlite3.c:50829
#2  0x55a0177681c8      in  sqlite3PcacheFetch() at lib/sqlite-amalgamation-3330000/sqlite3.c:49266
#3  0x55a01776f558      in  getPageNormal() at lib/sqlite-amalgamation-3330000/sqlite3.c:57178
#4  0x55a01776f9b5      in  sqlite3PagerGet() at lib/sqlite-amalgamation-3330000/sqlite3.c:57354
#5  0x55a017778ac5      in  getAndInitPage() at lib/sqlite-amalgamation-3330000/sqlite3.c:66663
#6  0x55a01777c77e      in  moveToChild() at lib/sqlite-amalgamation-3330000/sqlite3.c:69593
#7  0x55a01777d7e5      in  btreePrevious() at lib/sqlite-amalgamation-3330000/sqlite3.c:70333
#8  0x55a01777d8fd      in  sqlite3BtreePrevious() at lib/sqlite-amalgamation-3330000/sqlite3.c:70367
#9  0x55a01777d888      in  btreePrevious() at lib/sqlite-amalgamation-3330000/sqlite3.c:70350
#10 0x55a01777d8fd      in  sqlite3BtreePrevious() at lib/sqlite-amalgamation-3330000/sqlite3.c:70367
#11 0x55a017799dec      in  sqlite3VdbeExec() at lib/sqlite-amalgamation-3330000/sqlite3.c:90902
#12 0x55a017790f90      in  sqlite3Step() at lib/sqlite-amalgamation-3330000/sqlite3.c:83691
#13 0x55a0177911eb      in  sqlite3_step() at lib/sqlite-amalgamation-3330000/sqlite3.c:83748
#14 0x55a0175caa56      in  db_file_exists() at plugins/in_tail/tail_db.c:112
#15 0x55a0175cad0c      in  flb_tail_db_file_set() at plugins/in_tail/tail_db.c:181
#16 0x55a0175cf44d      in  set_file_position() at plugins/in_tail/tail_file.c:733
#17 0x55a0175cffb0      in  flb_tail_file_append() at plugins/in_tail/tail_file.c:970
#18 0x55a0175d1985      in  flb_tail_file_rotated() at plugins/in_tail/tail_file.c:1534
#19 0x55a0175c9f07      in  tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:205
#20 0x55a017570d96      in  flb_input_collector_fd() at src/flb_input.c:1108
#21 0x55a0175839eb      in  flb_engine_handle_event() at src/flb_engine.c:411
#22 0x55a0175839eb      in  flb_engine_start() at src/flb_engine.c:696
#23 0x55a01756583e      in  flb_lib_worker() at src/flb_lib.c:627
#24 0x7fcf79bd0fa2      in  start_thread() at /glibc-2.28/nptl/pthread_create.c:486
#25 0x7fcf792b14ce      in  ???() at v/linux/x86_64/clone.S:95
#26 0xffffffffffffffff  in  ???() at ???:0

or

[2021/12/09 11:48:34] [debug] [out_es] converted_size is 0
[2021/12/09 11:48:35] [engine] caught signal (SIGSEGV)
#0  0x7fb2e8ede6ff      in  ???() at ultiarch/memmove-vec-unaligned-erms.S:262
#1  0x55bfa7c0f49d      in  es_bulk_append() at plugins/out_es/es_bulk.c:105
#2  0x55bfa7c0cd63      in  elasticsearch_format() at plugins/out_es/es.c:551
#3  0x55bfa7c0db51      in  cb_es_flush() at plugins/out_es/es.c:787
#4  0x55bfa7b7e40e      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:514
#5  0x55bfa807cf66      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#6  0xffffffffffffffff  in  ???() at ???:0

In some case, ES_BULK_CHUNK is not enough to increase size of the buffer Fix fluent#4412 Signed-off-by: Guillaume Smaha <guillaume.smaha@gmail.com>

In some case, ES_BULK_CHUNK is not enough to increase size of the buffer Fix #4412 Signed-off-by: Guillaume Smaha <guillaume.smaha@gmail.com>

fhitchen · 2021-12-14T15:31:22Z

I have tried fluent-bit versions 1.7.x and 1.8.x and they all crash using tail input and elastic output with heavy JSON formatted logs. I tried the change in commit 0a061fb. Fluent-bit was a bit more stable but still crashed. I have gone back to version 1.6.10 which "just runs".

I have noticed another issue with the 1.8.x series, it is continually logging errors on writing to elastic, but the "status": element in the elastic response only ever shows status 201 which should be "document created" and not an error. When I see errors reported with 1.6 I always see a real error status in the response, like a 429 for example. I think that the elastic response handling has also been broken.

edsiper · 2021-12-14T17:33:10Z

@fhitchen if you have a simple repro case for the crash, please provide the payload and config used.

In some case, ES_BULK_CHUNK is not enough to increase size of the buffer Fix fluent#4412 Signed-off-by: Guillaume Smaha <guillaume.smaha@gmail.com>

mimmus · 2023-01-18T10:44:40Z

What's the status of this issue?

This was referenced Dec 9, 2021

out_es: estimate bulk size to less reallocation(#3775) #3788

Merged

[engine] caught signal (SIGSEGV) on 2 pods in the daemonset #4195

Closed

GuillaumeSmaha mentioned this issue Dec 9, 2021

out_es: Fix buffer size when converted_size is equal to 0. #4414

Merged

1 task

edsiper closed this as completed in #4414 Dec 9, 2021

edsiper pushed a commit that referenced this issue Dec 9, 2021

out_es: fix buffer size when converted_size is equal to 0. (#4414)

0a061fb

In some case, ES_BULK_CHUNK is not enough to increase size of the buffer Fix #4412 Signed-off-by: Guillaume Smaha <guillaume.smaha@gmail.com>

edsiper pushed a commit that referenced this issue Dec 9, 2021

out_es: fix buffer size when converted_size is equal to 0. (#4414)

b2e80bc

In some case, ES_BULK_CHUNK is not enough to increase size of the buffer Fix #4412 Signed-off-by: Guillaume Smaha <guillaume.smaha@gmail.com>

nokute78 mentioned this issue Jan 19, 2024

SIGABRT on realloc() in Elasticsearch #8388

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

caught signal (SIGSEGV) with output elasticsearch #4412

caught signal (SIGSEGV) with output elasticsearch #4412

GuillaumeSmaha commented Dec 9, 2021 •

edited

Loading

GuillaumeSmaha commented Dec 9, 2021

fhitchen commented Dec 14, 2021

edsiper commented Dec 14, 2021

mimmus commented Jan 18, 2023

caught signal (SIGSEGV) with output elasticsearch #4412

caught signal (SIGSEGV) with output elasticsearch #4412

Comments

GuillaumeSmaha commented Dec 9, 2021 • edited Loading

Bug Report

GuillaumeSmaha commented Dec 9, 2021

fhitchen commented Dec 14, 2021

edsiper commented Dec 14, 2021

mimmus commented Jan 18, 2023

GuillaumeSmaha commented Dec 9, 2021 •

edited

Loading