Skip to content

Halt at output_pre_cb_flush within connection cleaning up #2826

@bigangryrobot

Description

@bigangryrobot

Bug Report

Describe the bug
We have been seeing an intermittent halt sequence on fluent versions 1.2.0 through 1.6.6 across 4 independent and separately managed clusters over the past 2 weeks. While I suspect that fluent in this case is reacting to an upstream pressure like latency or network retries, the system seems to halt as its cleaning connections in this environment

Functional Log: this seems like a normal filter and flush to the upstream client

[2020/12/02 05:02:31] [debug] [filter:modify:modify.3] Input map size 1 elements, output map size 2 elements
[2020/12/02 05:02:31] [debug] [filter:modify:modify.3] Input map size 1 elements, output map size 2 elements
[2020/12/02 05:02:31] [debug] [task] created task=0x7f454e8ce8a0 id=1 OK
[2020/12/02 05:02:31] [debug] [task] created task=0x7f454e8ce9e0 id=356 OK
[2020/12/02 05:02:31] [debug] [task] created task=0x7f454e8ceb20 id=357 OK
--- snip ---
[2020/12/02 05:02:31] [debug] [task] created task=0x7f452788b760 id=434 OK
[2020/12/02 05:02:31] [debug] [task] created task=0x7f452788b9e0 id=437 OK
[2020/12/02 05:02:31] [debug] [upstream] KA connection #432 is in a failed state to: <snip>:8088, cleaning up
[2020/12/02 05:02:31] [debug] [upstream] KA connection #434 is in a failed state to: <snip>:8088, cleaning up
--- snip ---
[2020/12/02 05:02:31] [debug] [upstream] KA #465 is in a failed state to: <snip>:8088, cleaning up
[2020/12/02 05:02:31] [debug] [upstream] KA connection #498 is in a failed state to: <snip>:8088, cleaning up
[2020/12/02 05:02:31] [debug] [task] created task=0x7f454e918f00 id=438 OK
[2020/12/02 05:02:31] [debug] [input:tail:tail.0] inode=3648827372 events: IN_MODIFY
[2020/12/02 05:02:31] [debug] [task] destroy task=0x7f4550c55a00 (task_id=3)
[2020/12/02 05:02:31] [debug] [http_client] not using http_proxy for header
[2020/12/02 05:02:31] [debug] [http_client] header=POST /services/collector/event HTTP/1.1
Host: <snip>:8088
Content-Length: 1501
User-Agent: Fluent-Bit
Authorization: Splunk <snip>

Halt point: This follows the same as the normal flush sequence up until the upstream cleanup

[2020/12/02 05:02:32] [debug] [filter:modify:modify.3] Input map size 1 elements, output map size 2 elements
[2020/12/02 05:02:32] [debug] [filter:modify:modify.3] Input map size 1 elements, output map size 2 elements
[2020/12/02 05:02:32] [debug] [filter:modify:modify.3] Input map size 1 elements, output map size 2 elements
[2020/12/02 05:02:32] [debug] [task] created task=0x7f454e91a620 id=3 OK
[2020/12/02 05:02:32] [debug] [task] created task=0x7f454e91a6c0 id=439 OK
--- snip ---
[2020/12/02 05:02:32] [debug] [task] created task=0x7f4527524f80 id=516 OK
[2020/12/02 05:02:32] [debug] [task] created task=0x7f45275250c0 id=517 OK
[2020/12/02 05:02:32] [debug] [upstream] KA connection #328 is in a failed state to: <snip>:8088, cleaning up
[2020/12/02 05:02:32] [debug] [upstream] KA connection #329 is in a failed state to: <snip>:8088, cleaning up
--- snip ---
[2020/12/02 05:02:32] [debug] [upstream] KA connection #410 is in a failed state to: <snip>:8088, cleaning up
[2020/12/02 05:02:32] [debug] [upstream] KA connection #464 is in a failed state to: <snip>:8088, cleaning up

And here is what strace is showing me is happening around the above time when it lands into an endless [pid 52799] nanosleep({1, 0}, 0x7fff9c6cb340) = 0

[pid 52876] socket(PF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 444
[pid 52876] connect(444, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("169.254.25.10")}, 16) = 0
[pid 52876] poll([{fd=444, events=POLLOUT}], 1, 0) = 1 ([{fd=444, revents=POLLOUT}])
[pid 52876] sendmmsg(444, {{{msg_name(0)=NULL, msg_iov(1)=[{"\250U\1\0\0\1\0\0\0\0\0\0\<snip>"..., 52}], msg_controllen=0, msg_flags=0}, 52}, {{msg_name(0)=NULL, msg_iov(1)=[{"8X\1\0\0\1\0\0\0\0\0\0\<snip>"..., 52}], msg_controllen=0, msg_flags=0}, 52}}, 2, MSG_NOSIGNAL) = 2
[pid 52876] poll([{fd=444, events=POLLIN}], 1, 5000) = 1 ([{fd=444, revents=POLLIN}])
[pid 52876] ioctl(444, FIONREAD, [138]) = 0
[pid 52876] recvfrom(444, "8X\205\200\0\1\0\0\0\1\0\0\<snip>"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("169.254.25.10")}, [16]) = 138
[pid 52876] poll([{fd=444, events=POLLIN}], 1, 4999) = 1 ([{fd=444, revents=POLLIN}])
[pid 52876] ioctl(444, FIONREAD, [102]) = 0
[pid 52876] recvfrom(444, "\250U\205\200\0\1\0\1\0\0\0\0\<snip>"..., 65536, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("169.254.25.10")}, [16]) = 102
[pid 52876] close(444)                  = 0
[pid 52876] connect(443, {sa_family=AF_INET, sin_port=htons(8088), sin_addr=inet_addr("<snip>")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid 52876] epoll_ctl(8, EPOLL_CTL_ADD, 443, {EPOLLOUT|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=263755968, u64=140630377928896}}) = 0
[pid 52876] write(301, "\0@}\20\2\0\0\0", 8 <unfinished ...>
[pid 52799] <... nanosleep resumed> 0x7fff9c6cb340) = 0
[pid 52799] nanosleep({1, 0}, 0x7fff9c6cb340) = 0
[pid 52799] nanosleep({1, 0}, 0x7fff9c6cb340) = 0
[pid 52799] nanosleep({1, 0}, 0x7fff9c6cb340) = 0
[pid 52799] nanosleep({1, 0}, 0x7fff9c6cb340) = 0

The stack at that point

(gdb) bt
#0  0x00007f8178bb54a7 in write () from target:/lib/x86_64-linux-gnu/libpthread.so.0
#1  0x0000563a8ef92391 in flb_output_return (ret=1, th=0x7f816059a2c0) at /tmp/fluent-bit/include/fluent-bit/flb_output.h:570
#2  0x0000563a8ef92479 in flb_output_return_do (x=1) at /tmp/fluent-bit/include/fluent-bit/flb_output.h:601
#3  0x0000563a8ef92601 in cb_counter_flush (data=0x7f8146a05800, bytes=101903,
    tag=0x7f81760965e0 "<snip>-95f3076c5e07e6d1286eb15c0bae601626cd9234907c50a730286c"..., tag_len=214, i_ins=0x7f8177287180, out_context=0x7f81760470f8, config=0x7f8177232c40) at /tmp/fluent-bit/plugins/out_counter/counter.c:78
#4  0x0000563a8ef3550f in output_pre_cb_flush () at /tmp/fluent-bit/include/fluent-bit/flb_output.h:472
#5  0x0000563a8f389907 in co_init () at /tmp/fluent-bit/lib/monkey/deps/flb_libco/amd64.c:117
#6  0xac6c6174656d6572 in ?? ()
#7  0x74736f486562754b in ?? ()
<snip>
#9  0x6862756b31703430 in ?? ()
#174 0xa982736c6562616c in ?? ()

This is the SIGCONT debug data

[2020/12/02 06:51:39] [debug] [upstream] KA connection #330 is in a failed state to: <snip>:8088, cleaning up
[2020/12/02 06:51:39] [debug] [upstream] KA connection #333 is in a failed state to: <snip>:8088, cleaning up
[2020/12/02 06:51:39] [debug] [upstream] KA connection #329 is in a failed state to: <snip>:8088, cleaning up
[2020/12/02 06:52:31] [engine] caught signal (SIGINT)
[2020/12/02 06:52:31] [ info] [input] pausing tail.0
[2020/12/02 06:52:31] [ info] [input:tail:tail.0] flushing pending docker mode data...
[2020/12/02 06:52:31] [ info] [input] pausing systemd.1
[2020/12/02 06:54:03] [engine] caught signal (SIGCONT)
[2020/12/02 06:54:03] Fluent Bit Dump

===== Input =====
tail.0 (tail)
│
├─ status
│  └─ overlimit     : no
│     ├─ mem size   : 62.9M (65989114 bytes)
│     └─ mem limit  : 0b (0 bytes)
│
├─ tasks
│  ├─ total tasks   : 479
│  ├─ new           : 30
│  ├─ running       : 449
│  └─ size          : 62.9M (65989114 bytes)
│
└─ chunks
   └─ total chunks  : 479
      ├─ up chunks  : 479
      ├─ down chunks: 0
      └─ busy chunks: 479
         ├─ size    : 62.9M (65989114 bytes)
         └─ size err: 0

systemd.1 (systemd)
│
├─ status
│  └─ overlimit     : no
│     ├─ mem size   : 526.2K (538861 bytes)
│     └─ mem limit  : 0b (0 bytes)
│
├─ tasks
│  ├─ total tasks   : 6
│  ├─ new           : 0
│  ├─ running       : 6
│  └─ size          : 526.2K (538861 bytes)
│
└─ chunks
   └─ total chunks  : 6
      ├─ up chunks  : 6
      ├─ down chunks: 0
      └─ busy chunks: 6
         ├─ size    : 526.2K (538861 bytes)
         └─ size err: 0


===== Storage Layer =====
total chunks     : 485
├─ mem chunks    : 485
└─ fs chunks     : 0
   ├─ up         : 0
   └─ down       : 0

Secondary Bug ?
As an attempt avoid this I have tried net.keepalive false in my config per this doc https://docs.fluentbit.io/manual/administration/networking#concepts but it does not seem to disable the feature as possibly noted as well in #2550. I also tried KeepAlive On as noted in #2189 but that too does not seem change the behavior of the error or the connection usage

Your Environment
Kubernetes multiple versions using various helm charts, though latest testing across 4 clusters using 1.6.6 and config that is here

  fluent-bit.conf: |
    [SERVICE]
        HTTP_Server    On
        Config_Watch   Off
        HTTP_Listen    0.0.0.0
        HTTP_PORT      2020
        Flush          1
        Daemon         Off
        Log_Level      trace
        Parsers_File   parsers.conf
    [INPUT]
        Name                tail
        Tag                 kube.*
        Path                /var/log/containers/*.log
        Parser              docker
        DB                  /var/log/flb_kube.db
        Buffer_Chunk_Size   4k
        Buffer_Max_Size     1G
        DB.Sync             Normal
        Docker_Mode         On
        Docker_Mode_Flush   4
    [INPUT]
        Name               systemd
        Tag                host.*
        Path               /var/log/journal
        DB                 /var/log/flb_systemd.db
        Strip_Underscores  true
        Systemd_Filter     _SYSTEMD_UNIT=kubelet.service
        Systemd_Filter     _SYSTEMD_UNIT=docker.service
        Systemd_Filter     _SYSTEMD_UNIT=containerd.service
    [FILTER]
        Name           kubernetes
        Match          kube.*
        Kube_URL       https://kubernetes.default.svc:443
        Merge_Log On
        Keep_Log Off
    [FILTER]
        Name modify
        Match *
        Set LoggingAgent fluent-bit
        Set KubeClusterName ${CLUSTER_NAME}
        Set KubeClusterAbbreviation ${CLUSTER_NAME}
        Set KubeClusterRegion ${CLUSTER_REGION}
        Set KubeHostname ${HOST_NAME}
        Set KubeHostIP ${HOST_IP}
    [FILTER]
        Name        nest
        Match       *
        Operation   nest
        Wildcard    *
        Nest_under  event
    [FILTER]
        Name modify
        Match *
        Set index ${SPLUNK_INDEX}
    [OUTPUT]
        Name counter
        Match *
    [OUTPUT]
        Name splunk
        Match *
        Host ${SPLUNK_HOST}
        Splunk_Token ${SPLUNK_TOKEN}
        Port 8088
        net.keepalive false
        TLS Off
        TLS.Verify Off
        Splunk_Send_Raw On
  parsers.conf: |
    [PARSER]
        Name        apache
        Format      regex
        Regex       ^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
        Time_Key    time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name        apache2
        Format      regex
        Regex       ^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^ ]*) +\S*)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
        Time_Key    time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name        apache_error
        Format      regex
        Regex       ^\[[^ ]* (?<time>[^\]]*)\] \[(?<level>[^\]]*)\](?: \[pid (?<pid>[^\]]*)\])?( \[client (?<client>[^\]]*)\])? (?<message>.*)$

    [PARSER]
        Name        nginx
        Format      regex
        Regex       ^(?<remote>[^ ]*) (?<host>[^ ]*) (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
        Time_Key    time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name        syslog
        Format      regex
        Regex       ^\<(?<pri>[0-9]+)\>(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
        Time_Key    time
        Time_Format %b %d %H:%M:%S

    [PARSER]
        Name        json-test
        Format      json
        Time_Key    time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name        docker
        Format      json
        Time_Keep   On
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L

Startup Log

�[1mFluent Bit v1.6.6�[0m
* �[1m�[93mCopyright (C) 2019-2020 The Fluent Bit Authors�[0m
* �[1m�[93mCopyright (C) 2015-2018 Treasure Data�[0m
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2020/12/02 07:09:28] [ info] Configuration:
[2020/12/02 07:09:28] [ info]  flush time     | 1.000000 seconds
[2020/12/02 07:09:28] [ info]  grace          | 5 seconds
[2020/12/02 07:09:28] [ info]  daemon         | 0
[2020/12/02 07:09:28] [ info] ___________
[2020/12/02 07:09:28] [ info]  inputs:
[2020/12/02 07:09:28] [ info]      tail
[2020/12/02 07:09:28] [ info]      systemd
[2020/12/02 07:09:28] [ info] ___________
[2020/12/02 07:09:28] [ info]  filters:
[2020/12/02 07:09:28] [ info]      kubernetes.0
[2020/12/02 07:09:28] [ info]      modify.1
[2020/12/02 07:09:28] [ info]      nest.2
[2020/12/02 07:09:28] [ info]      modify.3
[2020/12/02 07:09:28] [ info] ___________
[2020/12/02 07:09:28] [ info]  outputs:
[2020/12/02 07:09:28] [ info]      counter.0
[2020/12/02 07:09:28] [ info]      splunk.1
[2020/12/02 07:09:28] [ info] ___________
[2020/12/02 07:09:28] [ info]  collectors:
[2020/12/02 07:09:28] [ info] [engine] started (pid=1)
[2020/12/02 07:09:28] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2020/12/02 07:09:28] [debug] [storage] [cio stream] new stream registered: tail.0
[2020/12/02 07:09:28] [debug] [storage] [cio stream] new stream registered: systemd.1
[2020/12/02 07:09:28] [ info] [storage] version=1.0.6, initializing...
[2020/12/02 07:09:28] [ info] [storage] in-memory
[2020/12/02 07:09:28] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/12/02 07:09:28] [debug] [input:tail:tail.0] inotify watch fd=27
[2020/12/02 07:09:28] [debug] [input:tail:tail.0] scanning path /var/log/containers/*.log

Additional context
These are large clusters on a variety of node sizes, running between 24 to 72 cores and 50 to 500gb of memory and between 90 to 200 pods per node at a time

No indications of system or cgroup pressure are seen that are common between the nodes having this halt and even systems with less than 100 pods have been observed entering this loop, though it seems the less pods means it will run longer before dying

Nodes have on average 3 to 5gb of logs on disk in /var/log/containers

Does not seem related to #2110 or #1108 as I see fluent following log rotations successfully

This originally presented at this issue #2661 with the following stack

​(gdb) bt
#0 0x00005630b94bea6f in create_conn (u=0x7f511f285620) at /tmp/fluent-bit/src/flb_upstream.c:158
#1 0x00005630b94becdd in flb_upstream_conn_get (u=0x7f511f285620) at /tmp/fluent-bit/src/flb_upstream.c:236
#2 0x00005630b95206e9 in cb_splunk_flush (data=0x7f50f678e980, bytes=127312,
tag=0x7f50f75d2320 "<snip>faba71d739f58eed031b45fe84fbebda35c1a8cc4851de4435eefb559df057f5.log", tag_len=146,
i_ins=0x7f511f260180, out_context=0x7f511f22a8d0, config=0x7f511f232000) at /tmp/fluent-bit/plugins/out_splunk/splunk.c:169
#3 0x00005630b94baa5d in output_pre_cb_flush () at /tmp/fluent-bit/include/fluent-bit/flb_output.h:371
#4 0x00005630b98980a7 in co_init () at /tmp/fluent-bit/lib/monkey/deps/flb_libco/amd64.c:117
#5 0x0000000000000000 in ?? ()
(gdb)

After applying fix versions 1.6.6 and an experimental version I built before its release with the patch applied over 1.6.5 code we had the same halt pattern

I have run this with the original config and swapping output to be a counter only, which progressed fine though the read and write pumps

The system logs at the expected rate for it seems at least 1 or 2 flushes on the host before shutting down, about 30 seconds of runtime

No oom, cpu or other issues encountered or logged to any of the containers

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions