Bug Report
Describe the bug
We have been seeing an intermittent halt sequence on fluent versions 1.2.0 through 1.6.6 across 4 independent and separately managed clusters over the past 2 weeks. While I suspect that fluent in this case is reacting to an upstream pressure like latency or network retries, the system seems to halt as its cleaning connections in this environment
Functional Log: this seems like a normal filter and flush to the upstream client
[2020/12/02 05:02:31] [debug] [filter:modify:modify.3] Input map size 1 elements, output map size 2 elements
[2020/12/02 05:02:31] [debug] [filter:modify:modify.3] Input map size 1 elements, output map size 2 elements
[2020/12/02 05:02:31] [debug] [task] created task=0x7f454e8ce8a0 id=1 OK
[2020/12/02 05:02:31] [debug] [task] created task=0x7f454e8ce9e0 id=356 OK
[2020/12/02 05:02:31] [debug] [task] created task=0x7f454e8ceb20 id=357 OK
--- snip ---
[2020/12/02 05:02:31] [debug] [task] created task=0x7f452788b760 id=434 OK
[2020/12/02 05:02:31] [debug] [task] created task=0x7f452788b9e0 id=437 OK
[2020/12/02 05:02:31] [debug] [upstream] KA connection #432 is in a failed state to: <snip>:8088, cleaning up
[2020/12/02 05:02:31] [debug] [upstream] KA connection #434 is in a failed state to: <snip>:8088, cleaning up
--- snip ---
[2020/12/02 05:02:31] [debug] [upstream] KA #465 is in a failed state to: <snip>:8088, cleaning up
[2020/12/02 05:02:31] [debug] [upstream] KA connection #498 is in a failed state to: <snip>:8088, cleaning up
[2020/12/02 05:02:31] [debug] [task] created task=0x7f454e918f00 id=438 OK
[2020/12/02 05:02:31] [debug] [input:tail:tail.0] inode=3648827372 events: IN_MODIFY
[2020/12/02 05:02:31] [debug] [task] destroy task=0x7f4550c55a00 (task_id=3)
[2020/12/02 05:02:31] [debug] [http_client] not using http_proxy for header
[2020/12/02 05:02:31] [debug] [http_client] header=POST /services/collector/event HTTP/1.1
Host: <snip>:8088
Content-Length: 1501
User-Agent: Fluent-Bit
Authorization: Splunk <snip>
Halt point: This follows the same as the normal flush sequence up until the upstream cleanup
[2020/12/02 05:02:32] [debug] [filter:modify:modify.3] Input map size 1 elements, output map size 2 elements
[2020/12/02 05:02:32] [debug] [filter:modify:modify.3] Input map size 1 elements, output map size 2 elements
[2020/12/02 05:02:32] [debug] [filter:modify:modify.3] Input map size 1 elements, output map size 2 elements
[2020/12/02 05:02:32] [debug] [task] created task=0x7f454e91a620 id=3 OK
[2020/12/02 05:02:32] [debug] [task] created task=0x7f454e91a6c0 id=439 OK
--- snip ---
[2020/12/02 05:02:32] [debug] [task] created task=0x7f4527524f80 id=516 OK
[2020/12/02 05:02:32] [debug] [task] created task=0x7f45275250c0 id=517 OK
[2020/12/02 05:02:32] [debug] [upstream] KA connection #328 is in a failed state to: <snip>:8088, cleaning up
[2020/12/02 05:02:32] [debug] [upstream] KA connection #329 is in a failed state to: <snip>:8088, cleaning up
--- snip ---
[2020/12/02 05:02:32] [debug] [upstream] KA connection #410 is in a failed state to: <snip>:8088, cleaning up
[2020/12/02 05:02:32] [debug] [upstream] KA connection #464 is in a failed state to: <snip>:8088, cleaning up
And here is what strace is showing me is happening around the above time when it lands into an endless [pid 52799] nanosleep({1, 0}, 0x7fff9c6cb340) = 0
[pid 52876] socket(PF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 444
[pid 52876] connect(444, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("169.254.25.10")}, 16) = 0
[pid 52876] poll([{fd=444, events=POLLOUT}], 1, 0) = 1 ([{fd=444, revents=POLLOUT}])
[pid 52876] sendmmsg(444, {{{msg_name(0)=NULL, msg_iov(1)=[{"\250U\1\0\0\1\0\0\0\0\0\0\<snip>"..., 52}], msg_controllen=0, msg_flags=0}, 52}, {{msg_name(0)=NULL, msg_iov(1)=[{"8X\1\0\0\1\0\0\0\0\0\0\<snip>"..., 52}], msg_controllen=0, msg_flags=0}, 52}}, 2, MSG_NOSIGNAL) = 2
[pid 52876] poll([{fd=444, events=POLLIN}], 1, 5000) = 1 ([{fd=444, revents=POLLIN}])
[pid 52876] ioctl(444, FIONREAD, [138]) = 0
[pid 52876] recvfrom(444, "8X\205\200\0\1\0\0\0\1\0\0\<snip>"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("169.254.25.10")}, [16]) = 138
[pid 52876] poll([{fd=444, events=POLLIN}], 1, 4999) = 1 ([{fd=444, revents=POLLIN}])
[pid 52876] ioctl(444, FIONREAD, [102]) = 0
[pid 52876] recvfrom(444, "\250U\205\200\0\1\0\1\0\0\0\0\<snip>"..., 65536, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("169.254.25.10")}, [16]) = 102
[pid 52876] close(444) = 0
[pid 52876] connect(443, {sa_family=AF_INET, sin_port=htons(8088), sin_addr=inet_addr("<snip>")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid 52876] epoll_ctl(8, EPOLL_CTL_ADD, 443, {EPOLLOUT|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=263755968, u64=140630377928896}}) = 0
[pid 52876] write(301, "\0@}\20\2\0\0\0", 8 <unfinished ...>
[pid 52799] <... nanosleep resumed> 0x7fff9c6cb340) = 0
[pid 52799] nanosleep({1, 0}, 0x7fff9c6cb340) = 0
[pid 52799] nanosleep({1, 0}, 0x7fff9c6cb340) = 0
[pid 52799] nanosleep({1, 0}, 0x7fff9c6cb340) = 0
[pid 52799] nanosleep({1, 0}, 0x7fff9c6cb340) = 0
The stack at that point
(gdb) bt
#0 0x00007f8178bb54a7 in write () from target:/lib/x86_64-linux-gnu/libpthread.so.0
#1 0x0000563a8ef92391 in flb_output_return (ret=1, th=0x7f816059a2c0) at /tmp/fluent-bit/include/fluent-bit/flb_output.h:570
#2 0x0000563a8ef92479 in flb_output_return_do (x=1) at /tmp/fluent-bit/include/fluent-bit/flb_output.h:601
#3 0x0000563a8ef92601 in cb_counter_flush (data=0x7f8146a05800, bytes=101903,
tag=0x7f81760965e0 "<snip>-95f3076c5e07e6d1286eb15c0bae601626cd9234907c50a730286c"..., tag_len=214, i_ins=0x7f8177287180, out_context=0x7f81760470f8, config=0x7f8177232c40) at /tmp/fluent-bit/plugins/out_counter/counter.c:78
#4 0x0000563a8ef3550f in output_pre_cb_flush () at /tmp/fluent-bit/include/fluent-bit/flb_output.h:472
#5 0x0000563a8f389907 in co_init () at /tmp/fluent-bit/lib/monkey/deps/flb_libco/amd64.c:117
#6 0xac6c6174656d6572 in ?? ()
#7 0x74736f486562754b in ?? ()
<snip>
#9 0x6862756b31703430 in ?? ()
#174 0xa982736c6562616c in ?? ()
This is the SIGCONT debug data
[2020/12/02 06:51:39] [debug] [upstream] KA connection #330 is in a failed state to: <snip>:8088, cleaning up
[2020/12/02 06:51:39] [debug] [upstream] KA connection #333 is in a failed state to: <snip>:8088, cleaning up
[2020/12/02 06:51:39] [debug] [upstream] KA connection #329 is in a failed state to: <snip>:8088, cleaning up
[2020/12/02 06:52:31] [engine] caught signal (SIGINT)
[2020/12/02 06:52:31] [ info] [input] pausing tail.0
[2020/12/02 06:52:31] [ info] [input:tail:tail.0] flushing pending docker mode data...
[2020/12/02 06:52:31] [ info] [input] pausing systemd.1
[2020/12/02 06:54:03] [engine] caught signal (SIGCONT)
[2020/12/02 06:54:03] Fluent Bit Dump
===== Input =====
tail.0 (tail)
│
├─ status
│ └─ overlimit : no
│ ├─ mem size : 62.9M (65989114 bytes)
│ └─ mem limit : 0b (0 bytes)
│
├─ tasks
│ ├─ total tasks : 479
│ ├─ new : 30
│ ├─ running : 449
│ └─ size : 62.9M (65989114 bytes)
│
└─ chunks
└─ total chunks : 479
├─ up chunks : 479
├─ down chunks: 0
└─ busy chunks: 479
├─ size : 62.9M (65989114 bytes)
└─ size err: 0
systemd.1 (systemd)
│
├─ status
│ └─ overlimit : no
│ ├─ mem size : 526.2K (538861 bytes)
│ └─ mem limit : 0b (0 bytes)
│
├─ tasks
│ ├─ total tasks : 6
│ ├─ new : 0
│ ├─ running : 6
│ └─ size : 526.2K (538861 bytes)
│
└─ chunks
└─ total chunks : 6
├─ up chunks : 6
├─ down chunks: 0
└─ busy chunks: 6
├─ size : 526.2K (538861 bytes)
└─ size err: 0
===== Storage Layer =====
total chunks : 485
├─ mem chunks : 485
└─ fs chunks : 0
├─ up : 0
└─ down : 0
Secondary Bug ?
As an attempt avoid this I have tried net.keepalive false in my config per this doc https://docs.fluentbit.io/manual/administration/networking#concepts but it does not seem to disable the feature as possibly noted as well in #2550. I also tried KeepAlive On as noted in #2189 but that too does not seem change the behavior of the error or the connection usage
Your Environment
Kubernetes multiple versions using various helm charts, though latest testing across 4 clusters using 1.6.6 and config that is here
fluent-bit.conf: |
[SERVICE]
HTTP_Server On
Config_Watch Off
HTTP_Listen 0.0.0.0
HTTP_PORT 2020
Flush 1
Daemon Off
Log_Level trace
Parsers_File parsers.conf
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
Parser docker
DB /var/log/flb_kube.db
Buffer_Chunk_Size 4k
Buffer_Max_Size 1G
DB.Sync Normal
Docker_Mode On
Docker_Mode_Flush 4
[INPUT]
Name systemd
Tag host.*
Path /var/log/journal
DB /var/log/flb_systemd.db
Strip_Underscores true
Systemd_Filter _SYSTEMD_UNIT=kubelet.service
Systemd_Filter _SYSTEMD_UNIT=docker.service
Systemd_Filter _SYSTEMD_UNIT=containerd.service
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Merge_Log On
Keep_Log Off
[FILTER]
Name modify
Match *
Set LoggingAgent fluent-bit
Set KubeClusterName ${CLUSTER_NAME}
Set KubeClusterAbbreviation ${CLUSTER_NAME}
Set KubeClusterRegion ${CLUSTER_REGION}
Set KubeHostname ${HOST_NAME}
Set KubeHostIP ${HOST_IP}
[FILTER]
Name nest
Match *
Operation nest
Wildcard *
Nest_under event
[FILTER]
Name modify
Match *
Set index ${SPLUNK_INDEX}
[OUTPUT]
Name counter
Match *
[OUTPUT]
Name splunk
Match *
Host ${SPLUNK_HOST}
Splunk_Token ${SPLUNK_TOKEN}
Port 8088
net.keepalive false
TLS Off
TLS.Verify Off
Splunk_Send_Raw On
parsers.conf: |
[PARSER]
Name apache
Format regex
Regex ^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
[PARSER]
Name apache2
Format regex
Regex ^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^ ]*) +\S*)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
[PARSER]
Name apache_error
Format regex
Regex ^\[[^ ]* (?<time>[^\]]*)\] \[(?<level>[^\]]*)\](?: \[pid (?<pid>[^\]]*)\])?( \[client (?<client>[^\]]*)\])? (?<message>.*)$
[PARSER]
Name nginx
Format regex
Regex ^(?<remote>[^ ]*) (?<host>[^ ]*) (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
[PARSER]
Name syslog
Format regex
Regex ^\<(?<pri>[0-9]+)\>(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
Time_Key time
Time_Format %b %d %H:%M:%S
[PARSER]
Name json-test
Format json
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
[PARSER]
Name docker
Format json
Time_Keep On
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Startup Log
�[1mFluent Bit v1.6.6�[0m
* �[1m�[93mCopyright (C) 2019-2020 The Fluent Bit Authors�[0m
* �[1m�[93mCopyright (C) 2015-2018 Treasure Data�[0m
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2020/12/02 07:09:28] [ info] Configuration:
[2020/12/02 07:09:28] [ info] flush time | 1.000000 seconds
[2020/12/02 07:09:28] [ info] grace | 5 seconds
[2020/12/02 07:09:28] [ info] daemon | 0
[2020/12/02 07:09:28] [ info] ___________
[2020/12/02 07:09:28] [ info] inputs:
[2020/12/02 07:09:28] [ info] tail
[2020/12/02 07:09:28] [ info] systemd
[2020/12/02 07:09:28] [ info] ___________
[2020/12/02 07:09:28] [ info] filters:
[2020/12/02 07:09:28] [ info] kubernetes.0
[2020/12/02 07:09:28] [ info] modify.1
[2020/12/02 07:09:28] [ info] nest.2
[2020/12/02 07:09:28] [ info] modify.3
[2020/12/02 07:09:28] [ info] ___________
[2020/12/02 07:09:28] [ info] outputs:
[2020/12/02 07:09:28] [ info] counter.0
[2020/12/02 07:09:28] [ info] splunk.1
[2020/12/02 07:09:28] [ info] ___________
[2020/12/02 07:09:28] [ info] collectors:
[2020/12/02 07:09:28] [ info] [engine] started (pid=1)
[2020/12/02 07:09:28] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2020/12/02 07:09:28] [debug] [storage] [cio stream] new stream registered: tail.0
[2020/12/02 07:09:28] [debug] [storage] [cio stream] new stream registered: systemd.1
[2020/12/02 07:09:28] [ info] [storage] version=1.0.6, initializing...
[2020/12/02 07:09:28] [ info] [storage] in-memory
[2020/12/02 07:09:28] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/12/02 07:09:28] [debug] [input:tail:tail.0] inotify watch fd=27
[2020/12/02 07:09:28] [debug] [input:tail:tail.0] scanning path /var/log/containers/*.log
Additional context
These are large clusters on a variety of node sizes, running between 24 to 72 cores and 50 to 500gb of memory and between 90 to 200 pods per node at a time
No indications of system or cgroup pressure are seen that are common between the nodes having this halt and even systems with less than 100 pods have been observed entering this loop, though it seems the less pods means it will run longer before dying
Nodes have on average 3 to 5gb of logs on disk in /var/log/containers
Does not seem related to #2110 or #1108 as I see fluent following log rotations successfully
This originally presented at this issue #2661 with the following stack
(gdb) bt
#0 0x00005630b94bea6f in create_conn (u=0x7f511f285620) at /tmp/fluent-bit/src/flb_upstream.c:158
#1 0x00005630b94becdd in flb_upstream_conn_get (u=0x7f511f285620) at /tmp/fluent-bit/src/flb_upstream.c:236
#2 0x00005630b95206e9 in cb_splunk_flush (data=0x7f50f678e980, bytes=127312,
tag=0x7f50f75d2320 "<snip>faba71d739f58eed031b45fe84fbebda35c1a8cc4851de4435eefb559df057f5.log", tag_len=146,
i_ins=0x7f511f260180, out_context=0x7f511f22a8d0, config=0x7f511f232000) at /tmp/fluent-bit/plugins/out_splunk/splunk.c:169
#3 0x00005630b94baa5d in output_pre_cb_flush () at /tmp/fluent-bit/include/fluent-bit/flb_output.h:371
#4 0x00005630b98980a7 in co_init () at /tmp/fluent-bit/lib/monkey/deps/flb_libco/amd64.c:117
#5 0x0000000000000000 in ?? ()
(gdb)
After applying fix versions 1.6.6 and an experimental version I built before its release with the patch applied over 1.6.5 code we had the same halt pattern
I have run this with the original config and swapping output to be a counter only, which progressed fine though the read and write pumps
The system logs at the expected rate for it seems at least 1 or 2 flushes on the host before shutting down, about 30 seconds of runtime
No oom, cpu or other issues encountered or logged to any of the containers
Bug Report
Describe the bug
We have been seeing an intermittent halt sequence on fluent versions 1.2.0 through 1.6.6 across 4 independent and separately managed clusters over the past 2 weeks. While I suspect that fluent in this case is reacting to an upstream pressure like latency or network retries, the system seems to halt as its cleaning connections in this environment
Functional Log: this seems like a normal filter and flush to the upstream client
Halt point: This follows the same as the normal flush sequence up until the upstream cleanup
And here is what strace is showing me is happening around the above time when it lands into an endless
[pid 52799] nanosleep({1, 0}, 0x7fff9c6cb340) = 0The stack at that point
This is the SIGCONT debug data
Secondary Bug ?
As an attempt avoid this I have tried
net.keepalive falsein my config per this doc https://docs.fluentbit.io/manual/administration/networking#concepts but it does not seem to disable the feature as possibly noted as well in #2550. I also triedKeepAlive Onas noted in #2189 but that too does not seem change the behavior of the error or the connection usageYour Environment
Kubernetes multiple versions using various helm charts, though latest testing across 4 clusters using 1.6.6 and config that is here
Startup Log
Additional context
These are large clusters on a variety of node sizes, running between 24 to 72 cores and 50 to 500gb of memory and between 90 to 200 pods per node at a time
No indications of system or cgroup pressure are seen that are common between the nodes having this halt and even systems with less than 100 pods have been observed entering this loop, though it seems the less pods means it will run longer before dying
Nodes have on average 3 to 5gb of logs on disk in
/var/log/containersDoes not seem related to #2110 or #1108 as I see fluent following log rotations successfully
This originally presented at this issue #2661 with the following stack
After applying fix versions 1.6.6 and an experimental version I built before its release with the patch applied over 1.6.5 code we had the same halt pattern
I have run this with the original config and swapping output to be a counter only, which progressed fine though the read and write pumps
The system logs at the expected rate for it seems at least 1 or 2 flushes on the host before shutting down, about 30 seconds of runtime
No oom, cpu or other issues encountered or logged to any of the containers