fluent-bit with storage enabled completely freezes (due to sockets not being closed properly(?)) #2950

sjentzsch · 2021-01-15T21:47:15Z

Bug Report

Describe the bug

Under certain circumstances fluent-bit can completely freeze, not processing any more logs anymore.
We saw this behavior quite some time, for many pods already, with a certain randomness of appearing or not.

We do rely on storage configuration, and my concern is that only with storage enabled this bug can appear. Probably in combination with higher load, and certain (mis?)configuration (see our config below). However, it's unfortunately a serious issue for us, as storage buffering should make logging outage more resilient, but with this it's actually the other way round :-)

I could reproduce the issue when running a log generator generating 200 msg à 1 kB of data per second, while denying any outgoing traffic from the fluent-bit pod. After a while, one out of three times our pod froze and did not process any more logs at all.

Having the debug-image of fluent-bit present, I could make the following interesting discovery: All sockets at some point did end up in CLOSE_WAIT state, find here my netstat debugging while it occurred:

/flb-storage/tail.0 # netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       
tcp        0      0 fluent-bit-jfz6p:60382  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:59954  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:59962  dtag-laas.netic.dk:https ESTABLISHED 
[...]
tcp        0      0 fluent-bit-jfz6p:60380  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:60054  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:60172  dtag-laas.netic.dk:https ESTABLISHED 
tcp    10106      0 fluent-bit-jfz6p:59820  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:60470  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:60370  dtag-laas.netic.dk:https ESTABLISHED 
tcp      641      0 fluent-bit-jfz6p:59802  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:59952  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:60068  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:2020   10.2.131.169:40492      TIME_WAIT   
tcp        0      0 fluent-bit-jfz6p:60638  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:60292  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      1 fluent-bit-jfz6p:60614  dtag-laas.netic.dk:https SYN_SENT    
tcp        0      0 fluent-bit-jfz6p:60362  dtag-laas.netic.dk:https ESTABLISHED 
tcp    11355      0 fluent-bit-jfz6p:60716  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:60308  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:59836  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:59882  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:60586  dtag-laas.netic.dk:https ESTABLISHED 
Active UNIX domain sockets (w/o servers)
Proto RefCnt Flags       Type       State         I-Node Path
/flb-storage/tail.0 # netstat | grep ESTABLISHED | wc -l
199
/flb-storage/tail.0 # netstat | grep ESTABLISHED | wc -l
7
/flb-storage/tail.0 # netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       
tcp       32      0 fluent-bit-jfz6p:60382  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp       32      0 fluent-bit-jfz6p:59954  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp        1      0 fluent-bit-jfz6p:59962  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp       32      0 fluent-bit-jfz6p:60354  dtag-laas.netic.dk:https CLOSE_WAIT  
[...]
tcp       32      0 fluent-bit-jfz6p:60068  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp        0      0 fluent-bit-jfz6p:2020   10.2.131.169:40492      TIME_WAIT   
tcp        1      0 fluent-bit-jfz6p:60292  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp        0      0 fluent-bit-jfz6p:60614  dtag-laas.netic.dk:https ESTABLISHED 
tcp       32      0 fluent-bit-jfz6p:59976  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp       32      0 fluent-bit-jfz6p:60620  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp       32      0 fluent-bit-jfz6p:60362  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp    11387      0 fluent-bit-jfz6p:60716  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp        1      0 fluent-bit-jfz6p:60298  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp       32      0 fluent-bit-jfz6p:59882  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp       32      0 fluent-bit-jfz6p:60586  dtag-laas.netic.dk:https CLOSE_WAIT

Note that, I stripped a few lines, actually there were around ~200 sockets, with all the same state respectively.

While that happens, our /flb-storage/ usually countains a few hundred files which would not change anymore, since the pod froze.

The logs of such affected pod look like this:

[2021/01/14 13:58:38] [error] [tls] SSL error: NET - Connection was reset by peer
[2021/01/14 13:58:38] [error] [src/flb_http_client.c:1120 errno=25] Inappropriate ioctl for device
[2021/01/14 13:58:38] [ warn] [output:es:es.0] http_do=-1 URI=/api/v1/ingest/elastic-bulk/_bulk
[2021/01/14 13:58:38] [error] [tls] SSL error: NET - Connection was reset by peer
[2021/01/14 13:58:38] [error] [src/flb_http_client.c:1120 errno=25] Inappropriate ioctl for device
[2021/01/14 13:58:38] [ warn] [output:es:es.0] http_do=-1 URI=/api/v1/ingest/elastic-bulk/_bulk
[2021/01/14 13:58:38] [ warn] [engine] failed to flush chunk '1-1610632705.229524992.flb', retry in 8 seconds: task_id=37, input=tail.0 > output=es.2
[2021/01/14 13:58:38] [error] [upstream] connection #170 to dtag-laas.netic.dk:443 timed out after 10 seconds
[...]
[2021/01/14 13:58:38] [error] [upstream] connection #1271 to dtag-laas.netic.dk:443 timed out after 10 seconds
[2021/01/14 13:58:38] [error] [upstream] connection #896 to dtag-laas.netic.dk:443 timed out after 10 seconds
[2021/01/14 13:58:38] [ warn] [engine] failed to flush chunk '1-1610632704.928894552.flb', retry in 7 seconds: task_id=18, input=tail.0 > output=es.0
[2021/01/14 13:58:38] [error] [io] TCP connection failed: dtag-laas.netic.dk:443 (Connection timed out)
[...]
[2021/01/14 13:58:38] [error] [io] TCP connection failed: dtag-laas.netic.dk:443 (Connection timed out)
[2021/01/14 13:58:39] [ warn] [engine] failed to flush chunk '1-1610632705.103450536.flb', retry in 10 seconds: task_id=28, input=tail.0 > output=es.0
[2021/01/14 13:58:39] [ warn] [engine] failed to flush chunk '1-1610632705.26765971.flb', retry in 9 seconds: task_id=24, input=tail.0 > output=es.0
[2021/01/14 13:58:39] [ info] [input:tail:tail.0] inotify_fs_add(): inode=149604215 watch_fd=97 name=/var/log/containers/admin-management-59b684bf76-7dbqk_staging_admin-management-025be4bb58a5958704f1f80e8ed72322c98402e2f756e71708e4d0c56e5b842c.log
[...]
[2021/01/14 13:58:40] [ warn] [engine] failed to flush chunk '1-1610632705.269473234.flb', retry in 9 seconds: task_id=39, input=tail.0 > output=es.0
[2021/01/14 13:58:40] [ info] [input:tail:tail.0] inotify_fs_add(): inode=149604213 watch_fd=98 name=/var/log/containers/skill-e2e-tests-86598c5668-2qxpg_staging_skill-e2e-tests-e1e165a7ed49643b0e645fe5ad7f111d0e0472d1d38ab9875cf43da7210782b4.log
[2021/01/14 13:58:40] [ info] [input:tail:tail.0] inotify_fs_add(): inode=208216295 watch_fd=99 name=/var/log/containers/skill-e2e-tests-birthday-5cd88db447-p6x86_prodshadow_skill-e2e-tests-birthday-99aa9d08d371222835ab92a28fcfe0ea3245459c30efa271db31c854bd8b622d.log
[2021/01/14 13:58:41] [ warn] [engine] failed to flush chunk '1-1610632706.515332169.flb', retry in 10 seconds: task_id=108, input=tail.0 > output=es.2
[2021/01/14 13:58:41] [ warn] [engine] failed to flush chunk '1-1610632705.205745559.flb', retry in 11 seconds: task_id=34, input=tail.0 > output=es.0
[2021/01/14 13:58:41] [ warn] [engine] failed to flush chunk '1-1610632705.249240345.flb', retry in 9 seconds: task_id=38, input=tail.0 > output=es.0

... with no more logs ever following (= pod froze completely).

I assume this being a bug in fluent-bit, as it does not close sockets properly, and thus blocks itself somehow. Once I even saw 731 sockets being in CLOSE_WAIT within one pod.

Note that, I saw both cases: At few times, a pod restart would help, at others (in most cases), even restarting the pod would not help. Only when I cleared the storage folder /flb-storage/ and restart the pod, there would be a high chance of resurrection.

From my subjective judgement, the issue appear more often, if the load is high and fluent-bit has to process many container log files. When our backend (elasticsearch) was offline for a few hours we discovered this issue for the first time (about 30% of our pods were frozen).

Screenshots

As seen in screenshot, we have a short spike of data (disregard the time before the spike; the pod did spawn with the spike), and then no output anymore. Fluent-bit won't process any more data.

Your Environment

Version used: tested with both v1.6.9 and v1.6.10 which both show the bug
Configuration (ConfigMap in Kubernetes):

  custom_parsers.conf: |
    [PARSER]
        Name kube-svh
        Format regex
        Regex ((?<tag>[^.]+)\.)?(?<namespace_name>[^.]+)\.(?<pod_name>[^.]+)\.(?<container_name>[^.]+)\.(?<docker_id>[^.]+)$
  fluent-bit.conf: |
    [SERVICE]
        Flush 1
        Daemon Off
        Log_Level info
        Parsers_File parsers.conf
        Parsers_File custom_parsers.conf
        HTTP_Server On
        HTTP_Listen 0.0.0.0
        HTTP_Port 2020
        storage.path /flb-storage/
        storage.sync normal
        storage.checksum off
        storage.max_chunks_up 2048
        storage.backlog.mem_limit 64M
        storage.metrics on

    [INPUT]
        Name              tail
        Path              /var/log/containers/*.log
        Parser            docker
        DB                /var/log/flb_kube.db
        DB.Sync           Normal
        DB.locking        false
        Refresh_Interval  5
        Mem_Buf_Limit     64MB
        Skip_Long_Lines   On
        Tag               kube.<namespace_name>.<pod_name>.<container_name>.<docker_id>
        Tag_Regex         (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})
        Buffer_Chunk_Size 32k
        Buffer_Max_Size   4M
        storage.type      filesystem

    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_Tag_Prefix     kube.
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Regex_Parser        kube-svh
        Merge_Log           Off
        K8S-Logging.Parser  On
        K8S-Logging.Exclude On
        Labels              Off
        Annotations         Off

    [FILTER]
        Name modify
        Match kube.*
        Remove_Wildcard attrs

    [OUTPUT]
        Name  es
        Match kube.staging.*
        Host  XYZ
        Port  443
        Path  /api/v1/ingest/elastic-bulk
        Logstash_Format On
        Retry_Limit False
        Type  _doc
        Time_Key @timestamp
        Replace_Dots On
        Logstash_Prefix staging-4

        HTTP_User XYZ
        HTTP_Passwd XYZ

        Buffer_Size False

        tls on
        tls.verify on
        tls.debug 1

        storage.total_limit_size 30G
    [OUTPUT]
        Name  es
        Match kube.eventrouter.*
        Host  XYZ
        Port  443
        Path  /api/v1/ingest/elastic-bulk
        Logstash_Format On
        Retry_Limit False
        Type  _doc
        Time_Key @timestamp
        Replace_Dots On
        Logstash_Prefix staging-4

        HTTP_User XYZ
        HTTP_Passwd XYZ

        Buffer_Size False

        tls on
        tls.verify on
        tls.debug 1

        storage.total_limit_size 30G
    [...]

Note: We have a few Output-Section that basically follow the pattern as shown here (that's why the file ends with [...]).

Bonus Question: Looking at the configuration, do those values sound reasonable? We were not too sure about storage.max_chunks_up, storage.backlog.mem_limit and their relation to buffering sizes as specified in the tail input.

Environment name and version (e.g. Kubernetes? What version?): Kubernetes 1.18
Server type and version: n/a
Operating System and version: n/a
Filters and plugins: (see configuration)

Thanks a lot for your investigation! Looking forward.

The text was updated successfully, but these errors were encountered:

sjentzsch · 2021-01-15T21:57:42Z

Let me mention one more issue I saw when testing things out. On a few pods that were not affected, I saw the following log line being spammed over and over for hours:

[2021/01/13 22:47:54] [error] [upstream] connection #237 to dtag-laas.netic.dk:443 timed out after 10 seconds
[2021/01/13 22:47:56] [error] [upstream] connection #237 to dtag-laas.netic.dk:443 timed out after 10 seconds
[2021/01/13 22:47:57] [error] [upstream] connection #237 to dtag-laas.netic.dk:443 timed out after 10 seconds

The same connection #id, every few seconds, without ever ending. Only a pod restart could "resolve" it. Any idea what that issue might be related to?

I don't think our network itself is corrupted, as few pods were always running fine, and only changes to or inside the fluent-bit could make them work properly again (a combination of pod restarts and flushing the storage folder).

sjentzsch · 2021-01-19T12:46:31Z

Even with storage disabled pods randomly freeze (occured yesterday and today once). I suspect this being a general bug in recent versions of fluent-bit? Maybe related to recent changes with socket handling.

wstarzak · 2021-02-02T14:06:51Z

I do have same issue + there is no way to force healthcheck to restart the pod on timeouts

apollo13 · 2021-02-15T19:18:03Z

@sjentzsch Regarding your time out errors in the log. I am seeing the same thing on normal machines (ie not in k8s) and attaching strace -f -p THEPID 2>&1|grep connect shows that connect errors out with EINPROGRESS which makes sense for non-blocking sockets. The funny thing is that as soon as I stop strace, the log entries also vanish. It looks like that interrupting the thread with strace somehow fixes the issue :) If I attach strace a second time I no longer see the connect failures.

github-actions · 2021-03-18T02:14:13Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

knutwannheden · 2021-03-18T06:41:00Z

AFAICT there were no comments by committers here. what does it mean for the state of a project when the maintainers don't take the time to respond to a very detailed bug report like this one?

nokute78 · 2021-03-31T23:01:56Z

The patch 1938ec8 will be merged from v1.7.3 (not released yet.)

Similar to #3192

nokute78 · 2021-04-14T11:11:51Z

v1.7.3 has been released. Can you check it?

github-actions · 2021-05-16T02:01:27Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

pierluigilenoci · 2021-05-16T10:08:16Z

@sjentzsch could you please follow up?

github-actions · 2021-06-17T01:53:36Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

pierluigilenoci · 2021-06-17T10:09:50Z

@sjentzsch could you please follow up?

yteraoka · 2021-07-15T06:03:53Z

Too many CLOSE_WAIT issue occured on Anthos baremetal stackdriver-log-forwarder (container image gcr.io/anthos-baremetal-release/fluent-bit:v1.7.7-gke.4).
This fluent-bit send logs to Google Cloud Logging using filesystem buffer.

github-actions · 2021-08-15T01:55:44Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions · 2021-08-21T01:47:23Z

This issue was closed because it has been stalled for 5 days with no activity.

JeffLuoo · 2021-09-17T19:55:19Z

May I get the steps on reproducing this issue?

data-dude · 2022-02-11T16:26:11Z

This is still a problem with 1.8.12 . A plain vanilla configuration and fluent bit just stops after 5 minutes.

Edit: This may not be a fluentbit problem. See comment below if interested.

leonardo-albertovich · 2022-02-11T16:44:35Z

Do you have reproduction instructions @data-dude?

data-dude · 2022-02-11T20:02:33Z

This may not be a fluentbit problem. I have a plain configuration. Path is reading from /var/log/messages and output to stdout. The health api is failing right away and after 3 or 5 minutes it stops sending output. But there are no errors in the logs. This same setup works for multiple other nodes so I think this one server has an issue. I've been thinking of building a docker image with trace on and try that out. My setup is running on a closed network so i can't pull the configuration down to show you.

Fluentd works fine on this server so i'm not sure what causes fluentbit to just stop with no errors. Unfortunately strace isn't installed on the server.

servo1x mentioned this issue Feb 10, 2021

Fluent-bit stops processing logs under high load, locks writing in ch_manager pipe #2661

Open

This was referenced Mar 9, 2021

Flood of connection timed out message and accordingly side effect #3192

Closed

Fix issue #3192 #3202

Closed

github-actions bot added the Stale label Mar 18, 2021

github-actions bot removed the Stale label Mar 20, 2021

github-actions bot added the Stale label May 16, 2021

github-actions bot removed the Stale label May 17, 2021

konstantin-kornienko mentioned this issue May 31, 2021

Fluentbit gets stuck [multiple issues] #3581

Closed

github-actions bot added the Stale label Jun 17, 2021

github-actions bot removed the Stale label Jun 21, 2021

gabegorelick mentioned this issue Jun 30, 2021

FluentBit stops working consistently around ~3 minutes after startup #3518

Closed

github-actions bot added the Stale label Aug 15, 2021

github-actions bot closed this as completed Aug 21, 2021

uri200 mentioned this issue Apr 28, 2022

[Magma] [Bug] AGW 1.6.1 Disk space fills up and crashes AGW once external network is down magma/magma#11240

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fluent-bit with storage enabled completely freezes (due to sockets not being closed properly(?)) #2950

fluent-bit with storage enabled completely freezes (due to sockets not being closed properly(?)) #2950

sjentzsch commented Jan 15, 2021

sjentzsch commented Jan 15, 2021

sjentzsch commented Jan 19, 2021

wstarzak commented Feb 2, 2021

apollo13 commented Feb 15, 2021

github-actions bot commented Mar 18, 2021

knutwannheden commented Mar 18, 2021

nokute78 commented Mar 31, 2021 •

edited

Loading

nokute78 commented Apr 14, 2021

github-actions bot commented May 16, 2021

pierluigilenoci commented May 16, 2021

github-actions bot commented Jun 17, 2021

pierluigilenoci commented Jun 17, 2021

yteraoka commented Jul 15, 2021

github-actions bot commented Aug 15, 2021

github-actions bot commented Aug 21, 2021

JeffLuoo commented Sep 17, 2021

data-dude commented Feb 11, 2022 •

edited

Loading

leonardo-albertovich commented Feb 11, 2022

data-dude commented Feb 11, 2022 •

edited

Loading

fluent-bit with storage enabled completely freezes (due to sockets not being closed properly(?)) #2950

fluent-bit with storage enabled completely freezes (due to sockets not being closed properly(?)) #2950

Comments

sjentzsch commented Jan 15, 2021

Bug Report

sjentzsch commented Jan 15, 2021

sjentzsch commented Jan 19, 2021

wstarzak commented Feb 2, 2021

apollo13 commented Feb 15, 2021

github-actions bot commented Mar 18, 2021

knutwannheden commented Mar 18, 2021

nokute78 commented Mar 31, 2021 • edited Loading

nokute78 commented Apr 14, 2021

github-actions bot commented May 16, 2021

pierluigilenoci commented May 16, 2021

github-actions bot commented Jun 17, 2021

pierluigilenoci commented Jun 17, 2021

yteraoka commented Jul 15, 2021

github-actions bot commented Aug 15, 2021

github-actions bot commented Aug 21, 2021

JeffLuoo commented Sep 17, 2021

data-dude commented Feb 11, 2022 • edited Loading

leonardo-albertovich commented Feb 11, 2022

data-dude commented Feb 11, 2022 • edited Loading

nokute78 commented Mar 31, 2021 •

edited

Loading

data-dude commented Feb 11, 2022 •

edited

Loading

data-dude commented Feb 11, 2022 •

edited

Loading