Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fluent-bit with storage enabled completely freezes (due to sockets not being closed properly(?)) #2950

Closed
sjentzsch opened this issue Jan 15, 2021 · 19 comments
Labels

Comments

@sjentzsch
Copy link

Bug Report

Describe the bug

Under certain circumstances fluent-bit can completely freeze, not processing any more logs anymore.
We saw this behavior quite some time, for many pods already, with a certain randomness of appearing or not.

We do rely on storage configuration, and my concern is that only with storage enabled this bug can appear. Probably in combination with higher load, and certain (mis?)configuration (see our config below). However, it's unfortunately a serious issue for us, as storage buffering should make logging outage more resilient, but with this it's actually the other way round :-)

I could reproduce the issue when running a log generator generating 200 msg à 1 kB of data per second, while denying any outgoing traffic from the fluent-bit pod. After a while, one out of three times our pod froze and did not process any more logs at all.

Having the debug-image of fluent-bit present, I could make the following interesting discovery: All sockets at some point did end up in CLOSE_WAIT state, find here my netstat debugging while it occurred:

/flb-storage/tail.0 # netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       
tcp        0      0 fluent-bit-jfz6p:60382  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:59954  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:59962  dtag-laas.netic.dk:https ESTABLISHED 
[...]
tcp        0      0 fluent-bit-jfz6p:60380  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:60054  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:60172  dtag-laas.netic.dk:https ESTABLISHED 
tcp    10106      0 fluent-bit-jfz6p:59820  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:60470  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:60370  dtag-laas.netic.dk:https ESTABLISHED 
tcp      641      0 fluent-bit-jfz6p:59802  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:59952  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:60068  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:2020   10.2.131.169:40492      TIME_WAIT   
tcp        0      0 fluent-bit-jfz6p:60638  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:60292  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      1 fluent-bit-jfz6p:60614  dtag-laas.netic.dk:https SYN_SENT    
tcp        0      0 fluent-bit-jfz6p:60362  dtag-laas.netic.dk:https ESTABLISHED 
tcp    11355      0 fluent-bit-jfz6p:60716  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:60308  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:59836  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:59882  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:60586  dtag-laas.netic.dk:https ESTABLISHED 
Active UNIX domain sockets (w/o servers)
Proto RefCnt Flags       Type       State         I-Node Path
/flb-storage/tail.0 # netstat | grep ESTABLISHED | wc -l
199
/flb-storage/tail.0 # netstat | grep ESTABLISHED | wc -l
7
/flb-storage/tail.0 # netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       
tcp       32      0 fluent-bit-jfz6p:60382  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp       32      0 fluent-bit-jfz6p:59954  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp        1      0 fluent-bit-jfz6p:59962  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp       32      0 fluent-bit-jfz6p:60354  dtag-laas.netic.dk:https CLOSE_WAIT  
[...]
tcp       32      0 fluent-bit-jfz6p:60068  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp        0      0 fluent-bit-jfz6p:2020   10.2.131.169:40492      TIME_WAIT   
tcp        1      0 fluent-bit-jfz6p:60292  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp        0      0 fluent-bit-jfz6p:60614  dtag-laas.netic.dk:https ESTABLISHED 
tcp       32      0 fluent-bit-jfz6p:59976  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp       32      0 fluent-bit-jfz6p:60620  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp       32      0 fluent-bit-jfz6p:60362  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp    11387      0 fluent-bit-jfz6p:60716  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp        1      0 fluent-bit-jfz6p:60298  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp       32      0 fluent-bit-jfz6p:59882  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp       32      0 fluent-bit-jfz6p:60586  dtag-laas.netic.dk:https CLOSE_WAIT  

Note that, I stripped a few lines, actually there were around ~200 sockets, with all the same state respectively.

While that happens, our /flb-storage/ usually countains a few hundred files which would not change anymore, since the pod froze.

The logs of such affected pod look like this:

[2021/01/14 13:58:38] [error] [tls] SSL error: NET - Connection was reset by peer
[2021/01/14 13:58:38] [error] [src/flb_http_client.c:1120 errno=25] Inappropriate ioctl for device
[2021/01/14 13:58:38] [ warn] [output:es:es.0] http_do=-1 URI=/api/v1/ingest/elastic-bulk/_bulk
[2021/01/14 13:58:38] [error] [tls] SSL error: NET - Connection was reset by peer
[2021/01/14 13:58:38] [error] [src/flb_http_client.c:1120 errno=25] Inappropriate ioctl for device
[2021/01/14 13:58:38] [ warn] [output:es:es.0] http_do=-1 URI=/api/v1/ingest/elastic-bulk/_bulk
[2021/01/14 13:58:38] [ warn] [engine] failed to flush chunk '1-1610632705.229524992.flb', retry in 8 seconds: task_id=37, input=tail.0 > output=es.2
[2021/01/14 13:58:38] [error] [upstream] connection #170 to dtag-laas.netic.dk:443 timed out after 10 seconds
[...]
[2021/01/14 13:58:38] [error] [upstream] connection #1271 to dtag-laas.netic.dk:443 timed out after 10 seconds
[2021/01/14 13:58:38] [error] [upstream] connection #896 to dtag-laas.netic.dk:443 timed out after 10 seconds
[2021/01/14 13:58:38] [ warn] [engine] failed to flush chunk '1-1610632704.928894552.flb', retry in 7 seconds: task_id=18, input=tail.0 > output=es.0
[2021/01/14 13:58:38] [error] [io] TCP connection failed: dtag-laas.netic.dk:443 (Connection timed out)
[...]
[2021/01/14 13:58:38] [error] [io] TCP connection failed: dtag-laas.netic.dk:443 (Connection timed out)
[2021/01/14 13:58:39] [ warn] [engine] failed to flush chunk '1-1610632705.103450536.flb', retry in 10 seconds: task_id=28, input=tail.0 > output=es.0
[2021/01/14 13:58:39] [ warn] [engine] failed to flush chunk '1-1610632705.26765971.flb', retry in 9 seconds: task_id=24, input=tail.0 > output=es.0
[2021/01/14 13:58:39] [ info] [input:tail:tail.0] inotify_fs_add(): inode=149604215 watch_fd=97 name=/var/log/containers/admin-management-59b684bf76-7dbqk_staging_admin-management-025be4bb58a5958704f1f80e8ed72322c98402e2f756e71708e4d0c56e5b842c.log
[...]
[2021/01/14 13:58:40] [ warn] [engine] failed to flush chunk '1-1610632705.269473234.flb', retry in 9 seconds: task_id=39, input=tail.0 > output=es.0
[2021/01/14 13:58:40] [ info] [input:tail:tail.0] inotify_fs_add(): inode=149604213 watch_fd=98 name=/var/log/containers/skill-e2e-tests-86598c5668-2qxpg_staging_skill-e2e-tests-e1e165a7ed49643b0e645fe5ad7f111d0e0472d1d38ab9875cf43da7210782b4.log
[2021/01/14 13:58:40] [ info] [input:tail:tail.0] inotify_fs_add(): inode=208216295 watch_fd=99 name=/var/log/containers/skill-e2e-tests-birthday-5cd88db447-p6x86_prodshadow_skill-e2e-tests-birthday-99aa9d08d371222835ab92a28fcfe0ea3245459c30efa271db31c854bd8b622d.log
[2021/01/14 13:58:41] [ warn] [engine] failed to flush chunk '1-1610632706.515332169.flb', retry in 10 seconds: task_id=108, input=tail.0 > output=es.2
[2021/01/14 13:58:41] [ warn] [engine] failed to flush chunk '1-1610632705.205745559.flb', retry in 11 seconds: task_id=34, input=tail.0 > output=es.0
[2021/01/14 13:58:41] [ warn] [engine] failed to flush chunk '1-1610632705.249240345.flb', retry in 9 seconds: task_id=38, input=tail.0 > output=es.0

... with no more logs ever following (= pod froze completely).

I assume this being a bug in fluent-bit, as it does not close sockets properly, and thus blocks itself somehow. Once I even saw 731 sockets being in CLOSE_WAIT within one pod.

Note that, I saw both cases: At few times, a pod restart would help, at others (in most cases), even restarting the pod would not help. Only when I cleared the storage folder /flb-storage/ and restart the pod, there would be a high chance of resurrection.

From my subjective judgement, the issue appear more often, if the load is high and fluent-bit has to process many container log files. When our backend (elasticsearch) was offline for a few hours we discovered this issue for the first time (about 30% of our pods were frozen).

Screenshots
no-output

As seen in screenshot, we have a short spike of data (disregard the time before the spike; the pod did spawn with the spike), and then no output anymore. Fluent-bit won't process any more data.

Your Environment

  • Version used: tested with both v1.6.9 and v1.6.10 which both show the bug
  • Configuration (ConfigMap in Kubernetes):
  custom_parsers.conf: |
    [PARSER]
        Name kube-svh
        Format regex
        Regex ((?<tag>[^.]+)\.)?(?<namespace_name>[^.]+)\.(?<pod_name>[^.]+)\.(?<container_name>[^.]+)\.(?<docker_id>[^.]+)$
  fluent-bit.conf: |
    [SERVICE]
        Flush 1
        Daemon Off
        Log_Level info
        Parsers_File parsers.conf
        Parsers_File custom_parsers.conf
        HTTP_Server On
        HTTP_Listen 0.0.0.0
        HTTP_Port 2020
        storage.path /flb-storage/
        storage.sync normal
        storage.checksum off
        storage.max_chunks_up 2048
        storage.backlog.mem_limit 64M
        storage.metrics on

    [INPUT]
        Name              tail
        Path              /var/log/containers/*.log
        Parser            docker
        DB                /var/log/flb_kube.db
        DB.Sync           Normal
        DB.locking        false
        Refresh_Interval  5
        Mem_Buf_Limit     64MB
        Skip_Long_Lines   On
        Tag               kube.<namespace_name>.<pod_name>.<container_name>.<docker_id>
        Tag_Regex         (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})
        Buffer_Chunk_Size 32k
        Buffer_Max_Size   4M
        storage.type      filesystem

    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_Tag_Prefix     kube.
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Regex_Parser        kube-svh
        Merge_Log           Off
        K8S-Logging.Parser  On
        K8S-Logging.Exclude On
        Labels              Off
        Annotations         Off

    [FILTER]
        Name modify
        Match kube.*
        Remove_Wildcard attrs

    [OUTPUT]
        Name  es
        Match kube.staging.*
        Host  XYZ
        Port  443
        Path  /api/v1/ingest/elastic-bulk
        Logstash_Format On
        Retry_Limit False
        Type  _doc
        Time_Key @timestamp
        Replace_Dots On
        Logstash_Prefix staging-4

        HTTP_User XYZ
        HTTP_Passwd XYZ

        Buffer_Size False

        tls on
        tls.verify on
        tls.debug 1

        storage.total_limit_size 30G
    [OUTPUT]
        Name  es
        Match kube.eventrouter.*
        Host  XYZ
        Port  443
        Path  /api/v1/ingest/elastic-bulk
        Logstash_Format On
        Retry_Limit False
        Type  _doc
        Time_Key @timestamp
        Replace_Dots On
        Logstash_Prefix staging-4

        HTTP_User XYZ
        HTTP_Passwd XYZ

        Buffer_Size False

        tls on
        tls.verify on
        tls.debug 1

        storage.total_limit_size 30G
    [...]

Note: We have a few Output-Section that basically follow the pattern as shown here (that's why the file ends with [...]).

Bonus Question: Looking at the configuration, do those values sound reasonable? We were not too sure about storage.max_chunks_up, storage.backlog.mem_limit and their relation to buffering sizes as specified in the tail input.

  • Environment name and version (e.g. Kubernetes? What version?): Kubernetes 1.18
  • Server type and version: n/a
  • Operating System and version: n/a
  • Filters and plugins: (see configuration)

Thanks a lot for your investigation! Looking forward.

@sjentzsch
Copy link
Author

Let me mention one more issue I saw when testing things out. On a few pods that were not affected, I saw the following log line being spammed over and over for hours:

[2021/01/13 22:47:54] [error] [upstream] connection #237 to dtag-laas.netic.dk:443 timed out after 10 seconds
[2021/01/13 22:47:56] [error] [upstream] connection #237 to dtag-laas.netic.dk:443 timed out after 10 seconds
[2021/01/13 22:47:57] [error] [upstream] connection #237 to dtag-laas.netic.dk:443 timed out after 10 seconds

The same connection #id, every few seconds, without ever ending. Only a pod restart could "resolve" it. Any idea what that issue might be related to?

I don't think our network itself is corrupted, as few pods were always running fine, and only changes to or inside the fluent-bit could make them work properly again (a combination of pod restarts and flushing the storage folder).

@sjentzsch
Copy link
Author

Even with storage disabled pods randomly freeze (occured yesterday and today once). I suspect this being a general bug in recent versions of fluent-bit? Maybe related to recent changes with socket handling.

@wstarzak
Copy link

wstarzak commented Feb 2, 2021

I do have same issue + there is no way to force healthcheck to restart the pod on timeouts

@apollo13
Copy link

@sjentzsch Regarding your time out errors in the log. I am seeing the same thing on normal machines (ie not in k8s) and attaching strace -f -p THEPID 2>&1|grep connect shows that connect errors out with EINPROGRESS which makes sense for non-blocking sockets. The funny thing is that as soon as I stop strace, the log entries also vanish. It looks like that interrupting the thread with strace somehow fixes the issue :) If I attach strace a second time I no longer see the connect failures.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Mar 18, 2021
@knutwannheden
Copy link

AFAICT there were no comments by committers here. what does it mean for the state of a project when the maintainers don't take the time to respond to a very detailed bug report like this one?

@github-actions github-actions bot removed the Stale label Mar 20, 2021
@nokute78
Copy link
Collaborator

nokute78 commented Mar 31, 2021

The patch 1938ec8 will be merged from v1.7.3 (not released yet.)

Similar to #3192

@nokute78
Copy link
Collaborator

v1.7.3 has been released. Can you check it?

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label May 16, 2021
@pierluigilenoci
Copy link
Contributor

@sjentzsch could you please follow up?

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jun 17, 2021
@pierluigilenoci
Copy link
Contributor

@sjentzsch could you please follow up?

@yteraoka
Copy link

Too many CLOSE_WAIT issue occured on Anthos baremetal stackdriver-log-forwarder (container image gcr.io/anthos-baremetal-release/fluent-bit:v1.7.7-gke.4).
This fluent-bit send logs to Google Cloud Logging using filesystem buffer.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Aug 15, 2021
@github-actions
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

@JeffLuoo
Copy link
Contributor

May I get the steps on reproducing this issue?

@data-dude
Copy link

data-dude commented Feb 11, 2022

This is still a problem with 1.8.12 . A plain vanilla configuration and fluent bit just stops after 5 minutes.

Edit: This may not be a fluentbit problem. See comment below if interested.

@leonardo-albertovich
Copy link
Collaborator

Do you have reproduction instructions @data-dude?

@data-dude
Copy link

data-dude commented Feb 11, 2022

This may not be a fluentbit problem. I have a plain configuration. Path is reading from /var/log/messages and output to stdout. The health api is failing right away and after 3 or 5 minutes it stops sending output. But there are no errors in the logs. This same setup works for multiple other nodes so I think this one server has an issue. I've been thinking of building a docker image with trace on and try that out. My setup is running on a closed network so i can't pull the configuration down to show you.

Fluentd works fine on this server so i'm not sure what causes fluentbit to just stop with no errors. Unfortunately strace isn't installed on the server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants