-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fluent-bit with storage enabled completely freezes (due to sockets not being closed properly(?)) #2950
Comments
Let me mention one more issue I saw when testing things out. On a few pods that were not affected, I saw the following log line being spammed over and over for hours:
The same connection #id, every few seconds, without ever ending. Only a pod restart could "resolve" it. Any idea what that issue might be related to? I don't think our network itself is corrupted, as few pods were always running fine, and only changes to or inside the fluent-bit could make them work properly again (a combination of pod restarts and flushing the storage folder). |
Even with storage disabled pods randomly freeze (occured yesterday and today once). I suspect this being a general bug in recent versions of fluent-bit? Maybe related to recent changes with socket handling. |
I do have same issue + there is no way to force healthcheck to restart the pod on timeouts |
@sjentzsch Regarding your time out errors in the log. I am seeing the same thing on normal machines (ie not in k8s) and attaching |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
AFAICT there were no comments by committers here. what does it mean for the state of a project when the maintainers don't take the time to respond to a very detailed bug report like this one? |
v1.7.3 has been released. Can you check it? |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
@sjentzsch could you please follow up? |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
@sjentzsch could you please follow up? |
Too many CLOSE_WAIT issue occured on Anthos baremetal stackdriver-log-forwarder (container image |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
This issue was closed because it has been stalled for 5 days with no activity. |
May I get the steps on reproducing this issue? |
This is still a problem with 1.8.12 . A plain vanilla configuration and fluent bit just stops after 5 minutes. Edit: This may not be a fluentbit problem. See comment below if interested. |
Do you have reproduction instructions @data-dude? |
This may not be a fluentbit problem. I have a plain configuration. Path is reading from /var/log/messages and output to stdout. The health api is failing right away and after 3 or 5 minutes it stops sending output. But there are no errors in the logs. This same setup works for multiple other nodes so I think this one server has an issue. I've been thinking of building a docker image with trace on and try that out. My setup is running on a closed network so i can't pull the configuration down to show you. Fluentd works fine on this server so i'm not sure what causes fluentbit to just stop with no errors. Unfortunately strace isn't installed on the server. |
Bug Report
Describe the bug
Under certain circumstances fluent-bit can completely freeze, not processing any more logs anymore.
We saw this behavior quite some time, for many pods already, with a certain randomness of appearing or not.
We do rely on storage configuration, and my concern is that only with storage enabled this bug can appear. Probably in combination with higher load, and certain (mis?)configuration (see our config below). However, it's unfortunately a serious issue for us, as storage buffering should make logging outage more resilient, but with this it's actually the other way round :-)
I could reproduce the issue when running a log generator generating 200 msg à 1 kB of data per second, while denying any outgoing traffic from the fluent-bit pod. After a while, one out of three times our pod froze and did not process any more logs at all.
Having the debug-image of fluent-bit present, I could make the following interesting discovery: All sockets at some point did end up in
CLOSE_WAIT
state, find here mynetstat
debugging while it occurred:Note that, I stripped a few lines, actually there were around ~200 sockets, with all the same state respectively.
While that happens, our
/flb-storage/
usually countains a few hundred files which would not change anymore, since the pod froze.The logs of such affected pod look like this:
... with no more logs ever following (= pod froze completely).
I assume this being a bug in fluent-bit, as it does not close sockets properly, and thus blocks itself somehow. Once I even saw 731 sockets being in
CLOSE_WAIT
within one pod.Note that, I saw both cases: At few times, a pod restart would help, at others (in most cases), even restarting the pod would not help. Only when I cleared the storage folder
/flb-storage/
and restart the pod, there would be a high chance of resurrection.From my subjective judgement, the issue appear more often, if the load is high and fluent-bit has to process many container log files. When our backend (elasticsearch) was offline for a few hours we discovered this issue for the first time (about 30% of our pods were frozen).
Screenshots
As seen in screenshot, we have a short spike of data (disregard the time before the spike; the pod did spawn with the spike), and then no output anymore. Fluent-bit won't process any more data.
Your Environment
Note: We have a few Output-Section that basically follow the pattern as shown here (that's why the file ends with
[...]
).Bonus Question: Looking at the configuration, do those values sound reasonable? We were not too sure about
storage.max_chunks_up
,storage.backlog.mem_limit
and their relation to buffering sizes as specified in the tail input.Thanks a lot for your investigation! Looking forward.
The text was updated successfully, but these errors were encountered: