-
Notifications
You must be signed in to change notification settings - Fork 9.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fluentd buffer gets overflow or queue limit size gets filled #7089
Comments
Hi @adrdimitrov Thanks so much for the detailed information!! We mainly provide support in this repository to solve problems related to the Bitnami charts or the containers. For information regarding the application itself or customization of the content within the application, we highly recommend checking forums and user guides made available by the project behind the application. In your case, I recommend you to post your question in the forum below, you'll find there more experienced users that can help you customizing the configuration: |
Hello @juan131 , Thanks, I wrote for this in the fluentd channel and on few other places and got this: uken/fluent-plugin-elasticsearch#909 the issue seems caused by the ruby version and it is fixed, but still seems that the helm chart is deploying old ruby version that have this issue. Is it possible to update the Ruby version for this chart ? Coz it is not very pleasant to deploy via helm charts and then to upgrade versions and etc.. Doing this for multiple kubernetes clusters will not be cool. |
Hey @juan131 , Have you managed to check the above. |
Hi @adrdimitrov Using the latest image available (tag I checked the version of $ docker run --rm -it bitnami/fluentd:1.13.3-debian-10-r0 -- bash
$ ruby --version
ruby 2.6.8p205 (2021-07-07 revision 67951) [x86_64-linux]
$ gem list excon
*** LOCAL GEMS ***
excon (0.85.0) It's supposed to be fine and it shouldn't be affected by the issue unless I'm missing something. What version of the container and chart are you using? |
Hey @juan131 , I am currently using helm_chart_version_fluentd = "3.7.5", so i will update and revert back. Thanks very much for your prompt response. |
Thanks! Please keep up updated about your insights using the latest chart version. |
Hello @juan131 I redeployed my fluentd with helm chart 4.1.3 version and the result is the same: I saw that even with 4.1.3 helm chart the ruby version is still the old one: Please note that i did not update it in place, i removed the fluentd and redeployed it using terraform. Am I doing something wrong ? |
Hi @adrdimitrov We're including the latest ruby version available in the branch The current chart is pointing to the version image version By the way, I thought the problem was related to the Ruby gem |
Hey @juan131 , Just saw your answer, turned out that I am still using the old debian package. I did not see that the image is described in the values file. I changed it and now I am running fluentd with ruby 2.6.8. I cannot confirm that 2.7.x is required or if 2.6.8 is fixing the issue, but I am now running 2.6.8 and will leave it like this and report back if it fixes the issue. Thanks a lot for your time ! |
Thanks so much! Please keep us updated about your insights. |
Hello @juan131 , I managed to deploy the latest helm chart version and as said above changed in values file to use 1.13.3-debian-10-r0 . Haven't faced the issue since then (5 days). So I guess I can confirm now that this issue is fixed in ruby 2.6.8. Thanks a lot for your time and efforts! It's appreciated. |
👏 !!! That's great!!! I'm very glad the problem was fixed using the latest image! Thanks so much for sharing your insights @adrdimitrov. I'll keep the issue opened for a few more days just in case you face it again. |
Hey @juan131 , Some bad news, although I haven't stop receiving logs (or at least i don't see gaps). The issue with the 100% CPU utilization of the pod is still there and the pod will get frequently restarted gracefully with SIGKILL: I will monitor this closely and try to catch the errors and the exact behaviour. Maybe upgrading to ruby 2.7.x is a good idea. |
Hello again, As mentioned yesterday I am still facing the CPU issue, but i left the fluentd to see how it deals with it. Unfortunately this night it stopped with similar as before scenario. It suddenly filled the chunk_queued_limit and stopped sending logs to ES. the total size is 40MB. As seen below the moment we hit 100% we lost the logs: I don't see errors in the logs, it just dies without any notifications. |
Hi @adrdimitrov I can build an exact copy of the current container replacing Ruby with What do you think? Does it make sense? |
Hey @juan131 Sorry for the late response I was on a vacation. Yes, it makes sense and will be great! Meanwhile me and my colleagues are monitoring this and so far we haven't seen the issue again. I am not sure if this happens under some specific circumstances or it is absolutely random. Will keep you posted. |
Thanks so much @adrdimitrov By the way, I built and published an image based on Ruby image:
registry: docker.io
repository: juanariza131/fluentd
tag: development |
Hey @juan131 , Quick update, I managed to deploy this custom image and it runs for a week now without issues. Will continue monitoring it for another week and keep you posted. |
Hey @juan131, Two weeks now with the image based on Ruby 2.7.x and I have faced no issues. I believe this is solved. |
These are great news!!! I'll do the required changes in our system to release a new |
@adrdimitrov a new revision of the Please give it a try when you have a chance! I proceed to close the issue as "solved" but please feel free to reopen it if you require further assistance. |
Hello team,
I am testing fluentd for our logging purposes and I am facing an issue with my buffer configuration (i guess). The set up is as follows:
Deployment: I am deploying fluentd on our kubernetes cluster which consist of 4 nodes, one generating almost 90% of the logs. Everything is done using terraform and bitnami/fleuntd helm chart. Fluentd is in kube-system namespace. It is sending logs to aws elasticsearch.
Input config on the forwarders:
Output config on aggregator:
Issue: Fluentd is working fine for hours and then it gets one of the two, either the buffer total_limit_size get reached and fleuntd stops working (even after I have set overflow_action to drop_oldest_chunk) or the queued_chunks_limit_size gets reached and again fleuntd stops sending. I have tried a lot of different configuration including default ones and using memory as a buffer. In any of the cases I am facing one of the two issues. Using the above configuration (this is my latest test) I got queued_chunk_limit_size reached (over 8200 file with the .meta files). The only errors i see in logs are for slow_flush_threshold from time to time. During the failure I am not observing extensive memory or CPU usage in either side Fleuntd nor Elasticsearch. Seems like the connection is just lost and never regained.
Restart of the pod is getting fluentd back to normal state and working, but this way i am losing logs and it is not sustainable to restart manually every time it stops sending logs.
The text was updated successfully, but these errors were encountered: