-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
out_forward to a server that's down make whole server's down #523
Comments
How about |
No, it doesn't works. It still invoke too many log.
|
Sorry for the delay.
Different plugin_id in each log is weird. I will check it later. |
Ah, okay. num_thread prevent stacktrace suppression. |
In line with this issues, warning also emitted too many logs, that make hardisk full. The warning is about buffer queue size exceeds limit.
There is 11 line emitted in 1s, you can imagine how long it needs to make 8gb hardisk full. |
It seems your fluentd instance receives lots of request or sends events to stucked plugin.
|
That's a good idea, but should one plugin failure make whole server down? Currently we are not investing on adding another nodes. (At least until we get bigger traffic). |
Should not down. In this problem, long time destination error with small buffer capacity causes lots of BufferQueueLimitError. Separating stream relaxes this problem. |
How about adding max log size options? So whatever the log error, it will never pass specific size (let's say, 1GB, but it should be configurable). |
What does line mean? Should input plugin block when buffers are filled? |
What does line mean? Should input plugin block when buffers are filled? I think what he means is to set a maximum size the log can grow to. Beyond that size, older events would be removed (either deleted or sent to a backup/archive file) and new events added to keep the log size below the max size setting. Ultimately the size on disk of the log files need to be controlled so if sending to a backup/archive, need to control the size and number of those files as well. |
Hmm... I see. We are now developing new plugin API including buffer. |
Is this thread about logs of Fluentd itself? New Buffer plugin API have nothing about logging. |
Hmm... from "Beyond that size, older events would be removed (either deleted or sent to a backup/archive file) and new events added to keep the log size below the max size setting. ", I assume he mentions buffers's queue. And maybe, I think dieend's intention is also buffer. |
We also have seen similar issues, as described in the original report - it will keep logging some message repeatedly, using a lot of CPU in the process, and eventually filling up the disk. I think this is partially an issue with the volume of log messages fluentd generates (which can be mitigated to some extent), but my bigger concern is whether some of these issues are independent of logging, i.e. even if we disabled logging, are there situations where fluentd gets stuck in a tight loop, trying to do something that fails repeatedly due to external factors? Looks like there are a few things that can mitigate the log impact, at least:
|
ah, I see 'emit_error_log_interval' is mentioned in http://docs.fluentd.org/articles/config-file but it is not explained. I tried it, and it does work, except it only suppresses specific messages. For instance, in my case I did see 'emit transaction failed' only every 10s (as I had configured), but I still see 'syslog failed to emit' every time. |
Yeah, we need to improve logging mechanizm for better control.
It should not happen in the fluentd's main thread. In the plugin thread, it may happen by plugin bug.
We are now using
From v0.12, this is true by default.
Ah! I will update documents later.
Hmm... currently, fluentd uses suppression per thread. |
I ran into this in 0.12.[16|18] even using
and once the buffers fill up, the logs explode with error msgs, journalD pegs a core trying to keep up, and the disk eventually fills up. And what also concerned me is the server kept accepting new messages on the tcp_input. This all occurs even if Short term fix
# consul uses nagios like checks so `exit 1` is a warning and `exit 2` is failure
( $(curl -s http://127.0.0.1:24220/api/plugins.json | jq -e '[.plugins[].buffer_queue_length] | sort | reverse | .[0] < 5') && echo 'running fine (buffers within limits)') || (echo 'getting too far behind; buffer is full.' && exit 2) QUESTION This surprised me as it seems like a common error condition. Am I missing anything? I can recreate it pretty easily so I suspect not but I would love to be proven wrong. And how can I have fluentD stop accepting incoming tcp requests if the buffer fills up? Production Outage This issue actually took down are entire cluster of fluentD Aggregators in a few minutes:
|
Fluentd's buffer is the one of throttiling mechanizm.
It needs several modification. Michael suggests block option or input plugin handles buffer full error.
For such cases, small retry_limit and using
Did you set |
No responses and updates. So let me close now. Please reopen if the problem still exists. |
The log output this warning, with stack trace and it's repeat until hundreds of thousand later until the disk full, and the server's down.
I tried to set
log_level
toerror
, but it is still happened. I tried to add--suppress-repeated-stacktrace
options, the message is reduced, but it's still outputting many many many logs, it's only suppressing the stacktrace, not the messages. The only workaround is by setting global log level-qq
.With this I lost my ability to follow warn error. Is there anyway for me to receive as much as possible log with as little as possible log? (uh, I mean, a way to get log but not repeating infinitely).
My partial configuration:
The text was updated successfully, but these errors were encountered: