Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No internal queueing anymore when all write plugins fail? #1543

Open
GuusHoutzager opened this issue Feb 8, 2016 · 7 comments
Open

No internal queueing anymore when all write plugins fail? #1543

GuusHoutzager opened this issue Feb 8, 2016 · 7 comments

Comments

@GuusHoutzager
Copy link

Hi,

I remember collectd eating all memory when I used it last when all write plugins fail. I then used the WriteQueueLimitHigh and WriteQueueLimitLow parameters to limit that. Now however, it seems like collectd is not queueing at all anymore. I've tested with 5.4, 5.5 and 5.5.1, but if I break the connection with graphite, I don't see memory usage go up of collectd and if I use the CollectInternalStats option in 5.5 and 5.5.1, I don't see the write queue getting bigger and bigger either. As a result I get holes in my graphs when the connection to graphite is restored since older data doesn't exist anymore?
I tested with 5.4 from Ubuntu 14.04 in both default and custom config as well as with 5.5 and 5.5.1 on CentOS 7 with our puppet generated config.
Am I doing something stupid or is something broken? Please let me know what you need from me to help debug this. I have various test environments to mess around with. Thanks!

Cheers,

Guus

@unixsurfer
Copy link

I have the same issue. Can we have some attention please?

@mfournier
Copy link

If a write plugin fails, collectd is expected to bail out and start dropping metrics sent to this destination. Such an event will be mentioned in the logs. Does it in your respective cases ?

Up to 42a62db (part of 5.5.1) collectd didn't identify a certain type of TCP-related failure scenario as a genuine error, and continued to submit data, until the kernel timed out the stalled socket (which by default in Linux is around 2 hours), causing the error-handling code in collectd to kick in. A workaround to avoid collectd ballooning up and consuming a ton of memory was to use WriteQueueLimit*.

My understanding is that the write queue is there to handle temporary backpressure from blocking destinations (disk IO saturation preventing rrdtools to fsync in a timely manner, network congestion between collectd and the graphite server, etc). But it's not designed to ensure no data gets lost when the target has a hard failure.

IMHO a better way to have a more resilient setup is to use a dedicated queuing system (amqp or write_kafka plugins, or mqtt in upcoming 5.6 version) and/or configure several concurrent write plugins to ensure redundancy.

@unixsurfer
Copy link

Why can't we use the write queue for hard failures(TCP remote point went away) and retry to connect to it later? We can have a very resilient setup but there will be always temporarily failure(failure is not an option and it is always there:-) ), thus I would expect from collectd to buffer metrics(using some limits) and retry to establish the connection every X seconds.

@dreddynarahari
Copy link

We are still facing this issue. Is there a solution or a bug fix in recent versions of collectd for this?

@octo
Copy link
Member

octo commented Nov 8, 2018

No, current versions still drop metrics on failure.

collectd doesn't retry failing write callbacks – and never did, the metric was always dropped. However, failing write callbacks often have severely increased latency, for example when they try to re-establish a network connection. This call latency is what fills the queue.

We could re-try failing write calls, now that there's an upper bound on the queue length / memory consumption. This doesn't necessarily achieve the desired effect though. Some plugins, write_http for example, build a buffer and try to send that once its full. If that fails, retrying the last metric is not going to make much of a difference.

Alternatively we could change the write callbacks to retry temporary failures internally (i.e. without returning), potentially blocking for a long while. This causes problems when you have more than one write plugin, because it starves write threads. Previously, when one plugin failed the other ones continued to function; now one failing plugin will halt all reporting.

All of these issues are solvable, but this problem is certainly near the "high complexity" end of the spectrum and that's probably why nobody has tackled it yet. If somebody wants to give this a shot and come up with a good design, I'm happy to review and give advice.

Best regards,
—octo

@rpv-tomsk
Copy link
Contributor

These are possible related to this issue: #2104, #2486, #2480

@sunkuranganath
Copy link
Member

This issue was discussed in community bi-weekly call as noted in: https://collectd.org/wiki/index.php/MinutesApr17th20

Discussion provided:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants