No internal queueing anymore when all write plugins fail? #1543

GuusHoutzager · 2016-02-08T15:00:36Z

Hi,

I remember collectd eating all memory when I used it last when all write plugins fail. I then used the WriteQueueLimitHigh and WriteQueueLimitLow parameters to limit that. Now however, it seems like collectd is not queueing at all anymore. I've tested with 5.4, 5.5 and 5.5.1, but if I break the connection with graphite, I don't see memory usage go up of collectd and if I use the CollectInternalStats option in 5.5 and 5.5.1, I don't see the write queue getting bigger and bigger either. As a result I get holes in my graphs when the connection to graphite is restored since older data doesn't exist anymore?
I tested with 5.4 from Ubuntu 14.04 in both default and custom config as well as with 5.5 and 5.5.1 on CentOS 7 with our puppet generated config.
Am I doing something stupid or is something broken? Please let me know what you need from me to help debug this. I have various test environments to mess around with. Thanks!

Cheers,

Guus

unixsurfer · 2016-02-11T13:43:04Z

I have the same issue. Can we have some attention please?

mfournier · 2016-02-11T20:30:45Z

If a write plugin fails, collectd is expected to bail out and start dropping metrics sent to this destination. Such an event will be mentioned in the logs. Does it in your respective cases ?

Up to 42a62db (part of 5.5.1) collectd didn't identify a certain type of TCP-related failure scenario as a genuine error, and continued to submit data, until the kernel timed out the stalled socket (which by default in Linux is around 2 hours), causing the error-handling code in collectd to kick in. A workaround to avoid collectd ballooning up and consuming a ton of memory was to use WriteQueueLimit*.

My understanding is that the write queue is there to handle temporary backpressure from blocking destinations (disk IO saturation preventing rrdtools to fsync in a timely manner, network congestion between collectd and the graphite server, etc). But it's not designed to ensure no data gets lost when the target has a hard failure.

IMHO a better way to have a more resilient setup is to use a dedicated queuing system (amqp or write_kafka plugins, or mqtt in upcoming 5.6 version) and/or configure several concurrent write plugins to ensure redundancy.

unixsurfer · 2016-02-11T22:27:03Z

Why can't we use the write queue for hard failures(TCP remote point went away) and retry to connect to it later? We can have a very resilient setup but there will be always temporarily failure(failure is not an option and it is always there:-) ), thus I would expect from collectd to buffer metrics(using some limits) and retry to establish the connection every X seconds.

dreddynarahari · 2018-11-06T21:45:36Z

We are still facing this issue. Is there a solution or a bug fix in recent versions of collectd for this?

octo · 2018-11-08T15:55:01Z

No, current versions still drop metrics on failure.

collectd doesn't retry failing write callbacks – and never did, the metric was always dropped. However, failing write callbacks often have severely increased latency, for example when they try to re-establish a network connection. This call latency is what fills the queue.

We could re-try failing write calls, now that there's an upper bound on the queue length / memory consumption. This doesn't necessarily achieve the desired effect though. Some plugins, write_http for example, build a buffer and try to send that once its full. If that fails, retrying the last metric is not going to make much of a difference.

Alternatively we could change the write callbacks to retry temporary failures internally (i.e. without returning), potentially blocking for a long while. This causes problems when you have more than one write plugin, because it starves write threads. Previously, when one plugin failed the other ones continued to function; now one failing plugin will halt all reporting.

All of these issues are solvable, but this problem is certainly near the "high complexity" end of the spectrum and that's probably why nobody has tackled it yet. If somebody wants to give this a shot and come up with a good design, I'm happy to review and give advice.

Best regards,
—octo

rpv-tomsk · 2018-11-08T17:48:54Z

These are possible related to this issue: #2104, #2486, #2480

sunkuranganath · 2020-04-18T00:36:16Z

This issue was discussed in community bi-weekly call as noted in: https://collectd.org/wiki/index.php/MinutesApr17th20

Discussion provided:

rubenk mentioned this issue May 25, 2016

What does a Proxy do, if it can't reach the server? #1683

Closed

guettli mentioned this issue May 31, 2016

Docs for resilient setup #1734

Closed

octo added the Pending feedback label Sep 12, 2016

octo added core Feature request help wanted and removed Pending feedback labels Nov 8, 2018

sunkuranganath mentioned this issue Apr 18, 2020

write_plugin SPOF #2104

Closed

hnez mentioned this issue Jul 19, 2022

[collectd 6] src/daemon/plugin.c: Use one thread per write plugin #4026

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No internal queueing anymore when all write plugins fail? #1543

No internal queueing anymore when all write plugins fail? #1543

GuusHoutzager commented Feb 8, 2016

unixsurfer commented Feb 11, 2016

mfournier commented Feb 11, 2016

unixsurfer commented Feb 11, 2016

dreddynarahari commented Nov 6, 2018

octo commented Nov 8, 2018

rpv-tomsk commented Nov 8, 2018

sunkuranganath commented Apr 18, 2020

No internal queueing anymore when all write plugins fail? #1543

No internal queueing anymore when all write plugins fail? #1543

Comments

GuusHoutzager commented Feb 8, 2016

unixsurfer commented Feb 11, 2016

mfournier commented Feb 11, 2016

unixsurfer commented Feb 11, 2016

dreddynarahari commented Nov 6, 2018

octo commented Nov 8, 2018

rpv-tomsk commented Nov 8, 2018

sunkuranganath commented Apr 18, 2020