Corrupted values being sent from collectd to itself. #2209

iain-buclaw-sociomantic · 2017-03-08T17:08:30Z

Version of collectd: 5.7.1
Operating system / distribution: Ubuntu 14.04

Expected behavior

A collectd instance that is also configured as a server is occasionally sending nonsense values to itself from common plugins, such as cpu, disk or swap. However the secondary server is receiving the values uncorrupted.

Both servers should receive the same uncorrupted values.

This is the (abridged) config for server 10.0.0.2. The reverse configuration is done for 10.0.0.3

LoadPlugin network

<Plugin network>
	# client setup:
	<Server "10.0.0.3" "65246">
	</Server>
	# server setup for carbon:
	<Listen "10.0.0.2" "65246">
	</Listen>
	Forward false
	ReportStats true
</Plugin>

LoadPlugin "write_graphite"

<Plugin "write_graphite">
	<Carbon>
		Host "localhost"
		Port "2003"
		Protocol "tcp"
		EscapeCharacter "_"
		StoreRates true
		AlwaysAppendDS true
	</Carbon>
</Plugin>

Actual behavior

tcpdump between 10.0.0.2 and localhost relay:

host-001.cpu-4.cpu-system.value 37.633281 1488989202
host-001.cpu-4.cpu-wait.value 0.100000 1488989202
host-001.cpu-4.cpu-user.value 673898617 1488989202  <---
host-001.cpu-4.cpu-nice.value 0.000000 1488989202
host-001.cpu-4.cpu-interrupt.value 0.000000 1488989202
host-001.cpu-4.cpu-softirq.value 0.233333 1488989202
host-001.cpu-4.cpu-steal.value 0.000000 1488989202
host-001.cpu-4.cpu-idle.value 26.699969 1488989202

tcpdump between secondary server 10.0.0.3 and relay

host-001.cpu-4.cpu-user.value 34.766620 1488989202  <---
host-001.cpu-4.cpu-system.value 37.633281 1488989202
host-001.cpu-4.cpu-wait.value 0.100000 1488989202
host-001.cpu-4.cpu-interrupt.value 0.000000 1488989202
host-001.cpu-4.cpu-softirq.value 0.233333 1488989202
host-001.cpu-4.cpu-steal.value 0.000000 1488989202
host-001.cpu-4.cpu-idle.value 26.699969 1488989202
host-001.cpu-4.cpu-nice.value 0.000000 1488989202

Steps to reproduce

Happened since upgrade to 5.7.1.

The text was updated successfully, but these errors were encountered:

iain-buclaw-sociomantic · 2017-03-09T10:52:51Z

For clarification, it only seems to affect DERIVE metrics, not GAUGE

iain-buclaw-sociomantic · 2017-03-10T11:54:13Z

Did some debug printfs inside the write_graphite plugin.

DERIVE:506208817
host-001.cpu-0.cpu-system.value 46.538426 1489137805^M
--
DERIVE:506210167
host-001.cpu-0.cpu-system.value 44.988956 1489137835^M
--
DERIVE:506211438
host-001.cpu-0.cpu-system.value 42.383147 1489137865^M
--
DERIVE:506212594
host-001.cpu-0.cpu-system.value 506212594 1489137895^M   <---  This is where it goes wrong.
--
DERIVE:506213858
host-001.cpu-0.cpu-system.value nan 1489137925^M
--
DERIVE:506215216
host-001.cpu-0.cpu-system.value 45.267625 1489137955^M
--
DERIVE:506216616
host-001.cpu-0.cpu-system.value 46.665085 1489137985^M
--

iain-buclaw-sociomantic · 2017-03-10T15:16:38Z

Ah, yes I am indeed looking in the wrong place. This seems to be related with StoreRates

iain-buclaw-sociomantic · 2017-03-10T16:16:24Z

A couple more printf's

DERIVE:49110088:RATE:0.266667
host-001.cpu-2.cpu-wait.value 0.266667 1489162150
--
DERIVE:49110096:RATE:(none+flag)
host-001.cpu-2.cpu-wait.value 49110096 1489162180
--
DERIVE:49110128:RATE:nan
host-001.cpu-2.cpu-wait.value nan 1489162210
--
DERIVE:49110164:RATE:1.200000
host-001.cpu-2.cpu-wait.value 1.200000 1489162240
--

So the call to uc_get_rate(ds, vl); returns NULL.

iain-buclaw-sociomantic · 2017-03-31T11:08:16Z

I think something like this would prevent collectd sending the derivative value if the call to lookup the stored rates fails for some unknown reason:

diff --git a/src/utils_format_graphite.c b/src/utils_format_graphite.c
index 4f509f4..8ae03e9 100644
--- a/src/utils_format_graphite.c
+++ b/src/utils_format_graphite.c
@@ -180,8 +180,13 @@ int format_graphite(char *buffer, size_t buffer_size, data_set_t const *ds,
   int buffer_pos = 0;
 
   gauge_t *rates = NULL;
-  if (flags & GRAPHITE_STORE_RATES)
+  if (flags & GRAPHITE_STORE_RATES) {
     rates = uc_get_rate(ds, vl);
+    if (rates == NULL) {
+      ERROR("format_graphite: error with uc_get_rate");
+      return (-1);
+    }
+  }
 
   for (size_t i = 0; i < ds->ds_num; i++) {
     char const *ds_name = NULL;

@octo - Any thoughts yet?

Farfaday · 2017-04-25T13:29:30Z

Hi,
we also have a similar issue, but only with the cpu plugin. We do not use the disk plugin. No wrong values seen on swap, but our servers are not swapping, the values are constant.

We do not use the "network" plugin, but only "write_graphite".

This is our write_graphite config:

<Plugin write_graphite>
  <Node "graphite">
    Host "graphite.xxx.com"
    Port "2013"
    Prefix "collectd."
    EscapeCharacter "_"
    StoreRates true
    AlwaysAppendDS false
    SeparateInstances false
    LogSendErrors true
    Protocol "udp"
  </Node>
</Plugin>

Version of collectd: 5.7.1
Operating system / distribution: Ubuntu 16.04
Installed from package http://packages.ubuntu.com/zesty/collectd

halcyonardency · 2017-04-27T17:01:49Z

We're also seeing intermittent unreasonable values for CPU.

CPU plugin has no special parameters.

write_graphite config:

LoadPlugin write_graphite
<Plugin write_graphite>
  <Carbon>
    Host "somehost"
    Port "2003"
    Protocol "tcp"
    Prefix "enterprise.collectd."
    Postfix ""
    StoreRates true
    AlwaysAppendDS false
    EscapeCharacter "_"
    SeparateInstances true
  </Carbon>
</Plugin>

Package: collectd-5.7.1.13.gf7e2d82-1.el7.centos.x86_64.rpm
OS: CentOS 7-latest

Installed from a mirror of the CI builds.

Farfaday · 2017-05-10T08:22:30Z

Compiled version 5.7.1 from source, (at about 17h). (small hole at that time)
The issue is still there but way less often.

iain-buclaw-sociomantic · 2017-05-18T16:33:41Z

I've started noticing this for all StoreRates graphs, not just the collectd server receiving them. Each time I look at this, I can't see any concrete reason, but then again there are examples of large refactorings, any one of them could have caused this, but then again, I'm not too sure of what reasons there could be for cache being invalidated.

936c450
2ede5e7
2761915
a9f509b
a762b2b

I can't see anything else that has changed around StoreRates, GRAPHITE_STORE_RATES, uc_get_rate, or uc_update. I am making a heavy assumption here that one of these areas is the main suspect.

octo · 2017-05-18T19:11:47Z

Hi @iain-buclaw-sociomantic,

thank you very much for reporting this and your analysis! I think not checking the return value of uc_get_rate() (comment) explains the behavior you're observing: when the rates pointer is NULL, the following code assumes that the StoreRates option is false and will submit the raw counter value instead of the rate. Do you want to turn that diff into a Pull Request?

Thanks and best regards,
—octo

iain-buclaw-sociomantic · 2017-05-18T19:34:07Z

Sure, that would be the corrective logic I think. Would still be nice to know out why it started returning NULL in the first place, however.

octo · 2017-05-18T20:28:32Z

Great, thanks! Yes, that would also be interesting to know. Buffers / caching in the network plugin and the cache entry timeout would be the first thing I'd look into. How many metrics are you sending? Do you fill an entire buffer on each iteration?

iain-buclaw-sociomantic · 2017-05-19T09:34:16Z

How many metrics are you sending?

There are two servers receiving metrics, one is receiving 30k metrics a second, the other 8k metrics a second. Both are having the same bug.

I don't think it's an issue with sending metrics, as between the two servers, only one of them sends a bad value to graphite.

Do you fill an entire buffer on each iteration?

Very likely, we compile with -DWG_SEND_BUF_SIZE=65464 because we have some pretty large custom types - however none of them are derive, and so we don't have problems with them. Just the built-in types like cpu, swap, network, processes, disk, are affected.

This prevents a wrong value being sent to graphite for DERIVE types. See #2209 Signed-off-by: Florian Forster <octo@collectd.org>

magnetik · 2017-06-01T15:16:19Z

Seeing the same issue with derive values on apache plugin (with apache_requests) and the interface plugin (with if_octets).

I'm willing to help. Is there anything we can do?

iain-buclaw-sociomantic · 2017-06-09T12:03:30Z

The immediate fix should have been released in 5.7.2. So I think this is good to close.

octo · 2017-07-20T06:17:47Z

@magnetik this was an error in the JSON formatting, i.e. in the output plugins and will affect all (derive) metrics, including those from the Apache plugin. If you're still having this issue after 5.7.2, please open a separate issue.

magnetik · 2017-07-20T06:23:37Z

Indeed no problem in 5.7.2. Thanks a lot to everyone who contributed in finding/fixing it, you guys are awesome. 👍

octo added the Pending feedback label Mar 8, 2017

octo self-assigned this May 18, 2017

octo added the Bug A genuine bug label May 18, 2017

octo added Pending contributor action and removed Pending feedback labels May 18, 2017

octo mentioned this issue May 19, 2017

write_riemann: spurious messages #1234

Closed

octo added Pending feedback and removed Pending contributor action labels May 19, 2017

iain-buclaw-sociomantic mentioned this issue May 19, 2017

format_graphite: Error if call to uc_get_rate fails to return a value. #2294

Closed

octo pushed a commit that referenced this issue May 19, 2017

format_graphite: Error if call to uc_get_rate fails to return a value.

58ef597

This prevents a wrong value being sent to graphite for DERIVE types. See #2209 Signed-off-by: Florian Forster <octo@collectd.org>

iain-buclaw-sociomantic mentioned this issue May 19, 2017

Collectd cache possibly being expired before data is sent to graphite #2296

Open

octo closed this as completed Jul 20, 2017

JamesFAtl mentioned this issue Aug 19, 2017

Periodic bursts of missing values that should be reported by the SNMP plugin #2406

Closed

JamesFAtl mentioned this issue Aug 30, 2017

write_graphite plugin: "error with uc_get_rate" when StoreRates is true #2413

Open

Farfaday mentioned this issue Oct 20, 2017

Corrupted (strange, out of band) values delivered to graphite grobian/carbon-c-relay#320

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrupted values being sent from collectd to itself. #2209

Corrupted values being sent from collectd to itself. #2209

iain-buclaw-sociomantic commented Mar 8, 2017 •

edited by octo

iain-buclaw-sociomantic commented Mar 9, 2017

iain-buclaw-sociomantic commented Mar 10, 2017

iain-buclaw-sociomantic commented Mar 10, 2017

iain-buclaw-sociomantic commented Mar 10, 2017 •

edited

iain-buclaw-sociomantic commented Mar 31, 2017

Farfaday commented Apr 25, 2017

halcyonardency commented Apr 27, 2017

Farfaday commented May 10, 2017

iain-buclaw-sociomantic commented May 18, 2017 •

edited

octo commented May 18, 2017

iain-buclaw-sociomantic commented May 18, 2017

octo commented May 18, 2017

iain-buclaw-sociomantic commented May 19, 2017 •

edited

magnetik commented Jun 1, 2017

iain-buclaw-sociomantic commented Jun 9, 2017

octo commented Jul 20, 2017

magnetik commented Jul 20, 2017

Corrupted values being sent from collectd to itself. #2209

Corrupted values being sent from collectd to itself. #2209

Comments

iain-buclaw-sociomantic commented Mar 8, 2017 • edited by octo

Expected behavior

Actual behavior

Steps to reproduce

iain-buclaw-sociomantic commented Mar 9, 2017

iain-buclaw-sociomantic commented Mar 10, 2017

iain-buclaw-sociomantic commented Mar 10, 2017

iain-buclaw-sociomantic commented Mar 10, 2017 • edited

iain-buclaw-sociomantic commented Mar 31, 2017

Farfaday commented Apr 25, 2017

halcyonardency commented Apr 27, 2017

Farfaday commented May 10, 2017

iain-buclaw-sociomantic commented May 18, 2017 • edited

octo commented May 18, 2017

iain-buclaw-sociomantic commented May 18, 2017

octo commented May 18, 2017

iain-buclaw-sociomantic commented May 19, 2017 • edited

magnetik commented Jun 1, 2017

iain-buclaw-sociomantic commented Jun 9, 2017

octo commented Jul 20, 2017

magnetik commented Jul 20, 2017

iain-buclaw-sociomantic commented Mar 8, 2017 •

edited by octo

iain-buclaw-sociomantic commented Mar 10, 2017 •

edited

iain-buclaw-sociomantic commented May 18, 2017 •

edited

iain-buclaw-sociomantic commented May 19, 2017 •

edited