Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupted values being sent from collectd to itself. #2209

Closed
iain-buclaw-sociomantic opened this issue Mar 8, 2017 · 17 comments
Closed

Corrupted values being sent from collectd to itself. #2209

iain-buclaw-sociomantic opened this issue Mar 8, 2017 · 17 comments
Assignees
Labels

Comments

@iain-buclaw-sociomantic
Copy link
Contributor

iain-buclaw-sociomantic commented Mar 8, 2017

  • Version of collectd: 5.7.1
  • Operating system / distribution: Ubuntu 14.04

Expected behavior

A collectd instance that is also configured as a server is occasionally sending nonsense values to itself from common plugins, such as cpu, disk or swap. However the secondary server is receiving the values uncorrupted.

Both servers should receive the same uncorrupted values.

This is the (abridged) config for server 10.0.0.2. The reverse configuration is done for 10.0.0.3

LoadPlugin network

<Plugin network>
	# client setup:
	<Server "10.0.0.3" "65246">
	</Server>
	# server setup for carbon:
	<Listen "10.0.0.2" "65246">
	</Listen>
	Forward false
	ReportStats true
</Plugin>

LoadPlugin "write_graphite"

<Plugin "write_graphite">
	<Carbon>
		Host "localhost"
		Port "2003"
		Protocol "tcp"
		EscapeCharacter "_"
		StoreRates true
		AlwaysAppendDS true
	</Carbon>
</Plugin>

Actual behavior

tcpdump between 10.0.0.2 and localhost relay:

host-001.cpu-4.cpu-system.value 37.633281 1488989202
host-001.cpu-4.cpu-wait.value 0.100000 1488989202
host-001.cpu-4.cpu-user.value 673898617 1488989202  <---
host-001.cpu-4.cpu-nice.value 0.000000 1488989202
host-001.cpu-4.cpu-interrupt.value 0.000000 1488989202
host-001.cpu-4.cpu-softirq.value 0.233333 1488989202
host-001.cpu-4.cpu-steal.value 0.000000 1488989202
host-001.cpu-4.cpu-idle.value 26.699969 1488989202

tcpdump between secondary server 10.0.0.3 and relay

host-001.cpu-4.cpu-user.value 34.766620 1488989202  <---
host-001.cpu-4.cpu-system.value 37.633281 1488989202
host-001.cpu-4.cpu-wait.value 0.100000 1488989202
host-001.cpu-4.cpu-interrupt.value 0.000000 1488989202
host-001.cpu-4.cpu-softirq.value 0.233333 1488989202
host-001.cpu-4.cpu-steal.value 0.000000 1488989202
host-001.cpu-4.cpu-idle.value 26.699969 1488989202
host-001.cpu-4.cpu-nice.value 0.000000 1488989202

Steps to reproduce

Happened since upgrade to 5.7.1.

@iain-buclaw-sociomantic
Copy link
Contributor Author

For clarification, it only seems to affect DERIVE metrics, not GAUGE

@iain-buclaw-sociomantic
Copy link
Contributor Author

Did some debug printfs inside the write_graphite plugin.

DERIVE:506208817
host-001.cpu-0.cpu-system.value 46.538426 1489137805^M
--
DERIVE:506210167
host-001.cpu-0.cpu-system.value 44.988956 1489137835^M
--
DERIVE:506211438
host-001.cpu-0.cpu-system.value 42.383147 1489137865^M
--
DERIVE:506212594
host-001.cpu-0.cpu-system.value 506212594 1489137895^M   <---  This is where it goes wrong.
--
DERIVE:506213858
host-001.cpu-0.cpu-system.value nan 1489137925^M
--
DERIVE:506215216
host-001.cpu-0.cpu-system.value 45.267625 1489137955^M
--
DERIVE:506216616
host-001.cpu-0.cpu-system.value 46.665085 1489137985^M
--

@iain-buclaw-sociomantic
Copy link
Contributor Author

Ah, yes I am indeed looking in the wrong place. This seems to be related with StoreRates

@iain-buclaw-sociomantic
Copy link
Contributor Author

iain-buclaw-sociomantic commented Mar 10, 2017

A couple more printf's

DERIVE:49110088:RATE:0.266667
host-001.cpu-2.cpu-wait.value 0.266667 1489162150
--
DERIVE:49110096:RATE:(none+flag)
host-001.cpu-2.cpu-wait.value 49110096 1489162180
--
DERIVE:49110128:RATE:nan
host-001.cpu-2.cpu-wait.value nan 1489162210
--
DERIVE:49110164:RATE:1.200000
host-001.cpu-2.cpu-wait.value 1.200000 1489162240
--

So the call to uc_get_rate(ds, vl); returns NULL.

@iain-buclaw-sociomantic
Copy link
Contributor Author

I think something like this would prevent collectd sending the derivative value if the call to lookup the stored rates fails for some unknown reason:

diff --git a/src/utils_format_graphite.c b/src/utils_format_graphite.c
index 4f509f4..8ae03e9 100644
--- a/src/utils_format_graphite.c
+++ b/src/utils_format_graphite.c
@@ -180,8 +180,13 @@ int format_graphite(char *buffer, size_t buffer_size, data_set_t const *ds,
   int buffer_pos = 0;
 
   gauge_t *rates = NULL;
-  if (flags & GRAPHITE_STORE_RATES)
+  if (flags & GRAPHITE_STORE_RATES) {
     rates = uc_get_rate(ds, vl);
+    if (rates == NULL) {
+      ERROR("format_graphite: error with uc_get_rate");
+      return (-1);
+    }
+  }
 
   for (size_t i = 0; i < ds->ds_num; i++) {
     char const *ds_name = NULL;

@octo - Any thoughts yet?

@Farfaday
Copy link

Hi,
we also have a similar issue, but only with the cpu plugin. We do not use the disk plugin. No wrong values seen on swap, but our servers are not swapping, the values are constant.

We do not use the "network" plugin, but only "write_graphite".

This is our write_graphite config:

<Plugin write_graphite>
  <Node "graphite">
    Host "graphite.xxx.com"
    Port "2013"
    Prefix "collectd."
    EscapeCharacter "_"
    StoreRates true
    AlwaysAppendDS false
    SeparateInstances false
    LogSendErrors true
    Protocol "udp"
  </Node>
</Plugin>

@halcyonardency
Copy link

We're also seeing intermittent unreasonable values for CPU.

image

CPU plugin has no special parameters.

write_graphite config:

LoadPlugin write_graphite
<Plugin write_graphite>
  <Carbon>
    Host "somehost"
    Port "2003"
    Protocol "tcp"
    Prefix "enterprise.collectd."
    Postfix ""
    StoreRates true
    AlwaysAppendDS false
    EscapeCharacter "_"
    SeparateInstances true
  </Carbon>
</Plugin>

Package: collectd-5.7.1.13.gf7e2d82-1.el7.centos.x86_64.rpm
OS: CentOS 7-latest

Installed from a mirror of the CI builds.

@Farfaday
Copy link

Compiled version 5.7.1 from source, (at about 17h). (small hole at that time)
The issue is still there but way less often.
screenshot_collectd_cpu

@iain-buclaw-sociomantic
Copy link
Contributor Author

iain-buclaw-sociomantic commented May 18, 2017

I've started noticing this for all StoreRates graphs, not just the collectd server receiving them. Each time I look at this, I can't see any concrete reason, but then again there are examples of large refactorings, any one of them could have caused this, but then again, I'm not too sure of what reasons there could be for cache being invalidated.

936c450
2ede5e7
2761915
a9f509b
a762b2b

I can't see anything else that has changed around StoreRates, GRAPHITE_STORE_RATES, uc_get_rate, or uc_update. I am making a heavy assumption here that one of these areas is the main suspect.

@octo octo self-assigned this May 18, 2017
@octo octo added the Bug A genuine bug label May 18, 2017
@octo
Copy link
Member

octo commented May 18, 2017

Hi @iain-buclaw-sociomantic,

thank you very much for reporting this and your analysis! I think not checking the return value of uc_get_rate() (comment) explains the behavior you're observing: when the rates pointer is NULL, the following code assumes that the StoreRates option is false and will submit the raw counter value instead of the rate. Do you want to turn that diff into a Pull Request?

Thanks and best regards,
—octo

@iain-buclaw-sociomantic
Copy link
Contributor Author

Sure, that would be the corrective logic I think. Would still be nice to know out why it started returning NULL in the first place, however.

@octo
Copy link
Member

octo commented May 18, 2017

Great, thanks! Yes, that would also be interesting to know. Buffers / caching in the network plugin and the cache entry timeout would be the first thing I'd look into. How many metrics are you sending? Do you fill an entire buffer on each iteration?

@iain-buclaw-sociomantic
Copy link
Contributor Author

iain-buclaw-sociomantic commented May 19, 2017

How many metrics are you sending?

There are two servers receiving metrics, one is receiving 30k metrics a second, the other 8k metrics a second. Both are having the same bug.

I don't think it's an issue with sending metrics, as between the two servers, only one of them sends a bad value to graphite.

Do you fill an entire buffer on each iteration?

Very likely, we compile with -DWG_SEND_BUF_SIZE=65464 because we have some pretty large custom types - however none of them are derive, and so we don't have problems with them. Just the built-in types like cpu, swap, network, processes, disk, are affected.

@magnetik
Copy link

magnetik commented Jun 1, 2017

Seeing the same issue with derive values on apache plugin (with apache_requests) and the interface plugin (with if_octets).

I'm willing to help. Is there anything we can do?

@iain-buclaw-sociomantic
Copy link
Contributor Author

The immediate fix should have been released in 5.7.2. So I think this is good to close.

@octo
Copy link
Member

octo commented Jul 20, 2017

@magnetik this was an error in the JSON formatting, i.e. in the output plugins and will affect all (derive) metrics, including those from the Apache plugin. If you're still having this issue after 5.7.2, please open a separate issue.

@octo octo closed this as completed Jul 20, 2017
@magnetik
Copy link

Indeed no problem in 5.7.2. Thanks a lot to everyone who contributed in finding/fixing it, you guys are awesome. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants