Statsd intermittently out puts two data points for the same metric on the same timestamp, the second of which is always zero. #551

stevencherry · 2015-11-03T14:42:45Z

This results in what looks like metric drop outs in our Graphite graphs. See my debugging below

[root@graphite ~]# tshark -l -f "udp port 2003 or tcp port 2003 or tcp port 2004" -i lo -E separator=, -Tfields -e data | xxd -r -p | grep "Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate"
Running as user "root" and group "root". This could be dangerous.
Capturing on lo
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 2.2 1446486496
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 0 1446486496
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 2.1 1446486506
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 0 1446486506
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 2.1 1446486516
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 1.9 1446486526
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 2.2 1446486536
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 2.2 1446486546
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 2.2 1446486556
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 2.2 1446486566
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 2.2 1446486576
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 0 1446486576
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 2.1 1446486586
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 2.1 1446486596

taken from above, the duplicate data points

Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 2.2 1446486496
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 0 1446486496

Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 2.1 1446486506
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 0 1446486506

Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 2.2 1446486576
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 0 1446486576

The zero data point always follows the first valid data point, as a result the valid point is replaced by zero, hence the apparent drop outs.

The text was updated successfully, but these errors were encountered:

coykitten · 2015-11-05T23:35:27Z

Hi, this seems like a potentially bad bug with statsd. What kind of load (UDP packets per second) does this happen under, and is there someway for us to recreate this situation for debugging?

Also what does your environment look like?
Statsd version?
node version?
OS?

stevencherry · 2015-11-06T14:03:51Z

Hi, answers to questions

As far as our environment is concerned. We have 5 physical statsd servers, all running matching versions of OS, statsd, node, carbon & hardware. Each statsd instance deliveres it's data points to a carbon relay agent running locally on the same server. In turn, the carbon relay distributes the metrics according to a consistent hashing algorithm to one of the same 5 servers all running carbon writers. The metrics are then deposited in whisper file format via the carbon writer. In front of the cluster sits an f5 load balancer to distribute statsd traffic to each node. The load balancer is configured to use a persistence profile, ensuring statsd traffic is delivered consistently across servers according to data point source address.

[root@graphite ~]# cat /etc/redhat-release 
CentOS release 6.5 (Final)

[root@graphite statsd]# node -v
v0.13.0-pre

[root@graphite statsd]# cat package.json 
{
  "name": "statsd",
  "description": "A simple, lightweight network daemon to collect metrics over UDP",
  "author": "Etsy",
  "scripts": {
    "test": "./run_tests.sh",
    "start": "node stats.js config.js",
    "install-windows-service": "node_modules\\.bin\\winser -i",
    "uninstall-windows-service": "node_modules\\.bin\\winser -r"
  },
  "repository": {
    "type": "git",
    "url": "https://github.com/etsy/statsd.git"
  },
  "version": "0.7.2",
  "dependencies": {
  },
  "devDependencies": {
    "nodeunit": "0.7.x",
    "underscore": "1.4.x",
    "temp": "0.4.x"
  },
  "optionalDependencies": {
    "node-syslog":"1.1.7",
    "hashring":"1.0.1",
    "winser": "=0.1.6"
  },
  "engines": {
    "node" : ">=0.8"
  },
  "bin": { "statsd": "./bin/statsd" }
}

Total input rate is approx. 7000 data points per second spread across thousands of metrics using 'c' and 'ms' data types. Measured across in a 5 minute window see working below,

[root@graphite ~]# tshark -l -f "tcp port 8126 or udp port 8125" -i em1 -E separator=, -Tfields -e data | xxd -r -p | tr "|" "\n" | pv --line-mode --rate -i 300 > /dev/null
Running as user "root" and group "root". This could be dangerous.
Capturing on em1
[6.94k/s]

I've written a test script to deliver data points to statsd and managed to reproduce the error.

Test script

#!/bin/bash
PORT=8125
SERVER=x.x.x.x
while true
do
        echo "Ops.statsd-test:1|c" | nc -w0 -u ${SERVER} ${PORT}
        echo "Ops.statsd-test:1|c"
        sleep $(python -c "import random;print(random.uniform(0.0, 0.75))")
done

Debuging results

root@graphite ~]# tshark -l -f "udp port 2003 or tcp port 2003 or tcp port 2004" -i lo -E separator=, -Tfields -e data | xxd -r -p | grep "Ops.statsd-test.rate"
Running as user "root" and group "root". This could be dangerous.
Capturing on lo
Ops.statsd-test.rate 2.6 1447076111
Ops.statsd-test.rate 3 1447076121
Ops.statsd-test.rate 2.4 1447076131
Ops.statsd-test.rate 0 1447076131
Ops.statsd-test.rate 2.2 1447076141
Ops.statsd-test.rate 2.2 1447076151
Ops.statsd-test.rate 2.3 1447076161
Ops.statsd-test.rate 2.2 1447076171
Ops.statsd-test.rate 2 1447076181
Ops.statsd-test.rate 2.3 1447076191
Ops.statsd-test.rate 2.9 1447076201
Ops.statsd-test.rate 2.5 1447076211
Ops.statsd-test.rate 2.5 1447076221
Ops.statsd-test.rate 2.5 1447076231
Ops.statsd-test.rate 2 1447076241
Ops.statsd-test.rate 2.2 1447076251
Ops.statsd-test.rate 2.5 1447076261
Ops.statsd-test.rate 0 1447076261
Ops.statsd-test.rate 0.7 1447076271

in particular, the duplicate data points

Ops.statsd-test.rate 2.4 1447076131
Ops.statsd-test.rate 0 1447076131

and

Ops.statsd-test.rate 2.5 1447076261
Ops.statsd-test.rate 0 1447076261

coykitten · 2016-04-01T17:01:43Z

@stevencherry Are you still having these problems? I tried to reproduce this bug this week, but failed. I'm wondering if this is a bug stemming from a particular configuration.

I tested on Cent7, node 0.10.16. Next chance I get I'll look to test this under the same node runtime.

deltaroe · 2016-04-15T22:39:06Z

Just noticed this issue in our environment as well. Let me know what information you need to help track this down.

lghamie · 2016-04-20T19:01:58Z

I do believe this is related to this code:
function flushMetrics() { setTimeout(flushMetrics, getFlushTimeout(flushInterval));

and this function:

function getFlushTimeout(interval) { return interval - (new Date().getTime() - startup_time * 1000) % flushInterval }

This used to be done using setInterval, and now changed to setTimeout (I don't understand why)

It seems to me that setTimeout is not that exact, and even thou you're correcting its offset, that might be causing this problem.

I tested this with 0.7.2 (uses setinterval) and it worked like a charm.

coykitten · 2016-05-06T16:43:10Z

@lghamie setInterval causes statsd to suffer from bucket drift as the metrics aggregation and flush time is not factored into the interval. more here #489

It's worth mentioning that this issue was opened far before we merged #489 and that 0.7.2 is a 2 year old release.

coykitten · 2016-05-11T19:41:10Z

Reproduced in #574 and Fixed by #575

coykitten added the bug A bug within statsd or it's bundled proxies label Nov 5, 2015

coykitten self-assigned this Mar 18, 2016

coykitten mentioned this issue May 6, 2016

StatsD metric flush timer changes in #489 are much worse #574

Closed

coykitten closed this as completed May 11, 2016

mrtazz unassigned coykitten Feb 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statsd intermittently out puts two data points for the same metric on the same timestamp, the second of which is always zero. #551

Statsd intermittently out puts two data points for the same metric on the same timestamp, the second of which is always zero. #551

stevencherry commented Nov 3, 2015

coykitten commented Nov 5, 2015

stevencherry commented Nov 6, 2015

coykitten commented Apr 1, 2016

deltaroe commented Apr 15, 2016

lghamie commented Apr 20, 2016

coykitten commented May 6, 2016 •

edited

Loading

coykitten commented May 11, 2016

Statsd intermittently out puts two data points for the same metric on the same timestamp, the second of which is always zero. #551

Statsd intermittently out puts two data points for the same metric on the same timestamp, the second of which is always zero. #551

Comments

stevencherry commented Nov 3, 2015

coykitten commented Nov 5, 2015

stevencherry commented Nov 6, 2015

coykitten commented Apr 1, 2016

deltaroe commented Apr 15, 2016

lghamie commented Apr 20, 2016

coykitten commented May 6, 2016 • edited Loading

coykitten commented May 11, 2016

coykitten commented May 6, 2016 •

edited

Loading