Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statsd intermittently out puts two data points for the same metric on the same timestamp, the second of which is always zero. #551

Closed
stevencherry opened this issue Nov 3, 2015 · 7 comments
Labels
bug A bug within statsd or it's bundled proxies

Comments

@stevencherry
Copy link

This results in what looks like metric drop outs in our Graphite graphs. See my debugging below

[root@graphite ~]# tshark -l -f "udp port 2003 or tcp port 2003 or tcp port 2004" -i lo -E separator=, -Tfields -e data | xxd -r -p | grep "Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate"
Running as user "root" and group "root". This could be dangerous.
Capturing on lo
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 2.2 1446486496
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 0 1446486496
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 2.1 1446486506
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 0 1446486506
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 2.1 1446486516
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 1.9 1446486526
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 2.2 1446486536
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 2.2 1446486546
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 2.2 1446486556
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 2.2 1446486566
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 2.2 1446486576
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 0 1446486576
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 2.1 1446486586
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 2.1 1446486596

taken from above, the duplicate data points

Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 2.2 1446486496
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 0 1446486496
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 2.1 1446486506
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 0 1446486506
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 2.2 1446486576
Apps.DataAcquisition-Scheduler-VWP-SHAREDCL101.vwp-sharedcl101.4.Counters.UrlToProcess.rate 0 1446486576

The zero data point always follows the first valid data point, as a result the valid point is replaced by zero, hence the apparent drop outs.

@coykitten coykitten added the bug A bug within statsd or it's bundled proxies label Nov 5, 2015
@coykitten
Copy link
Contributor

Hi, this seems like a potentially bad bug with statsd. What kind of load (UDP packets per second) does this happen under, and is there someway for us to recreate this situation for debugging?

Also what does your environment look like?
Statsd version?
node version?
OS?

@stevencherry
Copy link
Author

Hi, answers to questions

As far as our environment is concerned. We have 5 physical statsd servers, all running matching versions of OS, statsd, node, carbon & hardware. Each statsd instance deliveres it's data points to a carbon relay agent running locally on the same server. In turn, the carbon relay distributes the metrics according to a consistent hashing algorithm to one of the same 5 servers all running carbon writers. The metrics are then deposited in whisper file format via the carbon writer. In front of the cluster sits an f5 load balancer to distribute statsd traffic to each node. The load balancer is configured to use a persistence profile, ensuring statsd traffic is delivered consistently across servers according to data point source address.

[root@graphite ~]# cat /etc/redhat-release 
CentOS release 6.5 (Final)
[root@graphite statsd]# node -v
v0.13.0-pre
[root@graphite statsd]# cat package.json 
{
  "name": "statsd",
  "description": "A simple, lightweight network daemon to collect metrics over UDP",
  "author": "Etsy",
  "scripts": {
    "test": "./run_tests.sh",
    "start": "node stats.js config.js",
    "install-windows-service": "node_modules\\.bin\\winser -i",
    "uninstall-windows-service": "node_modules\\.bin\\winser -r"
  },
  "repository": {
    "type": "git",
    "url": "https://github.com/etsy/statsd.git"
  },
  "version": "0.7.2",
  "dependencies": {
  },
  "devDependencies": {
    "nodeunit": "0.7.x",
    "underscore": "1.4.x",
    "temp": "0.4.x"
  },
  "optionalDependencies": {
    "node-syslog":"1.1.7",
    "hashring":"1.0.1",
    "winser": "=0.1.6"
  },
  "engines": {
    "node" : ">=0.8"
  },
  "bin": { "statsd": "./bin/statsd" }
}

Total input rate is approx. 7000 data points per second spread across thousands of metrics using 'c' and 'ms' data types. Measured across in a 5 minute window see working below,

[root@graphite ~]# tshark -l -f "tcp port 8126 or udp port 8125" -i em1 -E separator=, -Tfields -e data | xxd -r -p | tr "|" "\n" | pv --line-mode --rate -i 300 > /dev/null
Running as user "root" and group "root". This could be dangerous.
Capturing on em1
[6.94k/s] 

I've written a test script to deliver data points to statsd and managed to reproduce the error.

Test script

#!/bin/bash
PORT=8125
SERVER=x.x.x.x
while true
do
        echo "Ops.statsd-test:1|c" | nc -w0 -u ${SERVER} ${PORT}
        echo "Ops.statsd-test:1|c"
        sleep $(python -c "import random;print(random.uniform(0.0, 0.75))")
done

Debuging results

root@graphite ~]# tshark -l -f "udp port 2003 or tcp port 2003 or tcp port 2004" -i lo -E separator=, -Tfields -e data | xxd -r -p | grep "Ops.statsd-test.rate"
Running as user "root" and group "root". This could be dangerous.
Capturing on lo
Ops.statsd-test.rate 2.6 1447076111
Ops.statsd-test.rate 3 1447076121
Ops.statsd-test.rate 2.4 1447076131
Ops.statsd-test.rate 0 1447076131
Ops.statsd-test.rate 2.2 1447076141
Ops.statsd-test.rate 2.2 1447076151
Ops.statsd-test.rate 2.3 1447076161
Ops.statsd-test.rate 2.2 1447076171
Ops.statsd-test.rate 2 1447076181
Ops.statsd-test.rate 2.3 1447076191
Ops.statsd-test.rate 2.9 1447076201
Ops.statsd-test.rate 2.5 1447076211
Ops.statsd-test.rate 2.5 1447076221
Ops.statsd-test.rate 2.5 1447076231
Ops.statsd-test.rate 2 1447076241
Ops.statsd-test.rate 2.2 1447076251
Ops.statsd-test.rate 2.5 1447076261
Ops.statsd-test.rate 0 1447076261
Ops.statsd-test.rate 0.7 1447076271

in particular, the duplicate data points

Ops.statsd-test.rate 2.4 1447076131
Ops.statsd-test.rate 0 1447076131

and

Ops.statsd-test.rate 2.5 1447076261
Ops.statsd-test.rate 0 1447076261

@coykitten coykitten self-assigned this Mar 18, 2016
@coykitten
Copy link
Contributor

@stevencherry Are you still having these problems? I tried to reproduce this bug this week, but failed. I'm wondering if this is a bug stemming from a particular configuration.

I tested on Cent7, node 0.10.16. Next chance I get I'll look to test this under the same node runtime.

@deltaroe
Copy link

Just noticed this issue in our environment as well. Let me know what information you need to help track this down.

@lghamie
Copy link

lghamie commented Apr 20, 2016

I do believe this is related to this code:
function flushMetrics() { setTimeout(flushMetrics, getFlushTimeout(flushInterval));

and this function:

function getFlushTimeout(interval) { return interval - (new Date().getTime() - startup_time * 1000) % flushInterval }

This used to be done using setInterval, and now changed to setTimeout (I don't understand why)

It seems to me that setTimeout is not that exact, and even thou you're correcting its offset, that might be causing this problem.

I tested this with 0.7.2 (uses setinterval) and it worked like a charm.

@coykitten
Copy link
Contributor

coykitten commented May 6, 2016

@lghamie setInterval causes statsd to suffer from bucket drift as the metrics aggregation and flush time is not factored into the interval. more here #489

It's worth mentioning that this issue was opened far before we merged #489 and that 0.7.2 is a 2 year old release.

@coykitten
Copy link
Contributor

Reproduced in #574 and Fixed by #575

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug within statsd or it's bundled proxies
Projects
None yet
Development

No branches or pull requests

4 participants