Query about timestamp clamping #390

Closed
ciaranj opened this Issue Jan 31, 2014 · 5 comments

4 participants

@ciaranj

I'm playing with statsd & ceres at the moment (specifically a fork+port of ceres to node.js for funs) and I noted a problem with my collection.

I've configured everything to flush @ 60s intervals and ceres' timestep to also be 60s.

Everything works most of the time :)

However, sometimes the setInterval execution time 'jitters'. This is not usually an issue for say jitters such around 4:30:30 such as 4:30:31 or 4:30:29 however (as ceres considers these to all be part of 4:30:00's timestep)

It is a problem though if statsd's flush intervals come in around the '00' second boundary, for example 4:30:00, 4:29:59, 4:30:01. Whilst all of these are clearly the 'same' flush as far as statsd is concered, ceres considers them belonging to the 4:30:00, 4:29:00 and 4:30:00 'steps' respectively.

In reality this means I get dropped data-points when a sequence such as 4:30:00 and 4:30:59 occurs, and empty (inserted) datapoints when a sequence such as 4:30:59 and then 4:32:00 occurs.

Now, I don't think is particularly a bug in either bit of code! They both do what they do, but what I want to know is if in my ceres back-end [that converts the statsd flushes into ceres node-writes] would it be wrong to do some 'clamping' of the timestamps received from statsd' flush_stats (since I know that every call to flush_stats conforms to a particular flushinterval, I can choose an initial arbritrary time and manually increment my internal counter by 'flushInterval' on each flush. Effectively ignoring the timestamp statsd provides ?

Mostly I'm concerned about the drift over time this might introduce, but if it slips out by more than flushInterval/2 I could just reset the process?

@draco2003
@ciaranj

Hmmm, as far as I can see that would suffer from the same issue as I currently have :/ e.g 4:19:59 and 4:21:01 would be 2 minutes apart rather than just 60 seconds apart :/

My example wip is at:
ciaranj@f016306

@mrtazz
Etsy, Inc. member

FYI this is something we see on the graphite backend with regular graphite as well. I introduced the timestamp lag metric some time back to at least have a way of tracking it. One way to minimize the effect of this will likely be making the metric calculation in StatsD more efficient.

@ciaranj

@mrtazz I did notice that gauge, but I couldn't see any backends using it. I'm torn between the approach of specifying an absolute timeline (as in my code) and 'correcting' as in the lag approach.

I did initially try to use the lag value, but was concerned that over time (because it re-calculates the 'new' /old/ timestamp) the buckets could still end up being skipped.

I.e. (assuming flush interval of 60s, and times are MM:SS)

  • 00:59 (initial, so has a lag of 00:00, backend chooses 00:00)
  • 02:00 (61s later so has lag of 00:01, backend considers it to be (02:00 - 00:01 = 01:59) => 01:00)
  • 03:02 (62s later so has lag of 00:02, backend considers it to be: (03:02 - 00:02 = 03:00) => 03:00) [Skipped a slot!]

Is this how you envisaged the lag being used, or was it more as a post-event diagnostic to see 'ah there was a lag here, we can't trust this data-boundary' type thingy?

(Obviously my 'maths' could be dumb! :) )

@mrtazz
Etsy, Inc. member

Right now I don't think correcting the timestamp is wise as it might lead to a lot of differences and confusion. I introduced the metric merely to have it graphed and get a better picture of how often it happens and how much impact it has.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment