I'm playing with statsd & ceres at the moment (specifically a fork+port of ceres to node.js for funs) and I noted a problem with my collection.
I've configured everything to flush @ 60s intervals and ceres' timestep to also be 60s.
Everything works most of the time :)
However, sometimes the setInterval execution time 'jitters'. This is not usually an issue for say jitters such around 4:30:30 such as 4:30:31 or 4:30:29 however (as ceres considers these to all be part of 4:30:00's timestep)
It is a problem though if statsd's flush intervals come in around the '00' second boundary, for example 4:30:00, 4:29:59, 4:30:01. Whilst all of these are clearly the 'same' flush as far as statsd is concered, ceres considers them belonging to the 4:30:00, 4:29:00 and 4:30:00 'steps' respectively.
In reality this means I get dropped data-points when a sequence such as 4:30:00 and 4:30:59 occurs, and empty (inserted) datapoints when a sequence such as 4:30:59 and then 4:32:00 occurs.
Now, I don't think is particularly a bug in either bit of code! They both do what they do, but what I want to know is if in my ceres back-end [that converts the statsd flushes into ceres node-writes] would it be wrong to do some 'clamping' of the timestamps received from statsd' flush_stats (since I know that every call to flush_stats conforms to a particular flushinterval, I can choose an initial arbritrary time and manually increment my internal counter by 'flushInterval' on each flush. Effectively ignoring the timestamp statsd provides ?
Mostly I'm concerned about the drift over time this might introduce, but if it slips out by more than flushInterval/2 I could just reset the process?
Hmmm, as far as I can see that would suffer from the same issue as I currently have :/ e.g 4:19:59 and 4:21:01 would be 2 minutes apart rather than just 60 seconds apart :/
My example wip is at:
FYI this is something we see on the graphite backend with regular graphite as well. I introduced the timestamp lag metric some time back to at least have a way of tracking it. One way to minimize the effect of this will likely be making the metric calculation in StatsD more efficient.
@mrtazz I did notice that gauge, but I couldn't see any backends using it. I'm torn between the approach of specifying an absolute timeline (as in my code) and 'correcting' as in the lag approach.
I did initially try to use the lag value, but was concerned that over time (because it re-calculates the 'new' /old/ timestamp) the buckets could still end up being skipped.
I.e. (assuming flush interval of 60s, and times are MM:SS)
Is this how you envisaged the lag being used, or was it more as a post-event diagnostic to see 'ah there was a lag here, we can't trust this data-boundary' type thingy?
(Obviously my 'maths' could be dumb! :) )
Right now I don't think correcting the timestamp is wise as it might lead to a lot of differences and confusion. I introduced the metric merely to have it graphed and get a better picture of how often it happens and how much impact it has.